NeuroAgent

How to Merge Data Without ID in JavaScript Web Scraping

Complete guide to merging data without common IDs in JavaScript web scraping. Learn fuzzy matching, Fuse.js, and Jaro-Winkler algorithms.

Question

How to combine disparate data when parsing a website without a common identifier in JavaScript?

Good day. I’m collecting data from a website for my application and I’ve encountered a problem with combining information from different sources. It’s more convenient for me to use JavaScript for parsing external pages, but working with internal pages containing specifications is more challenging.

Problem: I have two data arrays that need to be merged, but they are not connected through a common ID:

  1. Data from external pages (with assigned IDs):
javascript
[
    {
        'id':1,
        "name": "Intel Core i3-12100F OEM",
        "img": "img",
        "price": "6099"
    },
    // ...
]
  1. Data from internal pages (with detailed information):
javascript
{
  "data": {
    "code": "5444958",
    "name": "Процессор Intel Core i3-14100F OEM",
    "specs": "[LGA 1700, 4 x 3.5 ГГц, L2 - 5 МБ, L3 - 12 МБ, 2 х DDR4, DDR5-4800 МГц, TDP 110 Вт]",
    "description": "4-ядерный процессор...",
    "price": 8299,
    "bonus": 0,
    "imageUrl": "img",
    "characteristics": {
      "Общие параметры": [
        {
          "title": "Модель",
          "value": "Intel Core i3-14100F"
        },
        // ...
      ]
    },
    // ...
  }
}

Questions:

  • What approaches are available for matching records based on name or other fields?
  • Should I use loops or libraries like pandas-js for merging data?
  • How to efficiently combine disparate data in JavaScript when parsing a website?
NeuroAgent

Merging Disparate Data in Web Scraping Without Common Identifiers in JavaScript

When merging disparate data from web scraping without a common identifier in JavaScript, several approaches can be used, including fuzzy matching, synonym indexing, and weighted fields. The main methods include fuzzy search libraries like Fuse.js, as well as string comparison algorithms like Jaro-Winkler.

Table of Contents

Basic Approaches to Data Merging

When there’s no common identifier for merging data from different sources, you can use the following strategies:

1. Name-based fuzzy matching

The primary approach is to use fuzzy search algorithms to match product names. This is the most effective method when names differ slightly across sources.

2. Multi-factor matching

Use a combination of fields to improve accuracy:

  • Product name
  • Price (with an acceptable range)
  • Image (comparing hashes)
  • Other unique characteristics

3. Synonym indexing

Create alternative indexes for each product, including possible typos and abbreviations.

Important: According to research from DataScienceCentral, when working with fuzzy matching, two key aspects must be considered: how to weight proxy fields and how to measure false positive and false negative errors.


Libraries for Fuzzy Search in JavaScript

Fuse.js

Fuse.js is a lightweight JavaScript library for fuzzy search that provides various algorithms for string matching.

javascript
const Fuse = require('fuse.js');

const options = {
  keys: ['name'],
  threshold: 0.3, // Similarity threshold (0-1)
  distance: 100, // Maximum edit distance
  includeScore: true
};

const fuse = new Fuse(externalData, options);

// Search for matches
const result = fuse.search(internalData.name);

FuzzySet.js

FuzzySet.js provides a data structure for performing full-text search with probable typo detection.

javascript
const FuzzySet = require('fuzzyset.js');

const fuzzySet = FuzzySet();

// Add names from external data
externalData.forEach(item => {
  fuzzySet.add(item.name);
});

// Search for matches
const matches = fuzzySet.get(internalData.name);

Library Comparison

Library Features Speed Accuracy
Fuse.js Flexible configuration, supports multi-field arrays Medium High
FuzzySet.js Easy to use, good for typos High Medium
Custom Jaro-Winkler Full control over algorithm Low High

Data Matching Algorithms

Jaro-Winkler Algorithm

This algorithm is well-suited for comparing short strings like product names.

javascript
function jaroWinkler(s1, s2) {
  if (s1 === s2) return 1.0;
  
  const m1 = s1.length;
  const m2 = s2.length;
  if (m1 === 0 || m2 === 0) return 0.0;
  
  const matchDistance = Math.floor(Math.max(m1, m2) / 2) - 1;
  const s1Matches = new Array(m1).fill(false);
  const s2Matches = new Array(m2).fill(false);
  let matches = 0;
  let transpositions = 0;
  
  // Find matches
  for (let i = 0; i < m1; i++) {
    const start = Math.max(0, i - matchDistance);
    const end = Math.min(i + matchDistance + 1, m2);
    
    for (let j = start; j < end; j++) {
      if (!s2Matches[j] && s1[i] === s2[j]) {
        s1Matches[i] = true;
        s2Matches[j] = true;
        matches++;
        break;
      }
    }
  }
  
  if (matches === 0) return 0.0;
  
  // Count transpositions
  let k = 0;
  for (let i = 0; i < m1; i++) {
    if (s1Matches[i]) {
      while (!s2Matches[k]) k++;
      if (s1[i] !== s2[k]) transpositions++;
      k++;
    }
  }
  
  const jaro = (matches / m1 + matches / m2 + (matches - transpositions / 2) / matches) / 3;
  const prefixLength = 0;
  
  return jaro + prefixLength * 0.1 * (1 - jaro);
}

Combining TF-IDF with Jaro-Winkler

As mentioned in Stack Overflow, you can replace exact token matches in TF-IDF with approximate matches based on the Jaro-Winkler scheme.

Field Weighting

To improve matching accuracy, you can use weighted fields:

javascript
function calculateMatchScore(external, internal) {
  let score = 0;
  let maxScore = 0;
  
  // Name (weight 40%)
  const nameScore = jaroWinkler(external.name, internal.name);
  score += nameScore * 0.4;
  maxScore += 0.4;
  
  // Price (weight 30%)
  const priceDiff = Math.abs(parseFloat(external.price) - parseFloat(internal.price));
  const priceScore = priceDiff < 1000 ? 1 - (priceDiff / 1000) : 0;
  score += priceScore * 0.3;
  maxScore += 0.3;
  
  // Image (weight 20%)
  const imageScore = external.img === internal.imageUrl ? 1 : 0;
  score += imageScore * 0.2;
  maxScore += 0.2;
  
  // Specifications (weight 10%)
  const specsScore = jaroWinkler(external.specs || '', JSON.stringify(internal.specs)) * 0.1;
  score += specsScore;
  maxScore += 0.1;
  
  return score / maxScore;
}

Practical Implementation of Data Merging

Step 1: Data Preparation

First, normalize the data to improve matching quality:

javascript
function normalizeString(str) {
  return str.toLowerCase()
    .replace(/[^\w\s]/g, '') // Remove special characters
    .replace(/\s+/g, ' ')    // Normalize whitespace
    .trim();
}

function prepareData(data) {
  return data.map(item => ({
    ...item,
    normalizedName: normalizeString(item.name),
    normalizedSpecs: normalizeString(item.specs || '')
  }));
}

Step 2: Creating an Index

Optimize search by creating an index of external data:

javascript
function createIndex(externalData) {
  const index = {};
  
  externalData.forEach(item => {
    const key = item.normalizedName.split(' ')[0]; // First word as key
    if (!index[key]) {
      index[key] = [];
    }
    index[key].push(item);
  });
  
  return index;
}

Step 3: Main Merging Algorithm

javascript
function mergeDatasets(externalData, internalData, threshold = 0.7) {
  const preparedExternal = prepareData(externalData);
  const preparedInternal = prepareData(internalData);
  const index = createIndex(preparedExternal);
  
  const result = [];
  
  preparedInternal.forEach(internal => {
    const firstWord = internal.normalizedName.split(' ')[0];
    const candidates = index[firstWord] || [];
    
    let bestMatch = null;
    let bestScore = 0;
    
    candidates.forEach(external => {
      const score = calculateMatchScore(external, internal);
      if (score > bestScore) {
        bestScore = score;
        bestMatch = external;
      }
    });
    
    if (bestScore >= threshold) {
      result.push({
        ...bestMatch,
        ...internal,
        confidence: bestScore,
        matchedFields: getMatchedFields(bestMatch, internal)
      });
    } else {
      // No good enough match
      result.push({
        id: null,
        confidence: 0,
        ...internal,
        matchedFields: []
      });
    }
  });
  
  return result;
}

Step 4: Processing Results

javascript
function getMatchedFields(external, internal) {
  const matchedFields = [];
  
  if (jaroWinkler(external.name, internal.name) > 0.8) {
    matchedFields.push('name');
  }
  
  if (Math.abs(parseFloat(external.price) - parseFloat(internal.price)) < 100) {
    matchedFields.push('price');
  }
  
  if (external.img === internal.imageUrl) {
    matchedFields.push('image');
  }
  
  return matchedFields;
}

Performance Optimization

Caching Results

javascript
const mergeCache = new Map();

function cachedMerge(externalData, internalData) {
  const cacheKey = JSON.stringify({
    externalHash: hashData(externalData),
    internalHash: hashData(internalData)
  });
  
  if (mergeCache.has(cacheKey)) {
    return mergeCache.get(cacheKey);
  }
  
  const result = mergeDatasets(externalData, internalData);
  mergeCache.set(cacheKey, result);
  return result;
}

Parallel Processing

javascript
const { Worker } = require('worker_threads');

function parallelMerge(externalData, internalData, chunkSize = 100) {
  return new Promise((resolve) => {
    const chunks = [];
    for (let i = 0; i < internalData.length; i += chunkSize) {
      chunks.push(internalData.slice(i, i + chunkSize));
    }
    
    const workers = [];
    const results = [];
    
    chunks.forEach((chunk, index) => {
      const worker = new Worker('./mergeWorker.js', {
        workerData: {
          externalData,
          internalData: chunk
        }
      });
      
      worker.on('message', (result) => {
        results[index] = result;
        if (results.filter(r => r !== undefined).length === chunks.length) {
          resolve(results.flat());
        }
      });
      
      workers.push(worker);
    });
  });
}

Code Example for Data Merging

javascript
// Main data merging script
const Fuse = require('fuse.js');

async function mergeProductData() {
  // Data from external pages
  const externalData = [
    {
      'id': 1,
      "name": "Intel Core i3-12100F OEM",
      "img": "img1",
      "price": "6099"
    },
    {
      'id': 2,
      "name": "AMD Ryzen 5 5600G",
      "img": "img2", 
      "price": "8299"
    }
  ];

  // Data from internal pages
  const internalData = [
    {
      "data": {
        "code": "5444958",
        "name": "Processor Intel Core i3-14100F OEM",
        "specs": "[LGA 1700, 4 x 3.5 GHz, L2 - 5 MB, L3 - 12 MB, 2 x DDR4, DDR5-4800 MHz, TDP 110W]",
        "description": "4-core processor...",
        "price": 8299,
        "bonus": 0,
        "imageUrl": "img1",
        "characteristics": {
          "General Parameters": [
            {
              "title": "Model",
              "value": "Intel Core i3-14100F"
            }
          ]
        }
      }
    }
  ];

  // Fuse.js configuration
  const fuseOptions = {
    keys: [
      { name: 'name', weight: 0.4 },
      { name: 'price', weight: 0.3 },
      { name: 'img', weight: 0.2 },
      { name: 'specs', weight: 0.1 }
    ],
    threshold: 0.4,
    distance: 100,
    includeScore: true,
    minMatchCharLength: 3
  };

  // Create Fuse instance
  const fuse = new Fuse(externalData, fuseOptions);

  // Merge data
  const mergedData = internalData.map(internal => {
    const searchResult = fuse.search(internal.data.name);
    
    if (searchResult.length > 0) {
      const bestMatch = searchResult[0];
      return {
        ...bestMatch.item,
        ...internal.data,
        confidence: bestMatch.score,
        matchedFields: Object.keys(bestMatch.item).filter(key => 
          bestMatch.item[key] === internal.data[key]
        )
      };
    }
    
    return {
      id: null,
      confidence: 0,
      ...internal.data,
      matchedFields: []
    };
  });

  console.log('Merged data:', mergedData);
  return mergedData;
}

// Execution
mergeProductData().catch(console.error);

Recommendations and Best Practices

1. Choosing the Right Algorithm

  • For exact names: Fuse.js with low threshold
  • For typos: Jaro-Winkler
  • For large datasets: Indexing + caching

2. Configuring Weight Coefficients

Adjust weights based on the importance of each field:

javascript
const weights = {
  name: 0.5,      // Name is most important
  price: 0.2,     // Price is secondary
  image: 0.15,    // Image for additional verification
  specs: 0.1,     // Specifications for final check
  description: 0.05 // Description rarely used
};

3. Error Handling

javascript
function safeMerge(externalData, internalData) {
  try {
    return mergeDatasets(externalData, internalData);
  } catch (error) {
    console.error('Error merging data:', error);
    return [];
  }
}

4. Result Validation

javascript
function validateMergedData(mergedData) {
  return mergedData.filter(item => {
    // Check for complete data
    const hasRequiredFields = item.id && item.name && item.price;
    
    // Check confidence
    const hasGoodConfidence = item.confidence > 0.6;
    
    return hasRequiredFields && hasGoodConfidence;
  });
}

5. Integration with Scraping

For effective integration with the scraping process, you can use the following approach:

javascript
class DataMerger {
  constructor(options = {}) {
    this.options = {
      threshold: 0.7,
      weights: {
        name: 0.4,
        price: 0.3,
        image: 0.2,
        specs: 0.1
      },
      ...options
    };
    
    this.externalData = [];
    this.fuse = null;
  }
  
  addExternalData(data) {
    this.externalData = data;
    this.updateIndex();
  }
  
  updateIndex() {
    this.fuse = new Fuse(this.externalData, {
      keys: Object.keys(this.options.weights).map(key => ({
        name: key,
        weight: this.options.weights[key]
      })),
      threshold: this.options.threshold,
      includeScore: true
    });
  }
  
  mergeInternalData(internalData) {
    return internalData.map(internal => {
      const result = this.fuse.search(internal.name)[0];
      
      if (result && result.score >= this.options.threshold) {
        return {
          ...result.item,
          ...internal,
          confidence: result.score
        };
      }
      
      return {
        id: null,
        confidence: 0,
        ...internal
      };
    });
  }
}

// Usage
const merger = new DataMerger({
  threshold: 0.6,
  weights: {
    name: 0.5,
    price: 0.3,
    image: 0.2
  }
});

merger.addExternalData(externalData);
const merged = merger.mergeInternalData(internalData);

Sources

  1. [Fuzzy Merge - Guides](https://povertyaction.github.io/guides/cleaning/04 Data Aggregation/02 Fuzzy Merge/) - Comprehensive guide to fuzzy data merging
  2. A Comprehensive Guide to Matching Web-Scraped Data | Crawlbase - Methods for matching web-scraped data
  3. Detailed Guide to Data Matching - Detailed guide to data matching
  4. JavaScript fuzzy search that makes sense - Stack Overflow - Discussion of fuzzy search algorithms
  5. Fuzzy Bootstrap Matching - DataScienceCentral.com - Techniques for merging data files without key fields
  6. Fast, accurate and multilingual fuzzy search library for the frontend - Reddit - Libraries for fuzzy search
  7. How to Implement Fuzzy Search in JavaScript | Codementor - Practical implementation of fuzzy search
  8. Fuse.js | Lightweight fuzzy-search library - Official Fuse.js documentation
  9. Fuzzy Search in JavaScript - GeeksforGeeks - Overview of fuzzy search in JavaScript
  10. fuzzyset.js - a fuzzy string set for javascript - GitHub Pages - FuzzySet.js implementation

Conclusion

For effectively merging disparate data from web scraping without a common identifier in JavaScript, it’s recommended to:

  1. Use specialized libraries like Fuse.js or FuzzySet.js for fuzzy string matching
  2. Implement a multi-factor evaluation system considering name, price, image, and specifications
  3. Create indexes to optimize performance when working with large data volumes
  4. Configure thresholds and weight coefficients based on the specific task
  5. Implement caching mechanisms and parallel processing to improve performance

The main challenge lies in balancing accuracy and performance. It’s advisable to start with simple algorithms and gradually increase complexity as needed. It’s also important to implement validation and error handling mechanisms to ensure system reliability.