NeuroAgent

How to Merge Data Without ID in JavaScript Web Scraping

Complete guide to merging data without common IDs in JavaScript web scraping. Learn fuzzy matching, Fuse.js, and Jaro-Winkler algorithms.

#data-indexing #data-merging #fuse-js #fuzzy-matching #jaro-winkler #javascript-parsing #web-scraping

10/30/2025, 07:52 PM

Question

How to combine disparate data when parsing a website without a common identifier in JavaScript?

Good day. I’m collecting data from a website for my application and I’ve encountered a problem with combining information from different sources. It’s more convenient for me to use JavaScript for parsing external pages, but working with internal pages containing specifications is more challenging.

Problem: I have two data arrays that need to be merged, but they are not connected through a common ID:

Data from external pages (with assigned IDs):

javascript

[
    {
        'id':1,
        "name": "Intel Core i3-12100F OEM",
        "img": "img",
        "price": "6099"
    },
    // ...
]

Data from internal pages (with detailed information):

javascript

{
  "data": {
    "code": "5444958",
    "name": "Процессор Intel Core i3-14100F OEM",
    "specs": "[LGA 1700, 4 x 3.5 ГГц, L2 - 5 МБ, L3 - 12 МБ, 2 х DDR4, DDR5-4800 МГц, TDP 110 Вт]",
    "description": "4-ядерный процессор...",
    "price": 8299,
    "bonus": 0,
    "imageUrl": "img",
    "characteristics": {
      "Общие параметры": [
        {
          "title": "Модель",
          "value": "Intel Core i3-14100F"
        },
        // ...
      ]
    },
    // ...
  }
}

Questions:

What approaches are available for matching records based on name or other fields?
Should I use loops or libraries like pandas-js for merging data?
How to efficiently combine disparate data in JavaScript when parsing a website?

NeuroAgent

Merging Disparate Data in Web Scraping Without Common Identifiers in JavaScript

When merging disparate data from web scraping without a common identifier in JavaScript, several approaches can be used, including fuzzy matching, synonym indexing, and weighted fields. The main methods include fuzzy search libraries like Fuse.js, as well as string comparison algorithms like Jaro-Winkler.

Basic Approaches to Data Merging
Libraries for Fuzzy Search in JavaScript
Data Matching Algorithms
Practical Implementation of Data Merging
Performance Optimization
Code Example for Data Merging
Recommendations and Best Practices

Basic Approaches to Data Merging

When there’s no common identifier for merging data from different sources, you can use the following strategies:

1. Name-based fuzzy matching

The primary approach is to use fuzzy search algorithms to match product names. This is the most effective method when names differ slightly across sources.

2. Multi-factor matching

Use a combination of fields to improve accuracy:

Product name
Price (with an acceptable range)
Image (comparing hashes)
Other unique characteristics

3. Synonym indexing

Create alternative indexes for each product, including possible typos and abbreviations.

Important: According to research from DataScienceCentral, when working with fuzzy matching, two key aspects must be considered: how to weight proxy fields and how to measure false positive and false negative errors.

Libraries for Fuzzy Search in JavaScript

Fuse.js

Fuse.js is a lightweight JavaScript library for fuzzy search that provides various algorithms for string matching.

javascript

const Fuse = require('fuse.js');

const options = {
  keys: ['name'],
  threshold: 0.3, // Similarity threshold (0-1)
  distance: 100, // Maximum edit distance
  includeScore: true
};

const fuse = new Fuse(externalData, options);

// Search for matches
const result = fuse.search(internalData.name);

FuzzySet.js

FuzzySet.js provides a data structure for performing full-text search with probable typo detection.

javascript

const FuzzySet = require('fuzzyset.js');

const fuzzySet = FuzzySet();

// Add names from external data
externalData.forEach(item => {
  fuzzySet.add(item.name);
});

// Search for matches
const matches = fuzzySet.get(internalData.name);

Library Comparison

Library	Features	Speed	Accuracy
Fuse.js	Flexible configuration, supports multi-field arrays	Medium	High
FuzzySet.js	Easy to use, good for typos	High	Medium
Custom Jaro-Winkler	Full control over algorithm	Low	High

Data Matching Algorithms

Jaro-Winkler Algorithm

This algorithm is well-suited for comparing short strings like product names.

javascript

function jaroWinkler(s1, s2) {
  if (s1 === s2) return 1.0;
  
  const m1 = s1.length;
  const m2 = s2.length;
  if (m1 === 0 || m2 === 0) return 0.0;
  
  const matchDistance = Math.floor(Math.max(m1, m2) / 2) - 1;
  const s1Matches = new Array(m1).fill(false);
  const s2Matches = new Array(m2).fill(false);
  let matches = 0;
  let transpositions = 0;
  
  // Find matches
  for (let i = 0; i < m1; i++) {
    const start = Math.max(0, i - matchDistance);
    const end = Math.min(i + matchDistance + 1, m2);
    
    for (let j = start; j < end; j++) {
      if (!s2Matches[j] && s1[i] === s2[j]) {
        s1Matches[i] = true;
        s2Matches[j] = true;
        matches++;
        break;
      }
    }
  }
  
  if (matches === 0) return 0.0;
  
  // Count transpositions
  let k = 0;
  for (let i = 0; i < m1; i++) {
    if (s1Matches[i]) {
      while (!s2Matches[k]) k++;
      if (s1[i] !== s2[k]) transpositions++;
      k++;
    }
  }
  
  const jaro = (matches / m1 + matches / m2 + (matches - transpositions / 2) / matches) / 3;
  const prefixLength = 0;
  
  return jaro + prefixLength * 0.1 * (1 - jaro);
}

Combining TF-IDF with Jaro-Winkler

As mentioned in Stack Overflow, you can replace exact token matches in TF-IDF with approximate matches based on the Jaro-Winkler scheme.

Field Weighting

To improve matching accuracy, you can use weighted fields:

javascript

function calculateMatchScore(external, internal) {
  let score = 0;
  let maxScore = 0;
  
  // Name (weight 40%)
  const nameScore = jaroWinkler(external.name, internal.name);
  score += nameScore * 0.4;
  maxScore += 0.4;
  
  // Price (weight 30%)
  const priceDiff = Math.abs(parseFloat(external.price) - parseFloat(internal.price));
  const priceScore = priceDiff < 1000 ? 1 - (priceDiff / 1000) : 0;
  score += priceScore * 0.3;
  maxScore += 0.3;
  
  // Image (weight 20%)
  const imageScore = external.img === internal.imageUrl ? 1 : 0;
  score += imageScore * 0.2;
  maxScore += 0.2;
  
  // Specifications (weight 10%)
  const specsScore = jaroWinkler(external.specs || '', JSON.stringify(internal.specs)) * 0.1;
  score += specsScore;
  maxScore += 0.1;
  
  return score / maxScore;
}

Practical Implementation of Data Merging

Step 1: Data Preparation

First, normalize the data to improve matching quality:

javascript

function normalizeString(str) {
  return str.toLowerCase()
    .replace(/[^\w\s]/g, '') // Remove special characters
    .replace(/\s+/g, ' ')    // Normalize whitespace
    .trim();
}

function prepareData(data) {
  return data.map(item => ({
    ...item,
    normalizedName: normalizeString(item.name),
    normalizedSpecs: normalizeString(item.specs || '')
  }));
}

Step 2: Creating an Index

Optimize search by creating an index of external data:

javascript

function createIndex(externalData) {
  const index = {};
  
  externalData.forEach(item => {
    const key = item.normalizedName.split(' ')[0]; // First word as key
    if (!index[key]) {
      index[key] = [];
    }
    index[key].push(item);
  });
  
  return index;
}

Step 3: Main Merging Algorithm

javascript

function mergeDatasets(externalData, internalData, threshold = 0.7) {
  const preparedExternal = prepareData(externalData);
  const preparedInternal = prepareData(internalData);
  const index = createIndex(preparedExternal);
  
  const result = [];
  
  preparedInternal.forEach(internal => {
    const firstWord = internal.normalizedName.split(' ')[0];
    const candidates = index[firstWord] || [];
    
    let bestMatch = null;
    let bestScore = 0;
    
    candidates.forEach(external => {
      const score = calculateMatchScore(external, internal);
      if (score > bestScore) {
        bestScore = score;
        bestMatch = external;
      }
    });
    
    if (bestScore >= threshold) {
      result.push({
        ...bestMatch,
        ...internal,
        confidence: bestScore,
        matchedFields: getMatchedFields(bestMatch, internal)
      });
    } else {
      // No good enough match
      result.push({
        id: null,
        confidence: 0,
        ...internal,
        matchedFields: []
      });
    }
  });
  
  return result;
}

Step 4: Processing Results

javascript

function getMatchedFields(external, internal) {
  const matchedFields = [];
  
  if (jaroWinkler(external.name, internal.name) > 0.8) {
    matchedFields.push('name');
  }
  
  if (Math.abs(parseFloat(external.price) - parseFloat(internal.price)) < 100) {
    matchedFields.push('price');
  }
  
  if (external.img === internal.imageUrl) {
    matchedFields.push('image');
  }
  
  return matchedFields;
}

Performance Optimization

Caching Results

javascript

const mergeCache = new Map();

function cachedMerge(externalData, internalData) {
  const cacheKey = JSON.stringify({
    externalHash: hashData(externalData),
    internalHash: hashData(internalData)
  });
  
  if (mergeCache.has(cacheKey)) {
    return mergeCache.get(cacheKey);
  }
  
  const result = mergeDatasets(externalData, internalData);
  mergeCache.set(cacheKey, result);
  return result;
}

Parallel Processing

javascript

const { Worker } = require('worker_threads');

function parallelMerge(externalData, internalData, chunkSize = 100) {
  return new Promise((resolve) => {
    const chunks = [];
    for (let i = 0; i < internalData.length; i += chunkSize) {
      chunks.push(internalData.slice(i, i + chunkSize));
    }
    
    const workers = [];
    const results = [];
    
    chunks.forEach((chunk, index) => {
      const worker = new Worker('./mergeWorker.js', {
        workerData: {
          externalData,
          internalData: chunk
        }
      });
      
      worker.on('message', (result) => {
        results[index] = result;
        if (results.filter(r => r !== undefined).length === chunks.length) {
          resolve(results.flat());
        }
      });
      
      workers.push(worker);
    });
  });
}

Code Example for Data Merging

javascript

// Main data merging script
const Fuse = require('fuse.js');

async function mergeProductData() {
  // Data from external pages
  const externalData = [
    {
      'id': 1,
      "name": "Intel Core i3-12100F OEM",
      "img": "img1",
      "price": "6099"
    },
    {
      'id': 2,
      "name": "AMD Ryzen 5 5600G",
      "img": "img2", 
      "price": "8299"
    }
  ];

  // Data from internal pages
  const internalData = [
    {
      "data": {
        "code": "5444958",
        "name": "Processor Intel Core i3-14100F OEM",
        "specs": "[LGA 1700, 4 x 3.5 GHz, L2 - 5 MB, L3 - 12 MB, 2 x DDR4, DDR5-4800 MHz, TDP 110W]",
        "description": "4-core processor...",
        "price": 8299,
        "bonus": 0,
        "imageUrl": "img1",
        "characteristics": {
          "General Parameters": [
            {
              "title": "Model",
              "value": "Intel Core i3-14100F"
            }
          ]
        }
      }
    }
  ];

  // Fuse.js configuration
  const fuseOptions = {
    keys: [
      { name: 'name', weight: 0.4 },
      { name: 'price', weight: 0.3 },
      { name: 'img', weight: 0.2 },
      { name: 'specs', weight: 0.1 }
    ],
    threshold: 0.4,
    distance: 100,
    includeScore: true,
    minMatchCharLength: 3
  };

  // Create Fuse instance
  const fuse = new Fuse(externalData, fuseOptions);

  // Merge data
  const mergedData = internalData.map(internal => {
    const searchResult = fuse.search(internal.data.name);
    
    if (searchResult.length > 0) {
      const bestMatch = searchResult[0];
      return {
        ...bestMatch.item,
        ...internal.data,
        confidence: bestMatch.score,
        matchedFields: Object.keys(bestMatch.item).filter(key => 
          bestMatch.item[key] === internal.data[key]
        )
      };
    }
    
    return {
      id: null,
      confidence: 0,
      ...internal.data,
      matchedFields: []
    };
  });

  console.log('Merged data:', mergedData);
  return mergedData;
}

// Execution
mergeProductData().catch(console.error);

Recommendations and Best Practices

1. Choosing the Right Algorithm

For exact names: Fuse.js with low threshold
For typos: Jaro-Winkler
For large datasets: Indexing + caching

2. Configuring Weight Coefficients

Adjust weights based on the importance of each field:

javascript

const weights = {
  name: 0.5,      // Name is most important
  price: 0.2,     // Price is secondary
  image: 0.15,    // Image for additional verification
  specs: 0.1,     // Specifications for final check
  description: 0.05 // Description rarely used
};

3. Error Handling

javascript

function safeMerge(externalData, internalData) {
  try {
    return mergeDatasets(externalData, internalData);
  } catch (error) {
    console.error('Error merging data:', error);
    return [];
  }
}

4. Result Validation

javascript

function validateMergedData(mergedData) {
  return mergedData.filter(item => {
    // Check for complete data
    const hasRequiredFields = item.id && item.name && item.price;
    
    // Check confidence
    const hasGoodConfidence = item.confidence > 0.6;
    
    return hasRequiredFields && hasGoodConfidence;
  });
}

5. Integration with Scraping

For effective integration with the scraping process, you can use the following approach:

javascript

class DataMerger {
  constructor(options = {}) {
    this.options = {
      threshold: 0.7,
      weights: {
        name: 0.4,
        price: 0.3,
        image: 0.2,
        specs: 0.1
      },
      ...options
    };
    
    this.externalData = [];
    this.fuse = null;
  }
  
  addExternalData(data) {
    this.externalData = data;
    this.updateIndex();
  }
  
  updateIndex() {
    this.fuse = new Fuse(this.externalData, {
      keys: Object.keys(this.options.weights).map(key => ({
        name: key,
        weight: this.options.weights[key]
      })),
      threshold: this.options.threshold,
      includeScore: true
    });
  }
  
  mergeInternalData(internalData) {
    return internalData.map(internal => {
      const result = this.fuse.search(internal.name)[0];
      
      if (result && result.score >= this.options.threshold) {
        return {
          ...result.item,
          ...internal,
          confidence: result.score
        };
      }
      
      return {
        id: null,
        confidence: 0,
        ...internal
      };
    });
  }
}

// Usage
const merger = new DataMerger({
  threshold: 0.6,
  weights: {
    name: 0.5,
    price: 0.3,
    image: 0.2
  }
});

merger.addExternalData(externalData);
const merged = merger.mergeInternalData(internalData);

Sources

[Fuzzy Merge - Guides](https://povertyaction.github.io/guides/cleaning/04 Data Aggregation/02 Fuzzy Merge/) - Comprehensive guide to fuzzy data merging
A Comprehensive Guide to Matching Web-Scraped Data | Crawlbase - Methods for matching web-scraped data
Detailed Guide to Data Matching - Detailed guide to data matching
JavaScript fuzzy search that makes sense - Stack Overflow - Discussion of fuzzy search algorithms
Fuzzy Bootstrap Matching - DataScienceCentral.com - Techniques for merging data files without key fields
Fast, accurate and multilingual fuzzy search library for the frontend - Reddit - Libraries for fuzzy search
How to Implement Fuzzy Search in JavaScript | Codementor - Practical implementation of fuzzy search
Fuse.js | Lightweight fuzzy-search library - Official Fuse.js documentation
Fuzzy Search in JavaScript - GeeksforGeeks - Overview of fuzzy search in JavaScript
fuzzyset.js - a fuzzy string set for javascript - GitHub Pages - FuzzySet.js implementation

Conclusion

For effectively merging disparate data from web scraping without a common identifier in JavaScript, it’s recommended to:

Use specialized libraries like Fuse.js or FuzzySet.js for fuzzy string matching
Implement a multi-factor evaluation system considering name, price, image, and specifications
Create indexes to optimize performance when working with large data volumes
Configure thresholds and weight coefficients based on the specific task
Implement caching mechanisms and parallel processing to improve performance

The main challenge lies in balancing accuracy and performance. It’s advisable to start with simple algorithms and gradually increase complexity as needed. It’s also important to implement validation and error handling mechanisms to ensure system reliability.

How to optimize performance when merging large volumes of data in JavaScript?What are alternative fuzzy search algorithms besides Jaro-Winkler for JavaScript?How to implement parallel data processing during web scraping to speed up merging?What are the best practices for validating fuzzy data matching results?How to integrate a data merging system into an existing Node.js scraping project?What are existing solutions for automatic data matching from different sources?

Ask NeuroAgent

How to Merge Data Without ID in JavaScript Web Scraping

Merging Disparate Data in Web Scraping Without Common Identifiers in JavaScript

Table of Contents

Basic Approaches to Data Merging

1. Name-based fuzzy matching

2. Multi-factor matching

3. Synonym indexing

Libraries for Fuzzy Search in JavaScript

Fuse.js

FuzzySet.js

Library Comparison

Data Matching Algorithms

Jaro-Winkler Algorithm

Combining TF-IDF with Jaro-Winkler

Field Weighting

Practical Implementation of Data Merging

Step 1: Data Preparation

Step 2: Creating an Index

Step 3: Main Merging Algorithm

Step 4: Processing Results

Performance Optimization

Caching Results

Parallel Processing

Code Example for Data Merging

Recommendations and Best Practices

1. Choosing the Right Algorithm

2. Configuring Weight Coefficients

3. Error Handling

4. Result Validation

5. Integration with Scraping

Sources

Conclusion