NeuroAgent

How to merge data when web scraping without a common ID

Learn effective methods for merging disparate data when web scraping using JavaScript. From basic loops to specialized libraries for data merging without a common ID.

Question

How to combine disparate data when parsing a website with JavaScript? I’m collecting data from a website for an application and have encountered a problem: I find it more convenient to parse external pages using JavaScript, but this becomes inconvenient for collecting characteristics from internal pages. I have two data arrays: one from the external page (with basic product information) and another from the internal page (with detailed characteristics). The problem is that the data isn’t linked to a common ID. How can I efficiently combine this data? Should I use loops or can I consider a library like pandas-js? I’m also interested in what approaches exist for solving such tasks in JavaScript.

NeuroAgent

Combining Disparate Data in Web Scraping with JavaScript

Combining disparate data when web scraping with JavaScript can be effectively implemented using several approaches: from basic loops with matching by matching values to specialized libraries. For your task with two data arrays (basic information and detailed product characteristics) without a common ID, the best solution would be to use fuzzy matching, combine fields to create unique keys, or apply data merging libraries.

Contents

Basic Methods for Combining Data Without Common IDs

When working with disparate datasets that don’t have a common identifier, there are several effective approaches to combining them. As noted in research on combining data in JavaScript, the main principle involves creating a logical connection between datasets through matching values in other fields.

The most common methods include:

  1. Matching by a single field - if both arrays have a field with the same value (e.g., product name, SKU)
  2. Composite matching - using a combination of multiple fields to create a unique key
  3. Spatial matching - for geospatial data or visual elements
  4. Temporal matching - for data linked by time

For your task with basic information and detailed product characteristics, the most suitable approach would be one based on matches in multiple fields simultaneously.

Important: when there’s no common ID, the quality of matching directly depends on data accuracy and the chosen combination strategy. It’s recommended to first analyze the fields to identify the most reliable matching criteria.

Data Matching Approaches Based on Similarities

Exact Matching

A simple and fast method that works well when data has exact matches in specific fields:

javascript
function exactMerge(externalData, internalData, matchField) {
    const internalMap = new Map();
    
    // Create a map for quick access by field
    internalData.forEach(item => {
        internalMap.set(item[matchField], item);
    });
    
    // Combine data
    return externalData.map(external => {
        const match = internalMap.get(external[matchField]);
        return match ? { ...external, ...match } : external;
    });
}

Fuzzy Matching

A more complex but flexible approach for cases where exact matches aren’t possible:

javascript
const fuzzy = require('fuzzy');

function fuzzyMerge(externalData, internalData, matchField, threshold = 0.6) {
    const externalFields = externalData.map(item => item[matchField]);
    
    return externalData.map(external => {
        const searchTerm = external[matchField];
        const results = fuzzy.filter(searchTerm, internalData.map(item => item[matchField]));
        
        const bestMatch = results[0];
        if (bestMatch && bestMatch.score >= threshold) {
            const matchedItem = internalData[results[0].index];
            return { ...external, ...matchedItem };
        }
        return external;
    });
}

Composite Matching

When a single field isn’t enough, you can use a combination of multiple fields to create a unique key:

javascript
function compositeMerge(externalData, internalData, fields) {
    const createKey = item => fields.map(field => item[field]).join('|');
    
    const internalMap = new Map();
    internalData.forEach(item => {
        internalMap.set(createKey(item), item);
    });
    
    return externalData.map(external => {
        const key = createKey(external);
        const match = internalMap.get(key);
        return match ? { ...external, ...match } : external;
    });
}

Using Specialized Libraries

Lodash for Collection Operations

Lodash provides powerful tools for working with arrays and objects:

javascript
const _ = require('lodash');

function lodashMerge(externalData, internalData, matchField) {
    const internalMap = _.keyBy(internalData, matchField);
    return _.map(externalData, external => ({
        ...external,
        ...internalMap[external[matchField]]
    }));
}

Pandas-js for Data Analysis

For those familiar with the Python ecosystem, there’s pandas-js, which provides functionality similar to pandas:

javascript
const DataFrame = require('pandas-js');

function pandasMerge(externalData, internalData, matchField) {
    const df1 = new DataFrame(externalData);
    const df2 = new DataFrame(internalData);
    
    return df1.merge(df2, { on: matchField }).toArray();
}

D3.js for Complex Merge Operations

D3.js offers advanced data merging methods:

javascript
const d3 = require('d3');

function d3Merge(externalData, internalData, matchField) {
    const merged = d3.merge([externalData, internalData]);
    // More complex merge operations can be used here
    return merged;
}

Practical JavaScript Implementation Examples

Example 1: Combining Product Data

Let’s look at a practical example of combining product data from external and internal sources:

javascript
// Data from external pages (basic information)
const externalProducts = [
    { name: 'iPhone 13', price: 799, category: 'smartphones' },
    { name: 'Samsung Galaxy S21', price: 699, category: 'smartphones' },
    { name: 'MacBook Pro', price: 1999, category: 'laptops' }
];

// Data from internal pages (detailed characteristics)
const internalDetails = [
    { title: 'iPhone 13', storage: '128GB', camera: '12MP', battery: '3240mAh' },
    { title: 'MacBook Pro', storage: '512GB', camera: '720p', battery: '58.2Wh' },
    { title: 'Samsung Galaxy S21', storage: '256GB', camera: '64MP', battery: '4000mAh' }
];

// Function for merging with fuzzy matching
function mergeProductData(external, internal) {
    const internalMap = {};
    internal.forEach(item => {
        internalMap[item.title.toLowerCase()] = item;
    });
    
    return external.map(product => {
        const key = `${product.name.toLowerCase()} ${product.category}`;
        const bestMatch = findBestMatch(key, Object.keys(internalMap));
        
        if (bestMatch.score > 0.7) {
            return { ...product, ...internalMap[bestMatch.value] };
        }
        return product;
    });
}

// Simple fuzzy search implementation
function findBestMatch(searchTerm, options) {
    let bestScore = 0;
    let bestOption = null;
    
    options.forEach(option => {
        const score = calculateSimilarity(searchTerm, option);
        if (score > bestScore) {
            bestScore = score;
            bestOption = option;
        }
    });
    
    return { value: bestOption, score: bestScore };
}

function calculateSimilarity(str1, str2) {
    const longer = str1.length > str2.length ? str1 : str2;
    const shorter = str1.length > str2.length ? str2 : str1;
    
    if (longer.length === 0) return 1;
    
    const editDistance = getEditDistance(longer, shorter);
    return (longer.length - editDistance) / longer.length;
}

function getEditDistance(str1, str2) {
    const matrix = [];
    
    for (let i = 0; i <= str2.length; i++) {
        matrix[i] = [i];
    }
    
    for (let j = 0; j <= str1.length; j++) {
        matrix[0][j] = j;
    }
    
    for (let i = 1; i <= str2.length; i++) {
        for (let j = 1; j <= str1.length; j++) {
            if (str2.charAt(i - 1) === str1.charAt(j - 1)) {
                matrix[i][j] = matrix[i - 1][j - 1];
            } else {
                matrix[i][j] = Math.min(
                    matrix[i - 1][j - 1] + 1,
                    matrix[i][j - 1] + 1,
                    matrix[i - 1][j] + 1
                );
            }
        }
    }
    
    return matrix[str2.length][str1.length];
}

// Using the function
const mergedProducts = mergeProductData(externalProducts, internalDetails);
console.log(mergedProducts);

Example 2: Using a Specialized Library

javascript
// Using the fast-fuzzy library for efficient fuzzy search
const fastFuzzy = require('fast-fuzzy');

function mergeWithFuzzyLibrary(external, internal, matchFields) {
    const internalMap = new Map();
    
    // Create keys based on multiple fields
    internal.forEach(item => {
        const key = matchFields.map(field => item[field]).join(' | ');
        internalMap.set(key, item);
    });
    
    return external.map(externalItem => {
        const searchKey = matchFields.map(field => externalItem[field]).join(' | ');
        
        // Find the best match
        const results = fastFuzzy.filter(searchKey, Array.from(internalMap.keys()));
        
        if (results.length > 0 && results[0].score > 0.6) {
            const matchedItem = internalMap.get(results[0].value);
            return { ...externalItem, ...matchedItem };
        }
        
        return externalItem;
    });
}

Performance Optimization When Working with Large Data Volumes

When working with large data volumes, performance becomes a critical factor. Here are several optimization strategies:

1. Using Map Instead of Objects

javascript
// Fast index creation
function createIndex(data, keyField) {
    const index = new Map();
    data.forEach(item => {
        index.set(item[keyField], item);
    });
    return index;
}

// Efficient merging
function fastMerge(external, internal, keyField) {
    const internalIndex = createIndex(internal, keyField);
    return external.map(externalItem => {
        const match = internalIndex.get(externalItem[keyField]);
        return match ? { ...externalItem, ...match } : externalItem;
    });
}

2. Batch Data Processing

javascript
async function batchMerge(externalData, internalData, batchSize = 1000) {
    const results = [];
    
    for (let i = 0; i < externalData.length; i += batchSize) {
        const batch = externalData.slice(i, i + batchSize);
        const batchResults = mergeData(batch, internalData);
        results.push(...batchResults);
        
        // Allow time to process the next batch
        if (i + batchSize < externalData.length) {
            await new Promise(resolve => setTimeout(resolve, 100));
        }
    }
    
    return results;
}

3. Caching Results

javascript
const mergeCache = new Map();

function cachedMerge(external, internal, keyField) {
    const cacheKey = `${external.length}_${internal.length}_${keyField}`;
    
    if (mergeCache.has(cacheKey)) {
        return mergeCache.get(cacheKey);
    }
    
    const result = mergeData(external, internal, keyField);
    mergeCache.set(cacheKey, result);
    return result;
}

Recommendations for Choosing the Optimal Approach

When to use loops and basic methods:

  1. Small data volumes (up to 1000 records) - simple loops with Maps provide the best performance
  2. Well-defined matching criteria - when there are exact matches in fields
  3. Limited environment - when external libraries can’t be used

When to use specialized libraries:

  1. Large data volumes (from 10,000 records) - optimized algorithms work faster
  2. Fuzzy matching - when flexible search criteria are needed
  3. Complex operations - when aggregation, grouping, and other complex operations are required

For your specific task with product data:

  1. Preliminary data analysis - determine which fields most reliably connect external and internal data
  2. Create composite keys - often a combination of name, category, and price provides unique correspondence
  3. Use fuzzy search - for cases where product names may differ slightly
  4. Step-by-step merging - start with exact matching, then add fuzzy for remaining records with low match levels

Sources

  1. Data manipulation, cleaning, and processing in JavaScript - Learn JS Data
  2. Combining Data (Learn JS Data) / dakoop - Observable
  3. Data Blending: What It Is, Steps, Benefits & Best Practices - Matillion
  4. Stack Overflow: Joining two datasets using javascript
  5. Web Scraping with JavaScript and Node.js - ScrapingBee
  6. Web Scraping with Javascript (NodeJS) - ScrapingAnt
  7. A Quick Guide to CSS and jQuery Selectors for Web Scraper - Web Scraper
  8. Web Scraping with XPath and CSS Selectors - Crawlbase

Conclusion

Combining disparate data when web scraping without a common ID is a solvable task using various JavaScript approaches. For your situation with product data, the following approach is recommended:

  1. Start with exact matching based on available fields (name, category, price)
  2. Use composite keys to improve matching accuracy
  3. Implement fuzzy search for remaining records with low match levels
  4. Optimize performance using Maps and batch processing

For small data volumes (up to a few thousand records), basic loops with Maps will be the optimal solution. For large data arrays, it’s recommended to use specialized libraries like fast-fuzzy or lodash. It’s important to always conduct preliminary data analysis and test different combination strategies to achieve the best results.