How to combine disparate data when parsing a website without a common identifier in JavaScript?
Good day. I’m collecting data from a website for my application and I’ve encountered a problem with combining information from different sources. It’s more convenient for me to use JavaScript for parsing external pages, but working with internal pages containing specifications is more challenging.
Problem: I have two data arrays that need to be merged, but they are not connected through a common ID:
- Data from external pages (with assigned IDs):
[
{
'id':1,
"name": "Intel Core i3-12100F OEM",
"img": "img",
"price": "6099"
},
// ...
]
- Data from internal pages (with detailed information):
{
"data": {
"code": "5444958",
"name": "Процессор Intel Core i3-14100F OEM",
"specs": "[LGA 1700, 4 x 3.5 ГГц, L2 - 5 МБ, L3 - 12 МБ, 2 х DDR4, DDR5-4800 МГц, TDP 110 Вт]",
"description": "4-ядерный процессор...",
"price": 8299,
"bonus": 0,
"imageUrl": "img",
"characteristics": {
"Общие параметры": [
{
"title": "Модель",
"value": "Intel Core i3-14100F"
},
// ...
]
},
// ...
}
}
Questions:
- What approaches are available for matching records based on name or other fields?
- Should I use loops or libraries like pandas-js for merging data?
- How to efficiently combine disparate data in JavaScript when parsing a website?
Merging Disparate Data in Web Scraping Without Common Identifiers in JavaScript
When merging disparate data from web scraping without a common identifier in JavaScript, several approaches can be used, including fuzzy matching, synonym indexing, and weighted fields. The main methods include fuzzy search libraries like Fuse.js, as well as string comparison algorithms like Jaro-Winkler.
Table of Contents
- Basic Approaches to Data Merging
- Libraries for Fuzzy Search in JavaScript
- Data Matching Algorithms
- Practical Implementation of Data Merging
- Performance Optimization
- Code Example for Data Merging
- Recommendations and Best Practices
Basic Approaches to Data Merging
When there’s no common identifier for merging data from different sources, you can use the following strategies:
1. Name-based fuzzy matching
The primary approach is to use fuzzy search algorithms to match product names. This is the most effective method when names differ slightly across sources.
2. Multi-factor matching
Use a combination of fields to improve accuracy:
- Product name
- Price (with an acceptable range)
- Image (comparing hashes)
- Other unique characteristics
3. Synonym indexing
Create alternative indexes for each product, including possible typos and abbreviations.
Important: According to research from DataScienceCentral, when working with fuzzy matching, two key aspects must be considered: how to weight proxy fields and how to measure false positive and false negative errors.
Libraries for Fuzzy Search in JavaScript
Fuse.js
Fuse.js is a lightweight JavaScript library for fuzzy search that provides various algorithms for string matching.
const Fuse = require('fuse.js');
const options = {
keys: ['name'],
threshold: 0.3, // Similarity threshold (0-1)
distance: 100, // Maximum edit distance
includeScore: true
};
const fuse = new Fuse(externalData, options);
// Search for matches
const result = fuse.search(internalData.name);
FuzzySet.js
FuzzySet.js provides a data structure for performing full-text search with probable typo detection.
const FuzzySet = require('fuzzyset.js');
const fuzzySet = FuzzySet();
// Add names from external data
externalData.forEach(item => {
fuzzySet.add(item.name);
});
// Search for matches
const matches = fuzzySet.get(internalData.name);
Library Comparison
| Library | Features | Speed | Accuracy |
|---|---|---|---|
| Fuse.js | Flexible configuration, supports multi-field arrays | Medium | High |
| FuzzySet.js | Easy to use, good for typos | High | Medium |
| Custom Jaro-Winkler | Full control over algorithm | Low | High |
Data Matching Algorithms
Jaro-Winkler Algorithm
This algorithm is well-suited for comparing short strings like product names.
function jaroWinkler(s1, s2) {
if (s1 === s2) return 1.0;
const m1 = s1.length;
const m2 = s2.length;
if (m1 === 0 || m2 === 0) return 0.0;
const matchDistance = Math.floor(Math.max(m1, m2) / 2) - 1;
const s1Matches = new Array(m1).fill(false);
const s2Matches = new Array(m2).fill(false);
let matches = 0;
let transpositions = 0;
// Find matches
for (let i = 0; i < m1; i++) {
const start = Math.max(0, i - matchDistance);
const end = Math.min(i + matchDistance + 1, m2);
for (let j = start; j < end; j++) {
if (!s2Matches[j] && s1[i] === s2[j]) {
s1Matches[i] = true;
s2Matches[j] = true;
matches++;
break;
}
}
}
if (matches === 0) return 0.0;
// Count transpositions
let k = 0;
for (let i = 0; i < m1; i++) {
if (s1Matches[i]) {
while (!s2Matches[k]) k++;
if (s1[i] !== s2[k]) transpositions++;
k++;
}
}
const jaro = (matches / m1 + matches / m2 + (matches - transpositions / 2) / matches) / 3;
const prefixLength = 0;
return jaro + prefixLength * 0.1 * (1 - jaro);
}
Combining TF-IDF with Jaro-Winkler
As mentioned in Stack Overflow, you can replace exact token matches in TF-IDF with approximate matches based on the Jaro-Winkler scheme.
Field Weighting
To improve matching accuracy, you can use weighted fields:
function calculateMatchScore(external, internal) {
let score = 0;
let maxScore = 0;
// Name (weight 40%)
const nameScore = jaroWinkler(external.name, internal.name);
score += nameScore * 0.4;
maxScore += 0.4;
// Price (weight 30%)
const priceDiff = Math.abs(parseFloat(external.price) - parseFloat(internal.price));
const priceScore = priceDiff < 1000 ? 1 - (priceDiff / 1000) : 0;
score += priceScore * 0.3;
maxScore += 0.3;
// Image (weight 20%)
const imageScore = external.img === internal.imageUrl ? 1 : 0;
score += imageScore * 0.2;
maxScore += 0.2;
// Specifications (weight 10%)
const specsScore = jaroWinkler(external.specs || '', JSON.stringify(internal.specs)) * 0.1;
score += specsScore;
maxScore += 0.1;
return score / maxScore;
}
Practical Implementation of Data Merging
Step 1: Data Preparation
First, normalize the data to improve matching quality:
function normalizeString(str) {
return str.toLowerCase()
.replace(/[^\w\s]/g, '') // Remove special characters
.replace(/\s+/g, ' ') // Normalize whitespace
.trim();
}
function prepareData(data) {
return data.map(item => ({
...item,
normalizedName: normalizeString(item.name),
normalizedSpecs: normalizeString(item.specs || '')
}));
}
Step 2: Creating an Index
Optimize search by creating an index of external data:
function createIndex(externalData) {
const index = {};
externalData.forEach(item => {
const key = item.normalizedName.split(' ')[0]; // First word as key
if (!index[key]) {
index[key] = [];
}
index[key].push(item);
});
return index;
}
Step 3: Main Merging Algorithm
function mergeDatasets(externalData, internalData, threshold = 0.7) {
const preparedExternal = prepareData(externalData);
const preparedInternal = prepareData(internalData);
const index = createIndex(preparedExternal);
const result = [];
preparedInternal.forEach(internal => {
const firstWord = internal.normalizedName.split(' ')[0];
const candidates = index[firstWord] || [];
let bestMatch = null;
let bestScore = 0;
candidates.forEach(external => {
const score = calculateMatchScore(external, internal);
if (score > bestScore) {
bestScore = score;
bestMatch = external;
}
});
if (bestScore >= threshold) {
result.push({
...bestMatch,
...internal,
confidence: bestScore,
matchedFields: getMatchedFields(bestMatch, internal)
});
} else {
// No good enough match
result.push({
id: null,
confidence: 0,
...internal,
matchedFields: []
});
}
});
return result;
}
Step 4: Processing Results
function getMatchedFields(external, internal) {
const matchedFields = [];
if (jaroWinkler(external.name, internal.name) > 0.8) {
matchedFields.push('name');
}
if (Math.abs(parseFloat(external.price) - parseFloat(internal.price)) < 100) {
matchedFields.push('price');
}
if (external.img === internal.imageUrl) {
matchedFields.push('image');
}
return matchedFields;
}
Performance Optimization
Caching Results
const mergeCache = new Map();
function cachedMerge(externalData, internalData) {
const cacheKey = JSON.stringify({
externalHash: hashData(externalData),
internalHash: hashData(internalData)
});
if (mergeCache.has(cacheKey)) {
return mergeCache.get(cacheKey);
}
const result = mergeDatasets(externalData, internalData);
mergeCache.set(cacheKey, result);
return result;
}
Parallel Processing
const { Worker } = require('worker_threads');
function parallelMerge(externalData, internalData, chunkSize = 100) {
return new Promise((resolve) => {
const chunks = [];
for (let i = 0; i < internalData.length; i += chunkSize) {
chunks.push(internalData.slice(i, i + chunkSize));
}
const workers = [];
const results = [];
chunks.forEach((chunk, index) => {
const worker = new Worker('./mergeWorker.js', {
workerData: {
externalData,
internalData: chunk
}
});
worker.on('message', (result) => {
results[index] = result;
if (results.filter(r => r !== undefined).length === chunks.length) {
resolve(results.flat());
}
});
workers.push(worker);
});
});
}
Code Example for Data Merging
// Main data merging script
const Fuse = require('fuse.js');
async function mergeProductData() {
// Data from external pages
const externalData = [
{
'id': 1,
"name": "Intel Core i3-12100F OEM",
"img": "img1",
"price": "6099"
},
{
'id': 2,
"name": "AMD Ryzen 5 5600G",
"img": "img2",
"price": "8299"
}
];
// Data from internal pages
const internalData = [
{
"data": {
"code": "5444958",
"name": "Processor Intel Core i3-14100F OEM",
"specs": "[LGA 1700, 4 x 3.5 GHz, L2 - 5 MB, L3 - 12 MB, 2 x DDR4, DDR5-4800 MHz, TDP 110W]",
"description": "4-core processor...",
"price": 8299,
"bonus": 0,
"imageUrl": "img1",
"characteristics": {
"General Parameters": [
{
"title": "Model",
"value": "Intel Core i3-14100F"
}
]
}
}
}
];
// Fuse.js configuration
const fuseOptions = {
keys: [
{ name: 'name', weight: 0.4 },
{ name: 'price', weight: 0.3 },
{ name: 'img', weight: 0.2 },
{ name: 'specs', weight: 0.1 }
],
threshold: 0.4,
distance: 100,
includeScore: true,
minMatchCharLength: 3
};
// Create Fuse instance
const fuse = new Fuse(externalData, fuseOptions);
// Merge data
const mergedData = internalData.map(internal => {
const searchResult = fuse.search(internal.data.name);
if (searchResult.length > 0) {
const bestMatch = searchResult[0];
return {
...bestMatch.item,
...internal.data,
confidence: bestMatch.score,
matchedFields: Object.keys(bestMatch.item).filter(key =>
bestMatch.item[key] === internal.data[key]
)
};
}
return {
id: null,
confidence: 0,
...internal.data,
matchedFields: []
};
});
console.log('Merged data:', mergedData);
return mergedData;
}
// Execution
mergeProductData().catch(console.error);
Recommendations and Best Practices
1. Choosing the Right Algorithm
- For exact names: Fuse.js with low threshold
- For typos: Jaro-Winkler
- For large datasets: Indexing + caching
2. Configuring Weight Coefficients
Adjust weights based on the importance of each field:
const weights = {
name: 0.5, // Name is most important
price: 0.2, // Price is secondary
image: 0.15, // Image for additional verification
specs: 0.1, // Specifications for final check
description: 0.05 // Description rarely used
};
3. Error Handling
function safeMerge(externalData, internalData) {
try {
return mergeDatasets(externalData, internalData);
} catch (error) {
console.error('Error merging data:', error);
return [];
}
}
4. Result Validation
function validateMergedData(mergedData) {
return mergedData.filter(item => {
// Check for complete data
const hasRequiredFields = item.id && item.name && item.price;
// Check confidence
const hasGoodConfidence = item.confidence > 0.6;
return hasRequiredFields && hasGoodConfidence;
});
}
5. Integration with Scraping
For effective integration with the scraping process, you can use the following approach:
class DataMerger {
constructor(options = {}) {
this.options = {
threshold: 0.7,
weights: {
name: 0.4,
price: 0.3,
image: 0.2,
specs: 0.1
},
...options
};
this.externalData = [];
this.fuse = null;
}
addExternalData(data) {
this.externalData = data;
this.updateIndex();
}
updateIndex() {
this.fuse = new Fuse(this.externalData, {
keys: Object.keys(this.options.weights).map(key => ({
name: key,
weight: this.options.weights[key]
})),
threshold: this.options.threshold,
includeScore: true
});
}
mergeInternalData(internalData) {
return internalData.map(internal => {
const result = this.fuse.search(internal.name)[0];
if (result && result.score >= this.options.threshold) {
return {
...result.item,
...internal,
confidence: result.score
};
}
return {
id: null,
confidence: 0,
...internal
};
});
}
}
// Usage
const merger = new DataMerger({
threshold: 0.6,
weights: {
name: 0.5,
price: 0.3,
image: 0.2
}
});
merger.addExternalData(externalData);
const merged = merger.mergeInternalData(internalData);
Sources
- [Fuzzy Merge - Guides](https://povertyaction.github.io/guides/cleaning/04 Data Aggregation/02 Fuzzy Merge/) - Comprehensive guide to fuzzy data merging
- A Comprehensive Guide to Matching Web-Scraped Data | Crawlbase - Methods for matching web-scraped data
- Detailed Guide to Data Matching - Detailed guide to data matching
- JavaScript fuzzy search that makes sense - Stack Overflow - Discussion of fuzzy search algorithms
- Fuzzy Bootstrap Matching - DataScienceCentral.com - Techniques for merging data files without key fields
- Fast, accurate and multilingual fuzzy search library for the frontend - Reddit - Libraries for fuzzy search
- How to Implement Fuzzy Search in JavaScript | Codementor - Practical implementation of fuzzy search
- Fuse.js | Lightweight fuzzy-search library - Official Fuse.js documentation
- Fuzzy Search in JavaScript - GeeksforGeeks - Overview of fuzzy search in JavaScript
- fuzzyset.js - a fuzzy string set for javascript - GitHub Pages - FuzzySet.js implementation
Conclusion
For effectively merging disparate data from web scraping without a common identifier in JavaScript, it’s recommended to:
- Use specialized libraries like Fuse.js or FuzzySet.js for fuzzy string matching
- Implement a multi-factor evaluation system considering name, price, image, and specifications
- Create indexes to optimize performance when working with large data volumes
- Configure thresholds and weight coefficients based on the specific task
- Implement caching mechanisms and parallel processing to improve performance
The main challenge lies in balancing accuracy and performance. It’s advisable to start with simple algorithms and gradually increase complexity as needed. It’s also important to implement validation and error handling mechanisms to ensure system reliability.