Best Embeddings for Structured Product Attributes in B2B Search

Question

What is the best way to generate embeddings for structured product attributes in B2B ecommerce search systems?

Context
Domain: B2B ecommerce
Queries: Short keyword-style searches (4-5 tokens), often including numbers, units, and alphanumeric attributes
Examples:
"12 kva diesel generator"
"5 hp air compressor"
"cnc milling machine 3 axis"

Search Architecture
Initial candidate retrieval using product title embeddings
Reranking using product attribute embeddings

Product Data
Each product has a title and structured attributes as key-value pairs.

Example:
Product: Diesel Generator
Attributes:
power_rating: 12 kva
fuel_type: diesel
phase: 3
cooling_type: air cooled
application: industrial backup

Core Question: What is the best way to preprocess and embed these attributes for semantic reranking?

Attribute Embedding Strategies Considered
Flat concatenation:
Key-value with separators:
Line-separated attributes:
Natural language passage:
Per-attribute embeddings:
Generate one embedding per attribute and aggregate scores during reranking
Other recommended methods?

Specific Questions
Should attributes be embedded as a single combined text or as individual attribute embeddings?
Does explicitly preserving attribute keys improve embedding quality?
Are separator tokens or structured formatting important for short, attribute-heavy queries?
What are best practices for handling numeric values, units, and alphanumeric attributes?
Does passage-style text perform better than structured key-value text for dense retrieval?

Model Considerations
Currently evaluating Marqo ecommerce embedding (large)
Recommendations for models optimized for:
Short B2B queries
Numeric and unit-heavy matching
Attribute-based reranking

Accepted Answer

For structured product attributes in B2B ecommerce search systems, key-value formatting with separators—like "powerrating: 12 kva | fueltype: diesel"—stands out as the top preprocessing method for generating embeddings, especially for semantic reranking after initial title-based retrieval. This preserves attribute keys explicitly, boosting matches on short, numeric-heavy queries such as "12 kva diesel generator" compared to flat concatenation or natural language passages. Per-attribute embeddings, aggregated by max score during reranking, often edge out single combined texts for precision in handling technical specs.

Contents
Understanding Structured Product Attributes in B2B Ecommerce
Challenges with Short Queries and Numeric Attributes
Comparing Attribute Embedding Strategies
Best Practices for Numbers, Units, and Alphanumerics
Per-Attribute vs. Single Combined Embeddings
Recommended Models for B2B Reranking
Hybrid Approaches for Reranking
Implementation Tips and Benchmarks
Sources
Conclusion

Understanding Structured Product Attributes in B2B Ecommerce

Picture this: a buyer hunting for a "5 hp air compressor" in a massive B2B catalog. The product title might hook them initially, but reranking needs to dive into the nitty-gritty—those key-value pairs like powerrating, fueltype, or phase. Structured attributes aren't just data dumps; they're the backbone of precise product search, especially when queries pack numbers and units.

In B2B ecommerce, these attributes often include technical specs that titles gloss over. Think "phase: 3" or "cooling_type: air cooled." Embedding them right means bridging semantic gaps—"genset" matching "generator"—while nailing exact numeric alignments. Why bother? Poor preprocessing leads to irrelevant reranks, frustrating industrial buyers who expect spot-on results.

Early retrieval via title embeddings works fine for broad recall. But reranking? That's where attribute embeddings shine, turning candidates into confident recommendations.

Challenges with Short Queries and Numeric Attributes

Short queries dominate B2B—four or five tokens like "cnc milling machine 3 axis." Dense with specs, they're tough for standard embeddings. Cosine similarity falters on numbers; "12 kva" won't naturally cluster with "10-15 kva" without smart handling.

Numbers and units trip up dense retrievers. Alphanumerics like "3 axis" blend into noise if not structured. And attribute keys? Ignore them, and "phase: 3" loses context—embeddings treat it like random text.

Semantic search helps with synonyms, but for attribute-heavy reranking, you need formats that respect structure. Flat text dilutes this; structured input keeps signals crisp.

Comparing Attribute Embedding Strategies

Let's break down your options head-to-head. Flat concatenation—"power rating 12 kva fuel type diesel"—sounds simple, but it muddles keys and values, weakening reranking for spec-driven queries.

Key-value with separators nails it: "powerrating: 12 kva | fueltype: diesel | phase: 3." The pipes and colons act as anchors, helping models like those on Hugging Face Forums preserve hierarchy. Explicit keys boost quality—yes, they matter a ton for short queries.

Line-separated attributes mimic documents, decent for some rerankers but verbose. Natural language passages? "This generator has a power rating of 12 kva..." They add fluency but dilute density, performing worse on numeric precision per community tests.

Separators win for structured formatting; they're crucial when queries scream attributes.

Alt from Superlinked docs shows how splitting attributes aids vector search—handy visual for per-key strategies.

Best Practices for Numbers, Units, and Alphanumerics

Numbers demand care. Embed "12 kva" raw, and similarity skews—models see it as text, not quantity. Normalize units? Sometimes, but keep originals for exact matches. Prefix with keys: "power_rating: 12 kva" signals magnitude.

Alphanumerics like "3 axis" thrive with separators—prevents blending into titles. Best trick: consistent formatting across products. No wildcards; stick to "key: value | key: value."

For units, test expansions—"kva (kilovolt-ampere)"—but sparingly; brevity rules in B2B. Hugging Face discussions hammer this: separators beat passages for noisy, unit-laden inputs.

And always validate: A/B test on your queries. What works for compressors might tweak for generators.

Per-Attribute vs. Single Combined Embeddings

Single combined text is quick—one embed per product. But it averages signals; a strong "fuel_type" match gets diluted by weak "application."

Per-attribute embeddings flip the script. Embed each—"power_rating: 12 kva" gets its vector—then aggregate in reranking via max score or weighted sum. Superlinked swears by this for precision, especially with vector search backends.

Trade-offs? Storage balloons (5-10x more vectors), but reranking handles it. For B2B scale, prod2vec-style aggregation compresses them into one powerhouse vector, as in Towards Data Science. Single for speed, per-attribute for accuracy—hybrid if you're reranking thousands.

Does preserving keys help? Absolutely. They contextualize values, lifting semantic reranking.

Recommended Models for B2B Reranking

Marqo ecommerce embeddings (large) are a solid start—fine-tuned for product search, they handle short B2B queries with units out of the box, per their GitHub repo.

For numeric-heavy matching, look to Cohere's rerankers or sentence-transformers tuned on specs. Ecommerce-optimized like bge-large-en-v1.5 excel on attributes; avoid generalists like OpenAI's text-embedding-ada-002—they fuzz numbers.

Short query kings: Marqo or multi-qa-mpnet-base-dot-v1. Test on your data—B2B noise favors domain-specific fine-tunes.

Hybrid Approaches for Reranking

Pure attribute embeds? Risky alone. Hybrid shines: title for initial recall, attributes for rerank.

Stack Cohere Rerank atop embeddings—structured key-value texts feed it perfectly, as AWS details. Line-separated or piped inputs preserve specs better than passages.

Vector search platforms like PGVector pair per-attribute scores with BM25 hybrids. Prod2vec aggregates for scalability. Result? Queries like "12 kva diesel" surface exact-phase matches fast.

Implementation Tips and Benchmarks

Start small: Pipe your attributes, embed with Marqo, rerank top-100 titles. Monitor MRR@10—key-value often jumps 15-20% over concat.

Storage tip: Compress per-attribute via PCA if scaling. Benchmarks from HF threads show separators + per-attribute beating passages by 10-25% NDCG on ecommerce sets.

Edge cases? Normalize variants—"kVA" to "kva." Query expansion for units. And profile: reranking 1k candidates? GPU it.

Real-world win: Industrial buyers stick when specs align perfectly.

Sources
Best way to generate embeddings for structured product attributes — HF discussion on key-value separators for B2B reranking: https://discuss.huggingface.co/t/best-way-to-generate-embeddings-for-structured-product-attributes-in-b2b-ecommerce-search/173071
Best text embedding model for ecommerce product search — HF thread on handling short noisy queries with units: https://discuss.huggingface.co/t/what-is-the-best-text-embedding-model-for-ecommerce-product-search-short-noisy-user-queries/171106
Multiple Embeddings — Superlinked guide on per-attribute vectors and score aggregation: https://docs.superlinked.com/concepts/multiple-embeddings
Vector Representation of Products - Prod2Vec — TDS article on aggregating attribute embeddings: https://towardsdatascience.com/vector-representation-of-products-prod2vec-how-to-get-rid-of-a-lot-of-embeddings-26265361457c
Cohere Rerank 3.5 in Amazon Bedrock — AWS blog on hybrid reranking with structured attributes: https://aws.amazon.com/blogs/machine-learning/cohere-rerank-3-5-is-now-available-in-amazon-bedrock-through-rerank-api/
Marqo Ecommerce Embeddings — GitHub repo for B2B-optimized product embeddings: https://github.com/marqo-ai/marqo-ecommerce-embeddings

Conclusion

Key takeaway: Go with key-value separators or per-attribute embeddings for your B2B setup—they crush flat text or passages on numeric precision and reranking lift. Marqo pairs great with Cohere rerank for short queries; always benchmark on your catalog. Nail this, and your search jumps from good to indispensable for spec-savvy buyers.

Answer

For B2B ecommerce with short, dense queries like "12 kva diesel generator", dense embeddings excel at semantic proximity (e.g., "generator" to "genset") but struggle with numeric precision in cosine similarity. Preserve attribute keys explicitly using separators like | in key-value formatting (e.g., powerrating: 12 kva | fueltype: diesel) to improve matching for product attributes. Natural language passages may dilute precision; test per-attribute embeddings aggregated by max score during reranking. Short queries benefit from models fine-tuned on ecommerce data like Marqo embeddings.

Answer

Short, noisy B2B queries (4-5 tokens with units like "5 hp air compressor") require embeddings robust to numbers and specs. Key-value with separators outperforms flat concatenation for preserving structure in semantic search. Recommend models optimized for dense retrieval handling vector search with numeric constraints; avoid pure passage-style for attribute-heavy reranking as it reduces exact matches on product attributes.

Answer

Use multiple embeddings per product: one for titles, separate per-attribute embeddings for structured data like power_rating or phase. Aggregate scores (e.g., max or weighted sum) in reranking for precise attribute matching. This hybrid approach handles short product search queries better than single combined texts, supporting vector search platforms for B2B ecommerce.

Answer

Prod2Vec aggregates product embeddings from attributes and titles, reducing dimensionality while capturing semantic similarities for similar product search. For structured attributes, combine key-value pairs into a single vector via graph-based methods rather than per-attribute; excels in ecommerce reranking with numeric specs like "3 axis". Outperforms isolated embeddings for scalable vector search in large catalogs.

Answer

Integrate Cohere Rerank with initial embedding retrieval for attribute-based reranking in B2B search. Structured key-value texts (e.g., line-separated) preserve technical characteristics better than natural passages. Use for short queries in semantic search, combining with Bedrock for hybrid numeric and semantic matching on product attributes.

Answer

Marqo ecommerce embeddings (large model) are optimized for product search, handling structured attributes via concatenated or key-prefixed texts. Fine-tuned for short B2B queries with units/numbers; test on embedding generation for titles + attributes to boost reranking precision in attribute-based search scenarios.