Programming

Extract Book Titles, Images & Prices with lxml XPath

Extract book titles, image URLs and prices from books.toscrape.com using lxml XPath in Python. Includes sample code, urljoin for images, and error handling.

1 answer 1 view

How can I extract book titles, image URLs (src) and prices from the

elements returned by lxml? I’m parsing https://books.toscrape.com/catalogue/page-1.html and I use:

all_product_information = tree.xpath(‘//article[@class=“product_pod”]’)

This returns a list of Element objects (e.g. <Element article at 0x7008b358d240>). When I iterate over all_product_information or access items by index I still get Element objects but not the contained text or attributes. What is the correct way to extract, for each element, the book title, image URL and price — for example using element.xpath(‘.//h3/a/@title’), element.xpath(‘.//img/@src’) and element.xpath(‘.//p[@class=“price_color”]/text()’) — or should I use lxml Element methods instead?

Your XPath query all_product_information = tree.xpath('//article[@class="product_pod"]') returns a list of lxml Element objects, which you then iterate over to run relative XPath queries like product.xpath('.//h3/a/@title')[0] for titles, product.xpath('.//img/@src')[0] for image src, and product.xpath('.//p[@class="price_color"]/text()')[0]' for prices. These methods are spot-on for lxml xpath extraction in web scraping python—just remember XPath always spits out lists, so grab the first item with [0] (and check len() first to dodge IndexErrors). This approach crushes parsing sites like books.toscrape.com, pulling structured data blazingly fast without the overhead of heavier libraries.


Contents


Why lxml XPath for Web Scraping Python

Ever tried scraping a page and watched your script crawl to a halt? lxml xpath changes that. Built on libxml2, lxml parses massive HTML docs in milliseconds—way snappier than alternatives for python парсинг html. Picture this: you’re hitting books.toscrape.com, a perfect scraping playground with consistent

blocks holding titles, images, and prices.

What makes it shine? XPath lets you pinpoint elements surgically, like .//h3/a/@title for book names. No fumbling through DOM trees. And those Element objects you mentioned? They’re gold—they let you chain queries from any starting point. According to the GeeksforGeeks guide on lxml web scraping, this workflow fetches pages with requests, parses via html.fromstring(), and queries away. But heads up: XPath returns lists every time. Empty list? No match. Single item? Still a list with one entry. Hack it with [0].

Why not stick to Element methods like .get()? XPath’s more powerful for nested hunts. Your instinct on relative paths (starting with .) is dead right—keeps queries scoped to each product pod.


Setting Up Your lxml Parser

First things first: snag the page. Fire up requests:

python
import requests
from lxml import html

response = requests.get('https://books.toscrape.com/catalogue/page-1.html')
tree = html.fromstring(response.content)

Boom—tree is your parsed DOM. lxml handles malformed HTML like a champ, auto-fixing quirks that trip up stricter parsers. (Pro tip: toss in response.encoding = 'utf-8' if accents glitch.)

Now your line: all_product_information = tree.xpath('//article[@class="product_pod"]'). Spot on. This grabs every matching

. You’ll get something like [<Element article at 0x...>, <Element article at 0x...>]. Not text, not attrs—raw Elements. That’s normal. Iterate like:

python
for product in all_product_information:
 # Extract here—next sections dive in
 pass

Stuck on Elements? You’re not alone. Stack Overflow threads nail it: these are containers; drill down with .xpath() or .get() on specific bits. But XPath rules for precision.


Finding Product Pods on Books.toscrape

Books.toscrape mimics real e-commerce: 20 products per page, each in a tidy

. Your XPath nails it—//article[@class="product_pod"] skips wrappers, lasers to the goods.

Inspect the HTML quick. Each pod nests:

Relative XPath from the product Element keeps it clean: .// searches inside that pod only. Absolute paths? Messy for loops. Test in browser dev tools first—XPath tester extensions save headaches.

What if pods vary? Add predicates: //article[@class="product_pod" and .//img]. Robust.


Extracting Book Titles

Titles hide in @title attrs on tags. From a product Element:

python
title = product.xpath('.//h3/a/@title')[0]

Why [0]? XPath @title returns ['A Light in the Attic']—a list. Pop the first (it’s always one per pod). Empty? IndexError. Guard it:

python
titles = product.xpath('.//h3/a/@title')
if titles:
 title = titles[0]
else:
 title = 'N/A'

Fancy one-liner? title = product.xpath('.//h3/a/@title')[0] if product.xpath('.//h3/a/@title') else None. Or next(iter(product.xpath('.//h3/a/@title')), 'N/A')—elegant, no lists.

The Stack Overflow answer on img src extraction echoes this for books.toscrape: same pattern, same gotcha. Titles are reliable here, but real sites? Always validate.


Grabbing Image URLs and Prices

Images next: img_src = product.xpath('.//img/@src')[0]. Relative src like '../../media/cache/34bd/...jpg'. Fix with from urllib.parse import urljoin; full_img = urljoin(response.url, img_src).

Prices: price = product.xpath('.//p[@class="price_color"]/text()')[0]. Grabs '£51.77'. Strip currency? price.strip('£') or regex for floats.

Batch 'em:

python
data = {
 'title': product.xpath('.//h3/a/@title')[0],
 'img': urljoin(base_url, product.xpath('.//img/@src')[0]),
 'price': product.xpath('.//p[@class="price_color"]/text()')[0]
}

Pitfalls? Relative URLs break without urljoin. Multiple prices? [0] picks first. No match? Check XPath in inspector. lxml’s fault-tolerant, but sites change—add try/except.


Full Code Example and Pro Tips

Tie it together. Full scraper for page 1:

python
import requests
from lxml import html
from urllib.parse import urljoin

url = 'https://books.toscrape.com/catalogue/page-1.html'
response = requests.get(url)
tree = html.fromstring(response.content)

products = tree.xpath('//article[@class="product_pod"]')
books = []

for product in products:
 try:
 title = product.xpath('.//h3/a/@title')[0]
 img_src = urljoin(url, product.xpath('.//img/@src')[0])
 price = product.xpath('.//p[@class="price_color"]/text()')[0]
 books.append({'title': title, 'img': img_src, 'price': price})
 except IndexError:
 print(f"Missing data in pod: {product.get('data-thumb')}")

print(books[:2]) # Sample: [{'title': 'A Light in the Attic', ...}]

Scales to pages? Loop f'page-{i}.html'. Rate limit with time.sleep(1). Pandas? pd.DataFrame(books).to_csv(). Debug: print(html.tostring(product)).

Pro tips: Namespaces trip XPath? register_namespace(). Huge pages? etree.iterparse(). Anti-bot? Headers/User-Agent. This lxml xpath setup flies for web scraping python.


Sources

  1. Web Scraping using lxml and XPath in Python - GeeksforGeeks
  2. How do I extract img src from HTML via lxml XPath? - Stack Overflow

Conclusion

Mastering lxml xpath for web scraping python boils down to relative queries on Element objects, indexing lists with [0], and gluing URLs properly—your XPath paths are already perfect for books.toscrape titles, images, and prices. Run the full code above, tweak for errors, and you’ve got a bulletproof extractor. Scale it across pages, and you’re scraping catalogs like a pro. Questions on pagination or proxies? Dive deeper, but this foundation rocks.

Authors
Verified by moderation
Moderation
Extract Book Titles, Images & Prices with lxml XPath