What’s the optimal way to extract og:image meta tags for a list of articles from different websites? I need to create a “About Us” news section where our company is mentioned in various news articles. For each article, I need to automatically load preview images (similar to how Telegram, VK and other social networks do it). My current method using a PHP function that parses HTML code to search for og:image is too resource-intensive when processing a large number of articles. I’m considering saving the image when creating the element. What alternative solutions exist for efficiently obtaining og:image images?
Optimal Extraction of og:image Meta Tags for Large-Scale Article Processing
Efficient extraction of og:image meta tags for a large number of articles requires implementing caching strategies, alternative data extraction methods, and server load minimization techniques. The primary solutions recommended include implementing a multi-level caching system, using specialized services for Open Graph data extraction, and implementing asynchronous processing to prevent blocking the main execution thread.
Table of Contents
- Analysis of Current Approach
- Data Caching Strategies
- Alternative og:image Extraction Methods
- Performance Optimization
- Practical PHP Solutions
- Approach Comparison
- Implementation Recommendations
Analysis of Current Approach
The current method, which relies on parsing HTML code from each page to find the og:image meta tag, is indeed inefficient when processing a large number of articles for several reasons:
-
High CPU load: HTML parsing requires significant computational resources, especially when processing multiple pages simultaneously.
-
Execution time limitations: Most hosting providers limit maximum script execution time, making it impossible to process hundreds or thousands of pages in a single request.
-
HTML structure dependency: Different websites use different structures and formats for meta tags, requiring complex processing and regular code adaptation.
-
Network delays: Each request to an external site takes time to establish a connection and retrieve data, which multiplies when processing a large number of URLs.
Example problem: When processing 100 articles with a 3-second delay for each, the total time would be approximately 5 minutes, which is unacceptable for a user interface.
Data Caching Strategies
Multi-level Caching
Implement a multi-level caching system to minimize repeated requests to external resources:
class OpenGraphCache {
private $cacheDir;
private $cacheLifetime;
public function __construct($cacheDir = '/tmp/og_cache', $cacheLifetime = 86400) {
$this->cacheDir = $cacheDir;
$this->cacheLifetime = $cacheLifetime;
if (!file_exists($this->cacheDir)) {
mkdir($this->cacheDir, 0755, true);
}
}
public function get($url) {
$cacheFile = $this->getCacheFileName($url);
if (file_exists($cacheFile) && (time() - filemtime($cacheFile)) < $this->cacheLifetime) {
return json_decode(file_get_contents($cacheFile), true);
}
return false;
}
public function set($url, $data) {
$cacheFile = $this->getCacheFileName($url);
file_put_contents($cacheFile, json_encode($data));
}
private function getCacheFileName($url) {
return $this->cacheDir . '/' . md5($url) . '.json';
}
}
Types of Caching
-
File caching: Simple to implement, suitable for small to medium projects.
-
Database caching: More reliable solution with the ability to manage cache expiration times.
-
Redis/Memcached: Optimal solution for high-traffic projects with minimal access latency.
-
HTTP caching: Using
Cache-ControlandETagheaders for web server-level caching.
Caching Recommendations
- Use URL hash as cache filename to prevent collisions
- Set reasonable cache lifetimes (24-72 hours for news content)
- Implement background cache refresh for popular articles
- Add cache invalidation mechanism when data changes
Alternative og:image Extraction Methods
Specialized Services
Use ready-made APIs for extracting Open Graph data:
// Example using OpenGraph.io API
function getOgImageViaService($url) {
$apiKey = 'YOUR_API_KEY';
$apiUrl = "https://opengraph.io/api/1.1/site/{$url}?app_id={$apiKey}";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $apiUrl);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$response = curl_exec($ch);
curl_close($ch);
$data = json_decode($response, true);
if (isset($data['hybridGraph']['image'])) {
return $data['hybridGraph']['image'];
}
return null;
}
Advantages of specialized services:
- Optimized data extraction algorithms
- Guaranteed performance
- Error handling and redirect management
- Service-side caching
Headless Browsers
For complex sites with dynamic content, use headless browsers:
// Example using Puppeteer via ChromeDriver
function getOgImageWithHeadlessBrowser($url) {
$command = "curl -X POST http://localhost:9221/json/new";
$response = shell_exec($command);
$target = json_decode($response, true)['targetId'];
$command = "curl -X POST http://localhost:9221/json/{$target}/navigate -d '{\"url\": \"{$url}\"}'";
shell_exec($command);
sleep(2); // Wait for page to load
$command = "curl -X POST http://localhost:9221/json/{$target}/execute -d '{\"script\": \"return document.querySelector('meta[property=\"og:image\"]').getAttribute('content');\"}'";
$response = shell_exec($command);
$command = "curl -X DELETE http://localhost:9221/json/{$target}";
shell_exec($command);
return json_decode($response, true)['result'];
}
Proxying Through Services
Use proxying through services such as:
- Readability API: For extracting main content and images
- Embedly API: For getting structured data from web pages
- Iframely API: For extracting media content from various sources
Performance Optimization
Asynchronous Processing
Implement asynchronous processing to prevent blocking the main thread:
class AsyncImageFetcher {
private $maxConcurrent = 10;
private $timeout = 5;
public function fetchMultiple($urls) {
$results = [];
$mh = curl_multi_init();
$handles = [];
// Add first $maxConcurrent requests
foreach (array_slice($urls, 0, $this->maxConcurrent) as $i => $url) {
$handles[$i] = $this->createHandle($url, $i);
curl_multi_add_handle($mh, $handles[$i]);
}
$active = null;
do {
$status = curl_multi_exec($mh, $active);
if ($status != CURLM_OK) {
break;
}
// Check completed requests
while ($info = curl_multi_info_read($mh)) {
$i = array_search($info['handle'], $handles);
if ($i !== false) {
$response = curl_multi_getcontent($info['handle']);
$results[$i] = $this->processResponse($response);
curl_multi_remove_handle($mh, $info['handle']);
curl_close($info['handle']);
// Add new request if there are more URLs
if (isset($urls[$this->maxConcurrent + $i])) {
$newUrl = $urls[$this->maxConcurrent + $i];
$handles[$i] = $this->createHandle($newUrl, $i);
curl_multi_add_handle($mh, $handles[$i]);
}
}
}
if ($active) {
curl_multi_select($mh, $this->timeout);
}
} while ($active);
curl_multi_close($mh);
return $results;
}
private function createHandle($url, $index) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_TIMEOUT, $this->timeout);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_NOBODY, false);
return $ch;
}
}
Network Request Optimization
-
HTTP/2 usage: For simultaneous processing of multiple requests to the same domain.
-
Data compression: Enabling gzip compression to reduce the amount of data transferred.
-
Timeout optimization: Setting reasonable timeout values to prevent long waits.
-
Parallel processing: Using multithreading or asynchronous requests.
Resource-Efficient Parsers
Use specialized parsers instead of general HTML parsers:
// Example using simple regular expressions
function extractOgImageWithRegex($html) {
$pattern = '/<meta\s+property=["\']og:image["\']\s+content=["\']([^"\']+)["\']/i';
if (preg_match($pattern, $html, $matches)) {
return $matches[1];
}
return null;
}
Practical PHP Solutions
Hybrid Approach
Combine multiple methods to achieve optimal performance:
class HybridImageExtractor {
private $cache;
private $services;
public function __construct() {
$this->cache = new OpenGraphCache();
$this->services = [
'opengraph' => new OpenGraphService(),
'readability' => new ReadabilityService(),
'fallback' => new RegexExtractor()
];
}
public function extract($url) {
// Check cache
$cached = $this->cache->get($url);
if ($cached) {
return $cached;
}
// Try extraction through services in order
foreach ($this->services as $serviceName => $service) {
try {
$result = $service->extract($url);
if ($result) {
$this->cache->set($url, $result);
return $result;
}
} catch (Exception $e) {
// Log error and continue to next service
error_log("Service {$serviceName} failed for {$url}: " . $e->getMessage());
}
}
return null;
}
public function batchExtract($urls) {
$results = [];
$missingUrls = [];
// Check cache for all URLs
foreach ($urls as $url) {
$cached = $this->cache->get($url);
if ($cached) {
$results[$url] = $cached;
} else {
$missingUrls[] = $url;
}
}
// Process URLs missing from cache
if (!empty($missingUrls)) {
$newResults = $this->services['opengraph']->batchExtract($missingUrls);
foreach ($newResults as $url => $image) {
$results[$url] = $image;
$this->cache->set($url, $image);
}
}
return $results;
}
}
Optimized Parser with Limited Request
class OptimizedOgExtractor {
private $maxContentLength = 102400; // 100KB
private $timeout = 3;
public function extract($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 3);
curl_setopt($ch, CURLOPT_TIMEOUT, $this->timeout);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_MAXFILESIZE, $this->maxContentLength);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; ImageExtractor/1.0)');
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode !== 200 || !$response) {
return null;
}
// Quick search for og:image
return $this->findOgImage($response);
}
private function findOgImage($html) {
// Optimized search without full parsing
$pos = strpos($html, 'og:image');
if ($pos === false) {
return null;
}
// Extract the nearest meta tag
$metaPos = strrpos(substr($html, 0, $pos), '<meta');
if ($metaPos === false) {
return null;
}
$metaEnd = strpos($html, '>', $metaPos);
if ($metaEnd === false) {
return null;
}
$metaTag = substr($html, $metaPos, $metaEnd - $metaPos + 1);
// Extract content from meta tag
if (preg_match('/content=["\']([^"\']+)["\']/', $metaTag, $matches)) {
return $matches[1];
}
return null;
}
}
Integration with Processing Queues
For processing a large number of articles, use queues:
class ImageProcessingQueue {
private $queue;
public function __construct() {
$this->queue = new RedisQueue('image_processing');
}
public function addToQueue($articleId, $url) {
$job = [
'article_id' => $articleId,
'url' => $url,
'created_at' => time(),
'attempts' => 0
];
$this->queue->push($job);
}
public function processQueue() {
while ($job = $this->queue->pop()) {
try {
$extractor = new OptimizedOgExtractor();
$imageUrl = $extractor->extract($job['url']);
if ($imageUrl) {
$this->saveImage($job['article_id'], $imageUrl);
}
$this->queue->acknowledge($job['id']);
} catch (Exception $e) {
$job['attempts']++;
if ($job['attempts'] < 3) {
$this->queue->push($job);
} else {
$this->queue->fail($job['id'], $e->getMessage());
}
}
}
}
}
Approach Comparison
| Approach | Speed | Reliability | Implementation Complexity | Recommended Load |
|---|---|---|---|---|
| Direct HTML parsing | Low | Medium | Low | Several dozen articles per hour |
| Regular expressions | Medium | Low | Medium | Several hundred articles per hour |
| Specialized services | High | High | Medium | Thousands of articles per hour |
| Headless browsers | Low | High | High | Specific cases |
| Multi-level cache | Very high | Depends on implementation | High | Optimal solution for most cases |
Implementation Recommendations
Step-by-Step Implementation
-
Initial phase:
- Implement basic file system caching
- Use optimized parsers with limited requests
- Add error handling and timeouts
-
System development:
- Implement specialized services as the primary method
- Add multi-level caching (Redis + file-based)
- Implement asynchronous processing in batches
-
Scaling:
- Use processing queues for background tasks
- Implement load balancing between multiple services
- Add performance monitoring and logging
Performance Configuration
// Optimal PHP settings for image processing
ini_set('max_execution_time', 30);
ini_set('memory_limit', '256M');
ini_set('max_input_time', 30);
// Optimal cURL settings
$options = [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 3,
CURLOPT_TIMEOUT => 5,
CURLOPT_CONNECTTIMEOUT => 3,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_SSL_VERIFYHOST => false,
CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; NewsImageBot/1.0)',
CURLOPT_HTTPHEADER => [
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3',
]
];
Monitoring and Optimization
-
Performance tracking:
- Request execution time
- Ratio of successful/unsuccessful extractions
- Memory and CPU usage
-
Efficiency analysis:
- Cache hit rate
- Average request processing time
- System load during peak times
-
Automatic optimization:
- Dynamic timeout adjustment based on load
- Automatic switching between extraction methods
- Load prediction and peak prevention
Conclusion
An optimal system for extracting og:image meta tags should be based on a multi-level architecture using caching, specialized services, and asynchronous processing. Key recommendations:
-
Implement multi-level caching using Redis for frequently requested URLs and file-based caching for other cases.
-
Use specialized services as the primary data extraction method, with fallback to optimized parsers for cases where services are unavailable.
-
Implement asynchronous processing using queues to prevent blocking the main execution thread and distribute load evenly.
-
Optimize network requests by limiting response sizes, setting reasonable timeouts, and using parallel processing.
-
Add monitoring and logging to track performance and promptly detect issues.
This approach will enable efficient processing of thousands of articles with minimal server load and ensure fast response times for the user interface when loading preview images for the news section.