NeuroAgent

Optimal og:image Extraction for News Section

How to efficiently retrieve og:image images for multiple articles with minimal server load. Solutions for automatically loading previews in news sections.

Question

What’s the optimal way to extract og:image meta tags for a list of articles from different websites? I need to create a “About Us” news section where our company is mentioned in various news articles. For each article, I need to automatically load preview images (similar to how Telegram, VK and other social networks do it). My current method using a PHP function that parses HTML code to search for og:image is too resource-intensive when processing a large number of articles. I’m considering saving the image when creating the element. What alternative solutions exist for efficiently obtaining og:image images?

NeuroAgent

Optimal Extraction of og:image Meta Tags for Large-Scale Article Processing

Efficient extraction of og:image meta tags for a large number of articles requires implementing caching strategies, alternative data extraction methods, and server load minimization techniques. The primary solutions recommended include implementing a multi-level caching system, using specialized services for Open Graph data extraction, and implementing asynchronous processing to prevent blocking the main execution thread.


Table of Contents


Analysis of Current Approach

The current method, which relies on parsing HTML code from each page to find the og:image meta tag, is indeed inefficient when processing a large number of articles for several reasons:

  1. High CPU load: HTML parsing requires significant computational resources, especially when processing multiple pages simultaneously.

  2. Execution time limitations: Most hosting providers limit maximum script execution time, making it impossible to process hundreds or thousands of pages in a single request.

  3. HTML structure dependency: Different websites use different structures and formats for meta tags, requiring complex processing and regular code adaptation.

  4. Network delays: Each request to an external site takes time to establish a connection and retrieve data, which multiplies when processing a large number of URLs.

Example problem: When processing 100 articles with a 3-second delay for each, the total time would be approximately 5 minutes, which is unacceptable for a user interface.


Data Caching Strategies

Multi-level Caching

Implement a multi-level caching system to minimize repeated requests to external resources:

php
class OpenGraphCache {
    private $cacheDir;
    private $cacheLifetime;
    
    public function __construct($cacheDir = '/tmp/og_cache', $cacheLifetime = 86400) {
        $this->cacheDir = $cacheDir;
        $this->cacheLifetime = $cacheLifetime;
        if (!file_exists($this->cacheDir)) {
            mkdir($this->cacheDir, 0755, true);
        }
    }
    
    public function get($url) {
        $cacheFile = $this->getCacheFileName($url);
        
        if (file_exists($cacheFile) && (time() - filemtime($cacheFile)) < $this->cacheLifetime) {
            return json_decode(file_get_contents($cacheFile), true);
        }
        
        return false;
    }
    
    public function set($url, $data) {
        $cacheFile = $this->getCacheFileName($url);
        file_put_contents($cacheFile, json_encode($data));
    }
    
    private function getCacheFileName($url) {
        return $this->cacheDir . '/' . md5($url) . '.json';
    }
}

Types of Caching

  1. File caching: Simple to implement, suitable for small to medium projects.

  2. Database caching: More reliable solution with the ability to manage cache expiration times.

  3. Redis/Memcached: Optimal solution for high-traffic projects with minimal access latency.

  4. HTTP caching: Using Cache-Control and ETag headers for web server-level caching.

Caching Recommendations

  • Use URL hash as cache filename to prevent collisions
  • Set reasonable cache lifetimes (24-72 hours for news content)
  • Implement background cache refresh for popular articles
  • Add cache invalidation mechanism when data changes

Alternative og:image Extraction Methods

Specialized Services

Use ready-made APIs for extracting Open Graph data:

php
// Example using OpenGraph.io API
function getOgImageViaService($url) {
    $apiKey = 'YOUR_API_KEY';
    $apiUrl = "https://opengraph.io/api/1.1/site/{$url}?app_id={$apiKey}";
    
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $apiUrl);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    
    $response = curl_exec($ch);
    curl_close($ch);
    
    $data = json_decode($response, true);
    
    if (isset($data['hybridGraph']['image'])) {
        return $data['hybridGraph']['image'];
    }
    
    return null;
}

Advantages of specialized services:

  • Optimized data extraction algorithms
  • Guaranteed performance
  • Error handling and redirect management
  • Service-side caching

Headless Browsers

For complex sites with dynamic content, use headless browsers:

php
// Example using Puppeteer via ChromeDriver
function getOgImageWithHeadlessBrowser($url) {
    $command = "curl -X POST http://localhost:9221/json/new";
    $response = shell_exec($command);
    $target = json_decode($response, true)['targetId'];
    
    $command = "curl -X POST http://localhost:9221/json/{$target}/navigate -d '{\"url\": \"{$url}\"}'";
    shell_exec($command);
    
    sleep(2); // Wait for page to load
    
    $command = "curl -X POST http://localhost:9221/json/{$target}/execute -d '{\"script\": \"return document.querySelector('meta[property=\"og:image\"]').getAttribute('content');\"}'";
    $response = shell_exec($command);
    
    $command = "curl -X DELETE http://localhost:9221/json/{$target}";
    shell_exec($command);
    
    return json_decode($response, true)['result'];
}

Proxying Through Services

Use proxying through services such as:

  • Readability API: For extracting main content and images
  • Embedly API: For getting structured data from web pages
  • Iframely API: For extracting media content from various sources

Performance Optimization

Asynchronous Processing

Implement asynchronous processing to prevent blocking the main thread:

php
class AsyncImageFetcher {
    private $maxConcurrent = 10;
    private $timeout = 5;
    
    public function fetchMultiple($urls) {
        $results = [];
        $mh = curl_multi_init();
        $handles = [];
        
        // Add first $maxConcurrent requests
        foreach (array_slice($urls, 0, $this->maxConcurrent) as $i => $url) {
            $handles[$i] = $this->createHandle($url, $i);
            curl_multi_add_handle($mh, $handles[$i]);
        }
        
        $active = null;
        do {
            $status = curl_multi_exec($mh, $active);
            if ($status != CURLM_OK) {
                break;
            }
            
            // Check completed requests
            while ($info = curl_multi_info_read($mh)) {
                $i = array_search($info['handle'], $handles);
                if ($i !== false) {
                    $response = curl_multi_getcontent($info['handle']);
                    $results[$i] = $this->processResponse($response);
                    
                    curl_multi_remove_handle($mh, $info['handle']);
                    curl_close($info['handle']);
                    
                    // Add new request if there are more URLs
                    if (isset($urls[$this->maxConcurrent + $i])) {
                        $newUrl = $urls[$this->maxConcurrent + $i];
                        $handles[$i] = $this->createHandle($newUrl, $i);
                        curl_multi_add_handle($mh, $handles[$i]);
                    }
                }
            }
            
            if ($active) {
                curl_multi_select($mh, $this->timeout);
            }
        } while ($active);
        
        curl_multi_close($mh);
        return $results;
    }
    
    private function createHandle($url, $index) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, $this->timeout);
        curl_setopt($ch, CURLOPT_HEADER, false);
        curl_setopt($ch, CURLOPT_NOBODY, false);
        
        return $ch;
    }
}

Network Request Optimization

  1. HTTP/2 usage: For simultaneous processing of multiple requests to the same domain.

  2. Data compression: Enabling gzip compression to reduce the amount of data transferred.

  3. Timeout optimization: Setting reasonable timeout values to prevent long waits.

  4. Parallel processing: Using multithreading or asynchronous requests.

Resource-Efficient Parsers

Use specialized parsers instead of general HTML parsers:

php
// Example using simple regular expressions
function extractOgImageWithRegex($html) {
    $pattern = '/<meta\s+property=["\']og:image["\']\s+content=["\']([^"\']+)["\']/i';
    if (preg_match($pattern, $html, $matches)) {
        return $matches[1];
    }
    return null;
}

Practical PHP Solutions

Hybrid Approach

Combine multiple methods to achieve optimal performance:

php
class HybridImageExtractor {
    private $cache;
    private $services;
    
    public function __construct() {
        $this->cache = new OpenGraphCache();
        $this->services = [
            'opengraph' => new OpenGraphService(),
            'readability' => new ReadabilityService(),
            'fallback' => new RegexExtractor()
        ];
    }
    
    public function extract($url) {
        // Check cache
        $cached = $this->cache->get($url);
        if ($cached) {
            return $cached;
        }
        
        // Try extraction through services in order
        foreach ($this->services as $serviceName => $service) {
            try {
                $result = $service->extract($url);
                if ($result) {
                    $this->cache->set($url, $result);
                    return $result;
                }
            } catch (Exception $e) {
                // Log error and continue to next service
                error_log("Service {$serviceName} failed for {$url}: " . $e->getMessage());
            }
        }
        
        return null;
    }
    
    public function batchExtract($urls) {
        $results = [];
        $missingUrls = [];
        
        // Check cache for all URLs
        foreach ($urls as $url) {
            $cached = $this->cache->get($url);
            if ($cached) {
                $results[$url] = $cached;
            } else {
                $missingUrls[] = $url;
            }
        }
        
        // Process URLs missing from cache
        if (!empty($missingUrls)) {
            $newResults = $this->services['opengraph']->batchExtract($missingUrls);
            foreach ($newResults as $url => $image) {
                $results[$url] = $image;
                $this->cache->set($url, $image);
            }
        }
        
        return $results;
    }
}

Optimized Parser with Limited Request

php
class OptimizedOgExtractor {
    private $maxContentLength = 102400; // 100KB
    private $timeout = 3;
    
    public function extract($url) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 3);
        curl_setopt($ch, CURLOPT_TIMEOUT, $this->timeout);
        curl_setopt($ch, CURLOPT_HEADER, false);
        curl_setopt($ch, CURLOPT_NOBODY, false);
        curl_setopt($ch, CURLOPT_MAXFILESIZE, $this->maxContentLength);
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; ImageExtractor/1.0)');
        
        $response = curl_exec($ch);
        $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        curl_close($ch);
        
        if ($httpCode !== 200 || !$response) {
            return null;
        }
        
        // Quick search for og:image
        return $this->findOgImage($response);
    }
    
    private function findOgImage($html) {
        // Optimized search without full parsing
        $pos = strpos($html, 'og:image');
        if ($pos === false) {
            return null;
        }
        
        // Extract the nearest meta tag
        $metaPos = strrpos(substr($html, 0, $pos), '<meta');
        if ($metaPos === false) {
            return null;
        }
        
        $metaEnd = strpos($html, '>', $metaPos);
        if ($metaEnd === false) {
            return null;
        }
        
        $metaTag = substr($html, $metaPos, $metaEnd - $metaPos + 1);
        
        // Extract content from meta tag
        if (preg_match('/content=["\']([^"\']+)["\']/', $metaTag, $matches)) {
            return $matches[1];
        }
        
        return null;
    }
}

Integration with Processing Queues

For processing a large number of articles, use queues:

php
class ImageProcessingQueue {
    private $queue;
    
    public function __construct() {
        $this->queue = new RedisQueue('image_processing');
    }
    
    public function addToQueue($articleId, $url) {
        $job = [
            'article_id' => $articleId,
            'url' => $url,
            'created_at' => time(),
            'attempts' => 0
        ];
        
        $this->queue->push($job);
    }
    
    public function processQueue() {
        while ($job = $this->queue->pop()) {
            try {
                $extractor = new OptimizedOgExtractor();
                $imageUrl = $extractor->extract($job['url']);
                
                if ($imageUrl) {
                    $this->saveImage($job['article_id'], $imageUrl);
                }
                
                $this->queue->acknowledge($job['id']);
            } catch (Exception $e) {
                $job['attempts']++;
                
                if ($job['attempts'] < 3) {
                    $this->queue->push($job);
                } else {
                    $this->queue->fail($job['id'], $e->getMessage());
                }
            }
        }
    }
}

Approach Comparison

Approach Speed Reliability Implementation Complexity Recommended Load
Direct HTML parsing Low Medium Low Several dozen articles per hour
Regular expressions Medium Low Medium Several hundred articles per hour
Specialized services High High Medium Thousands of articles per hour
Headless browsers Low High High Specific cases
Multi-level cache Very high Depends on implementation High Optimal solution for most cases

Implementation Recommendations

Step-by-Step Implementation

  1. Initial phase:

    • Implement basic file system caching
    • Use optimized parsers with limited requests
    • Add error handling and timeouts
  2. System development:

    • Implement specialized services as the primary method
    • Add multi-level caching (Redis + file-based)
    • Implement asynchronous processing in batches
  3. Scaling:

    • Use processing queues for background tasks
    • Implement load balancing between multiple services
    • Add performance monitoring and logging

Performance Configuration

php
// Optimal PHP settings for image processing
ini_set('max_execution_time', 30);
ini_set('memory_limit', '256M');
ini_set('max_input_time', 30);

// Optimal cURL settings
$options = [
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_MAXREDIRS => 3,
    CURLOPT_TIMEOUT => 5,
    CURLOPT_CONNECTTIMEOUT => 3,
    CURLOPT_SSL_VERIFYPEER => false,
    CURLOPT_SSL_VERIFYHOST => false,
    CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; NewsImageBot/1.0)',
    CURLOPT_HTTPHEADER => [
        'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3',
    ]
];

Monitoring and Optimization

  1. Performance tracking:

    • Request execution time
    • Ratio of successful/unsuccessful extractions
    • Memory and CPU usage
  2. Efficiency analysis:

    • Cache hit rate
    • Average request processing time
    • System load during peak times
  3. Automatic optimization:

    • Dynamic timeout adjustment based on load
    • Automatic switching between extraction methods
    • Load prediction and peak prevention

Conclusion

An optimal system for extracting og:image meta tags should be based on a multi-level architecture using caching, specialized services, and asynchronous processing. Key recommendations:

  1. Implement multi-level caching using Redis for frequently requested URLs and file-based caching for other cases.

  2. Use specialized services as the primary data extraction method, with fallback to optimized parsers for cases where services are unavailable.

  3. Implement asynchronous processing using queues to prevent blocking the main execution thread and distribute load evenly.

  4. Optimize network requests by limiting response sizes, setting reasonable timeouts, and using parallel processing.

  5. Add monitoring and logging to track performance and promptly detect issues.

This approach will enable efficient processing of thousands of articles with minimal server load and ensure fast response times for the user interface when loading preview images for the news section.