I need to create a news section "About Us" where the company is mentioned in various news articles. For each article, I need to automatically load a preview image (like in Telegram, VK, and other social networks). The current method using a PHP function that parses HTML code to find og:image is too resource-intensive when processing a large number of articles. I'm considering saving the image when creating the element. What alternative solutions exist for efficiently retrieving og:image images?

Optimal extraction of og:image meta tags for a large number of articles requires the use of caching, alternative data extraction methods, and server load minimization strategies. As primary solutions, it is recommended to implement a multi-level caching system, use specialized services for extracting Open Graph data, and implement asynchronous processing to prevent blocking of the main execution thread.

NeuroAgent

Optimal og:image Extraction for News Section

How to efficiently retrieve og:image images for multiple articles with minimal server load. Solutions for automatically loading previews in news sections.

#caching-strategies #image-extraction #news-section #opengraph-optimization #php-performance

11/05/2025, 07:58 PM

Question

What’s the optimal way to extract og:image meta tags for a list of articles from different websites? I need to create a “About Us” news section where our company is mentioned in various news articles. For each article, I need to automatically load preview images (similar to how Telegram, VK and other social networks do it). My current method using a PHP function that parses HTML code to search for og:image is too resource-intensive when processing a large number of articles. I’m considering saving the image when creating the element. What alternative solutions exist for efficiently obtaining og:image images?

NeuroAgent

Optimal Extraction of og:image Meta Tags for Large-Scale Article Processing

Efficient extraction of og:image meta tags for a large number of articles requires implementing caching strategies, alternative data extraction methods, and server load minimization techniques. The primary solutions recommended include implementing a multi-level caching system, using specialized services for Open Graph data extraction, and implementing asynchronous processing to prevent blocking the main execution thread.

Analysis of Current Approach
Data Caching Strategies
Alternative og:image Extraction Methods
Performance Optimization
Practical PHP Solutions
Approach Comparison
Implementation Recommendations

Analysis of Current Approach

The current method, which relies on parsing HTML code from each page to find the og:image meta tag, is indeed inefficient when processing a large number of articles for several reasons:

High CPU load: HTML parsing requires significant computational resources, especially when processing multiple pages simultaneously.
Execution time limitations: Most hosting providers limit maximum script execution time, making it impossible to process hundreds or thousands of pages in a single request.
HTML structure dependency: Different websites use different structures and formats for meta tags, requiring complex processing and regular code adaptation.
Network delays: Each request to an external site takes time to establish a connection and retrieve data, which multiplies when processing a large number of URLs.

Example problem: When processing 100 articles with a 3-second delay for each, the total time would be approximately 5 minutes, which is unacceptable for a user interface.

Data Caching Strategies

Multi-level Caching

Implement a multi-level caching system to minimize repeated requests to external resources:

php

class OpenGraphCache {
    private $cacheDir;
    private $cacheLifetime;
    
    public function __construct($cacheDir = '/tmp/og_cache', $cacheLifetime = 86400) {
        $this->cacheDir = $cacheDir;
        $this->cacheLifetime = $cacheLifetime;
        if (!file_exists($this->cacheDir)) {
            mkdir($this->cacheDir, 0755, true);
        }
    }
    
    public function get($url) {
        $cacheFile = $this->getCacheFileName($url);
        
        if (file_exists($cacheFile) && (time() - filemtime($cacheFile)) < $this->cacheLifetime) {
            return json_decode(file_get_contents($cacheFile), true);
        }
        
        return false;
    }
    
    public function set($url, $data) {
        $cacheFile = $this->getCacheFileName($url);
        file_put_contents($cacheFile, json_encode($data));
    }
    
    private function getCacheFileName($url) {
        return $this->cacheDir . '/' . md5($url) . '.json';
    }
}

Types of Caching

File caching: Simple to implement, suitable for small to medium projects.
Database caching: More reliable solution with the ability to manage cache expiration times.
Redis/Memcached: Optimal solution for high-traffic projects with minimal access latency.
HTTP caching: Using Cache-Control and ETag headers for web server-level caching.

Caching Recommendations

Use URL hash as cache filename to prevent collisions
Set reasonable cache lifetimes (24-72 hours for news content)
Implement background cache refresh for popular articles
Add cache invalidation mechanism when data changes

Alternative og:image Extraction Methods

Specialized Services

Use ready-made APIs for extracting Open Graph data:

php

// Example using OpenGraph.io API
function getOgImageViaService($url) {
    $apiKey = 'YOUR_API_KEY';
    $apiUrl = "https://opengraph.io/api/1.1/site/{$url}?app_id={$apiKey}";
    
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $apiUrl);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    
    $response = curl_exec($ch);
    curl_close($ch);
    
    $data = json_decode($response, true);
    
    if (isset($data['hybridGraph']['image'])) {
        return $data['hybridGraph']['image'];
    }
    
    return null;
}

Advantages of specialized services:

Optimized data extraction algorithms
Guaranteed performance
Error handling and redirect management
Service-side caching

Headless Browsers

For complex sites with dynamic content, use headless browsers:

php

// Example using Puppeteer via ChromeDriver
function getOgImageWithHeadlessBrowser($url) {
    $command = "curl -X POST http://localhost:9221/json/new";
    $response = shell_exec($command);
    $target = json_decode($response, true)['targetId'];
    
    $command = "curl -X POST http://localhost:9221/json/{$target}/navigate -d '{\"url\": \"{$url}\"}'";
    shell_exec($command);
    
    sleep(2); // Wait for page to load
    
    $command = "curl -X POST http://localhost:9221/json/{$target}/execute -d '{\"script\": \"return document.querySelector('meta[property=\"og:image\"]').getAttribute('content');\"}'";
    $response = shell_exec($command);
    
    $command = "curl -X DELETE http://localhost:9221/json/{$target}";
    shell_exec($command);
    
    return json_decode($response, true)['result'];
}

Proxying Through Services

Use proxying through services such as:

Readability API: For extracting main content and images
Embedly API: For getting structured data from web pages
Iframely API: For extracting media content from various sources

Performance Optimization

Asynchronous Processing

Implement asynchronous processing to prevent blocking the main thread:

php

class AsyncImageFetcher {
    private $maxConcurrent = 10;
    private $timeout = 5;
    
    public function fetchMultiple($urls) {
        $results = [];
        $mh = curl_multi_init();
        $handles = [];
        
        // Add first $maxConcurrent requests
        foreach (array_slice($urls, 0, $this->maxConcurrent) as $i => $url) {
            $handles[$i] = $this->createHandle($url, $i);
            curl_multi_add_handle($mh, $handles[$i]);
        }
        
        $active = null;
        do {
            $status = curl_multi_exec($mh, $active);
            if ($status != CURLM_OK) {
                break;
            }
            
            // Check completed requests
            while ($info = curl_multi_info_read($mh)) {
                $i = array_search($info['handle'], $handles);
                if ($i !== false) {
                    $response = curl_multi_getcontent($info['handle']);
                    $results[$i] = $this->processResponse($response);
                    
                    curl_multi_remove_handle($mh, $info['handle']);
                    curl_close($info['handle']);
                    
                    // Add new request if there are more URLs
                    if (isset($urls[$this->maxConcurrent + $i])) {
                        $newUrl = $urls[$this->maxConcurrent + $i];
                        $handles[$i] = $this->createHandle($newUrl, $i);
                        curl_multi_add_handle($mh, $handles[$i]);
                    }
                }
            }
            
            if ($active) {
                curl_multi_select($mh, $this->timeout);
            }
        } while ($active);
        
        curl_multi_close($mh);
        return $results;
    }
    
    private function createHandle($url, $index) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, $this->timeout);
        curl_setopt($ch, CURLOPT_HEADER, false);
        curl_setopt($ch, CURLOPT_NOBODY, false);
        
        return $ch;
    }
}

Network Request Optimization

HTTP/2 usage: For simultaneous processing of multiple requests to the same domain.
Data compression: Enabling gzip compression to reduce the amount of data transferred.
Timeout optimization: Setting reasonable timeout values to prevent long waits.
Parallel processing: Using multithreading or asynchronous requests.

Resource-Efficient Parsers

Use specialized parsers instead of general HTML parsers:

php

// Example using simple regular expressions
function extractOgImageWithRegex($html) {
    $pattern = '/<meta\s+property=["\']og:image["\']\s+content=["\']([^"\']+)["\']/i';
    if (preg_match($pattern, $html, $matches)) {
        return $matches[1];
    }
    return null;
}

Practical PHP Solutions

Hybrid Approach

Combine multiple methods to achieve optimal performance:

php

class HybridImageExtractor {
    private $cache;
    private $services;
    
    public function __construct() {
        $this->cache = new OpenGraphCache();
        $this->services = [
            'opengraph' => new OpenGraphService(),
            'readability' => new ReadabilityService(),
            'fallback' => new RegexExtractor()
        ];
    }
    
    public function extract($url) {
        // Check cache
        $cached = $this->cache->get($url);
        if ($cached) {
            return $cached;
        }
        
        // Try extraction through services in order
        foreach ($this->services as $serviceName => $service) {
            try {
                $result = $service->extract($url);
                if ($result) {
                    $this->cache->set($url, $result);
                    return $result;
                }
            } catch (Exception $e) {
                // Log error and continue to next service
                error_log("Service {$serviceName} failed for {$url}: " . $e->getMessage());
            }
        }
        
        return null;
    }
    
    public function batchExtract($urls) {
        $results = [];
        $missingUrls = [];
        
        // Check cache for all URLs
        foreach ($urls as $url) {
            $cached = $this->cache->get($url);
            if ($cached) {
                $results[$url] = $cached;
            } else {
                $missingUrls[] = $url;
            }
        }
        
        // Process URLs missing from cache
        if (!empty($missingUrls)) {
            $newResults = $this->services['opengraph']->batchExtract($missingUrls);
            foreach ($newResults as $url => $image) {
                $results[$url] = $image;
                $this->cache->set($url, $image);
            }
        }
        
        return $results;
    }
}

Optimized Parser with Limited Request

php

class OptimizedOgExtractor {
    private $maxContentLength = 102400; // 100KB
    private $timeout = 3;
    
    public function extract($url) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 3);
        curl_setopt($ch, CURLOPT_TIMEOUT, $this->timeout);
        curl_setopt($ch, CURLOPT_HEADER, false);
        curl_setopt($ch, CURLOPT_NOBODY, false);
        curl_setopt($ch, CURLOPT_MAXFILESIZE, $this->maxContentLength);
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; ImageExtractor/1.0)');
        
        $response = curl_exec($ch);
        $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        curl_close($ch);
        
        if ($httpCode !== 200 || !$response) {
            return null;
        }
        
        // Quick search for og:image
        return $this->findOgImage($response);
    }
    
    private function findOgImage($html) {
        // Optimized search without full parsing
        $pos = strpos($html, 'og:image');
        if ($pos === false) {
            return null;
        }
        
        // Extract the nearest meta tag
        $metaPos = strrpos(substr($html, 0, $pos), '<meta');
        if ($metaPos === false) {
            return null;
        }
        
        $metaEnd = strpos($html, '>', $metaPos);
        if ($metaEnd === false) {
            return null;
        }
        
        $metaTag = substr($html, $metaPos, $metaEnd - $metaPos + 1);
        
        // Extract content from meta tag
        if (preg_match('/content=["\']([^"\']+)["\']/', $metaTag, $matches)) {
            return $matches[1];
        }
        
        return null;
    }
}

Integration with Processing Queues

For processing a large number of articles, use queues:

php

class ImageProcessingQueue {
    private $queue;
    
    public function __construct() {
        $this->queue = new RedisQueue('image_processing');
    }
    
    public function addToQueue($articleId, $url) {
        $job = [
            'article_id' => $articleId,
            'url' => $url,
            'created_at' => time(),
            'attempts' => 0
        ];
        
        $this->queue->push($job);
    }
    
    public function processQueue() {
        while ($job = $this->queue->pop()) {
            try {
                $extractor = new OptimizedOgExtractor();
                $imageUrl = $extractor->extract($job['url']);
                
                if ($imageUrl) {
                    $this->saveImage($job['article_id'], $imageUrl);
                }
                
                $this->queue->acknowledge($job['id']);
            } catch (Exception $e) {
                $job['attempts']++;
                
                if ($job['attempts'] < 3) {
                    $this->queue->push($job);
                } else {
                    $this->queue->fail($job['id'], $e->getMessage());
                }
            }
        }
    }
}

Approach Comparison

Approach	Speed	Reliability	Implementation Complexity	Recommended Load
Direct HTML parsing	Low	Medium	Low	Several dozen articles per hour
Regular expressions	Medium	Low	Medium	Several hundred articles per hour
Specialized services	High	High	Medium	Thousands of articles per hour
Headless browsers	Low	High	High	Specific cases
Multi-level cache	Very high	Depends on implementation	High	Optimal solution for most cases

Implementation Recommendations

Step-by-Step Implementation

Initial phase:
- Implement basic file system caching
- Use optimized parsers with limited requests
- Add error handling and timeouts
System development:
- Implement specialized services as the primary method
- Add multi-level caching (Redis + file-based)
- Implement asynchronous processing in batches
Scaling:
- Use processing queues for background tasks
- Implement load balancing between multiple services
- Add performance monitoring and logging

Performance Configuration

php

// Optimal PHP settings for image processing
ini_set('max_execution_time', 30);
ini_set('memory_limit', '256M');
ini_set('max_input_time', 30);

// Optimal cURL settings
$options = [
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_MAXREDIRS => 3,
    CURLOPT_TIMEOUT => 5,
    CURLOPT_CONNECTTIMEOUT => 3,
    CURLOPT_SSL_VERIFYPEER => false,
    CURLOPT_SSL_VERIFYHOST => false,
    CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; NewsImageBot/1.0)',
    CURLOPT_HTTPHEADER => [
        'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3',
    ]
];

Monitoring and Optimization

Performance tracking:
- Request execution time
- Ratio of successful/unsuccessful extractions
- Memory and CPU usage
Efficiency analysis:
- Cache hit rate
- Average request processing time
- System load during peak times
Automatic optimization:
- Dynamic timeout adjustment based on load
- Automatic switching between extraction methods
- Load prediction and peak prevention

Conclusion

An optimal system for extracting og:image meta tags should be based on a multi-level architecture using caching, specialized services, and asynchronous processing. Key recommendations:

Implement multi-level caching using Redis for frequently requested URLs and file-based caching for other cases.
Use specialized services as the primary data extraction method, with fallback to optimized parsers for cases where services are unavailable.
Implement asynchronous processing using queues to prevent blocking the main execution thread and distribute load evenly.
Optimize network requests by limiting response sizes, setting reasonable timeouts, and using parallel processing.
Add monitoring and logging to track performance and promptly detect issues.

This approach will enable efficient processing of thousands of articles with minimal server load and ensure fast response times for the user interface when loading preview images for the news section.

How to implement multi-level caching for og:image images in PHP?What specialized services can be used for extracting Open Graph data instead of direct HTML parsing?How to set up asynchronous processing for batch retrieval of og:image images?How to optimize network requests when extracting images from external websites?How to handle errors and redirects when extracting og:image images?How to integrate the image extraction system with processing queues for large data volumes?

Ask NeuroAgent