NeuroAgent

Extracting Text from Low-Contrast Images: A Complete Guide

Complete guide to extracting text from low-contrast and noisy images using OpenCV and machine learning. Optimized for vector format.

How can I quickly and simply extract text from an image with a low-contrast background and noise for subsequent use in vector format, when quick selection and color-based selection methods are ineffective and I cannot find a higher quality image?

NeuroAgent

Fast and Simple Text Extraction from Images with Low-Contrast Background and Noise

Extracting text from images with low-contrast backgrounds and noise requires specialized image preprocessing methods before using OCR. The main approaches include contrast enhancement, noise removal, adaptive binarization, and post-processing to improve recognition accuracy, with both classical OpenCV methods and modern neural network approaches being viable options.

Contents


Basic Preprocessing Methods for Low-Contrast Images

For effective text extraction from images with low-contrast backgrounds and noise, comprehensive preprocessing is necessary. As noted in research, increasing contrast between text/image and background is a key step that significantly improves recognition quality [source 1].

The main preprocessing methods include:

  1. Contrast correction - converting color images (RGB) to black and white using various algorithms [source 1]
  2. Noise removal using Gaussian or median filters [source 3]
  3. Adaptive binarization - converting the image to binary format considering local features [source 8]
  4. Histogram equalization - improving pixel brightness distribution [source 2]

It’s important to note that low contrast can lead to poor OCR results, so increasing contrast and density before performing recognition is mandatory [source 4].


Step-by-Step Text Extraction from Noisy Images

Step 1: Loading and Basic Image Processing

python
import cv2
import numpy as np

def load_and_basic_preprocess(image_path):
    # Load image
    img = cv2.imread(image_path)
    
    # Convert to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
    return img, gray

Step 2: Noise Removal

python
def remove_noise(image):
    # Gaussian filter for noise removal
    denoised = cv2.GaussianBlur(image, (3, 3), 0)
    
    # Alternative: median filter
    # denoised = cv2.medianBlur(image, 3)
    
    return denoised

Step 3: Contrast Enhancement

python
def enhance_contrast(image):
    # Method 1: Linear transformation
    enhanced = cv2.convertScaleAbs(image, alpha=1.5, beta=0)
    
    # Method 2: Adaptive histogram equalization
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
    enhanced = clahe.apply(image)
    
    return enhanced

Step 4: Binarization

python
def binarize_image(image):
    # Global thresholding
    _, binary = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    
    # Adaptive thresholding (better for low-contrast images)
    binary = cv2.adaptiveThreshold(image, 255, cv2.ADAPTIVE_THRESH_MEAN_C, 
                                  cv2.THRESH_BINARY_INV, 11, 2)
    
    return binary

Step 5: Morphological Operations

python
def apply_morphology(image):
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
    
    # Remove small noise
    cleaned = cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)
    
    # Enhance text
    enhanced = cv2.morphologyEx(cleaned, cv2.MORPH_CLOSE, kernel)
    
    return enhanced

Step 6: Text Recognition

python
import pytesseract

def extract_text(processed_image):
    # Configure Tesseract for Russian language
    config = '--oem 3 --psm 6 -l rus+eng'
    
    # Text recognition
    text = pytesseract.image_to_string(processed_image, config=config)
    
    return text

Modern Approaches Using Machine Learning

For complex cases with low-contrast and noisy images, deep learning approaches can be used. As noted in research, pre-trained networks can be used to extract features from noisy images.

Creating Vector Features

[Creating 5x5 - 25-dimensional vector features from a noisy image and extracting the target value (cleaned pixel) from the corresponding reference image](source 2) is an effective approach for training image cleaning models.

Using Pre-trained Networks

For detection and processing of low-contrast images, you can use:

  1. ResNet networks trained on ImageNet
  2. Simple pre-trained networks trained on MNIST/EMNIST
  3. Extracting and merging flattened weight vectors at the end of the network [source 5]

Full Pipeline Using Neural Networks

python
import tensorflow as tf
from tensorflow.keras import layers, models

def create_denoising_model():
    model = models.Sequential([
        layers.Input(shape=(None, None, 1)),
        layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
        layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
        layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same')
    ])
    
    model.compile(optimizer='adam', loss='mse')
    return model

def preprocess_for_ml(image):
    # Normalization
    image = image.astype('float32') / 255.0
    
    # Add channel dimension
    image = np.expand_dims(image, axis=-1)
    
    return image

Practical Python Code Examples

Comprehensive Example of Processing Low-Contrast Image

python
import cv2
import numpy as np
import pytesseract
import re

def full_ocr_pipeline(image_path):
    # Load image
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
    # Step 1: Noise removal
    denoised = cv2.GaussianBlur(gray, (3, 3), 0)
    
    # Step 2: Contrast enhancement
    enhanced = cv2.convertScaleAbs(denoised, alpha=1.5, beta=10)
    
    # Step 3: Adaptive binarization
    binary = cv2.adaptiveThreshold(enhanced, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
                                  cv2.THRESH_BINARY_INV, 11, 2)
    
    # Step 4: Morphological operations
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
    cleaned = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)
    
    # Step 5: Text recognition
    config = '--oem 3 --psm 6 -l rus+eng'
    text = pytesseract.image_to_string(cleaned, config=config)
    
    return {
        'original': img,
        'processed': cleaned,
        'text': clean_text(text)
    }

def clean_text(text):
    # Clean common OCR errors
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub(r'[^\w\s,.!?;:()"\']', '', text)  # Preserve common punctuation
    return text.strip()

# Example usage
result = full_ocr_pipeline('low_contrast_image.jpg')
print(result['text'])

Detecting Low-Contrast Images

python
def is_low_contrast(image, fraction_threshold=0.05):
    """Determine if an image has low contrast"""
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    std = np.std(gray)
    return std < fraction_threshold * 255

def detect_and_process_low_contrast(image_path):
    img = cv2.imread(image_path)
    
    if is_low_contrast(img):
        print("Low-contrast image detected. Applying enhanced processing...")
        return full_ocr_pipeline(image_path)
    else:
        print("Image has sufficient contrast.")
        return basic_ocr_pipeline(image_path)

Optimization for Vector Format

After extracting text from an image, it’s important to properly prepare it for use in vector format. This includes the following stages:

1. Text Cleaning and Normalization

python
def clean_text_for_vector(text):
    # Remove special characters
    text = re.sub(r'[^\w\s]', '', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove extra spaces
    text = ' '.join(text.split())
    
    return text

2. Conversion to Vector Formats

python
def text_to_vector_formats(text):
    # TF-IDF vectorization
    from sklearn.feature_extraction.text import TfidfVectorizer
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([text])
    
    # Word2Vec (pre-trained model)
    # from gensim.models import Word2Vec
    # tokens = text.split()
    # word_vectors = model.wv[tokens]
    
    return {
        'tfidf': tfidf_matrix,
        'text': text,
        'tokens': text.split()
    }

3. Saving in Various Vector Formats

python
def save_vector_formats(data, base_filename):
    # Save as JSON
    import json
    with open(f'{base_filename}.json', 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
    
    # Save as CSV
    import pandas as pd
    df = pd.DataFrame([{'text': data['text'], 'tokens': ' '.join(data['tokens'])}])
    df.to_csv(f'{base_filename}.csv', index=False, encoding='utf-8')
    
    # Save vector representations
    from scipy.sparse import save_npz
    save_npz(f'{base_filename}_tfidf.npz', data['tfidf'])

Sources

  1. Improve OCR Accuracy With Advanced Image Preprocessing - DocParser
  2. Using Machine Learning to Denoise Images for Better OCR Accuracy - PyImageSearch
  3. Image processing to improve tesseract OCR accuracy - Stack Overflow
  4. Improve OCR accuracy using advanced preprocessing techniques - Nitor Infotech
  5. How to verify if the image contains noise in background before 'OCR’ing - Stack Overflow
  6. How to Use Image Preprocessing to Improve the Accuracy of Tesseract OCR - Bomberbot
  7. How do you handle noise, distortion, and background in OCR images? - LinkedIn
  8. OCR Tools — Solving Real-World Problems with Low-Quality Images - Medium
  9. Unleashing the Power of EasyOCR in Python: A Comprehensive Guide - CodeRivers
  10. Detecting low contrast images with OpenCV, scikit-image, and Python - GeeksforGeeks
  11. Changing the contrast and brightness of an image! - OpenCV Documentation
  12. How to change the contrast and brightness of an image using OpenCV in Python - Tutorialspoint

Conclusion

For fast and simple text extraction from images with low-contrast background and noise, the following approach is recommended:

  1. Use comprehensive preprocessing with contrast enhancement, noise removal, and adaptive binarization
  2. Apply OpenCV and pytesseract for quick implementation without needing deep learning
  3. Optimize processing parameters depending on the specific type of noise
  4. Automate detection of low-contrast images for applying enhanced methods
  5. Systematize the conversion process to vector formats for further use

For complex cases, machine learning methods can be considered, but the proposed OpenCV-based pipeline already provides good results for most practical tasks of text extraction from low-quality images.