How can I quickly and simply extract text from an image with a low-contrast background and noise for subsequent use in vector format, when quick selection and color-based selection methods are ineffective and I cannot find a higher quality image?
Fast and Simple Text Extraction from Images with Low-Contrast Background and Noise
Extracting text from images with low-contrast backgrounds and noise requires specialized image preprocessing methods before using OCR. The main approaches include contrast enhancement, noise removal, adaptive binarization, and post-processing to improve recognition accuracy, with both classical OpenCV methods and modern neural network approaches being viable options.
Contents
- Basic Preprocessing Methods for Low-Contrast Images
- Step-by-Step Text Extraction from Noisy Images
- Modern Approaches Using Machine Learning
- Practical Python Code Examples
- Optimization for Vector Format
Basic Preprocessing Methods for Low-Contrast Images
For effective text extraction from images with low-contrast backgrounds and noise, comprehensive preprocessing is necessary. As noted in research, increasing contrast between text/image and background is a key step that significantly improves recognition quality [source 1].
The main preprocessing methods include:
- Contrast correction - converting color images (RGB) to black and white using various algorithms [source 1]
- Noise removal using Gaussian or median filters [source 3]
- Adaptive binarization - converting the image to binary format considering local features [source 8]
- Histogram equalization - improving pixel brightness distribution [source 2]
It’s important to note that low contrast can lead to poor OCR results, so increasing contrast and density before performing recognition is mandatory [source 4].
Step-by-Step Text Extraction from Noisy Images
Step 1: Loading and Basic Image Processing
import cv2
import numpy as np
def load_and_basic_preprocess(image_path):
# Load image
img = cv2.imread(image_path)
# Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
return img, gray
Step 2: Noise Removal
def remove_noise(image):
# Gaussian filter for noise removal
denoised = cv2.GaussianBlur(image, (3, 3), 0)
# Alternative: median filter
# denoised = cv2.medianBlur(image, 3)
return denoised
Step 3: Contrast Enhancement
def enhance_contrast(image):
# Method 1: Linear transformation
enhanced = cv2.convertScaleAbs(image, alpha=1.5, beta=0)
# Method 2: Adaptive histogram equalization
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
enhanced = clahe.apply(image)
return enhanced
Step 4: Binarization
def binarize_image(image):
# Global thresholding
_, binary = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
# Adaptive thresholding (better for low-contrast images)
binary = cv2.adaptiveThreshold(image, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY_INV, 11, 2)
return binary
Step 5: Morphological Operations
def apply_morphology(image):
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
# Remove small noise
cleaned = cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)
# Enhance text
enhanced = cv2.morphologyEx(cleaned, cv2.MORPH_CLOSE, kernel)
return enhanced
Step 6: Text Recognition
import pytesseract
def extract_text(processed_image):
# Configure Tesseract for Russian language
config = '--oem 3 --psm 6 -l rus+eng'
# Text recognition
text = pytesseract.image_to_string(processed_image, config=config)
return text
Modern Approaches Using Machine Learning
For complex cases with low-contrast and noisy images, deep learning approaches can be used. As noted in research, pre-trained networks can be used to extract features from noisy images.
Creating Vector Features
[Creating 5x5 - 25-dimensional vector features from a noisy image and extracting the target value (cleaned pixel) from the corresponding reference image](source 2) is an effective approach for training image cleaning models.
Using Pre-trained Networks
For detection and processing of low-contrast images, you can use:
- ResNet networks trained on ImageNet
- Simple pre-trained networks trained on MNIST/EMNIST
- Extracting and merging flattened weight vectors at the end of the network [source 5]
Full Pipeline Using Neural Networks
import tensorflow as tf
from tensorflow.keras import layers, models
def create_denoising_model():
model = models.Sequential([
layers.Input(shape=(None, None, 1)),
layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same')
])
model.compile(optimizer='adam', loss='mse')
return model
def preprocess_for_ml(image):
# Normalization
image = image.astype('float32') / 255.0
# Add channel dimension
image = np.expand_dims(image, axis=-1)
return image
Practical Python Code Examples
Comprehensive Example of Processing Low-Contrast Image
import cv2
import numpy as np
import pytesseract
import re
def full_ocr_pipeline(image_path):
# Load image
img = cv2.imread(image_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Step 1: Noise removal
denoised = cv2.GaussianBlur(gray, (3, 3), 0)
# Step 2: Contrast enhancement
enhanced = cv2.convertScaleAbs(denoised, alpha=1.5, beta=10)
# Step 3: Adaptive binarization
binary = cv2.adaptiveThreshold(enhanced, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY_INV, 11, 2)
# Step 4: Morphological operations
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
cleaned = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)
# Step 5: Text recognition
config = '--oem 3 --psm 6 -l rus+eng'
text = pytesseract.image_to_string(cleaned, config=config)
return {
'original': img,
'processed': cleaned,
'text': clean_text(text)
}
def clean_text(text):
# Clean common OCR errors
text = re.sub(r'\s+', ' ', text) # Remove extra spaces
text = re.sub(r'[^\w\s,.!?;:()"\']', '', text) # Preserve common punctuation
return text.strip()
# Example usage
result = full_ocr_pipeline('low_contrast_image.jpg')
print(result['text'])
Detecting Low-Contrast Images
def is_low_contrast(image, fraction_threshold=0.05):
"""Determine if an image has low contrast"""
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
std = np.std(gray)
return std < fraction_threshold * 255
def detect_and_process_low_contrast(image_path):
img = cv2.imread(image_path)
if is_low_contrast(img):
print("Low-contrast image detected. Applying enhanced processing...")
return full_ocr_pipeline(image_path)
else:
print("Image has sufficient contrast.")
return basic_ocr_pipeline(image_path)
Optimization for Vector Format
After extracting text from an image, it’s important to properly prepare it for use in vector format. This includes the following stages:
1. Text Cleaning and Normalization
def clean_text_for_vector(text):
# Remove special characters
text = re.sub(r'[^\w\s]', '', text)
# Convert to lowercase
text = text.lower()
# Remove extra spaces
text = ' '.join(text.split())
return text
2. Conversion to Vector Formats
def text_to_vector_formats(text):
# TF-IDF vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([text])
# Word2Vec (pre-trained model)
# from gensim.models import Word2Vec
# tokens = text.split()
# word_vectors = model.wv[tokens]
return {
'tfidf': tfidf_matrix,
'text': text,
'tokens': text.split()
}
3. Saving in Various Vector Formats
def save_vector_formats(data, base_filename):
# Save as JSON
import json
with open(f'{base_filename}.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
# Save as CSV
import pandas as pd
df = pd.DataFrame([{'text': data['text'], 'tokens': ' '.join(data['tokens'])}])
df.to_csv(f'{base_filename}.csv', index=False, encoding='utf-8')
# Save vector representations
from scipy.sparse import save_npz
save_npz(f'{base_filename}_tfidf.npz', data['tfidf'])
Sources
- Improve OCR Accuracy With Advanced Image Preprocessing - DocParser
- Using Machine Learning to Denoise Images for Better OCR Accuracy - PyImageSearch
- Image processing to improve tesseract OCR accuracy - Stack Overflow
- Improve OCR accuracy using advanced preprocessing techniques - Nitor Infotech
- How to verify if the image contains noise in background before 'OCR’ing - Stack Overflow
- How to Use Image Preprocessing to Improve the Accuracy of Tesseract OCR - Bomberbot
- How do you handle noise, distortion, and background in OCR images? - LinkedIn
- OCR Tools — Solving Real-World Problems with Low-Quality Images - Medium
- Unleashing the Power of EasyOCR in Python: A Comprehensive Guide - CodeRivers
- Detecting low contrast images with OpenCV, scikit-image, and Python - GeeksforGeeks
- Changing the contrast and brightness of an image! - OpenCV Documentation
- How to change the contrast and brightness of an image using OpenCV in Python - Tutorialspoint
Conclusion
For fast and simple text extraction from images with low-contrast background and noise, the following approach is recommended:
- Use comprehensive preprocessing with contrast enhancement, noise removal, and adaptive binarization
- Apply OpenCV and pytesseract for quick implementation without needing deep learning
- Optimize processing parameters depending on the specific type of noise
- Automate detection of low-contrast images for applying enhanced methods
- Systematize the conversion process to vector formats for further use
For complex cases, machine learning methods can be considered, but the proposed OpenCV-based pipeline already provides good results for most practical tasks of text extraction from low-quality images.