I’m encountering an IndexError when trying to fine-tune the Qwen 2.5 3B VLM model on the Trelis/chess_pieces dataset from Hugging Face. The error occurs at the trainer.train() line with the following traceback:
IndexError: index 2 is out of bounds for dimension 0 with size 2
The error happens in the processing_qwen2_5_vl.py file at line 177:
num_image_tokens = image_grid_thw[index].prod() // merge_length
Here’s my collate function:
def collate_fn(samples):
texts = []
images = []
for sample in samples:
prompt_text = processor.apply_chat_template(sample['text'], tokenize=False)
texts.append(prompt_text)
images.append(sample['image'])
inputs = processor(
text=texts,
images=images,
return_tensors='pt',
padding=True
)
labels=inputs['input_ids'].clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
inputs['labels'] = labels
return inputs
I’ve tried various solutions but haven’t been able to resolve this indexing issue. Can someone help me identify what’s causing this error and how to fix it when fine-tuning Qwen 2.5 3B VLM on image datasets?
The IndexError you’re encountering with Qwen 2.5 3B VLM fine-tuning is a common issue related to the image_grid_thw tensor dimension mismatch. This error typically occurs when the model expects a specific tensor structure but receives data with incompatible dimensions, causing the indexing operation to fail.
Contents
- Understanding the Error Origin
- Root Causes Analysis
- Solutions and Fixes
- Prevention Strategies
- Debugging Approach
- Complete Implementation Example
Understanding the Error Origin
The error index 2 is out of bounds for dimension 0 with size 2 occurs in the processing script at line 177:
num_image_tokens = image_grid_thw[index].prod() // merge_length
As noted in the GitHub issue, this happens because the image_grid_thw tensor doesn’t have the expected shape. The model expects image_grid_thw to contain grid dimensions in the format (T, H, W), but your tensor might only have 2 elements instead of the expected 3.
Root Causes Analysis
1. Missing Image Grid Dimensions
The image_grid_thw tensor is crucial for Qwen2.5-VL as it contains the spatial grid dimensions that the model uses to process images. When this tensor is missing or incorrectly shaped, the indexing operation fails.
2. Collate Function Limitations
Your current collate function doesn’t explicitly handle image_grid_thw:
def collate_fn(samples):
texts = []
images = []
for sample in samples:
prompt_text = processor.apply_chat_template(sample['text'], tokenize=False)
texts.append(prompt_text)
images.append(sample['image'])
inputs = processor(
text=texts,
images=images,
return_tensors='pt',
padding=True
)
3. Image Resolution Issues
Images must be properly resized to dimensions divisible by 14 (the patch size for Qwen2.5-VL). If images have incompatible dimensions, the grid calculation will produce incorrect results.
4. Batch Processing Inconsistencies
The error might occur when processing batches with mixed image types or when some samples don’t contain valid image data.
Solutions and Fixes
Solution 1: Enhanced Collate Function with Image Grid Handling
Modify your collate function to properly handle image_grid_thw:
def collate_fn(samples):
texts = []
images = []
valid_samples = []
# Filter out invalid samples first
for sample in samples:
try:
if 'image' in sample and sample['image'] is not None:
prompt_text = processor.apply_chat_template(sample['text'], tokenize=False)
texts.append(prompt_text)
images.append(sample['image'])
valid_samples.append(sample)
except Exception as e:
print(f"Skipping invalid sample: {e}")
if not valid_samples:
raise ValueError("No valid samples found in batch")
# Process with processor
inputs = processor(
text=texts,
images=images,
return_tensors='pt',
padding=True
)
# Calculate image_grid_thw manually if not provided
if 'image_grid_thw' not in inputs:
batch_size = len(images)
image_grid_thw = []
for img in images:
# Get image dimensions
if hasattr(img, 'size'): # PIL Image
w, h = img.size
else: # Tensor
h, w = img.shape[-2:]
# Calculate grid dimensions (T=1 for single images, H//14, W//14)
grid_h = h // 14
grid_w = w // 14
image_grid_thw.append([1, grid_h, grid_w])
inputs['image_grid_thw'] = torch.tensor(image_grid_thw, dtype=torch.long)
# Set up labels
labels = inputs['input_ids'].clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
inputs['labels'] = labels
return inputs
Solution 2: Image Preprocessing and Validation
Add validation and preprocessing steps:
def preprocess_image(image):
"""Ensure image has compatible dimensions for Qwen2.5-VL"""
if hasattr(image, 'size'): # PIL Image
w, h = image.size
else: # Tensor
h, w = image.shape[-2:]
# Check if dimensions are divisible by 14
if h % 14 != 0 or w % 14 != 0:
# Resize to nearest multiple of 14
new_h = ((h + 13) // 14) * 14
new_w = ((w + 13) // 14) * 14
image = image.resize((new_w, new_h)) if hasattr(image, 'resize') else F.interpolate(image.unsqueeze(0), size=(new_h, new_w)).squeeze(0)
return image
Solution 3: Batch Size and Mixed Precision Handling
# In your training script
training_args = TrainingArguments(
output_dir="./output",
per_device_train_batch_size=1, # Start with batch size 1
gradient_accumulation_steps=4,
fp16=True, # Mixed precision can help with memory issues
remove_unused_columns=False,
... # other arguments
)
# Override the data collator
data_collator = collate_fn # Your enhanced collate function
Prevention Strategies
1. Dataset Validation
Before training, validate your dataset:
def validate_dataset(dataset):
issues = []
for i, sample in enumerate(dataset):
try:
if 'image' not in sample or sample['image'] is None:
issues.append(f"Sample {i}: Missing image")
continue
img = sample['image']
if hasattr(img, 'size'):
w, h = img.size
else:
h, w = img.shape[-2:]
if h % 14 != 0 or w % 14 != 0:
issues.append(f"Sample {i}: Image dimensions {h}x{w} not divisible by 14")
except Exception as e:
issues.append(f"Sample {i}: {str(e)}")
if issues:
print("Dataset validation issues:")
for issue in issues[:10]: # Show first 10 issues
print(f" {issue}")
if len(issues) > 10:
print(f" ... and {len(issues) - 10} more issues")
else:
print("Dataset validation passed!")
return len(issues) == 0
2. Memory Management
# Use gradient checkpointing to save memory
model.gradient_checkpointing_enable()
# Enable memory-efficient attention
from transformers import AutoModelForVision2Seq
model = AutoModelForVision2Seq.from_pretrained(
"Qwen/Qwen2.5-VL-3B-Instruct",
torch_dtype=torch.float16,
device_map="auto",
use_cache=False # Disable cache for training
)
Debugging Approach
Step-by-Step Debugging
- Check batch composition:
# Add debugging to your collate function
def debug_collate_fn(samples):
print(f"Processing batch of {len(samples)} samples")
for i, sample in enumerate(samples):
if 'image' in sample:
img = sample['image']
if hasattr(img, 'size'):
print(f"Sample {i}: Image size {img.size}")
else:
print(f"Sample {i}: Tensor shape {img.shape}")
- Inspect image_grid_thw tensor:
# Add this after your collate function call
batch = collate_fn(dataset[:2]) # Test with small batch
print(f"image_grid_thw shape: {batch['image_grid_thw'].shape}")
print(f"image_grid_thw values: {batch['image_grid_thw']}")
- Check processor output:
# Test processor individually
test_samples = dataset[:2]
processor_outputs = processor(
text=[sample['text'] for sample in test_samples],
images=[sample['image'] for sample in test_samples],
return_tensors='pt'
)
print("Processor keys:", list(processor_outputs.keys()))
Complete Implementation Example
Here’s a complete working example incorporating all the fixes:
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq, TrainingArguments, Trainer
from torch.utils.data import DataLoader
from PIL import Image
import torchvision.transforms as transforms
# Enhanced collate function
def safe_collate_fn(samples):
texts = []
images = []
valid_indices = []
# Filter valid samples
for i, sample in enumerate(samples):
try:
if ('image' in sample and sample['image'] is not None and
'text' in sample and sample['text'] is not None):
# Validate image
img = sample['image']
if hasattr(img, 'size'):
w, h = img.size
if h % 14 != 0 or w % 14 != 0:
# Resize image
new_h = ((h + 13) // 14) * 14
new_w = ((w + 13) // 14) * 14
img = img.resize((new_w, new_h))
sample['image'] = img
texts.append(processor.apply_chat_template(sample['text'], tokenize=False))
images.append(img)
valid_indices.append(i)
except Exception as e:
print(f"Skipping sample {i}: {e}")
if not valid_indices:
raise ValueError("No valid samples in batch")
# Process with processor
inputs = processor(
text=texts,
images=images,
return_tensors='pt',
padding=True,
truncation=True,
max_length=2048
)
# Ensure image_grid_thw is present
if 'image_grid_thw' not in inputs:
batch_size = len(images)
image_grid_thw = []
for img in images:
if hasattr(img, 'size'):
w, h = img.size
else:
h, w = img.shape[-2:]
grid_h = h // 14
grid_w = w // 14
image_grid_thw.append([1, grid_h, grid_w])
inputs['image_grid_thw'] = torch.tensor(image_grid_thw, dtype=torch.long)
# Set up labels
labels = inputs['input_ids'].clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
inputs['labels'] = labels
return inputs
# Model and processor setup
model_name = "Qwen/Qwen2.5-VL-3B-Instruct"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
use_cache=False
)
# Training configuration
training_args = TrainingArguments(
output_dir="./chess_finetuned",
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
learning_rate=2e-5,
num_train_epochs=3,
fp16=True,
logging_steps=10,
save_steps=500,
remove_unused_columns=False,
report_to="none"
)
# Create trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=safe_collate_fn,
)
# Start training
trainer.train()
Sources
- vLLM Forums - IndexError: list index out of range (Qwen/Qwen2.5-VL-3B-Instruct)
- QwenLM GitHub - Fine-tuning Qwen2.5-VL-7B using Llamafactory fails with IndexError
- F22 Labs - Complete Guide to Fine-tuning Qwen2.5 VL Model
- Roboflow - How to Fine-Tune Qwen2.5-VL with a Custom Dataset
- HuggingFace Transformers - Fine tuning qwen2.5 error
Conclusion
The IndexError you’re experiencing with Qwen 2.5 3B VLM fine-tuning is primarily caused by improper handling of the image_grid_thw tensor and dimension mismatches in image processing. By implementing the enhanced collate function, adding proper image validation, and using debugging techniques, you can resolve this issue and successfully fine-tune the model on your chess pieces dataset.
Key takeaways:
- Always validate your dataset before training to ensure image dimensions are compatible
- Implement proper error handling in your collate function to skip invalid samples
- Calculate
image_grid_thwmanually if the processor doesn’t provide it - Start with smaller batch sizes and gradually increase as you validate the process
- Use mixed precision and memory optimization techniques to handle the 3B model efficiently
With these fixes, you should be able to successfully fine-tune the Qwen 2.5 3B VLM model on your chess pieces dataset without encountering IndexError issues.