How to fix an error when using Docling for PDF conversion?
I’m trying to use the Docling library to convert a PDF document to Markdown, but I’m getting an error. The code I’m running is based on the official documentation:
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
However, when running this code, the following error occurs:
Traceback (most recent call last):
File "C:\Users\Макс\Desktop\VS Code Projects\RAG\.venv\Lib\site-packages\docling\datamodel\document.py", line 171, in __init__
self._init_doc(backend, path_or_stream)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Макс\Desktop\VS Code Projects\RAG\.venv\Lib\site-packages\docling\datamodel\document.py", line 215, in _init_doc
self._backend = backend(self, path_or_stream=path_or_stream)
~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Макс\Desktop\VS Code Projects\RAG\.venv\Lib\site-packages\docling\backend\docling_parse_v4_backend.py", line 206, in __init__
self.parser = DoclingPdfParser(loglevel="fatal")
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
File "C:\Users\Макс\Desktop\VS Code Projects\RAG\.venv\Lib\site-packages\docling_parse\pdf_parser.py", line 579, in __init__
self.parser = pdf_parser_v2(level=loglevel)
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
RuntimeError: filename does not exists: C:\Users\Макс\Desktop\VS Code Projects\RAG\.venv\Lib\site-packages\docling_parse\pdf_resources_v2/glyphs//standard/additional.dat
2025-10-28 20:18:53,881 - INFO - Going to convert document batch...
Traceback (most recent call last):
File "C:\Users\Макс\Desktop\VS Code Projects\RAG\docling_test\docling1.py", line 10, in <module>
result = converter.convert(source)
File "C:\Users\Макс\Desktop\VS Code Projects\RAG\.venv\Lib\site-packages\pydantic\_internal\_validate_call.py", line 39, in wrapper_function
return wrapper(*args, **kwargs)
File "C:\Users\Макс\Desktop\VS Code Projects\RAG\.venv\Lib\site-packages\pydantic\_internal\_validate_call.py", line 136, in __call__
res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
File "C:\Users\Макс\Desktop\VS Code Projects\RAG\.venv\Lib\site-packages\docling\document_converter.py", line 237, in convert
return next(all_res)
File "C:\Users\Макс\Desktop\VS Code Projects\RAG\.venv\Lib\site-packages\docling\document_converter.py", line 260, in convert_all
for conv_res in conv_res_iter:
^^^^^^^^^^^^^
File "C:\Users\Макс\Desktop\VS Code Projects\RAG\.venv\Lib\site-packages\docling\document_converter.py", line 332, in _convert
for item in map(
~~~^
process_func,
^^^^^^^^^^^^^
input_batch,
^^^^^^^^^^^^
):
^
File "C:\Users\Макс\Desktop\VS Code Projects\RAG\.venv\Lib\site-packages\docling\document_converter.py", line 379, in _process_document
conv_res = self._execute_pipeline(in_doc, raises_on_error=raises_on_error)
File "C:\Users\Макс\Desktop\VS Code Projects\RAG\.venv\Lib\site-packages\docling\document_converter.py", line 415, in _execute_pipeline
raise ConversionError(f"Input document {in_doc.file} is not valid.")
docling.exceptions.ConversionError: Input document 2408.09869v5.pdf is not valid.
-> Cannot close object, library is destroyed. This may cause a memory leak!
Interestingly, the same code runs successfully on another computer. What could be the problem and how can I fix it?
Problem Details:
- Library: Docling
- Operation: PDF to Markdown conversion
- Error:
RuntimeError: filename does not existsandConversionError: Input document is not valid - Context: The code works on one computer but not on another
Possible Causes:
- Problem with Docling installation or configuration
- Missing required resources or files
- Library version conflicts
- File access permission issues
RuntimeError: filename does not exists Error in Docling
The RuntimeError: filename does not exists: C:\Users\Макс\Desktop\VS Code Projects\RAG\.venv\Lib\site-packages\docling_parse\pdf_resources_v2/glyphs//standard/additional.dat error when using Docling for PDF conversion is a common problem related to missing required model resources. This error occurs when Docling attempts to load or find the required files for PDF processing, but they are not present in the system.
Table of Contents
- Main Causes of the Error
- Solutions to the Problem
- Checking and Updating Dependencies
- Setting Up Offline Mode
- Alternative Conversion Approaches
- Checking Version Compatibility
- Conclusion
Main Causes of the Error
The error occurs due to missing resource files in the pdf_resources_v2 directory. According to research, the main causes include:
-
Incomplete model installation: Docling requires downloading additional resources on first run, but the process may be interrupted due to network issues or limitations.
-
Access restrictions: In some corporate networks or servers with limited internet access, downloading models from the Hugging Face Hub may be blocked.
-
Version conflicts: Incompatibility between Docling, docling-parse, and other dependency versions can lead to resource loading issues.
-
Access permission problems: Resource files may be missing due to write permission issues in the installation directory.
Solutions to the Problem
1. Manual Model Download
The most reliable method is to manually download the required models:
import os
from pathlib import Path
# Create the models directory if it doesn't exist
models_dir = Path.home() / ".cache" / "docling" / "models"
models_dir.mkdir(parents=True, exist_ok=True)
# Set the environment variable to use local models
os.environ["DOCLING_MODELS_PATH"] = str(models_dir)
Then use the command to download models:
docling-tools models download rapidocr
2. Reinstall Docling with Cache Cleanup
A complete reinstall with cache cleanup may solve the problem:
# Remove existing installation
pip uninstall docling docling-parse -y
# Clear cache
pip cache purge
# Install the latest version
pip install docling[rapidocr]
3. Using an Alternative Backend
Try using a different PDF processing backend:
from docling.document_converter import DocumentConverter
from docling.pipeline import PdfPipeline
# Use the standard backend instead of VLM
converter = DocumentConverter(
pipeline=PdfPipeline(
ocr=..., # OCR settings
backend="standard" # instead of VLM
)
)
source = "https://arxiv.org/pdf/2408.09869"
result = converter.convert(source)
print(result.document.export_to_markdown())
Checking and Updating Dependencies
Ensure all dependencies are updated to compatible versions:
pip list | grep -E "(docling|torch|numpy)"
# Update main dependencies
pip install --upgrade torch numpy
pip install --upgrade docling docling-parse
Note: As mentioned in research, there are compatibility issues between PyTorch and NumPy versions. Make sure you have a compatible combination installed.
Setting Up Offline Mode
If you’re working in an environment without internet access, you need to download all required models beforehand:
- Download models on a machine with internet:
# Create a temporary directory for models
mkdir -p ~/docling_models_offline
export DOCLING_MODELS_PATH=~/docling_models_offline
# Download all required models
docling-tools models download rapidocr
docling-tools models download docling-parse
- Transfer models to the target machine and set the environment variable:
import os
os.environ["DOCLING_MODELS_PATH"] = "/path/to/your/models"
- Run Docling with the correctly configured model path.
Alternative Conversion Approaches
1. Using Docker Container
A Docker container can solve dependency and model path issues:
# Create a Dockerfile
FROM python:3.11-slim
# Install Docling and download models
RUN pip install docling[rapidocr] && \
docling-tools models download rapidocr
# Run conversion
RUN python -c "
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert('https://arxiv.org/pdf/2408.09869')
print(result.document.export_to_markdown())
"
2. Conversion via CLI
Try using the Command Line Interface:
docling convert https://arxiv.org/pdf/2408.09869 output.md
3. Using Alternative Libraries
If the problem persists, consider alternative approaches:
# Alternative: pdf2image + OCR
from pdf2image import convert_from_path
import pytesseract
# Convert PDF to images
images = convert_from_path("document.pdf")
# Process each image
for i, image in enumerate(images):
text = pytesseract.image_to_string(image, lang='rus+eng')
print(f"Page {i+1}: {text}")
Checking Version Compatibility
Check the compatibility of component versions:
| Component | Recommended Version | Minimum Version |
|---|---|---|
| Docling | 2.3+ | 2.0+ |
| docling-parse | Latest | 1.0+ |
| PyTorch | 2.2.2+ | 2.0+ |
| NumPy | 2.0+ | 1.21+ |
To check compatibility, run:
import sys
import pkg_resources
# Check version compatibility
required_packages = [
'docling>=2.3',
'docling-parse>=1.0',
'torch>=2.0',
'numpy>=1.21'
]
for package in required_packages:
try:
pkg_resources.require(package)
print(f"✓ {package} is installed")
except pkg_resources.DistributionNotFound:
print(f"✗ {package} not found")
except pkg_resources.VersionConflict:
print(f"✗ Version conflict for {package}")
Conclusion
The RuntimeError: filename does not exists error in Docling is usually related to missing model resources. The main solutions include:
- Manually downloading models using
docling-tools models download - Complete reinstall with cache cleanup
- Setting up offline mode for working without internet
- Using alternative backends or approaches
- Checking version compatibility of all dependencies
Since the code works on one computer but not another, the most likely cause are differences in the installation environment - either incomplete model downloads or dependency version conflicts. It’s recommended to start with manual model download and checking version compatibility.
Sources
- Running docling offline with pre-downloaded models - GitHub Issue #232
- Installation - Docling Documentation
- Manual download of default models - GitHub Discussion #2089
docling-tools models download rapidocrnot actually being used by defaultDocumentConverter- Issue #2500- FAQ - Docling Documentation