NeuroAgent

How to Download Files in Python: Complete Guide

Learn how to download files over HTTP in Python using urllib and requests libraries. Complete guide with error handling, streaming, and podcast utility examples.

How to download a file over HTTP using Python?

I have a utility that downloads an MP3 file from a website on a schedule and then builds/updates a podcast XML file for iTunes. The XML processing is written in Python, but I currently use wget in a Windows .bat file to download the MP3 file. I want to replace wget with Python code to have the entire utility in Python.

How can I download files using Python instead of wget?

NeuroAgent

You can download files over HTTP using Python’s built-in urllib library or the popular requests library. The requests library is generally recommended for its simplicity and robust error handling, while urllib is built into Python and requires no additional dependencies. Both methods can easily replace wget in your podcast utility, allowing you to have the entire process in Python.


Contents


Basic Download Methods

Python offers several built-in ways to download files over HTTP. The most straightforward approach uses urllib.request, which is part of Python’s standard library.

Using urllib.request.urlretrieve

The urlretrieve function is the simplest way to download a file, similar to wget:

python
import urllib.request

url = 'https://example.com/podcast_episode.mp3'
filename = 'podcast_episode.mp3'

urllib.request.urlretrieve(url, filename)

This method downloads the entire file at once and saves it directly to the specified filename. As noted on Real Python, this approach is suitable for straightforward file downloading tasks when you don’t need progress feedback or advanced features.

Using urllib.request.urlopen

For more control over the download process, you can use urlopen:

python
import urllib.request

url = 'https://example.com/podcast_episode.mp3'
filename = 'podcast_episode.mp3'

with urllib.request.urlopen(url) as response, open(filename, 'wb') as out_file:
    out_file.write(response.read())

According to Stack Overflow, this method gives you more flexibility as it works with file-like objects and allows you to process the response before writing to disk.

Using the requests Library

The requests library is a third-party package that provides a more elegant and powerful API for HTTP requests. It’s not included in Python’s standard library, so you’ll need to install it first:

bash
pip install requests

Basic Download with requests.get

python
import requests

url = 'https://example.com/podcast_episode.mp3'
filename = 'podcast_episode.mp3'

response = requests.get(url)
with open(filename, 'wb') as file:
    file.write(response.content)

As Stack Overflow explains, the requests package has a very easy API to start with and is preferred by many developers for HTTP-related tasks.

Streaming Download for Large Files

For large files like MP3s, it’s better to stream the download rather than loading the entire file into memory:

python
import requests

def download_file(url, filename):
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)

url = 'https://example.com/large_podcast_episode.mp3'
download_file(url, 'large_episode.mp3')

Advanced Features

Progress Bar with tqdm

For better user experience, especially with large files, you can add a progress bar using the tqdm library:

bash
pip install tqdm
python
import requests
from tqdm import tqdm

def download_with_progress(url, filename):
    response = requests.get(url, stream=True)
    total_size = int(response.headers.get('content-length', 0))
    
    with open(filename, 'wb') as file, tqdm(
        desc=filename,
        total=total_size,
        unit='B',
        unit_scale=True,
        unit_divisor=1024,
    ) as progress_bar:
        for chunk in response.iter_content(chunk_size=8192):
            size = file.write(chunk)
            progress_bar.update(size)

# Usage
download_with_progress('https://example.com/podcast_episode.mp3', 'episode.mp3')

As Alpharithms explains, this implementation provides visual feedback during downloads, which is helpful for long downloads.

Authentication for Protected Files

If your podcast files are behind authentication:

python
import requests

def download_protected_file(url, filename, username, password):
    response = requests.get(url, auth=(username, password), stream=True)
    response.raise_for_status()
    
    with open(filename, 'wb') as file:
        for chunk in response.iter_content(chunk_size=8192):
            file.write(chunk)

According to HeyCoach, authentication can be handled seamlessly using the auth parameter.

Error Handling and Best Practices

Proper Error Handling

Always implement proper error handling for network operations:

python
import requests
import urllib.request
import urllib.error

def safe_download_with_requests(url, filename):
    try:
        response = requests.get(url, stream=True, timeout=30)
        response.raise_for_status()  # Raises HTTPError for bad responses
        
        with open(filename, 'wb') as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)
        return True
    except requests.exceptions.RequestException as e:
        print(f"Download failed: {e}")
        return False

def safe_download_with_urllib(url, filename):
    try:
        urllib.request.urlretrieve(url, filename)
        return True
    except urllib.error.URLError as e:
        print(f"Download failed: {e}")
        return False

Best Practices from Medium

  1. Always use raise_for_status() to catch HTTP errors
  2. Stream large files using stream=True and iter_content()
  3. Add proper error handling for network issues
  4. Use progress bars for better user experience
  5. Validate downloaded files

Complete Podcast Utility Example

Here’s a complete example that replaces your wget approach:

python
import requests
import os
import xml.etree.ElementTree as ET
from datetime import datetime
from urllib.parse import urljoin
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class PodcastUpdater:
    def __init__(self, base_url, output_dir='podcasts'):
        self.base_url = base_url
        self.output_dir = output_dir
        os.makedirs(output_dir, exist_ok=True)
    
    def download_episode(self, episode_url, episode_id):
        """Download a podcast episode with progress tracking and error handling"""
        filename = os.path.join(self.output_dir, f"{episode_id}.mp3")
        
        try:
            # Stream the download with progress bar
            with requests.get(episode_url, stream=True, timeout=30) as response:
                response.raise_for_status()
                
                # Get total file size
                total_size = int(response.headers.get('content-length', 0))
                
                logger.info(f"Downloading {episode_id} ({total_size/1024/1024:.1f} MB)")
                
                with open(filename, 'wb') as file:
                    downloaded = 0
                    for chunk in response.iter_content(chunk_size=8192):
                        if chunk:  # filter out keep-alive chunks
                            file.write(chunk)
                            downloaded += len(chunk)
                            
                            # Log progress every 10%
                            if total_size > 0 and downloaded % (total_size // 10) == 0:
                                progress = (downloaded / total_size) * 100
                                logger.info(f"Progress: {progress:.0f}%")
                
                logger.info(f"Successfully downloaded {filename}")
                return True
                
        except requests.exceptions.RequestException as e:
            logger.error(f"Failed to download {episode_id}: {e}")
            return False
    
    def update_podcast_xml(self, episodes_config):
        """Update the podcast XML file with new episodes"""
        xml_file = os.path.join(self.output_dir, 'podcast.xml')
        
        # Create XML structure (simplified example)
        root = ET.Element('rss', version='2.0')
        channel = ET.SubElement(root, 'channel')
        
        # Add basic podcast info
        ET.SubElement(channel, 'title').text = "My Podcast"
        ET.SubElement(channel, 'description').text = "A great podcast"
        ET.SubElement(channel, 'link').text = self.base_url
        
        # Add episodes
        for episode in episodes_config:
            if self.download_episode(episode['url'], episode['id']):
                item = ET.SubElement(channel, 'item')
                ET.SubElement(item, 'title').text = episode['title']
                ET.SubElement(item, 'description').text = episode['description']
                ET.SubElement(item, 'pubDate').text = datetime.now().strftime('%a, %d %b %Y %H:%M:%S GMT')
                ET.SubElement(item, 'enclosure', {
                    'url': urljoin(self.base_url, f"{episode['id']}.mp3"),
                    'type': 'audio/mpeg',
                    'length': str(episode.get('size', 0))
                })
        
        # Write XML file
        tree = ET.ElementTree(root)
        tree.write(xml_file, encoding='utf-8', xml_declaration=True)
        logger.info(f"Updated podcast XML: {xml_file}")

# Example usage
if __name__ == "__main__":
    podcast = PodcastUpdater('https://my-podcast-website.com')
    
    episodes = [
        {
            'id': 'episode001',
            'title': 'First Episode',
            'description': 'This is my first episode',
            'url': 'https://my-podcast-website.com/episodes/episode001.mp3',
            'size': 52428800  # 50MB
        },
        {
            'id': 'episode002',
            'title': 'Second Episode',
            'description': 'This is my second episode',
            'url': 'https://my-podcast-website.com/episodes/episode002.mp3',
            'size': 78643200  # 75MB
        }
    ]
    
    podcast.update_podcast_xml(episodes)

Choosing the Right Approach

When to Use urllib

Use Python’s built-in urllib when:

  • You want to avoid external dependencies
  • You’re working in a restricted environment
  • You need simple downloads without advanced features
  • Your utility needs to be self-contained

As Tutorialspoint notes, urllib.request is suitable for straightforward file downloading tasks.

When to Use requests

Use the requests library when:

  • You need better error handling
  • You want cleaner, more readable code
  • You require advanced features like streaming, sessions, or authentication
  • You’re working with complex HTTP scenarios
  • You want to add progress bars easily

According to AskPython, the requests library provides an exclusive and efficient way to handle HTTP requests in Python.

Recommendations for Your Use Case

For your podcast utility, I recommend using the requests library with streaming and proper error handling because:

  1. MP3 files can be large, so streaming prevents memory issues
  2. Error handling is crucial for automated scheduled downloads
  3. Progress bars provide feedback during downloads
  4. The code will be more maintainable and readable
  5. You can easily extend it for authentication if needed in the future

Conclusion

Downloading files over HTTP in Python is straightforward and can easily replace wget in your podcast utility. Here are the key takeaways:

  1. Both urllib and requests can handle file downloads, with requests offering better error handling and more features
  2. Use streaming downloads (stream=True with iter_content()) for large files like MP3s to avoid memory issues
  3. Always implement proper error handling with raise_for_status() and try-catch blocks
  4. Add progress bars using tqdm for better user experience with large downloads
  5. Consider authentication if your podcast files are behind protected access

For your specific podcast utility, the requests library with streaming, error handling, and progress tracking provides the most robust solution. The complete example above shows how to integrate file downloading with XML processing in a single Python script, eliminating the need for external tools like wget.


Sources

  1. Real Python - How to Download Files From URLs With Python
  2. Stack Overflow - How to download a file over HTTP?
  3. Stack Overflow - Download file from web in Python 3
  4. Tutorialspoint - Download a file over HTTP in Python
  5. AskPython - Python HTTP File Download: Using the Requests Library
  6. Alpharithms - Progress Bars for Python Downloads
  7. HeyCoach - Handling File Downloads In Python
  8. Medium - Downloading Files from URLs in Python
  9. WebScraping.AI - How do I handle file downloads during web scraping with Python?
  10. Geekflare - 5 Ways to Download Files from a URL Using Python