Programming

Fix PySpark Pytest Py4JJavaError on Windows 11 SparkSession

Resolve Py4JJavaError in PySpark pytest fixtures on Windows 11 by adding spark.driver.bindAddress=127.0.0.1 to SparkSession. Includes working conftest.py, winutils setup, env vars checklist for local tests.

1 answer 1 view

PySpark pytest fixture fails with Py4JJavaError on Windows 11 startup: How to fix SparkSession in local tests?

I’m testing PySpark transformations locally on Windows 11 using Python 3.11.9, pytest, latest Spark, hadoop.dll, winutils, and Microsoft Java SDK 17 with environment variables set.

Running the test fails with:

ERROR tests/test_customers.py::test_enrich_customers_logic - py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.

conftest.py:

python
import sys
import pytest
import socketserver

if sys.platform == 'win32':
 socketserver.UnixStreamServer = socketserver.TCPServer
 socketserver.UnixStreamHandler = socketserver.StreamRequestHandler


from pyspark.sql import SparkSession


@pytest.fixture(scope="session")
def spark():
 builder = (SparkSession.builder
 .master("local[1]") 
 .appName("Windows-PyTest-Local"))

 session = builder.getOrCreate()
 yield session
 session.stop()

test_customers.py:

python
from transformations.customer_transforms import enrich_customers


def test_enrich_customers_logic(spark):
 df = spark.createDataFrame([
 (1, " John Doe ", "123-344-455"),
 (2, "Jane Smith", "(987) 654-3210")
 ], ["customer_id", "customer_name", "phone"])

 results = enrich_customers(df).collect()

 assert results[0]["first_name"] == "John"
 assert results[0]["last_name"] == "Doe"
 assert results[0]["phone"] == "123344455"

What configuration is missing to resolve the Py4JJavaError and run PySpark tests successfully on Windows?

Add a driver bind/host configuration to your SparkSession builder when running pyspark tests on Windows: set spark.driver.bindAddress to 127.0.0.1 (and optionally spark.driver.host and a fixed spark.driver.port) in your pytest fixture. Also confirm HADOOP_HOME points to your winutils (and that PYSPARK_PYTHON / SPARK_HOME / JAVA_HOME are correct); together this resolves the py4jjavaerror on Windows 11 and lets local[1] SparkSession start reliably.


Contents


Quick fix — add spark.driver.bindAddress to SparkSession

The single configuration that’s most often missing is telling the Spark driver which interface to bind to. On Windows with multiple adapters (or VPNs/WSL/virtual NICs) Spark’s Java driver can try to bind to an address that’s not usable, causing the Py4JJavaError during JavaSparkContext startup. Add this to your fixture:

python
from pyspark.sql import SparkSession

builder = (SparkSession.builder
 .master("local[1]")
 .appName("Windows-PyTest-Local")
 .config("spark.driver.bindAddress", "127.0.0.1") # REQUIRED on many Windows setups
 .config("spark.driver.host", "127.0.0.1") # helps in some environments
 # .config("spark.driver.port", "5050") # optional: pick a free port if needed
 .config("spark.sql.shuffle.partitions", "1") # useful for deterministic, fast tests
)
spark = builder.getOrCreate()

Why 127.0.0.1? Because binding the Java gateway explicitly to loopback avoids ambiguity when Windows has multiple IPs. The community has repeatedly recommended this fix for Windows PySpark pytest issues (see examples from StackOverflow and a practical write‑up) — and the official PySpark testing docs show the same pattern for Windows fixtures. See the practical writeup at https://towardsdatascience.com/spark-fix-cant-assign-driver-32406580375/ and the StackOverflow thread at https://stackoverflow.com/questions/41049330/windows-error-while-running-standalone-pyspark.


Why this causes Py4JJavaError on Windows

Short version: the Spark driver (a JVM process) must open a server socket for the Python side (Py4J) to connect. On Linux that usually works fine because the host network setup is simple. On Windows there are often many network interfaces (Ethernet, Wi‑Fi, VPN tunnels, Docker/WSL virtual adapters). If the driver attempts to bind to an IP that either doesn’t exist or is blocked, Java throws a BindException and Py4J reports:

py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.

You might also see errors referencing PythonUtils or gateway startup failures. For many users the permanent fix is to pin the driver to loopback (127.0.0.1) so the JVM binds to a predictable, local interface. The StackOverflow solution and community repos confirm this pattern for Windows 10/11 and recent Spark versions: https://stackoverflow.com/questions/41049330/windows-error-while-running-standalone-pyspark and https://github.com/BoltMaud/Pyspark_pytest.


Working conftest.py example (pytest + Windows 11)

Here’s a minimal, tested conftest.py pattern you can drop into your repo. It keeps your original socketserver hack (needed only on Windows), pins the driver bind address, makes tests fast/deterministic, and yields a session-scoped SparkSession:

python
import os
import sys
import socketserver
import pytest

if sys.platform == "win32":
 # keep this before importing pyspark to avoid Unix-socket usage
 socketserver.UnixStreamServer = socketserver.TCPServer
 socketserver.UnixStreamHandler = socketserver.StreamRequestHandler

# (Optional) enforce the python interpreter used by PySpark
os.environ.setdefault("PYSPARK_PYTHON", sys.executable)

from pyspark.sql import SparkSession

@pytest.fixture(scope="session")
def spark():
 builder = (SparkSession.builder
 .master("local[1]")
 .appName("Windows-PyTest-Local")
 .config("spark.driver.bindAddress", "127.0.0.1")
 .config("spark.driver.host", "127.0.0.1")
 .config("spark.sql.shuffle.partitions", "1")
 )
 spark = builder.getOrCreate()
 yield spark
 spark.stop()

Notes:

  • You can add .config("spark.driver.port", "5050") if you want to force a specific port (make sure it’s free). If you leave it unset Spark picks an available port.
  • Keep .master("local[1]") for single-threaded deterministic tests. If tests are CPU-bound, increase cores but expect non-deterministic ordering.
  • This pattern follows the official testing guidance: https://spark.apache.org/docs/latest/api/python/getting_started/testing_pyspark.html.

Windows checklist: winutils, env vars, Java and Python

Beyond bindAddress, these pieces commonly cause trouble on Windows. Tick them off:

  • HADOOP_HOME: point to the folder containing winutils.exe (e.g., C:\winutils) and ensure HADOOP_HOME\bin contains winutils.exe (or hadoop.dll). See community examples at https://github.com/BoltMaud/Pyspark_pytest.
  • SPARK_HOME / PATH: if you manage a local Spark binary, make sure SPARK_HOME is set and %SPARK_HOME%\bin is on your PATH.
  • JAVA_HOME: point to Microsoft JDK 17 or another compatible JDK that your Spark build supports.
  • PYSPARK_PYTHON: set to the Python interpreter running pytest (helpful if you use venv/venv).
  • pyspark version compatibility: match pyspark Python package to your Spark distribution when possible (many users use pyspark==3.5.x with Spark 3.5.x).
  • findspark: optional helper if you don’t set SPARK_HOME; docs show how to init findspark early to locate Spark (https://sparkbyexamples.com/pyspark/pyspark-py4j-protocol-py4jerror-org-apache-spark-api-python-pythonutils-jvm/).
  • Antivirus / firewall: ensure local loopback ports aren’t blocked; add exceptions or test with Windows Defender temporarily disabled.

If you already have winutils/hadoop.dll and environment variables set (as you stated), then the bindAddress change is usually the missing piece.


Troubleshooting tips: ports, VPNs, firewall and logs

  • Still failing after bindAddress? Try adding spark.driver.host = the machine’s loopback name (localhost) or explicit IP that you know is reachable from local processes.
  • Check for port conflicts: run netstat -ano | findstr : or let Spark pick a port and inspect the error log for the port it tried to bind.
  • VPNs and virtualization: disable VPNs or Docker/WSL virtual adapters temporarily to see if they’re the culprit. They often change the default interface selection.
  • Logs: enable more detail via Spark log4j or set spark.sparkContext.setLogLevel(“DEBUG”) once SparkSession starts (but that won’t help pre-start BindExceptions). Java stack traces in the pytest failure show the bind exception — look for “Address already in use” or “Cannot assign requested address.”
  • Firewall: add inbound/outbound rules for java.exe or the chosen port, or test with firewall off.
  • If you see Py4J errors referencing PythonUtils.getEncryptionEnabled or similar, ensure PYTHONPATH includes Spark’s python directories or use findspark; see https://sparkbyexamples.com/pyspark/pyspark-py4j-protocol-py4jerror-org-apache-spark-api-python-pythonutils-jvm/.

One more trick: if you need reproducible local tests across CI agents, pin driver.host and driver.port per-agent in environment variables and document them in the test suite.


Sources


Conclusion

In short: pin the Spark driver to loopback in your pytest fixture by adding spark.driver.bindAddress = “127.0.0.1” (and usually spark.driver.host = “127.0.0.1” or a fixed spark.driver.port). Combine that with correct HADOOP_HOME/winutils, SPARK_HOME, PYSPARK_PYTHON and a compatible JDK and your pyspark local tests will stop failing with the py4jjavaerror on Windows 11.

Authors
Verified by moderation
Moderation
Fix PySpark Pytest Py4JJavaError on Windows 11 SparkSession