Python 3.14 and the Fall of the GIL: How My Code Finally Used All Eight Cores

The Ghost of the GIL

For as long as I’ve written Python, there’s been one phrase that inevitably shows up in every discussion about performance — the GIL. The Global Interpreter Lock, or GIL, has been both a guardian and a curse for Python developers. It made memory management simple and safe, but it also tied Python’s hands. Because of it, only one thread could run Python bytecode at a time. No matter how powerful your machine was, Python never truly used more than one core.

That limitation quietly shaped how we wrote Python for decades. Whenever we needed parallel performance, we reached for workarounds — multiprocessing, offloading to C extensions, or even rewriting critical paths in Rust or C++. It worked, but it was never elegant. It always felt like Python was jogging while the hardware beneath it was sprinting.

This year, that changed.

With Python 3.14, released in October 2025, CPython finally introduced a free-threaded build — a version of the interpreter where the GIL can be completely disabled. For the first time, threads can run truly in parallel across multiple cores. It’s the biggest change to Python’s runtime in more than three decades.

Putting It to the Test

I wanted to see what this really meant in practice, so I set up two environments on my MacBook using pyenv. One was Python 3.12, the standard build with the GIL. The other was Python 3.14, compiled in its new free-threading mode using the --disable-gil flag.

(The default 3.14 installer still includes the GIL, so developers need to opt into the free-threaded build manually.)

To keep things simple, I wrote a small script that creates eight threads — one for each core — and asks each of them to calculate the sum of squares up to forty million. It’s pure Python math: heavy on CPU, no I/O, no C extensions, no tricks.

Here’s a snippet of what I ran:

import threading
import time
import multiprocessing
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

def sum_of_squares(n):
    """Calculate sum of squares from 0 to n-1."""
    s = 0
    for i in range(n):
        s += i * i
    return s

def worker(worker_id, n):
    """Worker function to calculate sum of squares."""
    print(f"Worker {worker_id} starting")
    result = sum_of_squares(n)
    print(f"Worker {worker_id} calculated sum of squares for n={n}, result={result}")
    return result

def main():
    """Main function to coordinate the multi-threaded sum of squares calculation."""
    start_time = time.time()

    # Get the number of CPU cores
    num_cores = multiprocessing.cpu_count()
    print(f"Number of CPU cores: {num_cores}")

    # Define the workload for each thread - increased for better benchmark
    n = 100_000_000  # Each thread will calculate sum of squares for this range

    threads = []
    # Create and start threads equal to the number of cores
    for i in range(num_cores):
        thread = threading.Thread(target=worker, args=(i, n))
        threads.append(thread)
        thread.start()

    # Wait for all threads to complete
    for thread in threads:
        thread.join()

    # Calculate and print the total execution time
    end_time = time.time()
    total_time = end_time - start_time
    print(f"All workers completed in {total_time:.2f} seconds")

if __name__ == "__main__":
    main()

Each thread does the same job, crunching numbers as fast as it can. The only variable was the Python version.

The Results: Eight Cores, One Revelation

Let’s see how long it takes to run using regular Python 3.12

❯ python sum_of_squares.py 
Number of CPU cores: 8
Worker 0 starting
Worker 1 starting
Worker 2 starting
Worker 3 starting
Worker 4 starting
Worker 5 starting
Worker 6 starting
Worker 7 starting
Worker 1 calculated sum of squares for n=100000000, result=333333328333333350000000
Worker 6 calculated sum of squares for n=100000000, result=333333328333333350000000
Worker 0 calculated sum of squares for n=100000000, result=333333328333333350000000
Worker 3 calculated sum of squares for n=100000000, result=333333328333333350000000
Worker 7 calculated sum of squares for n=100000000, result=333333328333333350000000
Worker 4 calculated sum of squares for n=100000000, result=333333328333333350000000
Worker 2 calculated sum of squares for n=100000000, result=333333328333333350000000
Worker 5 calculated sum of squares for n=100000000, result=333333328333333350000000
All workers completed in 36.05 seconds

Now, with the GIL-free version:

❯ python sum_of_squares.py 
Number of CPU cores: 8
Worker 0 starting
Worker 1 starting
Worker 2 starting
Worker 3 starting
Worker 4 starting
Worker 5 starting
Worker 6 starting
Worker 7 starting
Worker 0 calculated sum of squares for n=100000000, result=333333328333333350000000
Worker 7 calculated sum of squares for n=100000000, result=333333328333333350000000
Worker 6 calculated sum of squares for n=100000000, result=333333328333333350000000
Worker 5 calculated sum of squares for n=100000000, result=333333328333333350000000
Worker 1 calculated sum of squares for n=100000000, result=333333328333333350000000
Worker 2 calculated sum of squares for n=100000000, result=333333328333333350000000
Worker 3 calculated sum of squares for n=100000000, result=333333328333333350000000
Worker 4 calculated sum of squares for n=100000000, result=333333328333333350000000
All workers completed in 11.57 seconds

On Python 3.12, the result was exactly what we’ve all come to expect. Every thread started and finished, but never truly ran together. The program took 36.05 seconds, and CPU utilization barely crossed a single core.

Then I switched to Python 3.14’s free-threaded build. I ran the same code — same hardware, same logic — and the difference was instant. My CPU monitor lit up across all eight cores. The threads worked in parallel, and the computation finished in just 11.57 seconds. That’s nearly three times faster, purely because the interpreter no longer forced everything through one thread at a time.

Why This Changes Everything

It’s hard to exaggerate how transformative this is. For years, Python’s threading library has existed mostly for I/O-bound tasks — reading files, handling web requests, waiting for network responses — but it was never useful for true CPU-bound work. Now, it finally is.

With the GIL gone, you no longer have to juggle process pools or shared memory queues just to take advantage of all your cores. You can write regular threaded Python code and see real parallel performance. For developers working on AI pipelines, vector processing, simulations, or data transformations, this is a game changer. Tasks like tokenization, embeddings, or agent reasoning can finally scale linearly with the number of cores.

It also changes how we think about Python servers. Frameworks that mix async and threading — from FastAPI to custom ML inference backends — will now see better throughput without juggling multiple processes. And since threads share memory, the overhead stays much lower than in multiprocess setups.

A New Chapter for Python

Of course, there are caveats. Single-threaded programs might run a little slower — typically 5 to 10 percent — because every object operation now uses atomic reference counting. Some C extensions still need small updates to adapt to the new model. But those are transitional details. The real story is that Python can finally grow into the multi-core world we’ve been living in for years.

For me, watching those eight threads blaze through their tasks in parallel wasn’t just about speed. It felt symbolic — like watching Python shed the last vestige of its old limitations. For decades, the GIL has been the punchline in every performance joke about Python. With 3.14, that’s beginning to fade.

This isn’t just an optimization. It’s liberation. Python 3.14 doesn’t merely make your code faster — it lets the language finally use all the power your machine has been offering all along.

Talk to us.