************** Free threading ************** The free-threaded (sometimes known as "nogil") build of Python is an experimental mode available from Python 3.13 onwards. It aims to disable the "Global Interpreter Lock" and allow multiple Python threads to run truly concurrently. Cython 3.1 and upwards has some basic support for this build of Python. Note that this support *is experimental* and is planned to remain experimental for at least as long as the free-threaded build is experimental in the CPython interpreter. This section of documentation documents the extent of the support and the known pitfalls. Useful links ============ * `PEP 703 `_ - the initial proposal that lead to this feature existing in Python. * `Python documentation for free-threaded extensions `_. * `Quansight labs' documentation of the status of free-threading `_. Status ====== .. note:: All of this is experimental and subject to change/removal! Cython 3.1 is able to build extension modules that are compatible with Freethreading builds of Python. However, by default these extension modules don't indicate their compatibility. Therefore, importing one of these extension modules will result in the interpreter re-enabling the GIL. The result is that the extension module will work, but you will lose the benefits of the free-threaded interpreter! The module-level directive ``# cython: freethreading_compatible = True`` declares that the module is fully compatible with the free-threaded interpreter. When you specify this directive, importing the module will not cause the interpreter to re-enable the GIL. The directive itself does not do anything to ensure compatibility - it is simply a way for you to indicate that you have tested your module and are confident that it works. If you want to temporarily force Python not to re-enable the GIL irrespective of whether extension modules claim to support it then you can either: * set ``PYTHON_GIL=0`` as an environmental variable, * run Python with ``-Xgil=0`` as a command-line argument. These options are mainly useful for testing. Tools for Thread-safety ======================= Cython is gradually adding tools to help you write thread-safe code. These are described here. Critical Sections ----------------- `Critical Sections `_ are a feature provided by Python to generate a local lock based on some Python object. Cython allows you to use critical sections with a convenient syntax:: o = object() ... with cython.critical_section(o): ... Critical sections can take one or two Python objects as arguments. You are required to hold the GIL on entry to a critical section (you can release the GIL inside the critical section but that also temporarily releases the critical section so is unlikely to be a useful thing to do). We suggest reading the Python documentation to understand how critical sections work. * It is guaranteed that the lock will be held when executing code within the critical section. However, there is no guarantee that the code block will be executed in one atomic action. This is very similar to the guarantee provided by a ``with gil`` block. * Operations on another Python object may end up temporarily releasing the critical section in favour of a critical section based on that object. On non-freethreading builds ``cython.critical_section`` does nothing - you get the same guarantees simply from the fact you hold the GIL. Our current experience is that this provides slightly less thread-safety than you get in freethreading builds simply because Python releases the GIL more readily than it releases a critical section. Locks ----- Cython provides ``cython.pymutex`` as a more robust lock type. Unlike ``cython.critical_section`` this will never release the lock unless you explicitly ask it to (at the cost of losing ``critical_section``'s inbuilt protection against deadlocks). ``cython.pymutex`` supports two operations: ``acquire`` and ``release``. ``cython.pymutex`` can also be used in a ``with`` statement:: cdef cython.pymutex l with l: ... # perform operations with the lock # or manually l.acquire() ... # perform operations with the lock l.release() ``acquire`` will avoid deadlocks if the GIL is held (only relevant in non-freethreading versions of Python). However, you are at risk of deadlock if you attempt to acquire the GIL while holding a ``cython.pymutex`` lock. Be aware that it is also possible for Cython to acquire the GIL implicitly (for example by raising an exception) and this is also a deadlock risk. On Python 3.13+, ``cython.pymutex`` is just a `PyMutex `_ and so is very low-cost. On earlier versions of Python, it uses the (undocumented) ``PyThread_type_lock``. ``cython.pythread_type_lock`` exposes the same interface but always uses ``PyThread_type_lock``. It is intended for sharing locks between modules with the Limited API (since ``PyMutex`` is unavailable in the Limited API). Note that unlike the "raw" ``PyThread_type_lock`` our wrapping will avoid deadlocks with the GIL. As an alternative syntax, ``cython.critical_section`` can be used as a decorator or a function taking at least one argument. In this case the critical section lasts the duration of the function and locks on the first argument:: @cython.cclass class C: @cython.critical_section def func(self, *args): ... # equivalent to: def func(self, *args): with cython.critical_section(self): ... Our expectation is that this will be most useful for locking on the ``self`` argument of methods in C classes. Pitfalls ======== Building on Windows ------------------- As of the Python 3.13 beta releases, building a free-threaded Cython extension module on Windows is tricky because Python provides a single header file shared between the Freethreading and regular builds. You therefore need to manually define the C macro ``Py_GIL_DISABLED=1``. Cython attempts to detect cases where this wasn't done correctly and will try to raise an ``ImportError`` instead of crashing. However - if you are seeing crashes immediately after you import a Cython extension module, this is the most likely explanation. Thread safety ------------- Cython extension modules don't yet try to ensure any significant level of thread safety. This means that if you have multiple threads both manipulating an object attribute of a ``cdef class`` (for example) then it is likely that the reference counting will end up inconsistent and the interpreter will crash. .. note:: When running pure Python code directly in the Python interpreter itself, the interpreter should ensure that reference counting is at least consistent and that the interpreter does not crash. Cython doesn't currently even go this far. By itself "not crashing" is not a useful level of thread safety for most algorithms. It will always be your own responsibility to use appropriate synchronization mechanisms so that your own algorithms work as you intend. Running concurrent Cython functions that do not interact with the same data is expected to be safe. What is likely to be extremely unsafe is code like:: for idx in cython.parallel.prange(n, nogil=True): with gil: ... In regular non-free-threaded builds only one thread will run the ``with gil`` block at once. In free-threaded builds multiple threads will be able to run simultaneously. It is extremely likely that these multiple threads will be operating on the same data in unsafe ways. We recommend against this kind of code in Freethreading builds at the moment (and even with future improvements in Cython, such code is likely to require extreme care to make it work correctly). .. note:: It is a common mistake to assume that a ``with gil`` block runs "atomically" (i.e. all in one go, without switching to another thread) on non-free-threaded builds. Many operations can cause the GIL to be released. Some more detail is in the section :ref:`gil_as_lock`. Opinionated Suggestions ======================= This section contains our views on how to use Cython effectively with free-threaded Python. It may evolve as our understanding grows. Interaction between threads --------------------------- Multi-threaded programs generally work best if you can minimize the interaction between threads. It's optimal if the different threads perform completely isolated blocks of work which are only collected at the end. Python code is no exception - especially since Python's reference counting means that even apparent "read-only" operations can actually involve both reading and writing. As an example consider a program that collects unique words from multiple files. In this case it would probably be best to read each file to a separate ``set`` and then combine them at the end:: def read_from_files_good(filenames): def read_from_file(filename): out = set() with open(filename, 'r') as f: for line in f: words = line.split() for word in words: out.add(word) return out overall_result = set() with concurrent.futures.ThreadPoolExecutor() as executor: for file_result in executor.map(read_from_file, filenames): overall_result.update(file_result) return overall_result rather than updating one ``set`` from all threads:: def read_from_files_bad(filenames): overall_result = set() def read_from_file(filename): with open(filename, 'r') as f: for line in f: words = line.split() for word in words: overall_result.add(word) with concurrent.futures.ThreadPoolExecutor() as executor: for _ in executor.map(read_from_file, filenames): pass return overall_result The less your threads interact, the less chance there is for bugs, the less need there is for locking to control their interaction, and the less likely they are to slow each other down by invaliding the CPU cache for other threads. Should you use ``prange``? -------------------------- Although ``prange`` is the parallelization mechanism *built in* to Cython, it is not the only option, and probably should not be your default option. ``prange`` is a fairly thin wrapper over OpenMP's "parallel for". This means it is ideal for problems where you have a big loop, every iteration is basically the same, and the result of each iteration is independent of any other iteration. If this does *not* describe your problem then ``prange`` is probably not the solution. Remember that all the threading options available in Python are also available in Cython. For example, you can start threads with ``threading.Thread`` or ``concurrent.futures.ThreadPoolExecutor``. They are much more flexible than ``prange``. Similarly, the synchronization tools in ``threading.Thread`` are also available in Cython. Try to avoid Python code in ``prange`` -------------------------------------- ``prange`` has some slightly unintuitive behaviour about which data is shared and which isn't. Typically C variables (e.g. ``int``, ``double``) are treated as "thread-local" and so each thread has its own copy. However, Python object variables are treated as shared between all the threads. This means that:: cdef int i cdef int total = 0 for i in cython.parallel.prange(10, nogil=True): tmp = i**2 total += tmp should work fine - each thread has its own ``tmp`` and ``total`` is a "reduction" (so treated in an efficient thread-safe way). However:: cdef int i cdef int total = 0 cdef object tmp for i in cython.parallel.prange(10, nogil=True): with gil: tmp = i**2 total += tmp In this case, there is only a single value of ``tmp`` shared between all the threads. They are continuously overwriting each other's values. Additionally, Cython does not currently ensure that ``tmp`` is even reference-counted in a thread-safe way, so you are at risk of crashes or memory-leaks in addition to getting a nonsense answer. If you do want to work with Python objects, then it is best to move them into a function and just have the loop call the function:: cdef int square(int x): cdef object tmp = x**2 cdef int result = tmp return result # ... cdef int i cdef int total = 0 for i in cython.parallel.prange(10, nogil=True): with gil: total += square(i) Since ``tmp`` is now local to the function scope, each function call has its own copy and thus there is no conflict of Python objects between threads. Use C++ for low-level synchronization primitives ------------------------------------------------ When you must have threads interact with each other, you usually need to use special data types to control the access to shared data. Python provides many of these in the ``threading`` module. However, sometimes it is useful to either: * avoid the Python-call overhead of the threading module, * use atomic variables to update numeric types in a controlled way without locking. For this our recommendation is to use the C++ standard library. Most of these are available simply by "cimporting" from ``libcpp``. In the event that Cython hasn't already wrapped what you want to use then you can do it yourself - our ``libcpp`` is provided for convenience but it does nothing that can't be done with regular Cython code. The C standard library also provides some of these features (e.g. atomic variables and mutexes). However, compiler support for the C++ standard library is better (in particular for MSVC) and the C++ standard library is more fully featured, so we recommend this first. One difficulty is with types that are not default constructable or moveable (e.g. ``latch``, ``semaphore``, ``barrier``). These are difficult to stack-allocate because of how Cython's code-genertion works, so you need to heap-allocate them:: from libcpp.latch cimport latch l = new latch(2) try: with nogil: # avoid deadlocks! ... # use the latch finally: del l Be careful not to hold the GIL while performing blocking operations with the C or C++ standard library threading tools. Unlike the Python standard library, they are not aware of the GIL/Python thread state. Therefore you have a very high probability of deadlock (even on free-threaded builds, which do occasionally switch to a GIL-locked mode when running certain operations). It is also possible to use C++ to create new threads (for example, using the ``std::jthread`` class). This works, but we generally recommend creating threads through Python instead. For a C++-created thread it's necessary to register them with the interpreter by calling ``with gil:`` before using any Python objects and this will not work reliably with multiple subinterpreters - this recommendation is therefore mainly to future-proof your code and not restrict where it can be used from. It is a fairly soft suggestion though, so feel free to ignore it if you have good reason to. Available library facilities include: * spawning threads (both C and C++, as of Cython 3.1 only the C version is wrapped), * atomic numeric types (both C and C++, wrapped for C++ in Cython 3 and C for Cython 3.1) * mutexes (regular, timed and recursive) (both C and C++, wrapped in Cython 3.1+), * shared mutexes providing many threads with read access or a single thread with write access (C++, wrapped in Cython 3.1+), * condition variables, allowing one thread to wait until a condition is met (C and C++, as of Cython 3.1 only the C version is wrapped), * ``call_once`` allowing an initialization function to be called safely from many threads (C and C++, wrapped in Cython 3.1+), * semaphores, representing a way of counting resource ownership (C++ only, wrapped in Cython 3.1+), * barriers and latches, which mark points where threads wait for each other (C++ only, wrapped in Cython 3.1+), * promises and futures - a way of transmitting a single "result" between threads (C++ only, wrapped in Cython 3.1+), * stop tokens, a convenient way of signaling a request to stop work (C++ only, wrapped in Cython 3.1+). This list of non-exhaustive. And you can also use third-party libraries outside the language standard libraries for more options. ``cython.critical_section`` vs GIL ---------------------------------- Understanding what protection a ``critical_section`` provides is important to being able to use it safely, and it's also worth comparing it to the guarantees that the GIL provides. Unfortunately some of this is very much an implementation detail of Python at the moment, so may be subject to change. What is guaranteed to be safe for both of ``critical_section`` and the GIL (on non-freethreading builds) is reading and writing to ``cdef`` attributes of extension types:: cdef class C: cdef object attr ... cdef C c_instance = C() with cython.critical_section(c_instance): c_instance.attr = something with cython.critical_section(c_instance): something = c_instance.attr The first and most obvious place that both a ``critical_section`` and the GIL can be interrupted is a ``with nogil:`` block. This is hopefully absolutely obvious for the GIL but it's worth noting that a critical section only applies when the Python thread state is held. In principle, both a ``critical_section`` and the GIL can be interrupted by executing arbitrary Python code. Arbitrary Python code can notably include the finalizers of any objects being destroyed. This means that reassigning a Python attribute can trigger arbitrary code (but typically only after the new value has been put in place). Additionally, triggering the GC can result in arbitrary code being executed. On Python <3.12 any Python memory allocation can trigger the GC so be wary of this if you aim to support multithreading in those versions (the first free-threaded interpreters were in Python 3.13 so the GC is harder to trigger from Cython code in them). For example, in the following code (which uses the definition of ``C`` from the previous example):: with cython.critical_section(c_instance): c_instance.attr = c_instance.attr + 1 the addition gets expanded to something like .. code-block:: C temp1 = c_instance->attr; // May trigger arbitrary Python code: // 1. If ``temp1`` is a class with an "__add__" method // 2. If the allocation of the result triggers the GC on Python <3.12 temp2 = PyNumber_Add(temp1, const_1); // this section is hidden inside a ``Py_SETREF`` or similar { temp3 = c_instance->attr; c_instance->attr = temp2; // May trigger arbitrary Python code through finalizers Py_DECREF(temp3); } (we show normal addition rather than in-place addition for ease of explanation, but the result is similar). Practically there are some differences between ``critical_section`` and the GIL: * Releasing the GIL happens at fairly regular intervals after a certail number of bytecode instructions. * Interrupting a ``critical_section`` only happens if the interpreter hits a deadlock (i.e. some other operation tries to get a critical section on the same object). The upshot is the if you're sure that no other code will have a reference to ``c_instance`` the example above is safe in a free-threaded interpreter (although arbitrary code may run, it won't interact with ``c_instance``) but unsafe in a GIL-enabled interpreter. As an example of some practical results: * if ``c_instance`` is a Python integer the the code above *seems* to execute correctly (i.e. gives the expected answer consistently) in both free-threaded and GIL builds (although this was in a simplified test where no garbage was available to collect). * if ``c_instance`` was a ``fractions.Fraction`` object the code above consistently gives the expected answer in freethreaded builds build not in GIL builds. ``fractions.Fraction.__add__`` will execute arbitrary code, but not code that interferes with the ``critical_section``. Again, beware the caveat that our simplified test had no garbage to collect. However, be wary of code like:: cdef class C: cdef object attr cdef void add_one(self): with cython.critical_section(self): self.attr += 1 ... c_instance = C() with cython.critical_section(c_instance): ... c_instance.add_one() ... The nested ``critical_section`` blocks represent a potential deadlock so may interrupt the outer ``critical_section``. Avoid ``cython.critical_section`` on non-extension types -------------------------------------------------------- Python-attribute access does hit a deadlock and will interrupt the ``critical_section``. The code below will return incorrect results on both free-threading and GIL builds:: # regular class class C: def __init__(self): self.attr = 1 ... c_instance = C() with cython.critical_section(c_instance): c_instance.attr += 1