Multiprocessing in Python

Multiprocessing in Python

Image
python logo

Reuven M. Lerner
Mon, 04/16/2018 – 09:20

Python’s “multiprocessing” module feels like threads, but actually launches
processes.

Many people, when they start to work with Python, are excited to hear
that the language supports threading. And, as I’ve discussed in previous
articles,
Python does indeed support native-level threads
with an easy-to-use and convenient interface.

However, there is a downside to these threads—namely the global
interpreter lock (GIL), which ensures that only one thread runs at a
time. Because a thread cedes the GIL whenever it uses I/O, this means
that although threads are a bad idea in CPU-bound Python programs, they’re a
good idea when you’re dealing with I/O.

But even when you’re using lots of I/O, you might prefer to take full
advantage of a multicore system. And in the world of Python, that
means using processes.

In my article “Launching
External Processes in Python”
, I described how you can launch processes from within a Python
program, but those examples all demonstrated that you can launch a
program in an external process. Normally, when people talk about
processes, they work much like they do with threads, but are even more
independent (and with more overhead, as well).

So, it’s something of a dilemma: do you launch easy-to-use
threads, even though they don’t really run in parallel? Or, do you
launch new processes, over which you have little control?

The answer is somewhere in the middle. The Python standard library
comes with “multiprocessing”, a module that gives the feeling of
working with threads, but that actually works with processes.

So in this article, I look at the “multiprocessing” library and describe some of the basic things it can do.

Multiprocessing Basics

The “multiprocessing” module is designed to look and feel like the
“threading” module, and it largely succeeds in doing so. For example,
the following is a simple example of a multithreaded program:


#!/usr/bin/env python3

import threading
import time
import random

def hello(n):
    time.sleep(random.randint(1,3))
    print("[{0}] Hello!".format(n))

for i in range(10):
    threading.Thread(target=hello, args=(i,)).start()

print("Done!")

In this example, there is a function (hello) that prints
“Hello!”
along with whatever argument is passed. It then runs a for loop that
runs hello ten times, each of them in an independent thread.

But wait. Before the function prints its output, it first sleeps for a
few seconds. When you run this program, you then end up with output that
demonstrates how the threads are running in parallel, and not
necessarily in the order they are invoked:


$ ./thread1.py
Done!
[2] Hello!
[0] Hello!
[3] Hello!
[6] Hello!
[9] Hello!
[1] Hello!
[5] Hello!
[8] Hello!
[4] Hello!
[7] Hello!

If you want to be sure that “Done!” is printed after all the threads
have finished running, you can use join. To do that, you need to grab
each instance of threading.Thread, put it in a list, and then invoke
join on each thread:


#!/usr/bin/env python3

import threading
import time
import random

def hello(n):
    time.sleep(random.randint(1,3))
    print("[{0}] Hello!".format(n))

threads = [ ]
for i in range(10):
    t = threading.Thread(target=hello, args=(i,))
    threads.append(t)
    t.start()

for one_thread in threads:
    one_thread.join()

print("Done!")

The only difference in this version is it puts the thread
object in a list (“threads”) and then iterates over that list,
joining them one by one.

But wait a second—I promised that I’d talk about
“multiprocessing”, not threading. What gives?

Well, “multiprocessing” was designed to give the feeling of working
with threads. This is so true that I basically can do some
search-and-replace on the program I just presented:

  • threading → multiprocessing
  • Thread → Process
  • threads → processes
  • thread → process

The result is as follows:


#!/usr/bin/env python3

import multiprocessing
import time
import random

def hello(n):
    time.sleep(random.randint(1,3))
    print("[{0}] Hello!".format(n))

processes = [ ]
for i in range(10):
    t = multiprocessing.Process(target=hello, args=(i,))
    processes.append(t)
    t.start()

for one_process in processes:
    one_process.join()

print("Done!")

In other words, you can run a function in a new process, with full
concurrency and take advantage of multiple cores, with
multiprocessing.Process. It works very much like a thread, including
the use of join on the Process objects you create. Each instance of
Process represents a process running on the computer, which
you can see
using ps, and which you can (in theory) stop with
kill.

What’s the Difference?

What’s amazing to me is that the API is almost identical, and yet two
very different things are happening behind the scenes. Let me try to
make the distinction clearer with another pair of examples.

Perhaps the biggest difference, at least to anyone programming with
threads and processes, is the fact that threads share global
variables. By contrast, separate processes are completely separate; one
process cannot affect another’s variables. (In a future article, I plan
to look at how
to get around that.)

Here’s a simple example of how a function running in a thread can
modify a global variable (note that what I’m doing here is to prove a
point; if you really want to modify global variables from within a
thread, you should use a lock):


#!/usr/bin/env python3

import threading
import time
import random

mylist = [ ]

def hello(n):
    time.sleep(random.randint(1,3))
    mylist.append(threading.get_ident())   # bad in real code!
    print("[{0}] Hello!".format(n))

threads = [ ]
for i in range(10):
    t = threading.Thread(target=hello, args=(i,))
    threads.append(t)
    t.start()

for one_thread in threads:
    one_thread.join()

print("Done!")
print(len(mylist))
print(mylist)

The program is basically unchanged, except that it defines a new, empty
list (mylist) at the top. The function appends its ID to that list
and then returns.

Now, the way that I’m doing this isn’t so wise, because Python data
structures aren’t thread-safe, and appending to a list from within
multiple threads eventually will catch up with you. But the point
here isn’t to demonstrate threads, but rather to contrast them with
processes.

When I run the above code, I get:


$ ./th-update-list.py
[0] Hello!
[2] Hello!
[6] Hello!
[3] Hello!
[1] Hello!
[4] Hello!
[5] Hello!
[7] Hello!
[8] Hello!
[9] Hello!
Done!
10
[123145344081920, 123145354592256, 123145375612928,
 ↪123145359847424, 123145349337088, 123145365102592,
 ↪123145370357760, 123145380868096, 123145386123264,
 ↪123145391378432]

So, you can see that the global variable mylist is shared by the
threads, and that when one thread modifies the list, that change is
visible to all the other threads.

But if you change the program to use “multiprocessing”, the output
looks a bit different:


#!/usr/bin/env python3

import multiprocessing
import time
import random
import os

mylist = [ ]

def hello(n):
    time.sleep(random.randint(1,3))
    mylist.append(os.getpid())
    print("[{0}] Hello!".format(n))

processes = [ ]
for i in range(10):
    t = multiprocessing.Process(target=hello, args=(i,))
    processes.append(t)
    t.start()

for one_process in processes:
    one_process.join()

print("Done!")
print(len(mylist))
print(mylist)

Aside from the switch to multiprocessing, the biggest change in this
version of the program is the use of os.getpid to get the current
process ID.

The output from this program is as follows:


$ ./proc-update-list.py
[0] Hello!
[4] Hello!
[7] Hello!
[8] Hello!
[2] Hello!
[5] Hello!
[6] Hello!
[9] Hello!
[1] Hello!
[3] Hello!
Done!
0
[]

Everything seems great until the end when it checks the value of
mylist. What happened to it? Didn’t the program append to it?

Sort of. The thing is, there is no “it” in this program. Each time
it
creates a new process with “multiprocessing”, each process has its own
value of the global mylist list. Each process thus adds to its own
list, which goes away when the processes are joined.

This means the call to mylist.append succeeds, but it succeeds in
ten different processes. When the function returns from executing in
its own process, there is no trace left of the list from that
process. The only mylist variable in the main process remains empty,
because no one ever appended to it.

Queues to the Rescue

In the world of threaded programs, even when you’re able to append to
the global mylist variable, you shouldn’t do it. That’s because
Python’s data structures aren’t thread-safe. Indeed, only one data
structure is guaranteed to be thread safe—the Queue class in the
multiprocessing module.

Queues are FIFOs (that is, “first in, first out”). Whoever wants to add
data to a queue invokes the put method on the queue. And whoever
wants to retrieve data from a queue uses the get command.

Now, queues in the world of multithreaded programs prevent issues
having to do with thread safety. But in the world of multiprocessing,
queues allow you to bridge the gap among your processes, sending data
back to the main process. For example:


#!/usr/bin/env python3

import multiprocessing
import time
import random
import os
from multiprocessing import Queue

q = Queue()

def hello(n):
    time.sleep(random.randint(1,3))
    q.put(os.getpid())
    print("[{0}] Hello!".format(n))

processes = [ ]
for i in range(10):
    t = multiprocessing.Process(target=hello, args=(i,))
    processes.append(t)
    t.start()

for one_process in processes:
    one_process.join()

mylist = [ ]
while not q.empty():
    mylist.append(q.get())

print("Done!")
print(len(mylist))
print(mylist)

In this version of the program, I don’t create mylist until late in
the game. However, I create an instance of
multiprocessing.Queue
very early on. That Queue instance is designed to be shared across the
different processes. Moreover, it can handle any type of Python data
that can be stored using “pickle”, which basically means any data
structure.

In the hello function, it replaces the call to
mylist.append with
one to q.put, placing the current process’ ID number on the queue.
Each of the ten processes it creates will add its own PID to the queue.

Note that this program takes place in stages. First it launches ten
processes, then they all do their work in parallel, and then it waits
for them to complete (with join), so that it can process the
results. It pulls data off the queue, puts it onto mylist, and then
performs some calculations on the data it has retrieved.

The implementation of queues is so smooth and easy to work with,
it’s easy to forget that these queues are using some serious
behind-the-scenes operating system magic to keep things
coordinated. It’s easy to think that you’re working with threading,
but that’s just the point of multiprocessing; it might feel like
threads, but each process runs separately. This gives you true
concurrency within your program, something threads cannot do.

Conclusion

Threading is easy to work with, but threads don’t truly execute in parallel.
Multiprocessing is a module that provides an API that’s almost
identical to that of threads. This doesn’t paper over all of the
differences, but it goes a long way toward making sure things aren’t
out of control.

Read More

Leave a Reply