While working on a recent project, I realized that heavy processes for python like scrapping could be made easier though python's The documentation and community engaging in multiprocessing is fairly sparse, so I wanted to share some of my learnings through an example project of scrapping the . Below I wrote a bit of code that pulls all of the available pokémon while minding the API's 100 calls per 60 second limits. You'll see that the iteration is fairly slow as there are 964 pokémon the API returns. multiprocessing library. PokéAPI Before Multiprocessing I simily create three calls , , and . The first, , simply returns all the url's from the api for the next process, , to iterate over the urls and pull the information together by using to pull each URL's response data. get_number_pokemon get_pokemon get_all_pokemon get_number_pokemon get_all_pokemon get_pokemon requests req timeit time pandas pd IPython.display Image, HTML random tqdm tqdm ratelimit limits, sleep_and_retry response = req.get(url) response.status_code == : response.status_code != : print( , status_code, url) Exception( .format(response.status_code)) response API_POKEMON = res = req.get(API_POKEMON.format(pokemon= )) number_pokemon = res.json()[ ] res_url = call_api(API_POKEMON.format(pokemon= .format(limit=str(number_pokemon)))) pokemon_links_values = [link[ ] link res_url.json()[ ]] pokemon_links_values info = resolved = : resolved: res = tooManyCalls = : res = call_api(link) res == : resolved = Exception e: print(e) e == : tooManyCalls = tooManyCalls: time.sleep( ) res.status_code < : pokemon_info = res.json() info = { : pokemon_info[ ][ ], : pokemon_info[ ], : pokemon_info[ ], : pokemon_info[ ], : pokemon_info[ ], : pokemon_info[ ], : pokemon_info[ ][ ] } resolved = res.status_code == : time.sleep( ) : sleep_val = random.randint( , ) time.sleep(sleep_val) Exception e: print(e) info : info list_pokemon = [] link tqdm(links_pokemon): pokemon = get_pokemon(link) pokemon != : list_pokemon.append(pokemon) time.sleep( ) pd.set_option( , ) df_pokemon = pd.DataFrame(list_pokemon) df_pokemon links_pokemon = get_number_pokemon() df_pokemon = get_all_pokemon(links_pokemon=links_pokemon) df_pokemon.sort_values([ ],inplace= ) df_pokemon, HTML(df_pokemon.iloc[ : ].to_html(formatters={ : image_formatter}, escape= )) import as import import import as from import import from import from import ## Rate limit to help with overcalling ## pokemon api is 100 calls per 60 seconds max @sleep_and_retry @limits(calls=100, period=60) : def call_api (url) if 404 return 'Not Found' if 200 'here' raise 'API response: {}' return 'https://pokeapi.co/api/v2/pokemon/{pokemon}' : def get_number_pokemon () '' 'count' '?offset=0&limit={limit}' 'url' for in 'results' return : def get_pokemon (link= ) '' None False try while not None False try if 'Not Found' True break except as if 'too many calls' True if 60 elif 300 'Image' 'sprites' 'front_default' 'id' 'id' 'name' 'name' 'height' 'height' 'base_experience' 'base_experience' 'weight' 'weight' 'species' 'species' 'name' True elif 429 60 else 1 10 except as return finally return : def get_all_pokemon (links_pokemon=None) for in if None 0.3 'display.max_colwidth' None return : def image_formatter (im) return f'<img src=" ">' {im} : def main_pokemon_run () 'id' True return 0 4 'Image' False As you can see from the response below, the API call took roughly 9 minutes and 30 seconds at a rate of 1.69 iterations per second. We can improve this drastically by adding multiprocessing. With Multiprocessing Now with multiprocessing we can separate the function into a multiprocessing pool function. We use the built in multiprocessing function to define the number of workers needed. Since we we want to get this done as quickly as possible using the full will ensure the program runs the most optimally without taking up all over our . The ensures that this function will never be 0. Next, I use multiprocessing's built in tool to share a global memory of our returned Pokémon. get_all_pokemon cpu_count() cpu_count - 1 CPUs max(cpu_count() -1, 1) Manager While you can use the returned value of the multiprocess map, using a manager is nice way to use multiprocessing for other more complicated tasks. Check the documentation for the available multiprocessing. Manager types. Another step needed is to create a partial function. As the function passes an array to iterate over, we need to add our parameters more specifically to our function before we iterate over it. The library allows us to do this by creating a partial function to send to the pool map. I use tqdm in this example to see the progress of the pool. pool.imap get_pokemon_multiprocess functools.partial I also show in the code the same function without it. Now that the pool has ran, we need to ensure that it is both closed and joined, you'll see I do this immediately after and in the block. This is done to validate that even if there is an error, the process workers are closed and terminated. finally The join function is necessary to have the processes after pooling being joined together. This also validates that the manager is completely finished binding. requests req timeit time pandas pd IPython.display Image, HTML random tqdm tqdm ratelimit limits, sleep_and_retry multiprocessing Pool, Manager, cpu_count functools partial API_POKEMON = response = req.get(url) response.status_code == : response.status_code != : Exception( .format(response.status_code)) response res = req.get(API_POKEMON.format(pokemon= )) number_pokemon = res.json()[ ] res_url = call_api(API_POKEMON.format(pokemon= .format(limit=str(number_pokemon)))) pokemon_links_values = [link[ ] link res_url.json()[ ]] pokemon_links_values link = links_pokemon[process] info = resolved = : resolved: res = tooManyCalls = : res = call_api(link) res == : resolved = Exception e: print(e) e == : tooManyCalls = tooManyCalls: time.sleep( ) res.status_code < : pokemon_info = res.json() info = { : pokemon_info[ ][ ], : pokemon_info[ ], : pokemon_info[ ], : pokemon_info[ ], : pokemon_info[ ], : pokemon_info[ ], : pokemon_info[ ][ ] } resolved = res.status_code == : print(res.status_code) time.sleep( ) : print(res.status_code) sleep_val = random.randint( , ) time.sleep(sleep_val) Exception e: print(e) : info != : listManager.append(info) time.sleep( ) workers = max(cpu_count() , ) manager = Manager() listManager = manager.list() pool = Pool(workers) : links_pokemon = get_number_pokemon() part_get_clean_pokemon = partial(get_pokemon_multiprocess, listManager, links_pokemon) _ tqdm(pool.imap(part_get_clean_pokemon, list(range( , len(links_pokemon)))), total=len(links_pokemon)): pool.close() pool.join() : pool.close() pool.join() pokemonList = list(listManager) df_pokemon = pd.DataFrame(pokemonList) df_pokemon.sort_values([ ],inplace= ) df_pokemon, HTML(df_pokemon.iloc[ : ].to_html(formatters={ : image_formatter}, escape= )) import as import import import as from import import from import from import from import from import 'https://pokeapi.co/api/v2/pokemon/{pokemon}' # To see how it ran # def infoDebugger(title): # print(title) # print('module name:', __name__) # if hasattr(os, 'getppid'): # print('parent process:', os.getppid()) # print('process id:', os.getpid()) @sleep_and_retry @limits(calls=100, period=60) : def call_api (url) if 404 return 'Not Found' if 200 raise 'API response: {}' return # https://docs.python.org/2/library/multiprocessing.html : def get_number_pokemon () '' 'count' '?offset=0&limit={limit}' 'url' for in 'results' return : def get_pokemon_multiprocess (listManager=None, links_pokemon=None, process= ) 0 # print('Called Pokemon', process) None False # print(link) try while not None False try if 'Not Found' True break except as if 'too many calls' True if 60 elif 300 'Image' 'sprites' 'front_default' 'id' 'id' 'name' 'name' 'height' 'height' 'base_experience' 'base_experience' 'weight' 'weight' 'species' 'species' 'name' True elif 429 60 else 1 10 except as finally if None 0.5 return : def image_formatter (im) return f'<img src=" ">' {im} : def main_pokemon_run_multiprocessing () ## cannot be 0, so max(NUMBER,1) solves this -1 1 ## create the pool ## Need a manager to help get the values async, the values will be updated after join try # could do this the below is visualize the rate success /etc # pool.imap(part_get_clean_pokemon, list(range(0, len(links_pokemon)))) # using tqdm to see progress imap works for in 0 pass finally 'id' True return 0 4 'Image' False You'll see in the below snapshot that the pooling worked successfully and improved the runtime and iteration time drastically from 9 minutes to 1 minute and iterations per second to 14.97 iterations per second. Hope this was helpful. To see the full code and check out my notebook on . github Let me know if you come up with any other useful uses for this and please share any feedback on my sample if you have any. Greyson Nevins-Archer