Python : For Loops X Vectorization. Make your code run 2000 X faster

Python has a bad reputation for being slow compared to optimized C. But when compared to C, Python is very easy, flexible and has a wide variety of uses. So how do you combine flexibility of Python with the speed of C. This is where packages known as Pandas and Numpy come in. If you have done any sort of data analysis or machine learning using python, I’m pretty sure you have used these packages. They make it very convenient to deal with huge datasets.

In this post we will be looking at just how fast you can process huge datasets using Pandas and Numpy, and how well it performs compared to other commonly used looping methods in Python. We will be testing out the following methods:

  1. Regular for loops
  2. Looping with
  3. Using
  4. Vectorization with Pandas and Numpy arrays

We will be using a function that is used to find the distance between two coordinates on the surface of the Earth, to analyze these methods. The code is as follows.

import numpy as npdef calculate_distance(lt1, ln1, lt2, ln2):
R = 6373.0


lat1 = np.deg2rad(lt1)

lon1 = np.deg2rad(ln1)
lat2 = np.deg2rad(lt2)
lon2 = np.deg2rad(ln2)

dlon = lon2 - lon1

dlat = lat2 - lat1

a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) *
np.sin(dlon / 2)**2

c = 2 * np.arcsin(np.sqrt(a))
distance = R * c
return distance

We are going to use a method to generate Pandas Dataframe’s filled with random coordinates of 10000, 100000 and 100000 rows to see the efficiency of these methods

import numpy as np
import pandas as pd
def generate_data(rows):
df = pd.DataFrame(columns =['lat', 'lon'])
df['lat'] = np.random.randint(9999999, 99999999, rows)/1000000
df['lon'] = np.random.randint(9999999, 99999999, rows)/1000000
return df
rows = 10000
df = generate_data(rows)

Now that everything has been set up, lets start the test. The results shown below is for processing 1,000,000 rows of data.

  1. Regular For loop:
latitude = 11.111111
longitude = 121.222222
for i in range(0, len(df)):
d = calculate_distance(latitude, longitude,
df.iloc[i]['lat'], df.iloc[i]['lon'])

The regular for loops takes 187 seconds to loop 1,000,000 rows through the calculate distance function. To some of you this might not seem like a lot of time to process 1 million rows. Let us make this our benchmark to compare speed

2. Using iterrows():

latitude = 11.111111
longitude = 121.222222
for index, row in df.iterrows():
calculate_distance(latitude, longitude, row['lat'], row['lon'])

iterrows() is the best method to actually loop through a Python Dataframe. Using regular for loops on dataframes is very inefficient. Using iterrows() the entire dataset was processed in under 65.5 seconds, almost 3 times faster that regular for loops. Although iterrows() are looping through the entire Dataframe just like normal for loops, iterrows are more optimized for Python Dataframes, hence the improvement in speed.

3. Using apply():

latitude = 11.111111
longitude = 121.222222
df.apply(lambda row: calculate_distance(
latitude,longitude, row['lat'], row['lon']), axis=1)

I just told you that iterrows() is the best method to loop through a python Dataframe, but apply() method does not actually loop through the dataset. This method applies a function along a specific axis (meaning, either rows or columns) of a DataFrame. This improves efficiency considerably. The time taken using this method is just 6.8 seconds, 27.5 times faster than a regular for loop.

4. Using Vectorization on Pandas and Numpy arrays:

latitude = 11.111111
longitude = 121.222222
calculate_distance(
latitude, longitude,
df['lat'].values, df['lon'].values)

Now this is where the game completely changes. Vectorization is by far the most efficient method to process huge datasets in python. Using Vectorization 1,000,000 rows of data was processed in .0765 Seconds, 2460 Times faster than a regular for loop.

These tests were conducted using 10,000 and 100,000 rows of data too and their results are as follows

+ — — — — — — — — — — — — — -+ — — — — + — — — — + — — — — -+ — +
| Method | 10000 | 100000 | 1000000 | Relative Speed |
+ — — — — — — — — — — — — — -+ — — — — + — — — — + — — — — -+ — +
| Regular for loop | 1820ms | 19s | 187s | 1X |
+ — — — — — — — — — — — — — -+ — — — — + — — — — + — — — — -+ — +
| iterrows() | 600ms | 6.4s | 65.5s | 2.9X |
+ — — — — — — — — — — — — — -+ — — — — + — — — — + — — — — -+ — +
| apply() | 60ms | .69s | 6.8s | 27.5X |
+ — — — — — — — — — — — — — -+ — — — — + — — — — + — — — — -+ — +
| Vectorization | 2ms | .007s | .076s | 2460X |
+ — — — — — — — — — — — — — -+ — — — — + — — — — + — — — — -+ — +

SUMMARY

As we proceed further into the twenty-first century, we are going through an explosion in the size of data. Traditional methods like for loops cannot process this huge amount of data especially on a slow programming language like Python. Vectorization or similar methods have to be implemented in order to handle this huge load of data more efficiently. Developers who use Python based Frameworks like Django can make use of these methods to really optimize their existing backend operations.

There are also other methods like using a custom Cython routine, but that is too complicated and in most cases is not worth the effort. Vectorization is always the first and best choice. However, there are few cases that cannot be vectorized in obvious ways. Furthermore, on a very very small Dataframe, other methods may yield a better performance.

Hello fellow Developers, my name's Pranoy. I'm a 24 year old programmer living in Kerala, India.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store