How to Perform Memory-Efficient Operations on Large Datasets with Pandas

Image by Editor | Midjourney

Let’s learn how to perform operation in Pandas with Large datasets.

Preparation

As we are talking about the Pandas package, you should have one installed. Additionally, we would use the Numpy package as well. So, install them both.

Then, let’s get into the central part of the tutorial.

Perform Memory-Efficients Operations with Pandas

Pandas are typically not known to process large datasets as memory-intensive operations with the Pandas package can take too much time or even swallow your whole RAM. However, there are ways to improve efficiency in panda operations.

In this tutorial, we will walk you through ways to enhance your experience with large Datasets in Pandas.

First, try loading the dataset with a memory optimization parameter. Also, try changing the data type, especially to a memory-friendly type, and drop any unnecessary columns.

import pandas as pd

df = pd.read_csv('some_large_dataset.csv', low_memory=True, dtype={'column': 'int32'}, usecols=['col1', 'col2'])

Converting the integer and float with the smallest type would help reduce the memory footprint. Using category type to the categorical column with a small number of unique values would also help. Smaller columns also help with memory efficiency.

Next, we can use the chunk process to avoid using all the memory. It would be more efficient if process it iteratively. For example, we want to get the column mean, but the dataset is too big. We can process 100,000 data at a time and get the total result.

chunk_results = []

def column_mean(chunk):
    chunk_mean = chunk['target_column'].mean()
    return chunk_mean

chunksize = 100000
for chunk in pd.read_csv('some_large_dataset.csv', chunksize=chunksize):
    chunk_results.append(column_mean(chunk))

final_result = sum(chunk_results) / len(chunk_results)

Additionally, avoid using the apply method with lambda functions; it could be memory intensive. Alternatively, it’s better to use vectorized operations or the .apply method with normal function.

df['new_column'] = df['existing_column'] * 2

For conditional operations in Pandas, it’s also faster to use np.whererather than directly using the Lambda function with .apply

import numpy as np 
df['new_column'] = np.where(df['existing_column'] > 0, 1, 0)

Then, using inplace=Truein many Pandas operations is much more memory-efficient than assigning them back to their DataFrame. It’s much more efficient because assigning them back would create a separate DataFrame before we put them into the same variable.

df.drop(columns=['column_to_drop'], inplace=True)

Lastly, filter the data early before any operations, if possible. This will limit the amount of data we process.

df = df[df['filter_column'] > threshold]

Try to master these tips to improve your Pandas experience in large datasets.

Additional Resources

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

from machine learning – My Blog https://ift.tt/aKotI4z
via IFTTT

Breaking News

How to Perform Memory-Efficient Operations on Large Datasets with Pandas

Preparation

Perform Memory-Efficients Operations with Pandas

Additional Resources

Posted byAI Global Tech

0 Comments

PhD Scholarships for Indian Students to Study Abroad in 2024-2025

Chat with Your Images Using Llama 3.2-Vision Multimodal LLMs | by Lihi Gur Arie, PhD | Dec, 2024

The 17 Best Barefoot Shoes for Running or Walking (2024)

Function Calling at the Edge – The Berkeley Artificial Intelligence Research Blog

Updates to Veo, Imagen and VideoFX, plus introducing Whisk in Google Labs

Should you switch from VSCode to Cursor? | by Marc Matterson | Dec, 2024

Scholarships for MBA in Australia for Indian Students in 2024-2025

Dream AI by Wombo Pricing, Pros Cons, Features, Alternatives

Productionising GenAI Agents: Evaluating Tool Selection with Automated Testing | by Heiko Hotz | Nov, 2024

Wait, how did a decentralized service like Bluesky go down?

Type and hit Enter to search

Breaking News

How to Perform Memory-Efficient Operations on Large Datasets with Pandas

Preparation

Perform Memory-Efficients Operations with Pandas

Additional Resources

Posted byAI Global Tech

You may like these posts

0 Comments

Type and hit Enter to search