Member-only story
Pandas Memory Optimization
We frequently come into the Pandas Python package, which provides a nice collection of tools for data wrangling that are crucial for data analytics. While the pandas library provides the majority of the tools required for daily data analytics activities, it is not well optimal in some cases.
Memory optimization is undoubtedly one of the primary limitations of the pandas library. While pandas developers have released several upgrades over the years, there is still a long way to go.
So what is the problem you may ask?
The issue starts with Python. Because Python does not need an explicit data type declaration for variables, unlike most other programming languages such as C, C++, Java, and so on, it is frequently left to the python interpreter to decide which data type should be given to the variable. Most of the time, the interpreter assigns a big amount of storage (64 bits) to each numeric variable, which accomplishes the job, but what if you don’t need all 64 bits to store your data?
The situation worsens when you have a huge dataset (say, 1 GB+) and limited computational resources (for example: free-tier Google Colab platform). Users have frequently reported that the Google Colab session fails in the middle of a computation and prompts them to upgrade to Colab Pro. But, do you really need the extra resources, or is there another method to optimize the amount of resources you use?
The answer is yes. There are many ways in which you can optimize the code to reduce…