Data Platform 101: Data Partitioning
With the Data Sphere growing exponentially in the past 10 years with an estimate of 181 Zettabytes in 2025, data storage and processing are becoming more challenging every day.
Data Partitioning comes as the go-to solution for many data needs, from database management to data engineering performance. Data partitioning has allowed data professionals to unlock storage and processing methods that can keep up with the data size of the modern world.
Data Partitioning
In this section, we will explain what Data Partitioning is as well as why it is needed and its different types.
Understanding Data Partitioning
The main issue Data Partitioning solves is fast access to large amounts of data. In the following diagram, we will go through a simplified use case of data partitioning to better understand what it looks like
Database with no partitioning:
Usage example*:
- t=12:00 — us-west service fetches all users living in western US states
- t=12:02 — us-west service receives 2 users
- t=12:03 — us-east service fetches all users living in eastern US states
- t=12:04 — Europe service fetches all users living in Europe
- t=12:05 — us-east service receives 2 users
- t=12:06 — Europe service receives 1 user
note*: Assuming the database takes 1 second to read 1 record and ignoring all other operational latencies.
We can see here how the Europe service had to wait longer than it should because the database was busy fulfilling another request before it got to the Europe service request.
Database with 3 partitions:
Let’s look at this from a 3 partition perspective
Usage example*:
- t=12:00 — us-west service fetches all users living in western US states
- t=12:02 — us-west service receives 2 users
- t=12:03 — us-east service fetches all users living in eastern US states
- t=12:04 — Europe service fetches all users living in Europe
- t=12:05 — us-east service receives 2 users / Europe service receives 1 user
While the us-east database partition was still busy processing the us-east service request when the Europe request came, it didn’t, however, have any impact on Europe service’s performance as it collected its data from the Europe partition directly.
Data Partitioning allows immediate access to certain data criteria without needing to search for them as they are used as the defining criteria for that partition, also known as the partition key.
The Need for Data Partitioning
Data Partitioning offers a lot of power to data professionals by making massive amounts of data accessible. It doesn’t come at a cheap price, however. Data partitioning poses many challenges, from load balancing to data shuffling and synchronisation (not covered in this 101 article).
One might wonder here, why not triple the resources and computation in the database instead of going through all the trouble of partitioning? While this approach has served us well in the past, computation powers are not catching up fast enough with data needs.
We could probably use this approach (called vertical scaling) at first by tripling memory and using a faster CPU, but at some point, we will need to scale by 100 or 1000 times.
With the exponential explosion of data in the world, processors and supercomputers even cannot handle the heavy computations on petabytes of data every day! Parallel computing unlocked by data partitioning is the most viable scalability option that can withstand data growth over the years.
The Types of Data Partitioning
Horizontal Partitioning
The most common partitioning mechanism is horizontal partitioning. Using business knowledge and being familiar with the data distribution in a table, the Data Engineer can identify a column (or multiple columns) to partition the table by. Each value in that column will represent a unique partition that can be accessed separately from the rest of the table.
Hash-Based Partitioning
A variation of Horizontal Partitioning that eliminates the need for business knowledge is hash-based partitioning. This method adds a new column to the table where a hashing function (SHA2 for example) is called on all the values in a data record. The hashed values can then be evenly split into the number of desired partitions (by applying a modulo of the hashed value on the number of partitions).
Here’s an example table and how we can generate the hash_partition column in it:
SELECT UserID, Name, Email from users;
SELECT UserID, Name, Email, CONV(SUBSTRING(SHA2(CONCAT(UserID, Name, Email), 256), 1, 16), 16, 10) % 3 as Partition from users;
While this method eliminates the need for business knowledge to achieve a balanced partitioning, it sacrifices operational performance as the data is partitioned arbitrarily and cross-partition operations will occur more often.
DateTime Partitioning
Date and Time Partitioning is a variation of horizontal partitioning where the column chosen for setting partitions is a date or time representation. This format allows data to be accessed by day, which can be very helpful in accessing and analysing transactional data, time series, logs, or change data capture.
Vertical Partitioning
Though uncommon, vertical partitioning (where every subset of columns is stored separately) can be very useful for multi-purpose tables that can have several hundred columns defined. This allows easy access to certain selections of columns that Data Engineers pre-define based on the business needs.
Data Partitioning Use Cases
Data Partitioning has many use cases and serves several Data needs.
Database
Data partitioning can be used in different ways depending on the data needs. For databases, it can be used to balance the read/write load on the database by distributing data points depending on the frequency of access. This ensures that the load is balanced between the multiple database partitions.
Analytics
For analytics, partitioning is used for faster access to the data depending on the analytics needs. An example of analytics needs is aggregating sales data per state to analyze local sales.
If the data was partitioned per state, the pipeline would look much simpler.
Partitioning in Data Platforms
Understanding Stakeholder Needs
While choosing the appropriate partitioning columns is owned by Data Engineers in general, providing the correct partitioning tooling in the Data Platform is extremely important to enable our stakeholders to apply the most suitable partitioning for the business.
Understanding the basics of partitioning becomes a necessary skill for Data Platform Engineers to enable them to understand the needs of the users and optimise the platform for performance and reliability.
Providing The Best Tools for Stakeholders
While Data Engineers are busy understanding and analysing the data to ensure we get the best business value out of it, Data Platform Engineers need to focus on providing reliable and fast data tools to serve the growing needs of any data-driven business.
In Data Platforms, partitioning and parallel processing are the bread and butter for scalability and serving the data needs of the business with limited delays.
Next Steps for Data Platform Engineers
After understanding the basics of partitioning, I invite you to continue the journey with these following topics (Google search):
- Kafka partitioning
- Spark Executors
- Flink Task Slots and Task Managers
- Cassandra Nodes and Virtual Nodes
- MySQL Sharding
- Parquet file format