Build Your First Data Platform
With AI and Big Data reaching levels beyond expectations, one can only wonder how all of this data is being processed and handled! In our previous article, we discussed the importance of data platforms to data-driven companies, allowing them to handle and process hundreds of petabytes of data within a few minutes. We also touched on the critical role data platform engineers play in establishing a robust and reliable infrastructure that serves ever-growing data use cases.
Today, I want us to dive deeper into the data platform engineer role. In this article, I’ll show you just how accessible the field is and guide you through building your first-ever data platform.
Taking The First Step
While going through some data platform designs shared by companies like Uber and Netflix can be both exciting and intimidating, becoming a data platform engineer is actually much more accessible than these articles might suggest.
Trying out data platforming and taking the first step, is easier than you might think. In fact, by the end of this article, you’ll have taken your first step as a data platform engineer and will have a good plan for your second step to move forward!
Data Platform: Use Case
Glossary
Let’s briefly define some key terms:
- MySQL: An open-source relational database management system.
- Doris: An open-source, high-performance, real-time analytical database.
- Flink: A framework and distributed processing engine for stateful computations of data streams and batches.
Business scenario
Picture an online retailer selling products on their website. They want to run some analysis on user activity, generate product recommendations and sales forecasts.
While their application’s back-end might benefit from the advantages offered by a MySQL database, their analysts will struggle as this database is not optimised for high volume reads and complex queries at scale.
Data Platform: The Solution
The business hires you as their data platform engineer to address this issue. You identify that we need to replicate the needed tables into an analytics data warehouse that allows analysts to work on the data efficiently and drive value for the business in record time.
After some research, you propose the following design:
This design will allow the application’s back-end to run smoothly with the MySQL database while our Connector ensures every data entry in MySQL is synchronized to Doris in real time to provide analytics with the latest view of the business’ data at all times.
Build Your First Data Platform
Pre-requirements
Before starting, ensure you have the following tools installed on your machine:
- Docker & Docker Compose: For containerizing and orchestrating our services
- MySQL: Our relational database
- Java: for running the application which is JVM based. Make sure your JAVA_HOME is configured.
Flink CDC
The tool we will be using today is Flink CDC, which will allow us to connect to databases and process data all through a simple Yaml file.
Implementation Steps
Note: these steps are sourced from the official documentation in the Flink CDC web page.
Start Your Flink Cluster
Starting with getting a Flink cluster up and running
Step 1: Download and Install Flink
Using your machine’s command line, run these commands
# Download Flink
curl https://archive.apache.org/dist/flink/flink-1.18.0/flink-1.18.0-bin-scala_2.12.tgz -o flink-1.18.0-bin-scala_2.12.tgz
# unzip
tar -xvzf flink-1.18.0-bin-scala_2.12.tgz -C ./
# enter the Flink Folder
cd flink-1.18.0/
Step 2: Configure your Flink cluster
Edit the file conf/flink-conf.yaml using a file editor or IDE:
# Add the following line to enable Flink Checkpoints
execution.checkpointing.interval: 3000
# Edit the following line to allow enough task slots
taskmanager.numberOfTaskSlots: 4
- Flink Checkpoints: Periodic snapshots for state recovery in Flink.
- Flink Task Slots: Resource units in Flink for executing tasks.
Step 3: Start the Flink Cluster
./bin/start-cluster.sh
Once the cluster starts, you will be able to open the Flink UI in your browser using http://localhost:8081/ and should see something like this:
Deploy Your Database and Data Warehouse
Next, let’s deploy our MySQL database and Doris data warehouse
Step 1: Host Configuration
Since Doris requires memory mapping support for operation, execute the following command on your machine:
sudo sysctl -w vm.max_map_count=2000000
Note: Mac users, the above command might not work due to the different ways of implementing containers internally on MacOS. Follow these steps instead.
Step 2: Create the docker-compose File
Create docker-compose.yaml file with the following configuration:
version: '2.1'
services:
doris:
image: yagagagaga/doris-standalone
ports:
- "8030:8030"
- "8040:8040"
- "9030:9030"
mysql:
image: debezium/example-mysql:1.1
ports:
- "3306:3306"
environment:
- MYSQL_ROOT_PASSWORD=123456
- MYSQL_USER=mysqluser
- MYSQL_PASSWORD=mysqlpw
Step 3: Deploy Your Environment
docker-compose up -d
Note: Newer versions of Docker will use the syntax:
docker compose up -d
Create MySQL Data
With our environment all set, we can simulate some MySQL data
Step 1: Access the MySQL Database
docker-compose exec mysql mysql -uroot -p123456
Step 2: Create MySQL Data
-- create database
CREATE DATABASE app_db;
USE app_db;
-- create orders table
CREATE TABLE `orders` (
`id` INT NOT NULL,
`price` DECIMAL(10,2) NOT NULL,
PRIMARY KEY (`id`)
);
-- create shipments table
CREATE TABLE `shipments` (
`id` INT NOT NULL,
`city` VARCHAR(255) NOT NULL,
PRIMARY KEY (`id`)
);
-- create products table
CREATE TABLE `products` (
`id` INT NOT NULL,
`product` VARCHAR(255) NOT NULL,
PRIMARY KEY (`id`)
);
-- insert records
INSERT INTO `orders` (`id`, `price`) VALUES (1, 4.00);
INSERT INTO `orders` (`id`, `price`) VALUES (2, 100.00);
-- insert records
INSERT INTO `shipments` (`id`, `city`) VALUES (1, 'beijing');
INSERT INTO `shipments` (`id`, `city`) VALUES (2, 'xian');
-- insert records
INSERT INTO `products` (`id`, `product`) VALUES (1, 'Beer');
INSERT INTO `products` (`id`, `product`) VALUES (2, 'Cap');
INSERT INTO `products` (`id`, `product`) VALUES (3, 'Peanut');
Create Doris Target Database
Now, let’s take a look at our data warehouse and set it up
Step 1: Use Doris UI
Doris UI can be accessed via http://localhost:8030/
Use root as the username and leave the password empty like so:
Step 2: Create the Database
Navigate to the Playground tab, select the mysql database and execute the query:
create database app_db;
Alternatively, you can access Doris by attaching it to your local MySQL engine from your terminal:
mysql -uroot -P9030 -h127.0.0.1
Using the MySQL CLI:
mysql> create database app_db;
Install and Deploy MySQL to Doris Connector
With everything in place, we can now deploy our data connector
Step 1: Install requirements
To deploy our connector we will first need a few packages:
- Flink CDC: This dependency contains the secret sauce of deploying Flink jobs using a Yaml file
- Flink CDC MySQL Connector: This dependency will allow Flink CDC job to connect to our MySQL Database
- Flink CDC Doris Connector: This dependency will allow Flink CDC job to connect to our Doris Data Warehouse
- MySQL Java Connector: This dependency is required for the Flink cluster to establish MySQL connections
# Inside the Flink cluster director ([...]/flink-1.18.0)
# Install Flink CDC
curl https://dlcdn.apache.org/flink/flink-cdc-3.1.0/flink-cdc-3.1.0-bin.tar.gz -o flink-cdc-3.1.0-bin.tar.gz
tar -xvzf flink-cdc-3.1.0-bin.tar.gz
# Install Flink CDC MySQL Connector
curl https://repo1.maven.org/maven2/org/apache/flink/flink-cdc-pipeline-connector-mysql/3.1.0/flink-cdc-pipeline-connector-mysql-3.1.0.jar -o flink-cdc-3.1.0/lib/flink-cdc-pipeline-connector-mysql-3.1.0.jar
# Install Flink CDC Doris Connector
curl https://repo1.maven.org/maven2/org/apache/flink/flink-cdc-pipeline-connector-doris/3.1.0/flink-cdc-pipeline-connector-doris-3.1.0.jar -o flink-cdc-3.1.0/lib/flink-cdc-pipeline-connector-doris-3.1.0.jar
# Install MySQL Java Connector
curl https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.27/mysql-connector-java-8.0.27.jar -o lib/mysql-connector-java-8.0.27.jar
# Restart the Flink Cluster to load the new dependencies
./bin/stop-cluster.sh
./bin/start-cluster.sh
Step 2: Create the Job Configuration
Create the file mysql-to-doris.yaml and insert into it:
###########################################################################
# Description: Sync MySQL all tables to Doris
###########################################################################
source:
type: mysql
hostname: localhost
port: 3306
username: root
password: 123456
tables: app_db.\.*
server-id: 5400-5404
server-time-zone: UTC
sink:
type: doris
fenodes: 127.0.0.1:8030
username: root
password: ""
table.create.properties.light_schema_change: true
table.create.properties.replication_num: 1
pipeline:
name: Sync MySQL Database to Doris
parallelism: 2
Step 3: Run MySQL to Doris Connector
bash flink-cdc-3.1.0/bin/flink-cdc.sh mysql-to-doris.yaml --flink-home './'
Now we can go back to the FLink UI and observe our Job has kicked off and is processing the data
And if we navigate to Doris UI, we can find our tables have been created
Live Schema and Data Changes
With our connector live, let’s see how it performs when we modify a table’s schema and apply some data updates.
Step 1: Access MySQL Database
docker-compose exec mysql mysql -uroot -p123456
Step 2: Apply a Schema Change
ALTER TABLE app_db.orders ADD amount varchar(100) NULL;
If we go to Doris UI, we can see the column has been added to the table
Step 3: Apply a Data Change
UPDATE app_db.orders SET price=100.00, amount=100.00 WHERE id=1;
DELETE FROM app_db.orders WHERE id=2;
Observe the change in Doris UI
Shut Down and Cleanup
Before we wrap up today, don’t forget to shutdown and cleanup the environment we created.
./bin/stop-cluster.sh
docker-compose down
Closing Thoughts
Congratulations on Deploying and running your first data platform. You have now taken the first step towards building your profile as a Data Platform Engineer.
Before we part ways, I’d like to leave you with these two thoughts
What Does This Look Like at Scale?
Data platforms are optimised for scalability by design, we can already spot scalable resources from the example we went through (aka. Flink task slots)
Analytics data warehouses like Doris are also design to handle large data processing at scale, with optimised storage (column based) and distributed processing, these platforms are well suited for scalability.
By leveraging Cloud computing and scalable container management tools like Kubernetes, Data Platforms can unlock exceptional computing power that can keep up with the growing needs of big data.
Next Steps?
Now that we have officially deployed our first Data Platform, we can take it to the next level and start building solid foundation to our skillset:
- Docker and Kubernetes: DevOps basics are essential to data platform work, as we need to optimise for performance, scalability becomes a must.
- MySQL and MySQL Replication: MySQL basics can be a great asset to a Data Platform Engineer such as acquiring the basics for MySQL sharding and replication, as well as Change Data Capture.
- Cloud-Based Data Warehouses and Data Lakes: Analytics data tools are the bread and butter of your data platform, getting exposure to some cloud based tools can help you better understand the value these tools offer at scale as well as make your data platform portfolio more attractive.
- Flink and Spark: Data processing at scale. Looking into tools like Flink and Spark, understanding how they handle and process data in a distributed way, understanding the fundamentals of partitioning and shuffling and running a few examples can help you grasp the potential of these tools.
I hope you enjoyed this guide as much as I enjoyed making it!
Keep me honest: If you spot any factual inaccuracies or misrepresented information, please let me know by leaving a comment or emailing me at ma.turki@dataplatformhub.com.”