Build Your First Data Platform

9 min readJun 24, 2024

With AI and Big Data reaching levels beyond expectations, one can only wonder how all of this data is being processed and handled! In our previous article, we discussed the importance of data platforms to data-driven companies, allowing them to handle and process hundreds of petabytes of data within a few minutes. We also touched on the critical role data platform engineers play in establishing a robust and reliable infrastructure that serves ever-growing data use cases.

Today, I want us to dive deeper into the data platform engineer role. In this article, I’ll show you just how accessible the field is and guide you through building your first-ever data platform.

Taking The First Step

While going through some data platform designs shared by companies like Uber and Netflix can be both exciting and intimidating, becoming a data platform engineer is actually much more accessible than these articles might suggest.

Trying out data platforming and taking the first step, is easier than you might think. In fact, by the end of this article, you’ll have taken your first step as a data platform engineer and will have a good plan for your second step to move forward!

Data Platform: Use Case

Glossary

Let’s briefly define some key terms:

MySQL: An open-source relational database management system.
Doris: An open-source, high-performance, real-time analytical database.
Flink: A framework and distributed processing engine for stateful computations of data streams and batches.

Business scenario

Picture an online retailer selling products on their website. They want to run some analysis on user activity, generate product recommendations and sales forecasts.

While their application’s back-end might benefit from the advantages offered by a MySQL database, their analysts will struggle as this database is not optimised for high volume reads and complex queries at scale.

Data Platform: The Solution

The business hires you as their data platform engineer to address this issue. You identify that we need to replicate the needed tables into an analytics data warehouse that allows analysts to work on the data efficiently and drive value for the business in record time.

After some research, you propose the following design:

This design will allow the application’s back-end to run smoothly with the MySQL database while our Connector ensures every data entry in MySQL is synchronized to Doris in real time to provide analytics with the latest view of the business’ data at all times.

Build Your First Data Platform

Pre-requirements

Before starting, ensure you have the following tools installed on your machine:

Docker & Docker Compose: For containerizing and orchestrating our services
MySQL: Our relational database
Java: for running the application which is JVM based. Make sure your JAVA_HOME is configured.

Flink CDC

The tool we will be using today is Flink CDC, which will allow us to connect to databases and process data all through a simple Yaml file.

Implementation Steps

Note: these steps are sourced from the official documentation in the Flink CDC web page.

Start Your Flink Cluster

Starting with getting a Flink cluster up and running

Step 1: Download and Install Flink

Using your machine’s command line, run these commands

# Download Flink
curl https://archive.apache.org/dist/flink/flink-1.18.0/flink-1.18.0-bin-scala_2.12.tgz -o flink-1.18.0-bin-scala_2.12.tgz
# unzip
tar -xvzf flink-1.18.0-bin-scala_2.12.tgz -C ./
# enter the Flink Folder
cd flink-1.18.0/

Step 2: Configure your Flink cluster

Edit the file conf/flink-conf.yaml using a file editor or IDE:

# Add the following line to enable Flink Checkpoints
execution.checkpointing.interval: 3000
# Edit the following line to allow enough task slots
taskmanager.numberOfTaskSlots: 4

Flink Checkpoints: Periodic snapshots for state recovery in Flink.
Flink Task Slots: Resource units in Flink for executing tasks.

Step 3: Start the Flink Cluster

./bin/start-cluster.sh

Once the cluster starts, you will be able to open the Flink UI in your browser using http://localhost:8081/ and should see something like this:

Deploy Your Database and Data Warehouse

Next, let’s deploy our MySQL database and Doris data warehouse

Step 1: Host Configuration

Since Doris requires memory mapping support for operation, execute the following command on your machine:

sudo sysctl -w vm.max_map_count=2000000

Note: Mac users, the above command might not work due to the different ways of implementing containers internally on MacOS. Follow these steps instead.

Step 2: Create the docker-compose File

Create docker-compose.yaml file with the following configuration:

version: '2.1'
services:
  doris:
    image: yagagagaga/doris-standalone
    ports:
      - "8030:8030"
      - "8040:8040"
      - "9030:9030"
  mysql:
    image: debezium/example-mysql:1.1
    ports:
      - "3306:3306"
    environment:
      - MYSQL_ROOT_PASSWORD=123456
      - MYSQL_USER=mysqluser
      - MYSQL_PASSWORD=mysqlpw

Step 3: Deploy Your Environment

docker-compose up -d

Note: Newer versions of Docker will use the syntax:

docker compose up -d

Create MySQL Data

With our environment all set, we can simulate some MySQL data

Step 1: Access the MySQL Database

docker-compose exec mysql mysql -uroot -p123456

Step 2: Create MySQL Data

-- create database
CREATE DATABASE app_db;
USE app_db;

-- create orders table
CREATE TABLE `orders` (
`id` INT NOT NULL,
`price` DECIMAL(10,2) NOT NULL,
PRIMARY KEY (`id`)
);
-- create shipments table
CREATE TABLE `shipments` (
`id` INT NOT NULL,
`city` VARCHAR(255) NOT NULL,
PRIMARY KEY (`id`)
);
-- create products table
CREATE TABLE `products` (
`id` INT NOT NULL,
`product` VARCHAR(255) NOT NULL,
PRIMARY KEY (`id`)
);

-- insert records
INSERT INTO `orders` (`id`, `price`) VALUES (1, 4.00);
INSERT INTO `orders` (`id`, `price`) VALUES (2, 100.00);
-- insert records
INSERT INTO `shipments` (`id`, `city`) VALUES (1, 'beijing');
INSERT INTO `shipments` (`id`, `city`) VALUES (2, 'xian');
-- insert records
INSERT INTO `products` (`id`, `product`) VALUES (1, 'Beer');
INSERT INTO `products` (`id`, `product`) VALUES (2, 'Cap');
INSERT INTO `products` (`id`, `product`) VALUES (3, 'Peanut');

Create Doris Target Database

Now, let’s take a look at our data warehouse and set it up

Step 1: Use Doris UI

Doris UI can be accessed via http://localhost:8030/

Use root as the username and leave the password empty like so:

Step 2: Create the Database

Navigate to the Playground tab, select the mysql database and execute the query:

create database app_db;

Alternatively, you can access Doris by attaching it to your local MySQL engine from your terminal:

mysql -uroot -P9030 -h127.0.0.1

Using the MySQL CLI:

mysql> create database app_db;

Install and Deploy MySQL to Doris Connector

With everything in place, we can now deploy our data connector

Step 1: Install requirements

To deploy our connector we will first need a few packages:

Flink CDC: This dependency contains the secret sauce of deploying Flink jobs using a Yaml file
Flink CDC MySQL Connector: This dependency will allow Flink CDC job to connect to our MySQL Database
Flink CDC Doris Connector: This dependency will allow Flink CDC job to connect to our Doris Data Warehouse
MySQL Java Connector: This dependency is required for the Flink cluster to establish MySQL connections

# Inside the Flink cluster director ([...]/flink-1.18.0)
# Install Flink CDC
curl https://dlcdn.apache.org/flink/flink-cdc-3.1.0/flink-cdc-3.1.0-bin.tar.gz -o flink-cdc-3.1.0-bin.tar.gz
tar -xvzf flink-cdc-3.1.0-bin.tar.gz
# Install Flink CDC MySQL Connector
curl https://repo1.maven.org/maven2/org/apache/flink/flink-cdc-pipeline-connector-mysql/3.1.0/flink-cdc-pipeline-connector-mysql-3.1.0.jar -o flink-cdc-3.1.0/lib/flink-cdc-pipeline-connector-mysql-3.1.0.jar
# Install Flink CDC Doris Connector
curl https://repo1.maven.org/maven2/org/apache/flink/flink-cdc-pipeline-connector-doris/3.1.0/flink-cdc-pipeline-connector-doris-3.1.0.jar -o flink-cdc-3.1.0/lib/flink-cdc-pipeline-connector-doris-3.1.0.jar
# Install MySQL Java Connector
curl https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.27/mysql-connector-java-8.0.27.jar -o lib/mysql-connector-java-8.0.27.jar
# Restart the Flink Cluster to load the new dependencies
./bin/stop-cluster.sh
./bin/start-cluster.sh

Step 2: Create the Job Configuration

Create the file mysql-to-doris.yaml and insert into it:

###########################################################################
# Description: Sync MySQL all tables to Doris
###########################################################################
source:
  type: mysql
  hostname: localhost
  port: 3306
  username: root
  password: 123456
  tables: app_db.\.*
  server-id: 5400-5404
  server-time-zone: UTC

sink:
  type: doris
  fenodes: 127.0.0.1:8030
  username: root
  password: ""
  table.create.properties.light_schema_change: true
  table.create.properties.replication_num: 1

pipeline:
  name: Sync MySQL Database to Doris
  parallelism: 2

Step 3: Run MySQL to Doris Connector

bash flink-cdc-3.1.0/bin/flink-cdc.sh mysql-to-doris.yaml --flink-home './'

Now we can go back to the FLink UI and observe our Job has kicked off and is processing the data

And if we navigate to Doris UI, we can find our tables have been created

Live Schema and Data Changes

With our connector live, let’s see how it performs when we modify a table’s schema and apply some data updates.

Step 1: Access MySQL Database

docker-compose exec mysql mysql -uroot -p123456

Step 2: Apply a Schema Change

ALTER TABLE app_db.orders ADD amount varchar(100) NULL;

If we go to Doris UI, we can see the column has been added to the table

Step 3: Apply a Data Change

UPDATE app_db.orders SET price=100.00, amount=100.00 WHERE id=1;
DELETE FROM app_db.orders WHERE id=2;

Observe the change in Doris UI

Shut Down and Cleanup

Before we wrap up today, don’t forget to shutdown and cleanup the environment we created.

./bin/stop-cluster.sh
docker-compose down

Closing Thoughts

Congratulations on Deploying and running your first data platform. You have now taken the first step towards building your profile as a Data Platform Engineer.

Before we part ways, I’d like to leave you with these two thoughts

What Does This Look Like at Scale?

Data platforms are optimised for scalability by design, we can already spot scalable resources from the example we went through (aka. Flink task slots)

Analytics data warehouses like Doris are also design to handle large data processing at scale, with optimised storage (column based) and distributed processing, these platforms are well suited for scalability.

By leveraging Cloud computing and scalable container management tools like Kubernetes, Data Platforms can unlock exceptional computing power that can keep up with the growing needs of big data.

Next Steps?

Now that we have officially deployed our first Data Platform, we can take it to the next level and start building solid foundation to our skillset:

Docker and Kubernetes: DevOps basics are essential to data platform work, as we need to optimise for performance, scalability becomes a must.
MySQL and MySQL Replication: MySQL basics can be a great asset to a Data Platform Engineer such as acquiring the basics for MySQL sharding and replication, as well as Change Data Capture.
Cloud-Based Data Warehouses and Data Lakes: Analytics data tools are the bread and butter of your data platform, getting exposure to some cloud based tools can help you better understand the value these tools offer at scale as well as make your data platform portfolio more attractive.
Flink and Spark: Data processing at scale. Looking into tools like Flink and Spark, understanding how they handle and process data in a distributed way, understanding the fundamentals of partitioning and shuffling and running a few examples can help you grasp the potential of these tools.

I hope you enjoyed this guide as much as I enjoyed making it!

Keep me honest: If you spot any factual inaccuracies or misrepresented information, please let me know by leaving a comment or emailing me at ma.turki@dataplatformhub.com.”

Build Your First Data Platform

Taking The First Step

Data Platform: Use Case

Glossary

Business scenario

Data Platform: The Solution

Build Your First Data Platform

Pre-requirements

Flink CDC

Implementation Steps

Start Your Flink Cluster

Deploy Your Database and Data Warehouse

Create MySQL Data

Create Doris Target Database

Install and Deploy MySQL to Doris Connector

Live Schema and Data Changes

Shut Down and Cleanup

Closing Thoughts

What Does This Look Like at Scale?

Next Steps?

Written by Mohamed Amine Turki

No responses yet