Simplify operational information processing in information lakes utilizing AWS Glue and Apache Hudi

September 13, 2023

1

The Analytics specialty observe of AWS Skilled Companies (AWS ProServe) helps clients throughout the globe with fashionable information structure implementations on the AWS Cloud. A contemporary information structure is an evolutionary structure sample designed to combine an information lake, information warehouse, and purpose-built shops with a unified governance mannequin. It focuses on defining requirements and patterns to combine information producers and shoppers and transfer information between information lakes and purpose-built information shops securely and effectively. Out of the various information producer techniques that feed information to an information lake, operational databases are most prevalent, the place operational information is saved, remodeled, analyzed, and eventually used to boost enterprise operations of a corporation. With the emergence of open storage codecs corresponding to Apache Hudi and its native help from AWS Glue for Apache Spark, many AWS clients have began including transactional and incremental information processing capabilities to their information lakes.

AWS has invested in native service integration with Apache Hudi and revealed technical contents to allow you to make use of Apache Hudi with AWS Glue (for instance, check with Introducing native help for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Half 1: Getting Began). In AWS ProServe-led buyer engagements, the use instances we work on normally include technical complexity and scalability necessities. On this put up, we talk about a standard use case in relation to operational information processing and the answer we constructed utilizing Apache Hudi and AWS Glue.

Use case overview

AnyCompany Journey and Hospitality needed to construct an information processing framework to seamlessly ingest and course of information coming from operational databases (utilized by reservation and reserving techniques) in an information lake earlier than making use of machine studying (ML) strategies to supply a customized expertise to its customers. As a result of sheer quantity of direct and oblique gross sales channels the corporate has, its reserving and promotions information are organized in lots of of operational databases with 1000’s of tables. Of these tables, some are bigger (corresponding to by way of document quantity) than others, and a few are up to date extra ceaselessly than others. Within the information lake, the info to be organized within the following storage zones:

Supply-aligned datasets – These have an equivalent construction to their counterparts on the supply
Aggregated datasets – These datasets are created based mostly on a number of source-aligned datasets
Client-aligned datasets – These are derived from a mix of source-aligned, aggregated, and reference datasets enriched with related enterprise and transformation logics, normally fed as inputs to ML pipelines or any client functions

The next are the info ingestion and processing necessities:

Replicate information from operational databases to the info lake, together with insert, replace, and delete operations.
Maintain the source-aligned datasets updated (sometimes inside the vary of 10 minutes to a day) in relation to their counterparts within the operational databases, guaranteeing analytics pipelines refresh consumer-aligned datasets for downstream ML pipelines in a well timed vogue. Furthermore, the framework ought to eat compute assets as optimally as doable per the scale of the operational tables.
To attenuate DevOps and operational overhead, the corporate needed to templatize the supply code wherever doable. For instance, to create source-aligned datasets within the information lake for 3,000 operational tables, the corporate didn’t wish to deploy 3,000 separate information processing jobs. The smaller the variety of jobs and scripts, the higher.
The corporate needed the power to proceed processing operational information within the secondary Area within the uncommon occasion of main Area failure.

As you possibly can guess, the Apache Hudi framework can resolve the primary requirement. Due to this fact, we are going to put our emphasis on the opposite necessities. We start with a Knowledge lake reference structure adopted by an outline of operational information processing framework. By exhibiting you our open-source resolution on GitHub, we delve into framework parts and stroll by their design and implementation points. Lastly, by testing the framework, we summarize the way it meets the aforementioned necessities.

Knowledge lake reference structure

Let’s start with a giant image: an information lake solves quite a lot of analytics and ML use instances coping with inner and exterior information producers and shoppers. The next diagram represents a generic information lake structure. To ingest information from operational databases to an Amazon Easy Storage Service (Amazon S3) staging bucket of the info lake, both AWS Database Migration Service (AWS DMS) or any AWS accomplice resolution from AWS Market that has help for change information seize (CDC) can fulfill the requirement. AWS Glue is used to create source-aligned and consumer-aligned datasets and separate AWS Glue jobs to do function engineering a part of ML engineering and operations. Amazon Athena is used for interactive querying and AWS Lake Formation is used for entry controls.

Data Lake Reference Architecture

Operational information processing framework

The operational information processing (ODP) framework comprises three parts: File Supervisor, File Processor, and Configuration Supervisor. Every element runs independently to unravel a portion of the operational information processing use case. We now have open-sourced this framework on GitHub—you possibly can clone the code repo and examine it whereas we stroll you thru the design and implementation of the framework parts. The supply code is organized in three folders, one for every element, and when you customise and undertake this framework on your use case, we suggest selling these folders as separate code repositories in your model management system. Think about using the next repository names:

aws-glue-hudi-odp-framework-file-manager
aws-glue-hudi-odp-framework-file-processor
aws-glue-hudi-odp-framework-config-manager

With this modular method, you possibly can independently deploy the parts to your information lake setting by following your most popular CI/CD processes. As illustrated within the previous diagram, these parts are deployed together with a CDC resolution.

Element 1: File Supervisor

File Supervisor detects information emitted by a CDC course of corresponding to AWS DMS and tracks them in an Amazon DynamoDB desk. As proven within the following diagram, it consists of an Amazon EventBridge occasion rule, an Amazon Easy Queue Service (Amazon SQS) queue, an AWS Lambda perform, and a DynamoDB desk. The EventBridge rule makes use of Amazon S3 Occasion Notifications to detect the arrival of CDC information within the S3 bucket. The occasion rule forwards the thing occasion notifications to the SQS queue as messages. The File Supervisor Lambda perform consumes these messages, parses the metadata, and inserts the metadata to the DynamoDB desk odpf_file_tracker. These data will then be processed by File Processor, which we talk about within the subsequent part.

ODPF Component: File Manager

Element 2: File Processor

File Processor is the workhorse of the ODP framework. It processes information from the S3 staging bucket, creates source-aligned datasets within the uncooked S3 bucket, and provides or updates metadata for the datasets (AWS Glue tables) within the AWS Glue Knowledge Catalog.

We use the next terminology when discussing File Processor:

Refresh cadence – This represents the info ingestion frequency (for instance, 10 minutes). It normally goes with AWS Glue employee kind (one in all G.1X, G.2X, G.4X, G.8X, G.025X, and so forth) and batch dimension.
Desk configuration – This contains the Hudi configuration (main key, partition key, pre-combined key, and desk kind (Copy on Write or Merge on Learn)), desk information storage mode (historic or present snapshot), S3 bucket used to retailer source-aligned datasets, AWS Glue database title, AWS Glue desk title, and refresh cadence.
Batch dimension – This numeric worth is used to separate tables into smaller batches and course of their respective CDC information in parallel. For instance, a configuration of fifty tables with a 10-minute refresh cadence and a batch dimension of 5 leads to a complete of 10 AWS Glue job runs, every processing CDC information for five tables.
Desk information storage mode – There are two choices:
- Historic – This desk within the information lake shops historic updates to data (all the time append).
- Present snapshot – This desk within the information lake shops newest versioned data (upserts) with the power to make use of Hudi time journey for historic updates.
File processing state machine – It processes CDC information that belong to tables that share a standard refresh cadence.
EventBridge rule affiliation with the file processing state machine – We use a devoted EventBridge rule for every refresh cadence with the file processing state machine as goal.
File processing AWS Glue job – It is a configuration-driven AWS Glue extract, remodel, and cargo (ETL) job that processes CDC information for a number of tables.

File Processor is applied as a state machine utilizing AWS Step Features. Let’s use an instance to know this. The next diagram illustrates operating File Processor state machine with a configuration that features 18 operational tables, a refresh cadence of 10 minutes, a batch dimension of 5, and an AWS Glue employee kind of G.1X.

ODP framework component: File Processor

The workflow contains the next steps:

The EventBridge rule triggers the File Processor state machine each 10 minutes.
Being the primary state within the state machine, the Batch Supervisor Lambda perform reads configurations from DynamoDB tables.
The Lambda perform creates 4 batches: three of them shall be mapped to 5 operational tables every, and the fourth one is mapped to 3 operational tables. Then it feeds the batches to the Step Features Map state.
For every merchandise within the Map state, the File Processor Set off Lambda perform shall be invoked, which in flip runs the File Processor AWS Glue job.
Every AWS Glue job performs the next actions:
- Checks the standing of an operational desk and acquires a lock when it’s not processed by another job. The odpf_file_processing_tracker DynamoDB desk is used for this goal. When a lock is acquired, it inserts a document within the DynamoDB desk with the standing updating_table for the primary time; in any other case, it updates the document.
- Processes the CDC information for the given operational desk from the S3 staging bucket and creates a source-aligned dataset within the S3 uncooked bucket. It additionally updates technical metadata within the AWS Glue Knowledge Catalog.
- Updates the standing of the operational desk to accomplished within the odpf_file_processing_tracker desk. In case of processing errors, it updates the standing to refresh_error and logs the stack hint.
- It additionally inserts this document into the odpf_file_processing_tracker_history DynamoDB desk together with further particulars corresponding to insert, replace, and delete row counts.
- Strikes the data that belong to efficiently processed CDC information from odpf_file_tracker to the odpf_file_tracker_history desk with file_ingestion_status set to raw_file_processed.
- Strikes to the subsequent operational desk within the given batch.
- Observe: a failure to course of CDC information for one of many operational tables of a given batch doesn’t affect the processing of different operational tables.

Element 3: Configuration Supervisor

Configuration Supervisor is used to insert configuration particulars to the odpf_batch_config and odpf_raw_table_config tables. To maintain this put up concise, we offer two structure patterns within the code repo and depart the implementation particulars to you.

Answer overview

Let’s take a look at the ODP framework by replicating information from 18 operational tables to an information lake and creating source-aligned datasets with 10-minute refresh cadence. We use Amazon Relational Database Service (Amazon RDS) for MySQL to arrange an operational database with 18 tables, add the New York Metropolis Taxi – Yellow Journey Knowledge dataset, arrange AWS DMS to copy information to Amazon S3, course of the information utilizing the framework, and eventually validate the info utilizing Amazon Athena.

Create S3 buckets

For directions on creating an S3 bucket, check with Making a bucket. For this put up, we create the next buckets:

odpf-demo-staging-EXAMPLE-BUCKET – You’ll use this emigrate operational information utilizing AWS DMS
odpf-demo-raw-EXAMPLE-BUCKET – You’ll use this to retailer source-aligned datasets
odpf-demo-code-artifacts-EXAMPLE-BUCKET – You’ll use this to retailer code artifacts

Deploy File Supervisor and File Processor

Deploy File Supervisor and File Processor by following directions from this README and this README, respectively.

Arrange Amazon RDS for MySQL

Full the next steps to arrange Amazon RDS for MySQL because the operational information supply:

Provision Amazon RDS for MySQL. For directions, check with Create and Connect with a MySQL Database with Amazon RDS.
Connect with the database occasion utilizing MySQL Workbench or DBeaver.
Create a database (schema) by operating the SQL command CREATE DATABASE taxi_trips;.
Create 18 tables by operating the SQL instructions within the ops_table_sample_ddl.sql script.

Populate information to the operational information supply

Full the next steps to populate information to the operational information supply:

To obtain the New York Metropolis Taxi – Yellow Journey Knowledge dataset for January 2021 (Parquet file), navigate to NYC TLC Journey Report Knowledge, increase 2021, and select Yellow Taxi Journey data. A file known as yellow_tripdata_2021-01.parquet shall be downloaded to your pc.
On the Amazon S3 console, open the bucket odpf-demo-staging-EXAMPLE-BUCKET and create a folder known as nyc_yellow_trip_data.
Add the yellow_tripdata_2021-01.parquet file to the folder.
Navigate to the bucket odpf-demo-code-artifacts-EXAMPLE-BUCKET and create a folder known as glue_scripts.
Obtain the file load_nyc_taxi_data_to_rds_mysql.py from the GitHub repo and add it to the folder.
Create an AWS Id and Entry Administration (IAM) coverage known as load_nyc_taxi_data_to_rds_mysql_s3_policy. For directions, check with Creating insurance policies utilizing the JSON editor. Use the odpf_setup_test_data_glue_job_s3_policy.json coverage definition.
Create an IAM position known as load_nyc_taxi_data_to_rds_mysql_glue_role. Connect the coverage created within the earlier step.
On the AWS Glue console, create a connection for Amazon RDS for MySQL. For directions, check with Including a JDBC connection utilizing your individual JDBC drivers and Establishing a VPC to hook up with Amazon RDS information shops over JDBC for AWS Glue. Identify the connection as odpf_demo_rds_connection.
Within the navigation pane of the AWS Glue console, select Glue ETL jobs, Python Shell script editor, and Add and edit an current script underneath Choices.
Select the file load_nyc_taxi_data_to_rds_mysql.py and select Create.
Full the next steps to create your job:
- Present a title for the job, corresponding to load_nyc_taxi_data_to_rds_mysql.
- For IAM position, select load_nyc_taxi_data_to_rds_mysql_glue_role.
- Set Knowledge processing models to 1/16 DPU.
- Beneath Superior properties, Connections, choose the connection you created earlier.
- Beneath Job parameters, add the next parameters:
  - input_sample_data_path = s3://odpf-demo-staging-EXAMPLE-BUCKET/nyc_yellow_trip_data/yellow_tripdata_2021-01.parquet
  - schema_name = taxi_trips
  - table_name = table_1
  - rds_connection_name = odpf_demo_rds_connection
- Select Save.
On the Actions menu, run the job.
Return to your MySQL Workbench or DBeaver and validate the document rely by operating the SQL command choose rely(1) row_count from taxi_trips.table_1. You’ll get an output of 1369769.
Populate the remaining 17 tables by operating the SQL instructions from the populate_17_ops_tables_rds_mysql.sql script.
Get the row rely from the 18 tables by operating the SQL instructions from the ops_data_validation_query_rds_mysql.sql script. The next screenshot exhibits the output.

Configure DynamoDB tables

Full the next steps to configure the DynamoDB tables:

Obtain file load_ops_table_configs_to_ddb.py from the GitHub repo and add it to the folder glue_scripts within the S3 bucket odpf-demo-code-artifacts-EXAMPLE-BUCKET.
Create an IAM coverage known as load_ops_table_configs_to_ddb_ddb_policy. Use the odpf_setup_test_data_glue_job_ddb_policy.json coverage definition.
Create an IAM position known as load_ops_table_configs_to_ddb_glue_role. Connect the coverage created within the earlier step.
On the AWS Glue console, select Glue ETL jobs, Python Shell script editor, and Add and edit an current script underneath Choices.
Select the file load_ops_table_configs_to_ddb.py and select Create.
Full the next steps to create a job:
- Present a title, corresponding to load_ops_table_configs_to_ddb.
- For IAM position, select load_ops_table_configs_to_ddb_glue_role.
- Set Knowledge processing models to 1/16 DPU.
- Beneath Job parameters, add the next parameters
  - batch_config_ddb_table_name = odpf_batch_config
  - raw_table_config_ddb_table_name = odpf_demo_taxi_trips_raw
  - aws_region = e.g., us-west-1
- Select Save.
On the Actions menu, run the job.
On the DynamoDB console, get the merchandise rely from the tables. You will see that 1 merchandise within the odpf_batch_config desk and 18 objects within the odpf_demo_taxi_trips_raw desk.

Arrange a database in AWS Glue

Full the next steps to create a database:

On the AWS Glue console, underneath Knowledge catalog within the navigation pane, select Databases.
Create a database known as odpf_demo_taxi_trips_raw.

Arrange AWS DMS for CDC

Full the next steps to arrange AWS DMS for CDC:

Create an AWS DMS replication occasion. For Occasion class, select dms.t3.medium.
Create a supply endpoint for Amazon RDS for MySQL.
Create goal endpoint for Amazon S3. To configure the S3 endpoint settings, use the JSON definition from dms_s3_endpoint_setting.json.
Create an AWS DMS activity.
- Use the supply and goal endpoints created within the earlier steps.
- To create AWS DMS activity mapping guidelines, use the JSON definition from dms_task_mapping_rules.json.
- Beneath Migration activity startup configuration, choose Routinely on create.
When the AWS DMS activity begins operating, you will notice a activity abstract just like the next screenshot.
Within the Desk statistics part, you will notice an output just like the next screenshot. Right here, the Full load rows and Complete rows columns are essential metrics whose counts ought to match with the document volumes of the 18 tables within the operational information supply.
Because of profitable full load completion, you can find Parquet information within the S3 staging bucket—one Parquet file per desk in a devoted folder, just like the next screenshot. Equally, you can find 17 such folders within the bucket.

File Supervisor output

The File Supervisor Lambda perform consumes messages from the SQS queue, extracts metadata for the CDC information, and inserts one merchandise per file to the odpf_file_tracker DynamoDB desk. While you verify the objects, you can find 18 objects with file_ingestion_status set to raw_file_landed, as proven within the following screenshot.

CDC Files in File Tracker DynamoDB Table

File Processor output

On the following tenth minute (because the activation of the EventBridge rule), the occasion rule triggers the File Processor state machine. On the Step Features console, you’ll discover that the state machine is invoked, as proven within the following screenshot.
As proven within the following screenshot, the Batch Generator Lambda perform creates 4 batches and constructs a Map state for parallel operating of the File Processor Set off Lambda perform.
Then, the File Processor Set off Lambda perform runs the File Processor Glue Job, as proven within the following screenshot.
Then, you’ll discover that the File Processor Glue Job runs create source-aligned datasets in Hudi format within the S3 uncooked bucket. For Desk 1, you will notice an output just like the next screenshot. There shall be 17 such folders within the S3 uncooked bucket.
Lastly, in AWS Glue Knowledge Catalog, you’ll discover 18 tables created within the odpf_demo_taxi_trips_raw database, just like the next screenshot.

Knowledge validation

Full the next steps to validate the info:

On the Amazon Athena console, open the question editor, and choose a workgroup or create a brand new workgroup.
Select AwsDataCatalog for Knowledge supply and odpf_demo_taxi_trips_raw for Database.
Run the raw_data_validation_query_athena.sql SQL question. You’ll get an output just like the next screenshot.

Validation abstract: The counts in Amazon Athena match with the counts of the operational tables and it proves that the ODP framework has processed all of the information and data efficiently. This concludes the demo. To check further eventualities, check with Prolonged Testing within the code repo.

Outcomes

Let’s evaluation how the ODP framework addressed the aforementioned necessities.

As mentioned earlier on this put up, by logically grouping tables by refresh cadence and associating them to EventBridge guidelines, we ensured that the source-aligned tables are refreshed by the File Processor AWS Glue jobs. With the AWS Glue employee kind configuration setting, we chosen the suitable compute assets whereas operating the AWS Glue jobs (the situations of the AWS Glue job).
By making use of table-specific configurations (from odpf_batch_config and odpf_raw_table_config) dynamically, we had been ready to make use of one AWS Glue job to course of CDC information for 18 tables.
You need to use this framework to help quite a lot of information migration use instances that require faster information migration from on-premises storage techniques to information lakes or analytics platforms on AWS. You may reuse File Supervisor as is and customise File Processor to work with different storage frameworks corresponding to Apache Iceberg, Delta Lake, and purpose-built information shops corresponding to Amazon Aurora and Amazon Redshift.
To grasp how the ODP framework met the corporate’s catastrophe restoration (DR) design criterion, we first want to know the DR structure technique at a excessive degree. The DR structure technique has the next points:
- One AWS account and two AWS Areas are used for main and secondary environments.
- The information lake infrastructure within the secondary Area is stored in sync with the one within the main Area.
- Knowledge is saved in S3 buckets, metadata information is saved within the AWS Glue Knowledge Catalog, and entry controls in Lake Formation are replicated from the first to secondary Area.
- The information lake supply and goal techniques have their respective DR environments.
- CI/CD tooling (model management, CI server, and so forth) are to be made extremely out there.
- The DevOps crew wants to have the ability to deploy CI/CD pipelines of analytics frameworks (corresponding to this ODP framework) to both the first or secondary Area.
- As you possibly can think about, catastrophe restoration on AWS is an unlimited topic, so we maintain our dialogue to the final design side.

By designing the ODP framework with three parts and externalizing operational desk configurations to DynamoDB world tables, the corporate was capable of deploy the framework parts to the secondary Area (within the uncommon occasion of a single-Area failure) and proceed to course of CDC information from the purpose it final processed within the main Area. As a result of the CDC file monitoring and processing audit information is replicated to the DynamoDB reproduction tables within the secondary Area, the File Supervisor microservice and File Processor can seamlessly run.

Clear up

While you’re completed testing this framework, you possibly can delete the provisioned AWS assets to keep away from any additional expenses.

Conclusion

On this put up, we took a real-world operational information processing use case and offered you the framework we developed at AWS ProServe. We hope this put up and the operational information processing framework utilizing AWS Glue and Apache Hudi will expedite your journey in integrating operational databases into your fashionable information platforms constructed on AWS.

In regards to the authors

Ravi-Itha Ravi Itha is a Principal Marketing consultant at AWS Skilled Companies with specialization in information and analytics and generalist background in utility growth. Ravi helps clients with enterprise information technique initiatives throughout insurance coverage, airways, pharmaceutical, and monetary companies industries. In his 6-year tenure at Amazon, Ravi has helped the AWS builder group by publishing roughly 15 open-source options (accessible through GitHub deal with), 4 blogs, and reference architectures. Exterior of labor, he’s keen about studying India Information Techniques and training Yoga Asanas.

srinivas-kandi Srinivas Kandi is a Knowledge Architect at AWS Skilled Companies. He leads buyer engagements associated to information lakes, analytics, and information warehouse modernizations. He enjoys studying historical past and civilizations.