Securely course of near-real-time information from Amazon MSK Serverless utilizing an AWS Glue streaming ETL job with IAM authentication


Streaming information has develop into an indispensable useful resource for organizations worldwide as a result of it gives real-time insights which can be essential for information analytics. The escalating velocity and magnitude of collected information has created a requirement for real-time analytics. This information originates from various sources, together with social media, sensors, logs, and clickstreams, amongst others. With streaming information, organizations achieve a aggressive edge by promptly responding to real-time occasions and making well-informed choices.

In streaming purposes, a prevalent method entails ingesting information via Apache Kafka and processing it with Apache Spark Structured Streaming. Nevertheless, managing, integrating, and authenticating the processing framework (Apache Spark Structured Streaming) with the ingesting framework (Kafka) poses vital challenges, necessitating a managed and serverless framework. For instance, integrating and authenticating a consumer like Spark streaming with Kafka brokers and zookeepers utilizing a guide TLS methodology requires certificates and keystore administration, which isn’t a simple job and requires a superb data of TLS setup.

To handle these points successfully, we suggest utilizing Amazon Managed Streaming for Apache Kafka (Amazon MSK), a completely managed Apache Kafka service that gives a seamless approach to ingest and course of streaming information. On this submit, we use Amazon MSK Serverless, a cluster kind for Amazon MSK that makes it potential so that you can run Apache Kafka with out having to handle and scale cluster capability. To additional improve safety and streamline authentication and authorization processes, MSK Serverless allows you to deal with each authentication and authorization utilizing AWS Id and Entry Administration (IAM) in your cluster. This integration eliminates the necessity for separate mechanisms for authentication and authorization, simplifying and strengthening information safety. For instance, when a consumer tries to put in writing to your cluster, MSK Serverless makes use of IAM to examine whether or not that consumer is an authenticated id and likewise whether or not it’s licensed to supply to your cluster.

To course of information successfully, we use AWS Glue, a serverless information integration service that makes use of the Spark Structured Streaming framework and allows near-real-time information processing. An AWS Glue streaming job can deal with massive volumes of incoming information from MSK Serverless with IAM authentication. This highly effective mixture ensures that information is processed securely and swiftly.

The submit demonstrates the right way to construct an end-to-end implementation to course of information from MSK Serverless utilizing an AWS Glue streaming extract, remodel, and cargo (ETL) job with IAM authentication to attach MSK Serverless from the AWS Glue job and question the info utilizing Amazon Athena.

Answer overview

The next diagram illustrates the structure that you simply implement on this submit.

The workflow consists of the next steps:

  1. Create an MSK Serverless cluster with IAM authentication and an EC2 Kafka consumer because the producer to ingest pattern information right into a Kafka matter. For this submit, we use the kafka-console-producer.sh Kafka console producer consumer.
  2. Arrange an AWS Glue streaming ETL job to course of the incoming information. This job extracts information from the Kafka matter, masses it into Amazon Easy Storage Service (Amazon S3), and creates a desk within the AWS Glue Knowledge Catalog. By constantly consuming information from the Kafka matter, the ETL job ensures it stays synchronized with the most recent streaming information. Furthermore, the job incorporates the checkpointing performance, which tracks the processed data, enabling it to renew processing seamlessly from the purpose of interruption within the occasion of a job run failure.
  3. Following the info processing, the streaming job shops information in Amazon S3 and generates a Knowledge Catalog desk. This desk acts as a metadata layer for the info. To work together with the info saved in Amazon S3, you need to use Athena, a serverless and interactive question service. Athena allows the run of SQL-like queries on the info, facilitating seamless exploration and evaluation.

For this submit, we create the answer sources within the us-east-1 Area utilizing AWS CloudFormation templates. Within the following sections, we present you the right way to configure your sources and implement the answer.

Configure sources with AWS CloudFormation

On this submit, you utilize the next two CloudFormation templates. The benefit of utilizing two totally different templates is which you could decouple the useful resource creation of ingestion and processing half based on your use case and if in case you have necessities to create particular course of sources solely.

  • vpc-mskserverless-client.yaml – This template units up information the ingestion service sources akin to a VPC, MSK Serverless cluster, and S3 bucket
  • gluejob-setup.yaml – This template units up the info processing sources such because the AWS Glue desk, database, connection, and streaming job

Create information ingestion sources

The vpc-mskserverless-client.yaml stack creates a VPC, personal and public subnets, safety teams, S3 VPC Endpoint, MSK Serverless cluster, EC2 occasion with Kafka consumer, and S3 bucket. To create the answer sources for information ingestion, full the next steps:

  1. Launch the stack vpc-mskserverless-client utilizing the CloudFormation template:
  2. Present the parameter values as listed within the following desk.
Parameters Description Pattern Worth
EnvironmentName Atmosphere identify that’s prefixed to useful resource names .
PrivateSubnet1CIDR IP vary (CIDR notation) for the personal subnet within the first Availability Zone .
PrivateSubnet2CIDR IP vary (CIDR notation) for the personal subnet within the second Availability Zone .
PublicSubnet1CIDR IP vary (CIDR notation) for the general public subnet within the first Availability Zone .
PublicSubnet2CIDR IP vary (CIDR notation) for the general public subnet within the second Availability Zone .
VpcCIDR IP vary (CIDR notation) for this VPC .
InstanceType Occasion kind for the EC2 occasion t2.micro
LatestAmiId AMI used for the EC2 occasion /aws/service/ami-amazon-linux- newest/amzn2-ami-hvm-x86_64-gp2
  1. When the stack creation is full, retrieve the EC2 occasion PublicDNS from the vpc-mskserverless-client stack’s Outputs tab.

The stack creation course of can take round quarter-hour to finish.

  1. On the Amazon EC2 console, entry the EC2 occasion that you simply created utilizing the CloudFormation template.
  2. Select the EC2 occasion whose InstanceId is proven on the stack’s Outputs tab.

Subsequent, you log in to the EC2 occasion utilizing Session Supervisor, a functionality of AWS Techniques Supervisor.

  1. On the Amazon EC2 console, choose the instanceid and on the Session Supervisor tab, select Join.


After you log in to the EC2 occasion, you create a Kafka matter within the MSK Serverless cluster from the EC2 occasion.

  1. Within the following export command, present the MSKBootstrapServers worth from the vpc-mskserverless- consumer stack output in your endpoint:
    $ sudo su – ec2-user
    $ BS=<your-msk-serverless-endpoint (e.g.) boot-xxxxxx.yy.kafka-serverless.us-east-1.a>

  2. Run the next command on the EC2 occasion to create a subject known as msk-serverless-blog. The Kafka consumer is already put in within the ec2-user dwelling listing (/dwelling/ec2-user).
    $ /dwelling/ec2-user/kafka_2.12-2.8.1/bin/kafka-topics.sh 
    --bootstrap-server $BS 
    --command-config /dwelling/ec2-user/kafka_2.12-2.8.1/bin/consumer.properties 
    --create –matter msk-serverless-blog 
    --partitions 1
    
    Created matter msk-serverless-blog

After you affirm the subject creation, you may push the info to the MSK Serverless.

  1. Run the next command on the EC2 occasion to create a console producer to supply data to the Kafka matter. (For supply information, we use nycflights.csv downloaded on the ec2-user dwelling listing /dwelling/ec2-user.)
$ /dwelling/ec2-user/kafka_2.12-2.8.1/bin/kafka-console-producer.sh 
--broker-list $BS 
--producer.config /dwelling/ec2-user/kafka_2.12-2.8.1/bin/consumer.properties 
--topic msk-serverless-blog < nycflights.csv

Subsequent, you arrange the info processing service sources, particularly AWS Glue parts just like the database, desk, and streaming job to course of the info.

Create information processing sources

The gluejob-setup.yaml CloudFormation template creates a database, desk, AWS Glue connection, and AWS Glue streaming job. Retrieve the values for VpcId, GluePrivateSubnet, GlueconnectionSubnetAZ, SecurityGroup, S3BucketForOutput, and S3BucketForGlueScript from the vpc-mskserverless-client stack’s Outputs tab to make use of on this template. Full the next steps:

  1. Launch the stack gluejob-setup:

  1. Present parameter values as listed within the following desk.
Parameters Description Pattern worth
EnvironmentName Atmosphere identify that’s prefixed to useful resource names. Gluejob-setup
VpcId ID of the VPC for safety group. Use the VPC ID created with the primary stack. Seek advice from the primary stack’s output.
GluePrivateSubnet Non-public subnet used for creating the AWS Glue connection. Seek advice from the primary stack’s output.
SecurityGroupForGlueConnection Safety group utilized by the AWS Glue connection. Seek advice from the primary stack’s output.
GlueconnectionSubnetAZ Availability Zone for the primary personal subnet used for the AWS Glue connection. .
GlueDataBaseName Identify of the AWS Glue Knowledge Catalog database. glue_kafka_blog_db
GlueTableName Identify of the AWS Glue Knowledge Catalog desk. blog_kafka_tbl
S3BucketNameForScript Bucket Identify for Glue ETL script. Use the S3 bucket identify from the earlier stack. For instance, aws-gluescript-${AWS::AccountId}-${AWS::Area}-${EnvironmentName}
GlueWorkerType Employee kind for AWS Glue job. For instance, G.1X. G.1X
NumberOfWorkers Variety of staff within the AWS Glue job. 3
S3BucketNameForOutput Bucket identify for writing information from the AWS Glue job. aws-glueoutput-${AWS::AccountId}-${AWS::Area}-${EnvironmentName}
TopicName MSK matter identify that must be processed. msk-serverless-blog
MSKBootstrapServers Kafka bootstrap server. boot-30vvr5lg.c1.kafka-serverless.us- east-1.amazonaws.com:9098

The stack creation course of can take round 1–2 minutes to finish. You may examine the Outputs tab for the stack after the stack is created.

Within the gluejob-setup stack, we created a Kafka kind AWS Glue connection, which consists of dealer data just like the MSK bootstrap server, matter identify, and VPC through which the MSK Serverless cluster is created. Most significantly, it specifies the IAM authentication possibility, which helps AWS Glue authenticate and authorize utilizing IAM authentication whereas consuming the info from the MSK matter. For additional readability, you may look at the AWS Glue connection and the related AWS Glue desk generated via AWS CloudFormation.

After efficiently creating the CloudFormation stack, now you can proceed with processing information utilizing the AWS Glue streaming job with IAM authentication.

Run the AWS Glue streaming job

To course of the info from the MSK matter utilizing the AWS Glue streaming job that you simply arrange within the earlier part, full the next steps:

  1. On the CloudFormation console, select the stack gluejob-setup.
  2. On the Outputs tab, retrieve the identify of the AWS Glue streaming job from the GlueJobName row. Within the following screenshot, the identify is GlueStreamingJob-glue-streaming-job.

  1. On the AWS Glue console, select ETL jobs within the navigation pane.
  2. Seek for the AWS Glue streaming job named GlueStreamingJob-glue-streaming-job.
  3. Select the job identify to open its particulars web page.
  4. Select Run to begin the job.
  5. On the Runs tab, affirm if the job ran with out failure.

  1. Retrieve the OutputBucketName from the gluejob-setup template outputs.
  2. On the Amazon S3 console, navigate to the S3 bucket to confirm the info.

  1. On the AWS Glue console, select the AWS Glue streaming job you ran, then select Cease job run.

As a result of this can be a streaming job, it should proceed to run indefinitely till manually stopped. After you confirm the info is current within the S3 output bucket, you may cease the job to avoid wasting price.

Validate the info in Athena

After the AWS Glue streaming job has efficiently created the desk for the processed information within the Knowledge Catalog, observe these steps to validate the info utilizing Athena:

  1. On the Athena console, navigate to the question editor.
  2. Select the Knowledge Catalog as the info supply.
  3. Select the database and desk that the AWS Glue streaming job created.
  4. To validate the info, run the next question to search out the flight quantity, origin, and vacation spot that coated the best distance in a yr:
SELECT distinct(flight),distance,origin,dest,yr from "glue_kafka_blog_db"."output" the place distance= (choose MAX(distance) from "glue_kafka_blog_db"."output")

The next screenshot exhibits the output of our instance question.

Clear up

To wash up your sources, full the next steps:

  1. Delete the CloudFormation stack gluejob-setup.
  2. Delete the CloudFormation stack vpc-mskserverless-client.

Conclusion

On this submit, we demonstrated a use case for constructing a serverless ETL pipeline for streaming with IAM authentication, which lets you deal with the outcomes of your analytics. You too can modify the AWS Glue streaming ETL code on this submit with transformations and mappings to make sure that solely legitimate information will get loaded to Amazon S3. This resolution allows you to harness the prowess of AWS Glue streaming, seamlessly built-in with MSK Serverless via the IAM authentication methodology. It’s time to behave and revolutionize your streaming processes.

Appendix

This part gives extra details about the right way to create the AWS Glue connection on the AWS Glue console, which helps set up the connection to the MSK Serverless cluster and permit the AWS Glue streaming job to authenticate and authorize utilizing IAM authentication whereas consuming the info from the MSK matter.

  1. On the AWS Glue console, within the navigation pane, underneath Knowledge catalog, select Connections.
  2. Select Create connection.
  3. For Connection identify, enter a novel identify in your connection.
  4. For Connection kind, select Kafka.
  5. For Connection entry, choose Amazon managed streaming for Apache Kafka (MSK).
  6. For Kafka bootstrap server URLs, enter a comma-separated record of bootstrap server URLs. Embrace the port quantity. For instance, boot-xxxxxxxx.c2.kafka-serverless.us-east- 1.amazonaws.com:9098.

  1. For Authentication, select IAM Authentication.
  2. Choose Require SSL connection.
  3. For VPC, select the VPC that incorporates your information supply.
  4. For Subnet, select the personal subnet inside your VPC.
  5. For Safety teams, select a safety group to permit entry to the info retailer in your VPC subnet.

Safety teams are related to the ENI hooked up to your subnet. You have to select at the least one safety group with a self-referencing inbound rule for all TCP ports.

  1. Select Save modifications.

After you create the AWS Glue connection, you need to use the AWS Glue streaming job to devour information from the MSK matter utilizing IAM authentication.


In regards to the authors

Shubham Purwar is a Cloud Engineer (ETL) at AWS Bengaluru specialised in AWS Glue and Amazon Athena. He’s captivated with serving to clients resolve points associated to their ETL workload and implement scalable information processing and analytics pipelines on AWS. In his free time, Shubham likes to spend time along with his household and journey around the globe.

Nitin Kumar is a Cloud Engineer (ETL) at AWS with a specialization in AWS Glue. He’s devoted to aiding clients in resolving points associated to their ETL workloads and creating scalable information processing and analytics pipelines on AWS.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles