Finish-to-end improvement lifecycle for information engineers to construct a knowledge integration pipeline utilizing AWS Glue

July 27, 2023

1

Knowledge is a key enabler for what you are promoting. Many AWS prospects have built-in their information throughout a number of information sources utilizing AWS Glue, a serverless information integration service, in an effort to make data-driven enterprise choices. To develop the ability of information at scale for the long run, it’s extremely beneficial to design an end-to-end improvement lifecycle to your information integration pipelines. The next are frequent asks from our prospects:

Is it doable to develop and take a look at AWS Glue information integration jobs on my native laptop computer?
Are there beneficial approaches to provisioning elements for information integration?
How can we construct a steady integration and steady supply (CI/CD) pipeline for our information integration pipeline?
What’s the finest apply to maneuver from a pre-production surroundings to manufacturing?

To deal with these asks, this submit defines the event lifecycle for information integration and demonstrates how software program engineers and information engineers can design an end-to-end improvement lifecycle utilizing AWS Glue, together with improvement, testing, and CI/CD, utilizing a pattern baseline template.

Finish-to-end improvement lifecycle for a knowledge integration pipeline

As we speak, it’s frequent to outline not solely information integration jobs but in addition all the info elements in code. This implies which you can depend on commonplace software program finest practices to construct your information integration pipeline. The software program improvement lifecycle on AWS defines the next six phases: Plan, Design, Implement, Take a look at, Deploy, and Keep.

On this part, we focus on every section within the context of information integration pipeline.

Plan

Within the planning section, builders acquire necessities from stakeholders reminiscent of end-users to outline a knowledge requirement. This could possibly be what the use instances are (for instance, advert hoc queries, dashboard, or troubleshooting), how a lot information to course of (for instance, 1 TB per day), what sorts of information, what number of completely different information sources to drag from, how a lot information latency to simply accept to make it queryable (for instance, quarter-hour), and so forth.

Design

Within the design section, you analyze necessities and establish the very best answer to construct the info integration pipeline. In AWS, you should select the proper providers to attain the purpose and provide you with the structure by integrating these providers and defining dependencies between elements. For instance, you might select AWS Glue jobs as a core part for loading information from completely different sources, together with Amazon Easy Storage Service (Amazon S3), then integrating them and preprocessing and enriching information. Then you might need to chain a number of AWS Glue jobs and orchestrate them. Lastly, you might need to use Amazon Athena and Amazon QuickSight to current the enriched information to end-users.

Implement

Within the implementation section, information engineers code the info integration pipeline. They analyze the necessities to establish coding duties to attain the ultimate end result. The code contains the next:

AWS useful resource definition
Knowledge integration logic

When utilizing AWS Glue, you possibly can outline the info integration logic in a job script, which will be written in Python or Scala. You should utilize your most popular IDE to implement AWS useful resource definition utilizing the AWS Cloud Growth Equipment (AWS CDK) or AWS CloudFormation, and likewise the enterprise logic of AWS Glue job scripts for information integration. To study extra about how you can implement your AWS Glue job scripts domestically, confer with Develop and take a look at AWS Glue model 3.0 and 4.0 jobs domestically utilizing a Docker container.

Take a look at

Within the testing section, you verify the implementation for bugs. High quality evaluation contains testing the code for errors and checking if it meets the necessities. As a result of many groups instantly take a look at the code you write, the testing section usually runs parallel to the event section. There are various kinds of testing:

Unit testing
Integration testing
Efficiency testing

For unit testing, even for information integration, you possibly can depend on a typical testing framework reminiscent of pytest and ScalaTest. To study extra about how you can obtain unit testing domestically, confer with Develop and take a look at AWS Glue model 3.0 and 4.0 jobs domestically utilizing a Docker container.

Deploy

When information engineers develop a knowledge integration pipeline, you code and take a look at on a unique copy of the product than the one which the end-users have entry to. The surroundings that end-users use known as manufacturing, whereas different copies are stated to be within the improvement or the pre-production surroundings.

Having separate construct and manufacturing environments ensures which you can proceed to make use of the info integration pipeline even whereas it’s being modified or upgraded. The deployment section contains a number of duties to maneuver the most recent construct copy to the manufacturing surroundings, reminiscent of packaging, surroundings configuration, and set up.

The next elements are deployed via the AWS CDK or AWS CloudFormation:

AWS sources
Knowledge integration job scripts for AWS Glue

AWS CodePipeline lets you construct a mechanism to automate deployments amongst completely different environments, together with improvement, pre-production, and manufacturing. Once you commit your code to AWS CodeCommit, CodePipeline mechanically provisions AWS sources based mostly on the CloudFormation templates included within the commit and uploads script recordsdata included within the decide to Amazon S3.

Keep

Even after you deploy your answer to a manufacturing surroundings, it’s not the tip of your mission. It’s good to monitor the info integration pipeline repeatedly and preserve sustaining and enhancing it. Extra particularly, you additionally want to repair bugs, resolve buyer points, and handle software program adjustments. As well as, you should monitor the general system efficiency, safety, and consumer expertise to establish new methods to enhance the present information integration pipeline.

Answer overview

Sometimes, you might have a number of accounts to handle and provision sources to your information pipeline. On this submit, we assume the next three accounts:

Pipeline account – This hosts the end-to-end pipeline
Dev account – This hosts the combination pipeline within the improvement surroundings
Prod account – This hosts the info integration pipeline within the manufacturing surroundings

If you need, you need to use the identical account and the identical Area for all three.

To begin making use of this end-to-end improvement lifecycle mannequin to your information platform simply and rapidly, we ready the baseline template aws-glue-cdk-baseline utilizing the AWS CDK. The template is constructed on high of AWS CDK v2 and CDK Pipelines. It provisions two sorts of stacks:

AWS Glue app stack – This provisions the info integration pipeline: one within the dev account and one within the prod account
Pipeline stack – This provisions the Git repository and CI/CD pipeline within the pipeline account

The AWS Glue app stack provisions the info integration pipeline, together with the next sources:

AWS Glue jobs
AWS Glue job scripts

The next diagram illustrates this structure.

On the time of publishing of this submit, the AWS CDK has two variations of the AWS Glue module: @aws-cdk/aws-glue and @aws-cdk/aws-glue-alpha, containing L1 constructs and L2 constructs, respectively. The pattern AWS Glue app stack is outlined utilizing aws-glue-alpha, the L2 assemble for AWS Glue, as a result of it’s easy to outline and handle AWS Glue sources. If you wish to use the L1 assemble, confer with Construct, Take a look at and Deploy ETL options utilizing AWS Glue and AWS CDK based mostly CI/CD pipelines.

The pipeline stack provisions all the CI/CD pipeline, together with the next sources:

The next diagram illustrates the pipeline workflow.

Each time the enterprise requirement adjustments (reminiscent of including information sources or altering information transformation logic), you make adjustments on the AWS Glue app stack and re-provision the stack to mirror your adjustments. That is performed by committing your adjustments within the AWS CDK template to the CodeCommit repository, then CodePipeline displays the adjustments on AWS sources utilizing CloudFormation change units.

Within the following sections, we current the steps to arrange the required surroundings and reveal the end-to-end improvement lifecycle.

Conditions

You want the next sources:

Initialize the mission

To initialize the mission, full the next steps:

Clone the baseline template to your office:

$ git clone git@github.com:aws-samples/aws-glue-cdk-baseline.git
$ cd aws-glue-cdk-baseline.git

Create a Python digital surroundings particular to the mission on the consumer machine:

We use a digital surroundings in an effort to isolate the Python surroundings for this mission and never set up software program globally.

Activate the digital surroundings in response to your OS:
- On MacOS and Linux, use the next command:
```
$ supply .venv/bin/activate
```
- On a Home windows platform, use the next command:
```
% .venvScriptsactivate.bat
```

After this step, the next steps run inside the bounds of the digital surroundings on the consumer machine and work together with the AWS account as wanted.

Set up the required dependencies described in necessities.txt to the digital surroundings:
```
$ pip set up -r necessities.txt
$ pip set up -r requirements-dev.txt
```

Edit the configuration file default-config.yaml based mostly in your environments (exchange every account ID with your individual):

pipelineAccount:
awsAccountId: 123456789101
awsRegion: us-east-1

devAccount:
awsAccountId: 123456789102
awsRegion: us-east-1

prodAccount:
awsAccountId: 123456789103
awsRegion: us-east-1

Run pytest to initialize the snapshot take a look at recordsdata by working the next command:
```
$ python3 -m pytest --snapshot-update
```

Bootstrap your AWS environments

Run the next instructions to bootstrap your AWS environments:

Within the pipeline account, exchange PIPELINE-ACCOUNT-NUMBER, REGION, and PIPELINE-PROFILE with your individual values:

$ cdk bootstrap aws://<PIPELINE-ACCOUNT-NUMBER>/<REGION> --profile <PIPELINE-PROFILE> 
--cloudformation-execution-policies arn:aws:iam::aws:coverage/AdministratorAccess

Within the dev account, exchange PIPELINE-ACCOUNT-NUMBER, DEV-ACCOUNT-NUMBER, REGION, and DEV-PROFILE with your individual values:

$ cdk bootstrap aws://<DEV-ACCOUNT-NUMBER>/<REGION> --profile <DEV-PROFILE> 
--cloudformation-execution-policies arn:aws:iam::aws:coverage/AdministratorAccess 
--trust <PIPELINE-ACCOUNT-NUMBER>

Within the prod account, exchange PIPELINE-ACCOUNT-NUMBER, PROD-ACCOUNT-NUMBER, REGION, and PROD-PROFILE with your individual values:

$ cdk bootstrap aws://<PROD-ACCOUNT-NUMBER>/<REGION> --profile <PROD-PROFILE> 
--cloudformation-execution-policies arn:aws:iam::aws:coverage/AdministratorAccess 
--trust <PIPELINE-ACCOUNT-NUMBER>

Once you use just one account for all environments, you possibly can simply run the cdk bootstrap command one time.

Deploy your AWS sources

Run the command utilizing the pipeline account to deploy the sources outlined within the AWS CDK baseline template:

$ cdk deploy --profile <PIPELINE-PROFILE>

This creates the pipeline stack within the pipeline account and the AWS Glue app stack within the improvement account.

When the cdk deploy command is accomplished, let’s confirm the pipeline utilizing the pipeline account.

On the CodePipeline console, navigate to GluePipeline. Then confirm that GluePipeline has the next levels: Supply, Construct, UpdatePipeline, Property, DeployDev, and DeployProd. Additionally confirm that the levels Supply, Construct, UpdatePipeline, Property, DeployDev have succeeded, and DeployProd is pending. It might take about quarter-hour.

Now that the pipeline has been created efficiently, you may as well confirm the AWS Glue app stack useful resource on the AWS CloudFormation console within the dev account.

At this step, the AWS Glue app stack is deployed solely within the dev account. You possibly can attempt to run the AWS Glue job ProcessLegislators to see the way it works.

Configure your Git repository with CodeCommit

In an earlier step, you cloned the Git repository from GitHub. Though it’s doable to configure the AWS CDK template to work with GitHub, GitHub Enterprise, or Bitbucket, for this submit, we use CodeCommit. When you want these third-party Git suppliers, configure the connections and edit pipeline_stack.py to outline the variable supply to make use of the goal Git supplier utilizing CodePipelineSource.

Since you already ran the cdk deploy command, the CodeCommit repository has already been created with all of the required code and associated recordsdata. Step one is to arrange entry to CodeCommit. The following step is to clone the repository from the CodeCommit repository to your native. Run the next instructions:

$ mkdir aws-glue-cdk-baseline-codecommit
$ cd aws-glue-cdk-baseline-codecommit
$ git clone ssh://git-codecommit.us-east-1.amazonaws.com/v1/repos/aws-glue-cdk-baseline

Within the subsequent step, we make adjustments on this native copy of the CodeCommit repository.

Finish-to-end improvement lifecycle

Now that the surroundings has been efficiently created, you’re prepared to begin growing a knowledge integration pipeline utilizing this baseline template. Let’s stroll via end-to-end improvement lifecycle.

Once you need to outline your individual information integration pipeline, you should add extra AWS Glue jobs and implement job scripts. For this submit, let’s assume the use case so as to add a brand new AWS Glue job with a brand new job script to learn a number of S3 places and be part of them.

Implement and take a look at in your native surroundings

First, implement and take a look at the AWS Glue job and its job script in your native surroundings utilizing Visible Studio Code.

Arrange your improvement surroundings by following the steps in Develop and take a look at AWS Glue model 3.0 and 4.0 jobs domestically utilizing a Docker container. The next steps are required within the context of this submit:

Begin Docker.
Pull the Docker picture that has the native improvement surroundings utilizing the AWS Glue ETL library:
```
$ docker pull public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01
```
Run the next command to outline the AWS named profile title:
```
$ PROFILE_NAME="<DEV-PROFILE>"
```
Run the next command to make it accessible with the baseline template:
```
$ cd aws-glue-cdk-baseline/
$ WORKSPACE_LOCATION=$(pwd)
```

Run the Docker container:

$ docker run -it -v ~/.aws:/dwelling/glue_user/.aws -v $WORKSPACE_LOCATION:/dwelling/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true 
--rm -p 4040:4040 -p 18080:18080 
--name glue_pyspark public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01 pyspark

Begin Visible Studio Code.
Select Distant Explorer within the navigation pane, then select the arrow icon of the workspace folder within the container public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01.

If the workspace folder shouldn’t be proven, select Open folder and choose /dwelling/glue_user/workspace.

Then you will notice a view much like the next screenshot.

Optionally, you possibly can set up AWS Device Equipment for Visible Studio Code, and begin Amazon CodeWhisperer to allow code suggestions powered by machine studying mannequin. For instance, in aws_glue_cdk_baseline/job_scripts/process_legislators.py, you possibly can put feedback like “# Write a DataFrame in Parquet format to S3”, press Enter key, then CodeWhisperer will advocate a code snippet much like the next:

Now you put in the required dependencies described in necessities.txt to the container surroundings.

Run the next instructions in the terminal in Visible Studio Code:

$ pip set up -r necessities.txt
$ pip set up -r requirements-dev.txt

Implement the code.

Now let’s make the required adjustments for a brand new AWS Glue job right here.

Edit the file aws_glue_cdk_baseline/glue_app_stack.py. Let’s add the next new code block after the present job definition of ProcessLegislators in an effort to add the brand new AWS Glue job JoinLegislators:

        self.new_glue_job = glue.Job(self, "JoinLegislators",
            executable=glue.JobExecutable.python_etl(
                glue_version=glue.GlueVersion.V4_0,
                python_version=glue.PythonVersion.THREE,
                script=glue.Code.from_asset(
                    path.be part of(path.dirname(__file__), "job_scripts/join_legislators.py")
                )
            ),
            description="a brand new instance PySpark job",
            default_arguments={
                "--input_path_orgs": config[stage]['jobs']['JoinLegislators']['inputLocationOrgs'],
                "--input_path_persons": config[stage]['jobs']['JoinLegislators']['inputLocationPersons'],
                "--input_path_memberships": config[stage]['jobs']['JoinLegislators']['inputLocationMemberships']
            },
            tags={
                "surroundings": self.surroundings,
                "artifact_id": self.artifact_id,
                "stack_id": self.stack_id,
                "stack_name": self.stack_name
            }
        )

Right here, you added three job parameters for various S3 places utilizing the variable config. It’s the dictionary generated from default-config.yaml. On this baseline template, we use this central config file for managing parameters for all of the Glue jobs within the construction <stage title>/jobs/<job title>/<parameter title>. Within the continuing steps, you present these places via the AWS Glue job parameters.

Create a brand new job script known as aws_glue_cdk_baseline/job_scripts/join_legislators.py:

aws_glue_cdk_baseline/job_scripts/join_legislators.py:

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import Be a part of
from awsglue.utils import getResolvedOptions


class JoinLegislators:
    def __init__(self):
        params = []
        if '--JOB_NAME' in sys.argv:
            params.append('JOB_NAME')
            params.append('input_path_orgs')
            params.append('input_path_persons')
            params.append('input_path_memberships')
        args = getResolvedOptions(sys.argv, params)

        self.context = GlueContext(SparkContext.getOrCreate())
        self.job = Job(self.context)

        if 'JOB_NAME' in args:
            jobname = args['JOB_NAME']
            self.input_path_orgs = args['input_path_orgs']
            self.input_path_persons = args['input_path_persons']
            self.input_path_memberships = args['input_path_memberships']
        else:
            jobname = "take a look at"
            self.input_path_orgs = "s3://awsglue-datasets/examples/us-legislators/all/organizations.json"
            self.input_path_persons = "s3://awsglue-datasets/examples/us-legislators/all/individuals.json"
            self.input_path_memberships = "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"
        self.job.init(jobname, args)
    
    def run(self):
        dyf = join_legislators(self.context, self.input_path_orgs, self.input_path_persons, self.input_path_memberships)
        df = dyf.toDF()
        df.printSchema()
        df.present()
        print(df.rely())

def read_dynamic_frame_from_json(glue_context, path):
    return glue_context.create_dynamic_frame.from_options(
        connection_type="s3",
        connection_options={
            'paths': [path],
            'recurse': True
        },
        format="json"
    )

def join_legislators(glue_context, path_orgs, path_persons, path_memberships):
    orgs = read_dynamic_frame_from_json(glue_context, path_orgs)
    individuals = read_dynamic_frame_from_json(glue_context, path_persons)
    memberships = read_dynamic_frame_from_json(glue_context, path_memberships)
    orgs = orgs.drop_fields(['other_names', 'identifiers']).rename_field('id', 'org_id').rename_field('title', 'org_name')
    dynamicframe_joined = Be a part of.apply(orgs, Be a part of.apply(individuals, memberships, 'id', 'person_id'), 'org_id', 'organization_id').drop_fields(['person_id', 'org_id'])
    return dynamicframe_joined

if __name__ == '__main__':
    JoinLegislators().run()

Create a brand new unit take a look at script for the brand new AWS Glue job known as aws_glue_cdk_baseline/job_scripts/assessments/test_join_legislators.py:

import pytest
import sys
import join_legislators
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

@pytest.fixture(scope="module", autouse=True)
def glue_context():
    sys.argv.append('--JOB_NAME')
    sys.argv.append('test_count')

    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    context = GlueContext(SparkContext.getOrCreate())
    job = Job(context)
    job.init(args['JOB_NAME'], args)

    yield(context)

def test_counts(glue_context):
    dyf = join_legislators.join_legislators(glue_context, 
        "s3://awsglue-datasets/examples/us-legislators/all/organizations.json",
        "s3://awsglue-datasets/examples/us-legislators/all/individuals.json", 
        "s3://awsglue-datasets/examples/us-legislators/all/memberships.json")
    assert dyf.toDF().rely() == 10439

In default-config.yaml, add the next underneath prod and dev:

 JoinLegislators:
      inputLocationOrgs: "s3://awsglue-datasets/examples/us-legislators/all/organizations.json"
      inputLocationPersons: "s3://awsglue-datasets/examples/us-legislators/all/individuals.json"
      inputLocationMemberships: "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"

Add the next underneath "jobs" within the variable config in assessments/unit/test_glue_app_stack.py, assessments/unit/test_pipeline_stack.py, and assessments/snapshot/test_snapshot_glue_app_stack.py (no want to exchange S3 places):

,
            "JoinLegislators": {
                "inputLocationOrgs": "s3://path_to_data_orgs",
                "inputLocationPersons": "s3://path_to_data_persons",
                "inputLocationMemberships": "s3://path_to_data_memberships"
            }

Select Run on the high proper to run the person job scripts.

If the Run button shouldn’t be proven, set up Python into the container via Extensions within the navigation pane.

For native unit testing, run the next command in the terminal in Visible Studio Code:
```
$ cd aws_glue_cdk_baseline/job_scripts/
$ python3 -m pytest
```

Then you possibly can confirm that the newly added unit take a look at handed efficiently.

Run pytest to initialize the snapshot take a look at recordsdata by working following command:
```
$ cd ../../
$ python3 -m pytest --snapshot-update
```

Deploy to the event surroundings

Full following steps to deploy the AWS Glue app stack to the event surroundings and run integration assessments there:

Arrange entry to CodeCommit.

Commit and push your adjustments to the CodeCommit repo:

$ git add .
$ git commit -m "Add the second Glue job"
$ git push

You possibly can see that the pipeline is efficiently triggered.

Integration take a look at

There may be nothing required for working the combination take a look at for the newly added AWS Glue job. The mixing take a look at script integ_test_glue_app_stack.py runs all the roles together with a particular tag, then verifies the state and its period. If you wish to change the situation or the edge, you possibly can edit assertions at the tip of the integ_test_glue_job methodology.

Deploy to the manufacturing surroundings

Full the next steps to deploy the AWS Glue app stack to the manufacturing surroundings:

On the CodePipeline console, navigate to GluePipeline.
Select Assessment underneath the DeployProd stage.
Select Approve.

Anticipate the DeployProd stage to finish, then you possibly can confirm the AWS Glue app stack useful resource within the dev account.

Clear up

To scrub up your sources, full following steps:

Run the next command utilizing the pipeline account:
```
$ cdk destroy --profile <PIPELINE-PROFILE>
```
Delete the AWS Glue app stack within the dev account and prod account.

Conclusion

On this submit, you realized how you can outline the event lifecycle for information integration and the way software program engineers and information engineers can design an end-to-end improvement lifecycle utilizing AWS Glue, together with improvement, testing, and CI/CD, via a pattern AWS CDK template. You will get began constructing your individual end-to-end improvement lifecycle to your workload utilizing AWS Glue.

In regards to the creator

Noritaka Sekiyama is a Principal Massive Knowledge Architect on the AWS Glue crew. He works based mostly in Tokyo, Japan. He’s chargeable for constructing software program artifacts to assist prospects. In his spare time, he enjoys biking along with his street bike.