Introducing AWS Glue crawler and create desk assist for Apache Iceberg format

August 17, 2023

3

Apache Iceberg is an open desk format for big datasets in Amazon Easy Storage Service (Amazon S3) and gives quick question efficiency over giant tables, atomic commits, concurrent writes, and SQL-compatible desk evolution. Iceberg has turn out to be extremely popular for its assist for ACID transactions in information lakes and options like schema and partition evolution, time journey, and rollback. Iceberg captures metadata info on the state of datasets as they evolve and alter over time.

AWS Glue crawlers now assist Iceberg tables, enabling you to make use of the AWS Glue Knowledge Catalog and migrate from different Iceberg catalogs simpler. AWS Glue crawlers will extract schema info and replace the placement of Iceberg metadata and schema updates within the Knowledge Catalog. You possibly can then question the Knowledge Catalog Iceberg tables throughout all analytics engines and apply AWS Lake Formation fine-grained permissions.

The Iceberg catalog helps you handle a group of Iceberg tables and tracks the desk’s present metadata. Iceberg gives a number of implementation choices for the Iceberg catalog, together with the AWS Glue Knowledge Catalog, Hive Metastore, and JDBC catalogs. Clients favor utilizing or migrating to the AWS Glue Knowledge Catalog due to its integrations with AWS analytical companies reminiscent of Amazon Athena, AWS Glue, Amazon EMR, and Lake Formation.

With at present’s launch, you may create and schedule an AWS Glue crawler to present Iceberg tables into within the Knowledge Catalog. You possibly can then present one or a number of S3 paths the place the Iceberg tables are situated. You could have the choice to offer the utmost depth of S3 paths that the crawler can traverse. With every crawler run, the crawler inspects every of the S3 paths and catalogs the schema info, reminiscent of new tables, deletes, and updates to schemas within the Knowledge Catalog. Crawlers assist schema merging throughout all snapshots and replace the newest metadata file location within the Knowledge Catalog that AWS analytical engines can immediately use.

Moreover, AWS Glue is launching assist for creating new (empty) Iceberg tables within the Knowledge Catalog utilizing the AWS Glue console or AWS Glue CreateTable API. Earlier than the launch, clients who needed to undertake Iceberg desk format had been required to generate Iceberg’s metadata.json file on Amazon S3 utilizing PutObject individually along with CreateTable. Typically, clients have used the create desk assertion on analytics engines reminiscent of Athena, AWS Glue, and so forth. The brand new CreateTable API eliminates the necessity to create the metadata.json file individually, and automates producing metadata.json based mostly on the given API enter. Additionally, clients who handle deployments utilizing AWS CloudFormation templates can now create Iceberg tables utilizing the CreateTable API. For extra particulars, seek advice from Creating Apache Iceberg tables.

For accessing the information utilizing Athena, you may as well use Lake Formation to safe your Iceberg desk utilizing fine-grained entry management permissions if you register the Amazon S3 information location with Lake Formation. For supply information in Amazon S3 and metadata that isn’t registered with Lake Formation, entry is set by AWS Identification and Entry Administration (IAM) permissions insurance policies for Amazon S3 and AWS Glue actions.

Answer overview

For our instance use case, a buyer makes use of Amazon EMR for information processing and Iceberg format for the transactional information. They retailer their product information in Iceberg format on Amazon S3 and host the metadata of their datasets in Hive Metastore on the EMR main node. The shopper needs to make product information accessible to analyst personas for interactive evaluation utilizing Athena. Many AWS analytics companies don’t combine natively with Hive Metastore, so we use an AWS Glue crawler to populate the metadata within the AWS Glue Knowledge Catalog. Athena helps Lake Formation permissions on Iceberg tables, so we apply fine-grained entry for information entry.

We configure the crawler to onboard the Iceberg schema to the Knowledge Catalog and use Lake Formation entry management for crawling. We apply Lake Formation grants on the database and crawled desk to allow analyst customers to question the information and confirm utilizing Athena.

After we populate the schema of the present Iceberg dataset to the Knowledge Catalog, we onboard new Iceberg tables to the Knowledge Catalog and cargo information into the newly created information utilizing Athena. We apply Lake Formation grants on the database and newly created desk to allow analyst customers to question the information and confirm utilizing Athena.

The next diagram illustrates the answer structure.

Arrange sources with AWS CloudFormation

To arrange the answer sources utilizing AWS CloudFormation, full the next steps:

Log in to the AWS Administration Console as IAM administrator.
Select Launch Stack to deploy a CloudFormation template.
Select Subsequent.
On the following web page, select Subsequent.
Overview the small print on the ultimate web page and choose I acknowledge that AWS CloudFormation would possibly create IAM sources.
Select Create.

The CloudFormation template generates the next sources:

VPC, subnet, and safety group for the EMR cluster
Knowledge lake bucket to retailer Iceberg desk information and metadata
IAM roles for the crawler and Lake Formation registration
EMR cluster and steps to create an Iceberg desk with Hive Metastore
Analyst position for information entry
Athena bucket path for outcomes

When the stack is full, on the AWS CloudFormation console, navigate to the Assets tab of the stack.
Observe down the values of EmrClusterId, DataLakeBucketName, LFRegisterLocationServiceRole, AWSGlueServiceRole, AthenaBucketName, and LFBusinessAnalystRole.
Navigate to the Amazon EMR console and select the EMR cluster you created.
Navigate to the Steps tab and confirm that the steps had been run.

This script run creates the database icebergcrawlerblodb utilizing Hive and the Iceberg desk product. It makes use of the Hive Metastore server on Amazon EMR because the metastore and shops the information on Amazon S3.

Navigate to the S3 bucket you created and confirm if the information and metadata are created for the Iceberg desk.

Among the sources that this stack deploys incur prices when in use.

Now that the information is on Amazon S3, we will register the bucket with Lake Formation to implement entry management and centralize the information governance.

Arrange Lake Formation permissions

To make use of the AWS Glue Knowledge Catalog in Lake Formation, full the next steps to replace the Knowledge Catalog settings to make use of Lake Formation permissions to regulate Knowledge Catalog sources as a substitute of IAM-based entry management:

Check in to the Lake Formation console as admin.
- If that is the primary time accessing the Lake Formation console, add your self as the information lake administrator.
Within the navigation pane, underneath Knowledge catalog, select Settings.
Deselect Use solely IAM entry management for brand new databases.
Deselect Use solely IAM entry management for brand new tables in new databases.
Select Model 3 for Cross account model settings.
Select Save.

Now you may arrange Lake Formation permissions.

Register the information lake S3 bucket with Lake Formation

To register the information lake S3 bucket, full the next steps:

On the Lake Formation console, within the navigation pane, select Knowledge lake places.
Select Register location.
For Amazon S3 path, enter the information lake bucket path.
For IAM position, select the position famous from the CloudFormation template for LFRegisterLocationServiceRole.
Select Register location.

Grant crawler position entry to the information location

To grant entry to the crawler, full the next steps:

On the Lake Formation console, within the navigation pane, select Knowledge places.
Select Grant.
For IAM customers and roles, select the position for the crawler.
For Storage places, enter the information lake bucket path.
Select Grant.

Create database and grant entry to the crawler position

Full the next steps to create your database and grant entry to the crawler position:

On the Lake Formation console, within the navigation pane, select Databases.
Select Create database.
Present the identify icebergcrawlerblogdb for the database.
Be sure Use solely IAM entry management for brand new tables on this database possibility will not be chosen.
Select Create database.
On the Motion menu, select Grant.
For IAM customers and roles, select the position for the crawler.
Depart the database specified as icebergcrawlerblogdb.
Choose Create desk, Describe, and Alter for Database permissions.
Select Grant.

Configure the crawler for Iceberg

To configure your crawler for Iceberg, full the next steps:

On the AWS Glue console, within the navigation pane, select Crawlers.
Select Create crawler.
Enter a reputation for the crawler. For this put up, we use icebergcrawler.
Underneath Knowledge supply configuration, select Add information supply.
For Knowledge supply, select Iceberg.
For S3 path, enter s3://<datalakebucket>/icebergcrawlerblogdb.db/.
Select Add a Iceberg information supply.

Assist for Iceberg tables is on the market via CreateCrawler and UpdateCrawler APIs and including the extra IcebergTarget as a goal, with the next properties:

connectionId – In case your Iceberg tables are saved in buckets that require VPC authorization, you may set your connection properties right here
icebergTables – That is an array of icebergPaths strings, every indicating the folder with which the metadata information for an Iceberg desk resides

See the next code:

{
    "IcebergTarget": {
        "connectionId": "iceberg-connection-123",
        "icebergMetaDataPaths": [
            "s3://bucketA/",
            "s3://bucketB/",
            "s3://bucket3/financedb/financetable/"
        ]
        "exclusions": ["departments/**", "employees/folder/**"]
        "maximumDepth": 5
    }
}

Select Subsequent.
For Present IAM position, enter the crawler position created by the stack.
Underneath Lake Formation configuration, choose Use Lake Formation credentials for crawling S3 information supply.
Select Subsequent.
Underneath Set output and scheduling, specify the goal database as icebergcrawlerblogdb.
Select Subsequent.
Select Create crawler.
Run the crawler.

Throughout every crawl, for every icebergTable path supplied, the crawler calls the Amazon S3 Listing API to seek out the latest metadata file underneath that Iceberg desk metadata folder and updates the metadata_location parameter to the newest manifest file.

The next screenshot exhibits the small print after a profitable run.

The crawler was in a position to crawl the S3 information supply and efficiently populate the schema for Iceberg information within the Knowledge Catalog.

Now you can begin utilizing the Knowledge Catalog as your main metastore and create new Iceberg tables immediately within the Knowledge Catalog or utilizing the createtable API.

Create a brand new Iceberg desk

To create an Iceberg desk within the Knowledge Catalog utilizing the console, full the steps on this part. Alternatively, you need to use a CloudFormation template to create an Iceberg desk utilizing the next code:

Kind: AWS::Glue::Desk
Properties: 
  CatalogId:"<account_id>"
  DatabaseName:"icebergcrawlerblogdb"
  TableInput:
    Identify: "product_details"
    StorageDescriptor:
       Columns:
         - Identify: "product_id"
           Kind: "string"
         - Identify: "manufacture_name"
           Kind: "string"
         - Identify: "product_rating"
           Kind: "int"
       Location: "s3://<datalakebucket>/icebergcrawlerblogdb.db/"
    TableType: "EXTERNAL_TABLE"
  OpenTableFormatInput:
    IcebergInput:
      MetadataOperation: "CREATE"
      Model: "2"

Grant the IAM position entry to the information location

First, grant the IAM position entry to the information location:

On the Lake Formation console, within the navigation pane, select Knowledge places.
Select Grant.
Choose Admin IAM position for IAM customers and roles.
For Storage location, enter the information lake bucket path.
Select Grant.

Create the Iceberg desk

Full the next steps to create the Iceberg desk:

On the Lake Formation console, within the navigation pane, select Tables.
Select Create desk.
For Identify, enter product_details.
Select icebergcrawlerblogdb for Database.
Choose Apache Iceberg desk for Desk format.
Present the trail for <datalakebucket>/icebergcrawlerblogdb.db/ for Desk location.

Present the next schema and select Add schema:

[
     {
         "Name": "product_id",
         "Type": "string"
     },
     {
         "Name": "manufacture_name",
         "Type": "string"
     },
     {
         "Name": "product_rating",
         "Type": "int"
     }
 ]

Select Submit to create the desk.

Add a file to the brand new Iceberg desk

Full the next steps so as to add a file to the Iceberg desk:

On the Athena console, navigate to the question editor.
Select Edit settings to configure the Athena question outcomes bucket utilizing the worth famous from the CloudFormation output for AthenaBucketName.
Select Save.

Run the next question so as to add a file to the desk:

insert into icebergcrawlerblogdb.product_details values('00001','ABC Firm',10)

Configure Lake Formation permissions on the Iceberg desk within the Knowledge Catalog

Athena helps Lake Formation permission on Iceberg tables, so for this put up, we present you how one can arrange fine-grained entry on the tables and question them utilizing Athena.

Now the information lake admin can delegate permissions on the database and desk to the LFBusinessAnalystRole-IcebergBlogIAM position by way of the Lake Formation console.

Grant the position entry to the database and describe permissions

To grant the LFBusinessAnalystRole-IcebergBlogIAM position entry to the database with describe permissions, full the next steps:

On the Lake Formation console, underneath Permissions within the navigation pane, select Knowledge lake permissions.
Select Grant
Underneath Principals, choose IAM customers and roles.
Select the IAM position LFBusinessAnalystRole-IcebergBlog.
Underneath LF-Tags or catalog sources, select icebergcrawlerblogdb for Databases.
Choose Describe for Database permissions.
Select Grant to use the permissions.

Grant column entry to the position

Subsequent, grant column entry to the LFBusinessAnalystRole-IcebergBlogIAM position:

On the Lake Formation console, underneath Permissions within the navigation pane, select Knowledge lake permissions.
Select Grant.
Underneath Principals, choose IAM customers and roles.
Select the IAM position LFBusinessAnalystRole-IcebergBlog.
Underneath LF-Tags or catalog sources, select icebergcrawlerblogdb for Databases and product for Tables.
Select Choose for Desk permissions.
Underneath Knowledge permissions, choose Column-based entry.
Choose Embody columns and select product_name and worth.
Select Grant to use the permissions.

Grant desk entry to the position

Lastly, grant desk entry to the LFBusinessAnalystRole-IcebergBlogIAM position:

On the Lake Formation console, underneath Permissions within the navigation pane, select Knowledge lake permissions.
Select Grant.
Underneath Principals, choose IAM customers and roles.
Select the IAM position LFBusinessAnalystRole-IcebergBlog.
Underneath LF-Tags or catalog sources, select icebergcrawlerblogdb for Databases and product_details for Tables.
Select Choose and Describe for Desk permissions.
Select Grant to use the permissions.

Confirm the tables utilizing Athena

To confirm the tables utilizing Athena, change to LFBusinessAnalystRole-IcebergBlogrole and full the next steps:

On the Athena console, navigate to the question editor.
Select Edit settings to configure the Athena question outcomes bucket utilizing the worth famous from the CloudFormation output for AthenaBucketName.
Select Save.
Run the queries on product and product_details to validate entry.

The next screenshot exhibits column permissions on product.

The next screenshot exhibits desk permissions on product_details.

We’ve efficiently crawled the Iceberg dataset created from Hive Metastore with information on Amazon S3 and created an AWS Glue Knowledge Catalog desk with the schema populated. We registered the information lake bucket with Lake Formation and enabled crawling entry to the information lake utilizing Lake Formation permissions. We granted Lake Formation permissions on the database and desk to the analyst person and validated entry to the information utilizing Athena.

Clear up

To keep away from undesirable fees to your AWS account, delete the AWS sources:

Check in to the CloudFormation console because the IAM admin used for creating the CloudFormation stack.
Delete the CloudFormation stack you created.

Conclusion

With the assist for Iceberg crawlers, you may shortly transfer to utilizing the AWS Glue Knowledge Catalog as your main Iceberg desk catalog. You possibly can mechanically register Iceberg tables into the Knowledge Catalog by operating an AWS Glue crawler, which doesn’t require any DDL or guide schema definition. You can begin constructing your serverless transactional information lake on AWS utilizing the AWS Glue crawler, create a brand new desk utilizing the Knowledge Catalog, and make the most of Lake Formation fine-grained entry controls for querying Iceberg tables codecs by Athena.

Seek advice from Working with different AWS companies for Lake Formation assist for Iceberg tables throughout numerous AWS analytical companies.

Particular thanks to everybody who contributed to this crawler and createtable function launch: Theo Xu, Kyle Duong, Anshuman Sharma, Atreya Srivathsan, Eric Wu, Jack Ye, Himani Desai, Atreya Srivathsan, Masoud Shahamiri and Sachet Saurabh.

If in case you have questions or strategies, submit them within the feedback part.

Concerning the authors

Sandeep Adwankar is a Senior Technical Product Supervisor at AWS. Based mostly within the California Bay Space, he works with clients across the globe to translate enterprise and technical necessities into merchandise that allow clients to enhance how they handle, safe, and entry information.

Srividya Parthasarathy is a Senior Large Knowledge Architect on the AWS Lake Formation crew. She enjoys constructing information mesh options and sharing them with the neighborhood.

Mahesh Mishra is a Principal Product Supervisor with AWS Lake Formation crew. He works with lots of AWS largest clients on rising expertise wants, and leads a number of information and analytics initiatives inside AWS together with sturdy assist for Transactional Knowledge Lakes.