Question your Apache Hive metastore with AWS Lake Formation permissions

July 21, 2023

3

Apache Hive is a SQL-based information warehouse system for processing extremely distributed datasets on the Apache Hadoop platform. There are two key parts to Apache Hive: the Hive SQL question engine and the Hive metastore (HMS). The Hive metastore is a repository of metadata in regards to the SQL tables, akin to database names, desk names, schema, serialization and deserialization info, information location, and partition particulars of every desk. Apache Hive, Apache Spark, Presto, and Trino can all use a Hive Metastore to retrieve metadata to run queries. The Hive metastore may be hosted on an Apache Hadoop cluster or may be backed by a relational database that’s exterior to a Hadoop cluster. Though the Hive metastore shops the metadata of tables, the precise information of the desk could possibly be residing on Amazon Easy Storage Service (Amazon S3), the Hadoop Distributed File System (HDFS) of the Hadoop cluster, or some other Hive-supported information shops.

As a result of Apache Hive was constructed on high of Apache Hadoop, many organizations have been utilizing the software program from the time they’ve been utilizing Hadoop for large information processing. Additionally, Hive metastore gives versatile integration with many different open-source massive information software program like Apache HBase, Apache Spark, Presto, and Apache Impala. Due to this fact, organizations have come to host enormous volumes of metadata of their structured datasets within the Hive metastore. A metastore is a important a part of an information lake, and having this info out there, wherever it resides, is vital. Nonetheless, many AWS analytics providers don’t combine natively with the Hive metastore, and due to this fact, organizations have needed to migrate their information to the AWS Glue Knowledge Catalog to make use of these providers.

AWS Lake Formation has launched assist for managing consumer entry to Apache Hive metastores by a federated AWS Glue connection. Beforehand, you might use Lake Formation to handle consumer permissions on AWS Glue Knowledge Catalog sources solely. With the Hive metastore connection from AWS Glue, you’ll be able to connect with a database in a Hive metastore exterior to the Knowledge Catalog, map it to a federated database within the Knowledge Catalog, apply Lake Formation permissions on the Hive database and tables, share them with different AWS accounts, and question them utilizing providers akin to Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, and AWS Glue ETL (extract, rework, and cargo). For added particulars on how the Hive metastore integration with Lake Formation works, check with Managing permissions on datasets that use exterior metastores.

Use instances for Hive metastore integration with the Knowledge Catalog embody the next:

An exterior Apache Hive metastore used for legacy massive information workloads like on-premises Hadoop clusters with information in Amazon S3
Transient Amazon EMR workloads with underlying information in Amazon S3 and the Hive metastore on Amazon Relational Database Service (Amazon RDS) clusters.

On this put up, we exhibit find out how to apply Lake Formation permissions on a Hive metastore database and tables and question them utilizing Athena. We illustrate a cross-account sharing use case, the place a Lake Formation steward in producer account A shares a federated Hive database and tables utilizing LF-Tags to shopper account B.

Resolution overview

Producer account A hosts an Apache Hive metastore in an EMR cluster, with underlying information in Amazon S3. We launch the AWS Glue Hive metastore connector from AWS Serverless Utility Repository in account A and create the Hive metastore connection in account A’s Knowledge Catalog. After we create the HMS connection, we create a database in account A’s Knowledge Catalog (known as the federated database) and map it to a database within the Hive metastore utilizing the connection. The tables from the Hive database are then accessible to the Lake Formation admin in account A, similar to some other tables within the Knowledge Catalog. The admin continues to arrange Lake Formation tag-based entry management (LF-TBAC) on the federated Hive database and share it to account B.

The info lake customers in account B will entry the Hive database and tables of account A, similar to querying some other shared Knowledge Catalog useful resource utilizing Lake Formation permissions.

The next diagram illustrates this structure.

The answer consists of steps in each accounts. In account A, carry out the next steps:

Create an S3 bucket to host the pattern information.
Launch an EMR 6.10 cluster with Hive. Obtain the pattern information to the S3 bucket. Create a database and exterior tables, pointing to the downloaded pattern information, in its Hive metastore.
Deploy the applying GlueDataCatalogFederation-HiveMetastore from AWS Serverless Utility Repository and configure it to make use of the Amazon EMR Hive metastore. This can create an AWS Glue connection to the Hive metastore that exhibits up on the Lake Formation console.
Utilizing the Hive metastore connection, create a federated database within the AWS Glue Knowledge Catalog.
Create LF-Tags and affiliate them to the federated database.
Grant permissions on the LF-Tags to account B. Grant database and desk permissions to account B utilizing LF-Tag expressions.

In account B, carry out the next steps:

As an information lake admin, assessment and settle for the AWS Useful resource Entry Supervisor (AWS RAM) invitations for the shares from account A.
The info lake admin then sees the shared database and tables. The admin creates a useful resource hyperlink to the database and grants fine-grained permissions to an information analyst on this account.
Each the info lake admin and the info analyst question the Hive tables which are out there to them utilizing Athena.

Account A has the next personas:

hmsblog-producersteward – Manages the info lake within the producer account A

Account B has the next personas:

hmsblog-consumersteward – Manages the info lake within the shopper account B
hmsblog-analyst – An information analyst who wants entry to chose Hive tables

Stipulations

To observe the tutorial on this put up, you want the next:

Lake Formation and AWS CloudFormation setup in account A

To maintain the setup easy, now we have an IAM admin registered as the info lake admin. Full the next steps:

Signal into the AWS Administration Console and select the us-west-2 Area.
On the Lake Formation console, below Permissions within the navigation pane, select Administrative roles and duties.
Select Handle Directors within the Knowledge lake directors part.
Beneath IAM customers and roles, select the IAM admin consumer that you’re logged in as and select Save.
Select Launch Stack to deploy the CloudFormation template:
Select Subsequent.
Present a reputation for the stack and select Subsequent.
On the following web page, select Subsequent.
Overview the small print on the ultimate web page and choose I acknowledge that AWS CloudFormation may create IAM sources.
Select Create.

Stack creation takes about 10 minutes. The stack establishes the producer account A setup as follows:

Creates an S3 information lake bucket
Registers the info lake bucket to Lake Formation with the Allow catalog federation flag
Launches an EMR 6.10 cluster with Hive and runs two steps in Amazon EMR:
- Downloads the pattern information from public S3 bucket to the newly created bucket
- Creates a Hive database and 4 exterior tables for the info in Amazon S3, utilizing a HQL script
Creates an IAM consumer (hmsblog-producersteward) and units this consumer as Lake Formation administrator
Creates LF-Tags (LFHiveBlogCampaignRole = Admin, Analyst)

Overview CloudFormation stack output in account A

To assessment the output of your CloudFormation stack, full the next steps:

Log in to the console because the IAM admin consumer you used earlier to run the CloudFormation template.
Open the CloudFormation console in one other browser tab.
Overview and notice down the stack Outputs tab particulars.
Select the hyperlink below Worth for ProducerStewardCredentials.

This can open the AWS Secrets and techniques Supervisor console.

Select Retrieve worth and notice down the credentials of hmsblog-producersteward.

Arrange a federated AWS Glue connection in account A

To arrange a federated AWS Glue connection, full the next steps:

Open the AWS Serverless Utility Repository console in one other browser tab.
Within the navigation pane, select Accessible purposes.
Choose Present apps that create customized IAM roles or useful resource insurance policies.
Within the search bar, enter Glue.

This can listing numerous purposes.

Select the applying named GlueDataCatalogFederation-HiveMetastore.

This can open the AWS Lambda console configuration web page for a Lambda operate that runs the connector software code.

To configure the Lambda operate, you want particulars of the EMR cluster launched by the CloudFormation stack.

On one other tab of your browser, open the Amazon EMR console.
Navigate to the cluster launched for this put up and notice down the next particulars from the cluster particulars web page:
1. Major node public DNS
2. Subnet ID
3. Safety group ID of the first node
Again on the Lambda configuration web page, below Overview, configure, and deploy, within the Utility settings part, present the next particulars. Depart the remainder because the default values.
1. For GlueConnectionName, enter hive-metastore-connection.
2. For HiveMetastoreURIs enter thrift://<Major-node-public-DNS-of your-EMR>:9083. For instance, thrift://ec2-54-70-203-146.us-west-2.compute.amazonaws.com:9083, the place 9083 is the Hive metastore port in EMR cluster.
3. For VPCSecurityGroupIds, enter the safety group ID of the EMR major node.
4. For VPCSubnetIds, enter the subnet ID of the EMR cluster.
Select Deploy.

Anticipate the Create Accomplished standing of the Lambda software. You may assessment the small print of the Lambda software on the Lambda console.

Open Lake Formation console and within the navigation pane, select Knowledge sharing.

It’s best to see hive-metastore-connection below Connections.

Select it and assessment the small print.
Within the navigation pane, below Administrative roles and duties, select LF-Tags.

It’s best to see the created LF-tag LFHiveBlogCampaignRole with two values: Analyst and Admin.

Select LF-Tag permissions and select Grant.
Select IAM customers and roles and enter hmsblog-producersteward.
Beneath LF-Tags, select Add LF-Tag.
Enter LFHiveBlogCampaignRole for Key and enter Analyst and Admin for Values.
Beneath Permissions, choose Describe and Affiliate for LF-Tag permissions and Grantable permissions.
Select Grant.

This provides LF-Tags permissions for the producer steward.

Log off because the IAM administrator consumer.

Grant Lake Formation permissions as producer steward

Full the next steps:

Check in to the console as hmsblog-producersteward, utilizing the credentials from the CloudFormation stack Output tab that you just famous down earlier.
On the Lake Formation console, within the navigation pane, select Administrative roles and duties.
Beneath Database creators, select Grant.
Add hmsblog-producersteward as a database creator.
Within the navigation pane, select Knowledge sharing.
Beneath Connections, select the hive-metastore-connection hyperlink.
On the Connection particulars web page, select Create database.
For Database identify, enter federated_emrhivedb.

That is the federated database within the native AWS Glue Knowledge Catalog that may level to a Hive metastore database. This can be a one-to-one mapping of a database within the Knowledge Catalog to a database within the exterior Hive metastore.

For Database identifier, enter the identify of the database within the EMR Hive metastore that was created by the Hive SQL script. For this put up, we use emrhms_salesdb.
As soon as created, choose federated_emrhivedb and select View tables.

This can fetch the database and desk metadata from the Hive metastore on the EMR cluster and show the tables created by the Hive script.

Now you affiliate the LF-Tags created by the CloudFormation script on this federated database and share it to the patron account B utilizing LF-Tag expressions.

Within the navigation pane, select Databases.
Choose federated_emrhivedb and on the Actions menu, select Edit LF-Tags.
Select Assign new LF-Tag.
Enter LFHiveBlogCampaignRole for Assigned keys and Admin for Values, then select Save.
Within the navigation pane, select Knowledge lake permissions.
Select Grant.
Choose Exterior accounts and enter the patron account B quantity.
Beneath LF-Tags or catalog sources, select Useful resource matched by LF-Tags.
Select Add LF-Tag.
Enter LFHiveBlogCampaignRole for Key and Admin for Values.
Within the Database permissions part, choose Describe for Database permissions and Grantable permissions.
Within the Desk permissions part, choose Choose and Describe for Desk permissions and Grantable permissions.
Select Grant.
Within the navigation pane, below Administrative roles and duties, select LF-Tag permissions.
Select Grant.
Choose Exterior accounts and enter the account ID of shopper account B.
Beneath LF-Tags, enter LFHiveBlogCampaignRole for Key and enter Analyst and Admin for Values.
Beneath Permissions, choose Describe and Affiliate below LF-Tag permissions and Grantable permissions.
Select Grant and confirm that the granted LF-Tag permissions show appropriately.
Within the navigation pane, select Knowledge lake permissions.

You may assessment and confirm the permissions granted to account B.

Within the navigation pane, below Administrative roles and duties, select LF-Tag permissions.

You may assessment and confirm the permissions granted to account B.

Log off of account A.

Lake Formation and AWS CloudFormation setup in account B

To maintain the setup easy, we use an IAM admin registered as the info lake admin.

Signal into the AWS Administration Console of account B and choose the us-west-2 Area.
On the Lake Formation console, below Permissions within the navigation pane, select Administrative roles and duties.
Select Handle Directors within the Knowledge lake directors part.
Beneath IAM customers and roles, select the IAM admin consumer that you’re logged in as and select Save.
Select Launch Stack to deploy the CloudFormation template:
Select Subsequent.
Present a reputation for the stack and select Subsequent.
On the following web page, select Subsequent.
Overview the small print on the ultimate web page and choose I acknowledge that AWS CloudFormation may create IAM sources.
Select Create.

Stack creation ought to take about 5 minutes. The stack establishes the producer account B setup as follows:

Creates an IAM consumer hmsblog-consumersteward and units this consumer as Lake Formation administrator
Creates one other IAM consumer hmsblog-analyst
Creates an S3 information lake bucket to retailer Athena question outcomes, with ListBucket and write object permissions to each hmsblog-consumersteward and hmsblog-analyst

Notice down the stack output particulars.

Settle for useful resource shares in account B

Check in to the console as hmsblog-consumersteward and full the next steps:

On the AWS CloudFormation console, navigate to the stack Outputs tab.
Select the hyperlink for ConsumerStewardCredentials to be redirected to the Secrets and techniques Supervisor console.
On the Secrets and techniques Supervisor console, select Retrieve secret worth and replica the password for the patron steward consumer.
Use the ConsoleIAMLoginURL worth from the CloudFormation template Output to log in to account B with the patron steward consumer identify hmsblog-consumersteward and the password you copied from Secrets and techniques Supervisor.
Open the AWS RAM console in one other browser tab.
Within the navigation pane, below Shared with me, select Useful resource shares to view the pending invites.

It’s best to see two useful resource share invites from producer account A: one for a database-level share and one for a table-level share.

Select every useful resource share hyperlink, assessment the small print, and select Settle for.

After you settle for the invites, the standing of the useful resource shares adjustments from Pending to Lively.

Open the Lake Formation console in one other browser tab.
Within the navigation pane, select Databases.

It’s best to see the shared database federated_emrhivedb from producer account A.

Select the database and select View tables to assessment the listing of tables shared below that database.

It’s best to see the 4 tables of the Hive database that’s hosted on the EMR cluster within the producer account.

Grant permissions in account B

To grant permissions in account B, full the next steps as hmsblog-consumersteward:

On the Lake Formation console, within the navigation pane, select Administrative roles and duties.
Beneath Database creators, select Grant.
For IAM customers and roles, enter hmsblog-consumersteward.
For Catalog permissions, choose Create database.
Select Grant.

This enables hmsblog-consumersteward to create a database useful resource hyperlink.

Within the navigation pane, select Databases.
Choose federated_emrhivedb and on the Actions menu, select Create useful resource hyperlink.
Enter rl_federatedhivedb for Useful resource hyperlink identify and select Create.
Select Databases within the navigation pane.
Choose the useful resource hyperlink rl_federatedhivedb and on the Actions menu, select Grant.
Select hmsblog-analyst for IAM customers and roles.
Beneath Useful resource hyperlink permissions, choose Describe, then select Grant.
Choose Databases within the navigation pane.
Choose the useful resource hyperlink rl_federatedhivedb and on the Actions menu, select Grant on the right track.
Select hmsblog-analyst for IAM customers and roles.
Select hms_productcategory and hms_supplier for Tables.
For Desk permissions, choose Choose and Describe, then select Grant.
Within the navigation pane, select Knowledge lake permissions and assessment the permissions granted to hms-analyst.

Question the Apache Hive database of the producer from the patron Athena

Full the next steps:

On the Athena console, navigate to the question editor.
Select Edit settings to configure the Athena question outcomes bucked.
Browse and select the S3 bucket hmsblog-athenaresults-<your-account-B>-us-west-2 that the CloudFormation template created.
Select Save.

hmsblog-consumersteward has entry to all 4 tables below federated_emrhivedb from the producer account.

Within the Athena question editor, select the database rl_federatedhivedb and run a question on any of the tables.

You have been in a position to question an exterior Apache Hive metastore database of the producer account by the AWS Glue Knowledge Catalog and Lake Formation permissions utilizing Athena from the recipient shopper account.

Signal out of the console as hmsblog-consumersteward and signal again in as hmsblog-analyst.
Use the identical methodology as defined earlier to get the login credentials from the CloudFormation stack Outputs tab.

hmsblog-analyst has Describe permissions on the useful resource hyperlink and entry to 2 of the 4 Hive tables. You may confirm that you just see them on the Databases and Tables pages on the Lake Formation console.

On the Athena console, you now configure the Athena question outcomes bucket, just like the way you configured it as hmsblog-consumersteward.

Within the question editor, select Edit settings.
Browse and select the S3 bucket hmsblog-athenaresults-<your-account-B>-us-west-2 that the CloudFormation template created.
Select Save.
Within the Athena question editor, select the database rl_federatedhivedb and run a question on the 2 tables.
Signal out of the console as hmsblog-analyst.

You have been in a position to limit sharing the exterior Apache Hive metastore tables utilizing Lake Formation permissions from one account to a different and question them utilizing Athena. You too can question the Hive tables utilizing Redshift Spectrum, Amazon EMR, and AWS Glue ETL from the patron account.

Clear up

To keep away from incurring prices on the AWS sources created on this put up, you’ll be able to carry out the next steps.

Clear up sources in account A

There are two CloudFormation stacks related to producer account A. It is advisable delete the dependencies and the 2 stacks within the appropriate order.

Log in because the admin consumer to producer account B.
On the Lake Formation console, select Knowledge lake permissions within the navigation pane.
Select Grant.
Grant Drop permissions to your function or consumer on federated_emrhivedb.
Within the navigation pane, select Databases.
Choose federated_emrhivedb and on the Actions menu, select Delete to delete the federated database that’s related to the Hive metastore connection.

This makes the AWS Glue connection’s CloudFormation stack able to be deleted.

Within the navigation pane, select Administrative roles and duties.
Beneath Database creators, choose Revoke and take away hmsblog-producersteward permissions.
On the CloudFormation console, delete the stack named serverlessrepo-GlueDataCatalogFederation-HiveMetastore first.

That is the one created by your AWS SAM software for the Hive metastore connection. Anticipate it to finish deletion.

Delete the CloudFormation stack that you just created for the producer account arrange.

This deletes the S3 buckets, EMR cluster, customized IAM roles and insurance policies, and the LF-Tags, database, tables, and permissions.

Clear up sources in account B

Full the next steps in account B:

Revoke permission to hmsblog-consumersteward as database creator, just like the steps within the earlier part.
Delete the CloudFormation stack that you just created for the patron account setup.

This deletes the IAM customers, S3 bucket, and all of the permissions from Lake Formation.

If there are any useful resource hyperlinks and permissions left, delete them manually in Lake Formation from each accounts.

Conclusion

On this put up, we confirmed you find out how to launch the AWS Glue Hive metastore federation software from AWS Serverless Utility Repository, configure it with a Hive metastore operating on an EMR cluster, create a federated database within the AWS Glue Knowledge Catalog, and map it to a Hive metastore database on the EMR cluster. We illustrated find out how to share and entry the Hive database tables for a cross-account situation and the advantages of utilizing Lake Formation to limit permissions.

All Lake Formation options akin to sharing to IAM principals inside similar account, sharing to exterior accounts, sharing to exterior account IAM principals, limiting column entry, and setting information filters work on federated Hive database and tables. You need to use any of the AWS analytics providers which are built-in with Lake Formation, akin to Athena, Redshift Spectrum, AWS Glue ETL, and Amazon EMR to question the federated Hive database and tables.

We encourage you to take a look at the options of the AWS Glue Hive metastore federation connector and discover Lake Formation permissions in your Hive database and tables. Please touch upon this put up or discuss to your AWS Account Crew to share suggestions on this function.

For extra particulars, see Managing permissions on datasets that use exterior metastores.

Concerning the authors

Aarthi Srinivasan is a Senior Large Knowledge Architect with AWS Lake Formation. She likes constructing information lake options for AWS prospects and companions. When not on the keyboard, she explores the most recent science and know-how tendencies and spends time together with her household.

Question your Apache Hive metastore with AWS Lake Formation permissions

Resolution overview

Stipulations

Lake Formation and AWS CloudFormation setup in account A

Overview CloudFormation stack output in account A

Arrange a federated AWS Glue connection in account A

Grant Lake Formation permissions as producer steward

Lake Formation and AWS CloudFormation setup in account B

Settle for useful resource shares in account B

Grant permissions in account B

Question the Apache Hive database of the producer from the patron Athena

Clear up

Clear up sources in account A

Clear up sources in account B

Conclusion

Concerning the authors

Related Articles

Pathlight Finds a Path to Actual-World GenAI Productiveness

Pretend WinRAR PoC Exploit Conceals VenomRAT Malware

iPhone 15 gives extra particulars on battery well being

LEAVE A REPLY Cancel reply

Latest Articles

Pathlight Finds a Path to Actual-World GenAI Productiveness

Pretend WinRAR PoC Exploit Conceals VenomRAT Malware

iPhone 15 gives extra particulars on battery well being

Google Advertisements Routinely Created Belongings Obtainable In 8 Languages

Atlas VPN Evaluate: Finest VPN for Torrenting Safely and Anonymously

About Us