Introduction
For greater than a decade now, the Hive desk format has been a ubiquitous presence within the huge information ecosystem, managing petabytes of information with outstanding effectivity and scale. However as the information volumes, information selection, and information utilization grows, customers face many challenges when utilizing Hive tables due to its antiquated directory-based desk format. A number of the frequent points embrace constrained schema evolution, static partitioning of information, and lengthy planning time due to S3 listing listings.
Apache Iceberg is a contemporary desk format that not solely addresses these issues but additionally provides further options like time journey, partition evolution, desk versioning, schema evolution, robust consistency ensures, object retailer file format (the flexibility to distribute recordsdata current in a single logical partition throughout many prefixes to keep away from object retailer throttling), hidden partitioning (customers don’t need to be intimately conscious of partitioning), and extra. Subsequently, Apache Iceberg desk format is poised to interchange the standard Hive desk format within the coming years.
Nonetheless, as there are already 25 million terabytes of information saved within the Hive desk format, migrating current tables within the Hive desk format into the Iceberg desk format is important for efficiency and value. Relying on the dimensions and utilization patterns of the information, a number of totally different methods could possibly be pursued to realize a profitable migration. On this weblog, I’ll describe just a few methods one might undertake for numerous use instances. Whereas these directions are carried out for Cloudera Information Platform (CDP), Cloudera Information Engineering, and Cloudera Information Warehouse, one can extrapolate them simply to different providers and different use instances as properly.
There are few situations that one may encounter. A number of of those use instances may suit your workload and also you may be capable of combine and match the potential options offered to fit your wants. They’re meant to be a basic information. In all of the use instances we try emigrate a desk named “occasions.”
Strategy 1
You have got the flexibility to cease your purchasers from writing to the respective Hive desk in the course of the period of your migration. That is supreme as a result of this may imply that you simply don’t have to switch any of your shopper code. Generally that is the one selection obtainable when you’ve got a whole bunch of purchasers that may doubtlessly write to a desk. It could possibly be a lot simpler to easily cease all these jobs slightly than permitting them to proceed in the course of the migration course of.
In-place desk migration
Answer 1A: utilizing Spark’s migrate process
Iceberg’s Spark extensions present an in-built process known as “migrate” emigrate an current desk from Hive desk format to Iceberg desk format. Additionally they present a “snapshot” process that creates an Iceberg desk with a distinct title with the identical underlying information. You possibly can first create a snapshot desk, run sanity checks on the snapshot desk, and make sure that every part is so as.
As soon as you might be glad you’ll be able to drop the snapshot desk and proceed with the migration utilizing the migrate process. Remember the fact that the migrate process creates a backup desk named “events__BACKUP__.” As of this writing, the “__BACKUP__” suffix is hardcoded. There may be an effort underway to let the person go a customized backup suffix sooner or later.
Remember the fact that each the migrate and snapshot procedures don’t modify the underlying information: they carry out in-place migration. They merely learn the underlying information (not even full learn, they simply learn the parquet headers) and create corresponding Iceberg metadata recordsdata. Because the underlying information recordsdata aren’t modified, you might not be capable of take full benefit of the advantages supplied by Iceberg straight away. You possibly can optimize your desk now or at a later stage utilizing the “rewrite_data_files” process. This shall be mentioned in a later weblog. Now let’s talk about the professionals and cons of this method.
PROS:
- Can do migration in levels: first do the migration after which perform the optimization later utilizing rewrite_data_files process (weblog to observe).
- Comparatively quick because the underlying information recordsdata are stored in place. You don’t have to fret about creating a brief desk and swapping it later. The process will try this for you atomically as soon as the migration is completed.
- Since a Hive backup is accessible one can revert the change totally by dropping the newly created Iceberg desk and by renaming the Hive backup desk (__backup__) desk to its unique title.
CONS:
- If the underlying information just isn’t optimized, or has loads of small recordsdata, these disadvantages could possibly be carried ahead to the Iceberg desk as properly. Question engines (Impala, Hive, Spark) may mitigate a few of these issues through the use of Iceberg’s metadata recordsdata. The underlying information file areas won’t change. So if the prefixes of the file path are frequent throughout a number of recordsdata we could proceed to endure from S3 throttling (see Object Retailer File Layout to see find out how to configure it correctly.) In CDP we solely help migrating exterior tables. Hive managed tables can’t be migrated. Additionally, the underlying file format for the desk needs to be one in all avro, orc, or parquet.
Notice: There may be additionally a SparkAction within the JAVA API.
Answer 1B: utilizing Hive’s “ALTER TABLE” command
Cloudera carried out a simple technique to do the migration in Hive. All it’s important to do is to change the desk properties to set the storage handler to “HiveIcebergStorageHandler.”
The professionals and cons of this method are basically the identical as Answer 1B. The migration is finished in place and the underlying information recordsdata aren’t modified. Hive creates Iceberg’s metadata recordsdata for a similar precise desk.
Shadow desk migration
Answer 1C: utilizing the CTAS assertion
This answer is most generic and it might doubtlessly be used with any processing engine (Spark/Hive/Impala) that helps SQL-like syntax.
You may run primary sanity checks on the information to see if the newly created desk is sound.
As soon as you might be glad together with your sanity checking you may rename your “occasions” desk to a “backup_events” desk after which rename your “iceberg_events” to “occasions.” Remember the fact that in some instances the rename operation may set off a listing rename of the underlying information listing. If that’s the case and your underlying information retailer is an object retailer like S3, that can set off a full copy of your information and could possibly be very costly. If whereas creating the Iceberg desk the situation clause is specified, then the renaming operation of the Iceberg desk won’t trigger the underlying information recordsdata to maneuver. The title will change solely within the Hive metastore. The identical applies for Hive tables as properly. In case your unique Hive desk was not created with the situation clause specified, then the rename to backup will set off a listing rename. In that case, In case your filesystem is object retailer primarily based, then it is likely to be finest to drop it altogether. Given the nuances round desk rename it’s vital to check with dummy tables in your system and test that you’re seeing your required conduct earlier than you carry out these operations on vital tables.
You may drop your “backup_events” if you want.
Your purchasers can now resume their learn/write operations on the “occasions” they usually don’t even have to know that the underlying desk format has modified. Now let’s talk about the professionals and cons of this method.
PROS:
- The newly created information is properly optimized for Iceberg and the information shall be distributed properly.
- Any current small recordsdata shall be coalesced mechanically.
- Frequent process throughout all of the engines.
- The newly created information recordsdata might make the most of Iceberg’s Object Retailer File Format, in order that the file paths have totally different prefixes, thus lowering object retailer throttling. Please see the linked documentation to see find out how to make the most of this function.
- This method just isn’t essentially restricted to migrating a Hive desk. One might use the identical method emigrate tables obtainable in any processing engine like Delta, Hudi, and so on.
- You may change the information format say from “orc” to “parquet.’’
CONS
- This can set off a full learn and write of the information and it is likely to be an costly operation.
- Your total information set shall be duplicated. It’s essential to have enough space for storing obtainable. This shouldn’t be an issue in a public cloud backed by an object retailer.
Strategy 2
You don’t have the luxurious of lengthy downtime to do your migration. You wish to let your purchasers or jobs proceed writing the information to the desk. This requires some planning and testing, however is feasible with some caveats. Right here is a technique you are able to do it with Spark. You may doubtlessly extrapolate the concepts offered to different engines.
- Create an Iceberg desk with the specified properties. Remember the fact that it’s important to hold the partitioning scheme the identical for this to work accurately.
- Modify your purchasers or jobs to put in writing to each tables so that they write to the “iceberg_events” desk and “occasions” desk. However for now, they solely learn from the “occasions” desk. Seize the timestamp from which your purchasers began writing to each of the tables.
- You programmatically listing all of the recordsdata within the Hive desk that had been inserted earlier than the timestamp you captured in step 2.
- Add all of the recordsdata captured in step 3 to the Iceberg desk utilizing the “add_files” process. The “add_files” process will merely add the file to your Iceberg desk. You additionally may be capable of make the most of your desk’s partitioning scheme to skip step 3 totally and add recordsdata to your newly created Iceberg desk utilizing the “add_files” process.
- Should you don’t have entry to Spark you may merely learn every of the recordsdata listed in step 3 and insert them into the “iceberg_events.”
- When you efficiently add all the information recordsdata, you’ll be able to cease your purchasers from studying/writing to the previous “occasions” and use the brand new “iceberg_events.”
Some caveats and notes
- In step 2, you’ll be able to management which tables your purchasers/jobs should write to utilizing some flag that may be fetched from exterior sources like setting variables, some database (like Redis) pointer, and properties recordsdata, and so on. That method you solely have to switch your shopper/job code as soon as and don’t need to hold modifying it for every step.
- In step 2, you might be capturing a timestamp that shall be used to calculate recordsdata wanted for step 3; this could possibly be affected by clock drift in your nodes. So that you may wish to sync all of your nodes earlier than you begin the migration course of.
- In case your desk is partitioned by date and time (as most actual world information is partitioned), as in all new information coming will go to a brand new partition on a regular basis, then you definitely may program your purchasers to begin writing to each the tables from a selected date and time. That method you simply have to fret about including the information from the previous desk (“occasions”) to the brand new desk (“Iceberg_events”) from that date and time, and you’ll make the most of your partitioning scheme and skip step 3 totally. That is the method that needs to be used at any time when potential.
Conclusion
Any giant migration is hard and needs to be thought by way of rigorously. Fortunately, as mentioned above there are a number of methods at our disposal to do it successfully relying in your use case. When you’ve got the flexibility to cease all of your jobs whereas the migration is going on it’s comparatively easy, however if you wish to migrate with minimal to no downtime then that requires some planning and cautious considering by way of your information format. You should utilize a mixture of the above approaches to finest fit your wants.
To study extra:
- For extra on desk migration, please discuss with respective on-line documentations in Cloudera Information Warehouse (CDW) and Cloudera Information Engineering (CDE).
- Watch our webinar Supercharge Your Analytics with Open Information Lakehouse Powered by Apache Iceberg. It features a stay demo recording of Iceberg capabilities.
- Attempt Cloudera Information Warehouse (CDW), Cloudera Information Engineering (CDE), and Cloudera Machine Studying (CML) by signing up for a 60 day trial, or take a look at drive CDP. It’s also possible to schedule a demo by clicking right here or if you have an interest in chatting about Apache Iceberg in CDP, contact your account staff.