As the usage of ChatGPT turns into extra prevalent, I often encounter clients and information customers citing ChatGPT’s responses of their discussions. I really like the keenness surrounding ChatGPT and the eagerness to find out about fashionable information architectures resembling information lakehouses, information meshes, and information materials. ChatGPT is a superb useful resource for gaining high-level insights and constructing consciousness of any expertise. Nonetheless, warning is critical when delving deeper into a selected expertise. ChatGPT is educated on historic information and relying on how one phrases their query, it could provide inaccurate or deceptive data.
I took the free model of ChatGPT on a check drive (in March 2023) and requested some easy questions on information lakehouse and its elements. Listed below are some responses that weren’t precisely proper, and our rationalization on the place and why it went flawed. Hopefully this weblog will give ChatGPT a chance to study and proper itself whereas counting in direction of my 2023 contribution to social good.

I assumed this was a reasonably complete record. The one key part that’s lacking is a typical, shared desk format, that can be utilized by all analytic companies accessing the lakehouse information. When implementing an information lakehouse, the desk format is a vital piece as a result of it acts as an abstraction layer, making it straightforward to entry all of the structured, unstructured information within the lakehouse by any engine or software, concurrently. The desk format offers the required construction for the unstructured information that’s lacking in an information lake, utilizing a schema or metadata definition, to deliver it nearer to an information warehouse. A number of the common desk codecs are Apache Iceberg, Delta Lake, Hudi, and Hive ACID.
Additionally, the info lake layer will not be restricted to cloud object shops. Many corporations nonetheless have huge quantities of information on premises and information lakehouses should not restricted to public clouds. They are often constructed on premises or as hybrid deployments leveraging personal clouds, HDFS shops, or Apache Ozone.
At Cloudera, we additionally present machine studying as a part of our lakehouse, so information scientists get quick access to dependable information within the information lakehouse to shortly launch new machine studying initiatives and construct and deploy new fashions for superior analytics.

I like how ChatGPT began this reply, nevertheless it shortly jumps into options and even offers an incorrect response on the characteristic comparability. Options should not the one means of deciding which is a greater desk format. It depends upon compatibility, openness, versatility, and different elements that may assure broader utilization for various information customers, assure safety and governance, and future-proof your structure.
Here’s a high-level characteristic comparability chart if you wish to go into the small print of what’s accessible on Delta Lake versus Apache Iceberg.


This response is a bit harmful due to its incorrectness and demonstrates why I really feel these instruments should not prepared for deeper evaluation. At first look it could appear like an inexpensive response, however its premise is flawed, which makes you doubt your complete response and different responses as properly. Saying “Delta Lake is constructed on prime of Apache Iceberg” is wrong as the 2 are utterly totally different, unrelated desk codecs and one has nothing to do with the conception of the opposite. They have been created by totally different organizations to resolve widespread information issues.

I’m impressed that ChatGPT obtained this one proper, though it made a number of errors with our product names, and missed a number of which are vital for a lakehouse implementation.
CDP’s elements that help an information lakehouse structure embody:
- Apache Iceberg desk format that’s built-in into CDP to supply construction to the large quantities of structured, unstructured information in your information lake.
- Information companies, together with cloud native information warehouse known as CDW, information engineering service known as CDE, information streaming service known as information in movement, and machine studying service known as CML.
- Cloudera Shared Information Expertise (SDX), which offers a unified information catalog with automated information profilers, unified safety, and unified governance over all of your information each in the private and non-private cloud.

ChatGPT is a superb software to get a high-level view of latest applied sciences, however I’d say use it fastidiously, validate its responses, and use it just for the attention stage of the shopping for cycle. As you go into the consideration or comparability stage, it’s not dependable but.
Additionally, solutions on ChatGPT maintain updating so hopefully it corrects itself earlier than you learn this weblog.
To study extra about Cloudera’s lakehouse go to the webpage and in case you are able to get began watch the Cloudera Now demo.
