top of page

The Data Lake House Challengers - Hudi, Delta Lake and Iceberg.



Qualitatively speaking, the three are all data storage middle tiers of Data Lake, and their data management functions are based on a series of meta files. The role of the meta file is similar to the catalog of the database, playing the functions of schema management, transaction management and data management.


Different from the database, these meta files are stored in the storage engine together with the data files, and users can see them directly. This approach directly inherits the tradition of data being visible to users in big data analysis, but it also virtually increases the risk of data being accidentally destroyed. Once a user accidentally deletes the meta directory, the table is destroyed, and it is very difficult to restore it.


The Meta file contains the schema information of the table. Therefore, the system can master Schema changes by itself and provide support for Schema evolution. Meta files also have the function of transaction log (the file system needs to support atomicity and consistency). All changes to the table will generate a new meta file, so the system has ACID and multi-version support, and can provide the function of accessing history. In these respects, the three are the same.Let's talk about the differences between the three.


Hudi

Let me talk about Hudi first. Hudi’s design goal is just like its name, Hadoop Upserts Deletes and Incrementals (formerly Hadoop Upserts anD Incrementals), emphasizing that it mainly supports Upserts, Deletes and Incremental data processing, and its main writing tools are Spark HudiDataSource API and its own DeltaStreamer supports three data writing methods: UPSERT, INSERT and BULK_INSERT.


Its support for Delete is also supported by specifying certain options when writing, and does not support a pure delete interface.Its typical usage is to write upstream data to Hudi via Kafka or Sqoop and via DeltaStreamer. DeltaStreamer is a resident service that continuously pulls data from upstream and writes it to hudi. Writing is divided into batches, and the scheduling interval between batches can be set. The default interval is 0, which is similar to the As-soon-as-possible strategy of Spark Streaming.


As data continues to be written, small files will be generated. For these small files, DeltaStreamer can automatically trigger the task of merging small files.In terms of query, Hudi supports Hive, Spark, and Presto. In terms of performance, Hudi designed HoodieKey something similar to a primary key. HoodieKeyThere are Min/Max statistics, BloomFilter, which is used to quickly locate the file where the Record is located. In the specific Upserts, if it HoodieKey does not exist in the BloomFilter, the insert is executed, otherwise, it is confirmed HoodieKey whether it really exists, and if it does exist, the update is executed. This HoodieKey + BloomFilter-based upserts method is more efficient, otherwise, you need to do a full table Join to achieve upserts.


For query performance, the general requirement is to generate filter conditions based on query predicates and push them down to the datasource. Hudi didn't do much work in this area, and its performance is entirely based on the engine's built-in predicate push and partition prune functions. Another major feature of Hudi is to support Copy On Write and Merge On Read. The former merges data when writing, and the writing performance is slightly worse, but the reading performance is higher. The latter does a merge when reading, and read performance checks, but writes data in a timely manner, so the latter can provide near real-time data analysis capabilities.


Finally, Hudi provides a script named run_sync_tool to synchronize the data schema to the Hive table. Hudi also provides a command line tool for managing Hudi tables.


Iceberg

Iceberg does not have a similar HoodieKey design, and it does not emphasize primary keys. As mentioned above, without a primary key, operations such as update/delete/merge must be implemented through Join, and Join requires an execution engine similar to SQL. Iceberg is not bound to a certain engine, nor does it have its own engine, so Iceberg does not support update/delete/merge. If the user needs to update data, the best way is to find out which partitions need to be updated, and then overwrite the data by overwrite.


The quick start and Spark interface provided by Iceberg's official website only mentioned the way to write data to Iceberg using the Spark dataframe API, and did not mention other data ingestion methods. As for writing using Spark Streaming, the code implements the corresponding, StreamWriteSupport which should support streaming writing, but it seems that the official website does not explicitly mention this. Supporting streaming writing means that there is a problem with small files, and the official website does not mention how to merge small files. I suspect that for streaming writing and small file merging, Iceberg may not be ready for production, so I didn't mention it (purely a personal guess). In terms of query, Iceberg supports Spark and Presto.Iceberg has done a lot of work on query performance. What is worth mentioning is its hidden partition function. Hidden partition means that for the data entered by the user, the user can select some of the columns to perform appropriate transformations (Transform) to form a new column as the partition column. This partition column is only for partitioning the data, and is not directly reflected in the table's schema. For example, if the user has a timestamp column, then a new partition column of timestamp_hour can be generated by hour(timestamp). timestamp_hour is not visible to users, it is only used to organize data.


Partition lists the statistics of the partition column, such as the data range contained in the partition. When users query, they can do partition prune based on partition statistics.In addition to hidden partition, Iceberg also collects information on common column columns. These statistics are very complete, including the size of the column, the value count of the column, the null value count, and the maximum and minimum values ​​of the column, and so on. This information can be used to filter data during query.Iceberg provides an API for table creation. Users can use this API to specify indication, schema, partition information, etc., and then complete the table creation in the Hive catalog.


Delta

Let's finally talk about Delta. Delta is positioned as a Data Lake storage layer that integrates streaming and batching, and supports update/delete/merge. Since it comes from Databricks, all data writing methods of spark, including batch and streaming based on dataframe, as well as SQL Insert, Insert Overwrite, etc. are supported (open source SQL writing is not currently supported, and EMR does support). Similar to Iceberg, Delta does not emphasize primary keys, so its update/delete/merge implementations are all based on spark's join function. In terms of data writing, Delta and Spark are strongly bound. This is different from Hudi: Hudi's data writing is not bound to Spark (you can use Spark or Hudi's own writing tool).


In terms of query, open source Delta currently supports Spark and Presto, but Spark is indispensable because the delta log processing requires Spark. This means that if you want to use Presto to query Delta, you also need to run a Spark job when querying. What's more painful is that Presto queries are based on SymlinkTextInputFormat. Before querying, run Spark job to generate such a Symlink file. If the table data is updated in real time, it means that you have to run a SparkSQL before querying each time, and then run Presto. In this case, why not do it all in SparkSQL? This is a very painful design.


For this reason, EMR has made improvements in this area and supports DeltaInputFormat. Users can directly use Presto to query Delta data without having to start a Spark task in advance.In terms of query performance, the open source Delta hardly has any optimization. Not to mention the hidden partition of Iceberg, there is no statistical information for ordinary columns. Databricks has reserved the Data Skipping technology that they are proud of. I have to say that this is not a good thing for the promotion of Delta. The EMR team is doing some work in this area, hoping to make up for the lack of ability in this area.


Delta is inferior to Hudi in terms of data merging, and inferior to Iceberg in terms of query performance. Does it mean that Delta is useless? actually not. One of the major advantages of Delta is its ability to integrate with Spark (although it is still not perfect at present, it will be much better after Spark-3.0), especially its stream-batch integrated design, with multi-hop data pipeline, can support analysis, Various scenarios such as machine learning and CDC. Flexible use and complete scene support are its biggest advantages compared to Hudi and Iceberg. In addition, Delta claims to be an improved version of the Lambda architecture and Kappa architecture, and there is no need to care about streaming batches or architecture. At this point, Hudi and Iceberg are beyond their ability.



Conclusion

From the above analysis, it can be seen that the original intention scenarios of the three engines are not exactly the same. For the incremental upserts of Hudi, Iceberg is positioned for high-performance analysis and reliable data management, and Delta is positioned for data processing that integrates streaming and batching.


The difference of this kind of scene also caused the difference in the design of the three. Especially Hudi, its design is more distinct than the other two. With the development of time, the three are constantly filling up their missing abilities, and may converge with each other in the future and invade each other's territory. Of course, it is also possible to pay attention to the scene of their own expertise and build their own barriers to advantage. Therefore, it is still unknown who will win and who will lose in the end.


Source: Some content may be referred from Internet. Few concepts may be outdated due to individual software latest releases and updates.




61 views

Comments


bottom of page