davidfoki.blogg.se - Lakehouse databricks paper

Lakes did not replace warehouses: they were complementary, each of them addressed different needs and use cases. Lakes feature low cost storage, higher speed, and greater scalability, but, on the other hand, they gave up many of the advantages of warehouses. To overcome these problems, the second generation data analytics platforms started offloading all the raw data into data lakes, low-cost storage systems providing a file-like API.ĭata lakes started with Mapreduce and Hadoop (even if the name data lake came later) and were successively followed up by cloud data lakes, such as the one based on S3, ADLS and GCS. images, videos, text documents, logs, etc) made data warehouses more and more expensive and inefficient. The growing volume of data to handle, along with the need to deal with unstructured data (i.e. Source: Wikipedia, "Extract, Transform, Load" The second generation, data lakes A processing engine supporting the above table format, for example Spark or Presto or Athena, and so on.The new table file format layer, Delta Lake, Apache Iceberg or Apache Hudi.A binary file format like Parquet or ORC used to store data and metadata.A File Storage layer, generally cloud based, for example AWS S3 or GCP Cloud Storage or Azure Data Lake Storage Gen2.Therefore, the main building blocks of a lakehouse platform (see figure 1.x), from a bottom-up perspective, are: This new layer is built on top of existing technologies in particular on a binary, often columnar, file format, which can be either Parquet, ORC or Avro, and on a storage layer. In practice, the lakehouse leverages a new metadata layer providing a “table abstraction” and some features typical of data warehouses on top of a classical Data Lake. In a single sentence, a lakehouse is a “data lake” on steroids, unifying the concept of “data lake” and “data warehouse”. In this article, I will try to explain, in plain language, what a lakehouse is, why it is generating so much hype, and why it is rapidly becoming a centerpiece of modern data platform architectures. Moreover, all major data companies are competing to support it, from AWS to Google Cloud, passing through Dremio, Snowflake and Cloudera, and the list is growing. In parallel Netflix, in collaboration with Apple, introduced Iceberg, while Uber introduced Hudi (pronounced “Hoodie”), both becoming top tier Apache projects in May 2020. Indeed, the Lakehouse architecture has been introduced separately and basically in parallel by three important and trustworthy companies, and with three different implementations.ĭatabricks published its seminal paper on data lake, followed by open sourcing Delta Lake framework. While it is undeniable that marketing is playing an important role in spreading the concept, there’s a lot more in this concept than just buzzwords.

Of course, as with any emerging technical innovations, it is hard to separate the marketing hype from the actual technological value, which, ultimately, only time and adoption can prove. The lakehouse is the new concept that moves data lakes closer to data warehouses, making them able to compete in the BI and analytical world. The past few years witnessed a contraposition between two different ecosystems, the data warehouses and the data lakes - the former designed as the core for analytical and business intelligence, generally SQL centred, and the latter based on data lakes, providing the backbone for advanced processing and AI/ML, operating on a wide variety of languages ranging from Scala to Python, R and SQL.ĭespite the contraposition between respective market leaders, thinking for example to Snowflake vs Databricks, the emerging pattern shows also a convergence between these two core architectural patterns.

Why the Lakehouse is here to stay Introduction