In recent years, Data Lakehouse architecture has become the primary data architecture for cloud-based data platforms. The Medallion architecture (bronze, silver, gold) has become the de facto standard when building a Lakehouse. Until now, Microsoft’s solution for cloud-based data platforms has been the Azure Synapse Analytics PaaS solution or Databricks on Azure. November 2023 Microsoft released a new SaaS-based analytics platform called Microsoft Fabric. If Fabric is not yet familiar to you, check out the Microsoft Learn overview.
Below are some of our thoughts on Fabric related to Data Lakehouse implementations and why organizations should adopt Fabric.
Synapse vs Fabric
The purpose of Synapse Analytics was to bring the data services available in Azure under one umbrella. In practice it did, but under the hood they did not always work together seamlessly. Examples of this include Spark and SQL workloads not natively communicating with each other, connections between different services not working out of the box, and needing to store multiple copies of data to optimally utilize it in different workloads. These are not insurmountable issues, but they cause extra work that does not add value for end users. Of course, many of these things can be managed with automation, as the template-based solutions we have developed at Islet do. In both solutions, Spark notebooks which perform the actual data handling, are at the heart of the Lakehouse architecture. For this purpose, we have developed our own Spark libraries, which make implementation faster and more quality-conscious, and they are fully compatible with Fabric’s notebooks.
How does Fabric change the picture?
Fabric does not change the basic principles of the Data Lakehouse and Medallion architecture, but it provides a completely new type of platform for building them. Since it’s a SaaS service, its setup and maintenance require less work than the PaaS-based Synapse.
Fabric’s common interface for all workloads is a good thing, of course, and reduces the number of tools needed and the transition between them. As for the workloads, Fabric has a lot to choose from: Data Factory, Data Engineering + Lakehouse, Data Warehouse, Data Science, Real-time Analytics, Data Activator, and Power BI. You can read more about these from the Fabric introduction found in the aforementioned link. Of course, not all workloads need to be used, but the most suitable tool is chosen for each need. The same things can be implemented with different workloads, for example, alternate implementations in a low-code or code-first manner.
However, the most important feature is the One Lake storage space under the hood and the Apache Delta Lake storage format used by all Fabric workloads. Behind One Lake is Azure Data Lake Storage Gen2, which means that One Lake supports all the same features as Data Lake Storage. Delta Lake, on the other hand, is an open storage format that supports ACID transactions and data versioning, and the same format is also used by Databricks. Synapse’s Notebooks can just as well use Delta Lake, and Serverless SQL Pool can also read it, but in Fabric all workloads both read and write Delta Lake natively. This of course makes it easier to utilize data between different workloads and also the ability of people in different roles to utilize the data on the platform, which is exactly what a modern data platform should be.
With the unified Delta Lake format, the need to copy the same data in different formats for different tools or use cases is significantly reduced. In addition to this, Fabric has completely new features like shortcut and database mirroring, which allow existing data, for example from AWS’s S3, Azure’s Storage, or Azure’s SQL and Snowflake databases, to be linked to One Lake without necessarily needing to be separately transferred to One Lake. Each case should of course be studied in more detail and the most suitable solution sought for the specific need.
Among the new features, worth mentioning separately is the Power BI’s Direct Lake connector, which can read data from One Lake in real time and very efficiently, essentially combining the best aspects of Direct Query and Import Mode type connections: up-to-date information model and efficiency.
In addition to the above, Fabric has numerous other new features and the product is continuously developing. It’s important to note that although Microsoft released a production-ready (GA) version of Fabric in November 2023, there are still deficiencies in its features. However, these are being patched at a rapid pace and new features are being announced weekly.
When is a good time to start using Fabric?
Organizations that are just starting to transition to a cloud-based data platform should definitely consider Fabric as a primary option. On the other hand, those organizations that have already built their data platforms on Synapse or Databricks are in no hurry to transfer already completed parts to Fabric, as Synapse will continue to be fully supported. However, for these organizations, it may be an interesting option to implement Fabric for a certain area of use and thus gain experience with the new platform.
There are indications that Microsoft will provide tools for migrations at some point. If an organization’s current Synapse-based solution is Lakehouse using Spark notebooks, as Islet’s implementation model is, the migration to Fabric will be a fairly light operation, regardless of whether it happens now or in a few years.
In summary, what benefits does Fabric bring to an organization?
Under the same service, you can now find everything related to data and analytics needs from data integration to its modification, storage, and reporting, as well as machine learning and AI tools.
Since all Fabric tools recognize the centralized One Lake and use the same data storage format, it’s easy for people in different roles to utilize the information stored on the platform. Time and money are saved when an individual does not have to figure out how to read the desired data.
Likewise, work is made more efficient by Copilot. It is integrated as part of all Fabric workloads and has the same visibility to the data on the platform as the developers, so developers can ask Copilot to write code, calculate formulas, or analyze data, for example.
Fabric’s costs are based on capacity units, which all workloads consume. As the amount of data grows and usage needs expand, more capacity is purchased or vice versa. Power BI licenses, however, are still purchased separately unless using the F64 capacity, ie. the former Power BI Premium.
Islet and Fabric
At Islet, we have been implementing Data Lakehouse architectures for a long time instead of traditional data warehouses, as in the case of Wihuri. We have developed generic, repeatable models and libraries for the efficient implementation of the Medallion architecture and use Delta Lake as the data storage format. Considering these, the transition to Fabric doesn’t greatly change our way of implementing Lakehouse, but it brings many new possibilities and features for building the data platform and utilizing data.
– – – – –
The blog’s author Mika Kuivanen is data architect at Islet with over 15 years of experience about databases, data & analytics and consulting.
More info:
#MicrosoftFabric #Azure #lakehouse #deltalake #powerBI #data #analytics #AI #onelake #Microsoft