Introduction to Databricks
Databricks, a US-based software company, was founded by the creators of Apache Spark, an open-source analytics engine. Their flagship product, the Databricks Lakehouse Platform, is a cloud-based data and analytics platform powered by Spark and Delta Lake. The platform combines the best elements of data lakes and data warehouses, providing tools for data engineers, data scientists, and business intelligence analysts to collaboratively develop solutions ranging from small-scale, highly customized data needs to enterprise-level data platform solutions. With its roots in the Big Data world, Databricks is particularly adept at handling large volumes of data, which typically involves cloud storage. Databricks is available on all major public cloud platforms (Azure, AWS, and GCP) and seamlessly integrates with their storage, security, and compute infrastructure while offering administration capabilities to users.
Under the hood, Databricks relies on decentralized, parallel computing powered by Apache Spark. Users can determine the types of clusters required for computation, while Databricks procures the necessary machines, activates them as needed or according to a predetermined schedule, and shuts them down when no longer required. Data is stored in the cloud but is separated into the customer’s own data storage solutions, such as Azure Data Lake Storage Gen2 or AWS S3. The total cost of ownership combines data and compute resources, allowing both to scale independently.
Data processing with Databricks
During data processing, Spark Dataframes, which resemble database tables, serve as the typical unit of processing and are managed by the Spark engine. Dataframe processing occurs within the cluster’s memory and is manipulated using Python, Scala, R, or SQL in data processing pipelines. Pipeline development takes place in Databricks Notebooks, which consist of one or more command cells executed sequentially. Different languages can be used within a single notebook, with code (e.g., Python or SQL) written in individual command cells. Generally, a single data processing pipeline comprises several sequentially run Notebooks, each focusing on a specific type of processing, such as raw data cleaning, key generation, or dimensional model generation. The final output of each processing pipeline is stored in the cloud, typically in the Delta Lake format, which is the de facto solution for modern data lake analytics. Although running on a decentralized and parallel computing platform, development at a technical level is quite straightforward, and developers don’t necessarily need to delve into the intricacies of the Spark engine. However, full customization is available for developers who require it.
Workflow generation and Dependency management
Delta Live Tables (DLT), a recent addition to Databricks, offers significant added value for data solutions. The most essential feature is the automatic directed acyclic graph (DAG) workflow generation and dependency management, which helps maintain proper processing order and ensures the correct sequence of table loading. DAG chains are visible to users, providing data lineage visibility from raw data to reporting-ready tables. Additionally, DLT enables declarative data pipeline development, allowing the same pipeline and code to process different source tables in batch mode or as a stream. This simplifies the architecture and reduces code complexity. Quality control is highly automated in DLT data pipelines, with data validation occurring during processing based on predefined rules and conditions, and the results (e.g., the number of rows not meeting conditions) are visually accessible to the user.
Data management using Unity Catalog
Among the interesting features introduced last year, the Databricks Unity Catalog offers visibility and manageability for multiple source systems and end-user groups, all centralized to ensure consistent data governance throughout the data lifecycle. Unity Catalog allows end-users to find data based on metadata from all registered sources and share data through the Delta Sharing feature within the same Databricks instance, within the same cloud, or even outside the cloud environment. Data is easily accessible yet controlled, ensuring that access to data within the Catalog is centrally managed, audited, and monitored in one place.
Data publishing for consumption with Databricks
Historically, Databricks’ challenge has been on the data distribution side, ensuring that BI developers and business analysts have access to ready-made data using familiar methods and tools. Today, Databricks offers SQL Warehouse, which allows data to be published in a user interface resembling a traditional relational database. With this feature, BI developers and analysts can query ready-made, modeled data directly from a web browser using Databricks SQL IDE or from supported IDEs such as DBeaver. Widely used BI tools like Power BI, Tableau, and Qlik have native connectors that easily integrate with Databricks SQL Warehouse. At the time of writing, SQL Warehouse requires a running cluster to operate, but a serverless solution is already in the public preview phase.
Typical use cases for Databricks
Databricks is suitable for a variety of use cases, including:
- Large data masses:
- Distributed computing, a wide range of cluster options, and rich programming language options provide all the necessary tools to work with large volumes of data efficiently and cost-effectively.
- Extensive and diverse data platforms:
- The versatility of Databricks allows for the implementation of numerous use cases within a single environment. Data engineers handle raw data retrieval, purification, and Data Vault/dimensional modeling; data scientists develop and train machine learning models based on these datasets; and BI analysts provide data for use with commonly used visualization tools in the SQL Warehouse API.
- Precision solutions with challenging requirements:
- For example, processing data from IoT devices may overwhelm many data platforms built with traditional tools (e.g., relational databases) in terms of data values and update frequency. Delta Live Table features, among other options, provide an easy-to-use Streaming Table for such purposes.
- In some cases, raw data can be in such a challenging form that solutions built with traditional tools eventually encounter performance or maintainability issues. Here, Spark Dataframes and Python’s modularity and reusability over SQL enable significantly more efficient and easy-to-maintain solutions.
- Building a new data platform from scratch:
- Databricks offers powerful tools for developing data platform solutions with familiar tools for developers. Choosing Databricks as the implementation technology from the start avoids inadvertently excluding any use cases or forcing the later integration of incompatible components.
- Machine Learning:
- Machine Learning (ML) has always been central to Databricks. Its ML capabilities are built on top of an open lakehouse architecture, with features such as AutoML and MLflow supporting the development, lifecycle management, and monitoring of machine learning models.
What has Islet done with Databricks?
Our team has built an enterprise-level data platform using Databricks, where data engineers leveraged Delta Live Table functionality to process data from raw inputs into a complete, distributable dimensional model delivered to end-users via the SQL Warehouse interface. The solution integrated SAP S/4HANA data from a large Finnish company into Azure using Databricks Notebooks. Raw data was retrieved from SAP and stored in ADLS Gen2 using Aecorsoft Data Integrator. The implementation was completely metadata-driven; when integrating new data, only the basic information of the new tables is written to the configuration file, after which Databricks Notebooks retrieves the new, available raw data, cleans it, and performs the necessary transformations for further processing. Finally, the dimensional model designed in collaboration with the customer is generated using SQL queries hosted in Azure DevOps repository, which are read by Delta Live Table pipelines. This model is then distributed to Power BI and analysts through SQL Warehouse.
To learn more about Databricks’ capabilities, our experiences, or to brainstorm about Lakehouse architecture, please contact us!
The authors of the blog, Aku Rantala and Mika Rönkkö, are ISLET’s Lead Cloud Data Architects who have managed a large number of projects. They have a wide range of skills with a variety of tools and technologies, Databricks and Microsoft Azure in particular are among their strengths.