-->

Intro­duc­tion to Databricks

Data­bricks, a US-based soft­ware com­pa­ny, was found­ed by the cre­ators of Apache Spark, an open-source ana­lyt­ics engine. Their flag­ship prod­uct, the Data­bricks Lake­house Plat­form, is a cloud-based data and ana­lyt­ics plat­form pow­ered by Spark and Delta Lake. The plat­form com­bines the best ele­ments of data lakes and data ware­hous­es, pro­vid­ing tools for data engi­neers, data sci­en­tists, and busi­ness intel­li­gence ana­lysts to col­lab­o­ra­tive­ly devel­op solu­tions rang­ing from small-scale, high­ly cus­tomized data needs to enter­prise-lev­el data plat­form solu­tions. With its roots in the Big Data world, Data­bricks is par­tic­u­lar­ly adept at han­dling large vol­umes of data, which typ­i­cal­ly involves cloud stor­age. Data­bricks is avail­able on all major pub­lic cloud plat­forms (Azure, AWS, and GCP) and seam­less­ly inte­grates with their stor­age, secu­ri­ty, and com­pute infra­struc­ture while offer­ing admin­is­tra­tion capa­bil­i­ties to users.

Under the hood, Data­bricks relies on decen­tral­ized, par­al­lel com­put­ing pow­ered by Apache Spark. Users can deter­mine the types of clus­ters required for com­pu­ta­tion, while Data­bricks pro­cures the nec­es­sary machines, acti­vates them as need­ed or accord­ing to a pre­de­ter­mined sched­ule, and shuts them down when no longer required. Data is stored in the cloud but is sep­a­rat­ed into the cus­tomer’s own data stor­age solu­tions, such as Azure Data Lake Stor­age Gen2 or AWS S3. The total cost of own­er­ship com­bines data and com­pute resources, allow­ing both to scale independently.

Data pro­cess­ing with Databricks

Dur­ing data pro­cess­ing, Spark Dataframes, which resem­ble data­base tables, serve as the typ­i­cal unit of pro­cess­ing and are man­aged by the Spark engine. Dataframe pro­cess­ing occurs with­in the clus­ter’s mem­o­ry and is manip­u­lat­ed using Python, Scala, R, or SQL in data pro­cess­ing pipelines. Pipeline devel­op­ment takes place in Data­bricks Note­books, which con­sist of one or more com­mand cells exe­cut­ed sequen­tial­ly. Dif­fer­ent lan­guages can be used with­in a sin­gle note­book, with code (e.g., Python or SQL) writ­ten in indi­vid­ual com­mand cells. Gen­er­al­ly, a sin­gle data pro­cess­ing pipeline com­pris­es sev­er­al sequen­tial­ly run Note­books, each focus­ing on a spe­cif­ic type of pro­cess­ing, such as raw data clean­ing, key gen­er­a­tion, or dimen­sion­al mod­el gen­er­a­tion. The final out­put of each pro­cess­ing pipeline is stored in the cloud, typ­i­cal­ly in the Delta Lake for­mat, which is the de fac­to solu­tion for mod­ern data lake ana­lyt­ics. Although run­ning on a decen­tral­ized and par­al­lel com­put­ing plat­form, devel­op­ment at a tech­ni­cal lev­el is quite straight­for­ward, and devel­op­ers don’t nec­es­sar­i­ly need to delve into the intri­ca­cies of the Spark engine. How­ev­er, full cus­tomiza­tion is avail­able for devel­op­ers who require it.

Work­flow gen­er­a­tion and Depen­den­cy management

Delta Live Tables (DLT), a recent addi­tion to Data­bricks, offers sig­nif­i­cant added val­ue for data solu­tions. The most essen­tial fea­ture is the auto­mat­ic direct­ed acyclic graph (DAG) work­flow gen­er­a­tion and depen­den­cy man­age­ment, which helps main­tain prop­er pro­cess­ing order and ensures the cor­rect sequence of table load­ing. DAG chains are vis­i­ble to users, pro­vid­ing data lin­eage vis­i­bil­i­ty from raw data to report­ing-ready tables. Addi­tion­al­ly, DLT enables declar­a­tive data pipeline devel­op­ment, allow­ing the same pipeline and code to process dif­fer­ent source tables in batch mode or as a stream. This sim­pli­fies the archi­tec­ture and reduces code com­plex­i­ty. Qual­i­ty con­trol is high­ly auto­mat­ed in DLT data pipelines, with data val­i­da­tion occur­ring dur­ing pro­cess­ing based on pre­de­fined rules and con­di­tions, and the results (e.g., the num­ber of rows not meet­ing con­di­tions) are visu­al­ly acces­si­ble to the user.

Data man­age­ment using Uni­ty Catalog

Among the inter­est­ing fea­tures intro­duced last year, the Data­bricks Uni­ty Cat­a­log offers vis­i­bil­i­ty and man­age­abil­i­ty for mul­ti­ple source sys­tems and end-user groups, all cen­tral­ized to ensure con­sis­tent data gov­er­nance through­out the data life­cy­cle. Uni­ty Cat­a­log allows end-users to find data based on meta­da­ta from all reg­is­tered sources and share data through the Delta Shar­ing fea­ture with­in the same Data­bricks instance, with­in the same cloud, or even out­side the cloud envi­ron­ment. Data is eas­i­ly acces­si­ble yet con­trolled, ensur­ing that access to data with­in the Cat­a­log is cen­tral­ly man­aged, audit­ed, and mon­i­tored in one place.

Data pub­lish­ing for con­sump­tion with Databricks

His­tor­i­cal­ly, Data­bricks’ chal­lenge has been on the data dis­tri­b­u­tion side, ensur­ing that BI devel­op­ers and busi­ness ana­lysts have access to ready-made data using famil­iar meth­ods and tools. Today, Data­bricks offers SQL Ware­house, which allows data to be pub­lished in a user inter­face resem­bling a tra­di­tion­al rela­tion­al data­base. With this fea­ture, BI devel­op­ers and ana­lysts can query ready-made, mod­eled data direct­ly from a web brows­er using Data­bricks SQL IDE or from sup­port­ed IDEs such as DBeaver. Wide­ly used BI tools like Pow­er BI, Tableau, and Qlik have native con­nec­tors that eas­i­ly inte­grate with Data­bricks SQL Ware­house. At the time of writ­ing, SQL Ware­house requires a run­ning clus­ter to oper­ate, but a server­less solu­tion is already in the pub­lic pre­view phase.

Typ­i­cal use cas­es for Databricks

Data­bricks is suit­able for a vari­ety of use cas­es, including:

  1. Large data masses: 
    • Dis­trib­uted com­put­ing, a wide range of clus­ter options, and rich pro­gram­ming lan­guage options pro­vide all the nec­es­sary tools to work with large vol­umes of data effi­cient­ly and cost-effectively.
  2. Exten­sive and diverse data platforms: 
    • The ver­sa­til­i­ty of Data­bricks allows for the imple­men­ta­tion of numer­ous use cas­es with­in a sin­gle envi­ron­ment. Data engi­neers han­dle raw data retrieval, purifi­ca­tion, and Data Vault/​dimensional mod­el­ing; data sci­en­tists devel­op and train machine learn­ing mod­els based on these datasets; and BI ana­lysts pro­vide data for use with com­mon­ly used visu­al­iza­tion tools in the SQL Ware­house API.
  3. Pre­ci­sion solu­tions with chal­leng­ing requirements: 
    • For exam­ple, pro­cess­ing data from IoT devices may over­whelm many data plat­forms built with tra­di­tion­al tools (e.g., rela­tion­al data­bas­es) in terms of data val­ues and update fre­quen­cy. Delta Live Table fea­tures, among oth­er options, pro­vide an easy-to-use Stream­ing Table for such purposes.
    • In some cas­es, raw data can be in such a chal­leng­ing form that solu­tions built with tra­di­tion­al tools even­tu­al­ly encounter per­for­mance or main­tain­abil­i­ty issues. Here, Spark Dataframes and Python’s mod­u­lar­i­ty and reusabil­i­ty over SQL enable sig­nif­i­cant­ly more effi­cient and easy-to-main­tain solutions.
  4. Build­ing a new data plat­form from scratch: 
    • Data­bricks offers pow­er­ful tools for devel­op­ing data plat­form solu­tions with famil­iar tools for devel­op­ers. Choos­ing Data­bricks as the imple­men­ta­tion tech­nol­o­gy from the start avoids inad­ver­tent­ly exclud­ing any use cas­es or forc­ing the lat­er inte­gra­tion of incom­pat­i­ble components.
  5. Machine Learn­ing:
    • Machine Learn­ing (ML) has always been cen­tral to Data­bricks. Its ML capa­bil­i­ties are built on top of an open lake­house archi­tec­ture, with fea­tures such as AutoML and MLflow sup­port­ing the devel­op­ment, life­cy­cle man­age­ment, and mon­i­tor­ing of machine learn­ing models.

What has Islet done with Databricks?

Our team has built an enter­prise-lev­el data plat­form using Data­bricks, where data engi­neers lever­aged Delta Live Table func­tion­al­i­ty to process data from raw inputs into a com­plete, dis­trib­utable dimen­sion­al mod­el deliv­ered to end-users via the SQL Ware­house inter­face. The solu­tion inte­grat­ed SAP S/4HANA data from a large Finnish com­pa­ny into Azure using Data­bricks Note­books. Raw data was retrieved from SAP and stored in ADLS Gen2 using Aecor­soft Data Inte­gra­tor. The imple­men­ta­tion was com­plete­ly meta­da­ta-dri­ven; when inte­grat­ing new data, only the basic infor­ma­tion of the new tables is writ­ten to the con­fig­u­ra­tion file, after which Data­bricks Note­books retrieves the new, avail­able raw data, cleans it, and per­forms the nec­es­sary trans­for­ma­tions for fur­ther pro­cess­ing. Final­ly, the dimen­sion­al mod­el designed in col­lab­o­ra­tion with the cus­tomer is gen­er­at­ed using SQL queries host­ed in Azure DevOps repos­i­to­ry, which are read by Delta Live Table pipelines. This mod­el is then dis­trib­uted to Pow­er BI and ana­lysts through SQL Warehouse.

To learn more about Data­bricks’ capa­bil­i­ties, our expe­ri­ences, or to brain­storm about Lake­house archi­tec­ture, please con­tact us!

Janne Antti­la CBO — Data and Ana­lyt­ics, Islet­ter janne.​anttila@​isletgroup.​fi +358 45 672 8569

The authors of the blog, Aku Ranta­la and Mika Rönkkö, are ISLET’s Lead Cloud Data Archi­tects who have man­aged a large num­ber of projects. They have a wide range of skills with a vari­ety of tools and tech­nolo­gies, Data­bricks and Microsoft Azure in par­tic­u­lar are among their strengths.

#Islet­Group #data #ana­lyt­ics #Data­Bricks #Data­Lake­house­Plat­form #Apach­eS­park
Like what you read? Share this!