At WiseAnalytics, we help our clients unlock the value of their data, we build systems that drives our client’s topline growth, help them manage their cost base or fulfil their regulatory obligations.As a professional service provider to Databricks, we get exposed to a number of different implementation and use cases. We have found that our clients typically choose Databricks other alternative platforms for a variety, but a few reasons stand out:
The first thing to note is that to a large extent, Databricks is built upon Apache Spark - an open source analytics engine, and although not all the features of Databricks are open sourced, Databricks has made a significant effort to contribute back new features such as the Delta Lake onto the open source community.
Being open sourced has a number of advantages, but I think 2 are particularly worth highlighting:
Apache Spark has been released in May 2014, almost 10 years ago and became one of the defacto tools for operating on larger datasets. Besides Databricks, Apache Spark is integrated in a number of cloud solutions, such as Google Dataproc, Amazon EMR or Athena and Azure HDInsight, or installed on some on-premise or Kubernetes clusters.
Spark knowledge is well spread in the data engineering community, and finding users proficient in some of its’ intricacies and optimizations isn’t difficult. This as well gets reflected in the number of knowledge based articles made available online, making it easier to find a solution to an existing problem.
Databricks is a managed solution, that is cloud first but cloud agnostic. Databricks can run on top of Azure, Google or Amazon Cloud. It is able to run on multiclouds environments, or hybrid clouds. This setup enables to provide a seamless interface and environment to operate on all of the company’s cloud data — while having the processing happen closest to the original datasource, avoiding potentially expensive migration and duplication of raw data across clouds.
Databricks offers a number of features made for enterprises, be it governance and management features required by larger organization such as Data Masking, Audit logs, Policy controls, single sign on and role based access controls, as well as a number of additional security and compliance features. One of the relatively new features offered by Databricks, in that space, is that of the Unity catalog, which provides a centralized way to manage access control, auditing, lineage and data discovery.
Compared to most Spark installations be them native or cloud based, Databricks makes it comparatively easier to manage these requirements.
Databricks is at the forefront of innovation for Spark, it created the Delta lake and Lakehouse, and brought real-time innovation with Delta Live table.
Delta tables provide ACID level of guarantees to big data, schema enforcement or “time travel” functionalities, constraints and checks making it more than just a concatenation of parquet files (or other format), something that was before the advent of the lakehouse common place in datalakes, bridging the feature gap between the datalake and datawarehouse.
Delta live table is a framework built on top of Spark structured streaming, that is meant to facilitate the implementation of both “Lambda” and “Kappa” style of real-time architecture. Delta live tables as well bring a number of novel functionalities such as the introduction of delta live table expectations, embedding data quality rules as part of the structure of the datasets, or built-in change data capture functionalities.
Databricks provides an optimized run time for Spark called “Photon”. Photon is a vectorized engine for SQL and Dataframe API written in C++ to achieve higher performance. Photon increases the parallelism of CPU processing, and provides a significantly more performant runtime than vanilla Spark. It is however more constrained in the features supported and comes at an additional cost, however it provides an additional choice compared to the traditional scale the number of machine and scale the size of the instance when performance consideration is key.
While Apache Spark, does offer an alternative C++ engine improving its performance through the Gluten/Velox project — the project is relatively more immature than Photon — the a white paper being released in August 2022, almost a month after the release of Photon in general availability, and not yet integrated as part of any cloud’s managed offering.
Migrating to Databricks is often a two step approach, one of lift and shift and one of modernization.
Our clients typically find migrating to Databricks fairly easy, the underlying engine Spark being extremely versatile is able to both the typical data formats used on most Hadoop installations or connect to external databases such as Snowflake or MongoDB. The lift and shift migration is supported by Spark ANSI-complaint SQL dialect, by supporting Pandas’ API on PySpark (ex: Koala) and by the fact that Spark has been a defacto standard of datalakes for the past 10 years.
Databricks offer a number of feature supporting a lift and shift migration, be it in the area of data federation or data replication. And while both area supported, our clients often prefer to migrate some of their data directly to databricks for a diverse set of reasons (modernization needs, cost consideration, latency and performance …), this is an area we expect databricks to further strenghten their offering with the recent acquisition of Arcion.
Databricks offer a full ecosystem for data scientists and engineers to operate with, from notebook and pipeline management functionalities, integration of machine learning end to end with MlFlow, providing training, tracking, and serving of both features and models, databricks introduced as well back in September 2023 their lakeviews dashboards providing improved dashboarding capabilities.
You don’t need to build everything from scratch when you use databricks, databricks offers a suite of “Solution Accelerators”, providing some form of cookie cutter template for a number of use cases, ranging from customer base management (churn prediction, retention management, rfm segmentation, customer segmentation, lifetime value), operations (stock management, optimized order picking, on shelf availability, supply chain distribution optimization), media (media mix modelling, multitouch attributions), compliance (AML/KYC, PHI Data removal).
These solution accelerators help bootstrap the different use cases and projects without reinventing the wheel, and help delivery proof of concepts without a high initial investment.
A data platform should provide seamless integrations and a rich ecosystems of tools. In addition to standard Spark integration, Databricks is integrated onto most of the popular datatools such as DBT, dashboarding tools such Preset, Tableau , PowerBI and Grafana, lineage tools such as OpenMetadata, Datahub and Collibra.