5years ago, I wrote ON the evolution of data engineering, a blog post covering some of the history and evolution of the data engineering role. That blog post got quite a bit of traction and coverage, being even an included reading as part of Udacity Data Engineering Nano degree.
It covered how the role of data engineers evolved through time, moving away from the traditional data warehouse and taking on skills from backend software engineering to address the challenges of big data, data orchestration, machine learning, real-time or the cloud. 5 years in, this observation is still stands.
However, over this period, the technology landscape has changed.
Kubernetes came in the picture, and became part of the toolbox of some data engineers. Softwares such as Spark, Airflow, SageMaker, Kubeflow and more started supporting Kubernetes as their container orchestration engine. Kubernetes offered Data Engineer the promise of pushing any type of compute work to a single “engine”, be it a web application, a map reduce job or a machine learning workflow; and this unencumbered by licenses, being able to run on any cloud or on-premise. From a technical perspective Kubernetes delivered on its’ promise, however Kubernetes is a complex system and many companies found it difficult to hire and retain engineers able operate it.
At the same time, platforms such as Databricks and Snowflake have emerged, trying to blend the functionalities of big data, machine learning, orchestration, realtime and cloud onto one integrated platforms. Data Engineers using these platforms still relies of similar programming language such as Python or Scala be it for coding in Spark with Databricks, or using Snowpark for Snowflake, however the direction of travel is clear — integrating all the data concerns into a single integrated environment — the data platform, away from tooling previously used by backend engineers.
These data platforms have become some sort of one stop shop to tackle most of the problem faced by data engineers even going as far integrating the need of offering container service applications within their offering.
The emergence of these data platforms raises questions for both the role of data and backend engineers as data becomes an increasingly important part of our software applications.
Two main architecture paradigms are driving the current data modernization movement:
Data platforms looked to provide features supporting these different architecture, but the data mesh approach has so far been the architecture supported to a higher level of maturity, with features such as data federation or no-copy data sharing on snowflake, or Databricks. The data fabric architecture does have components such as the “features store” or other machine learning deployment models supporting the approach, however there are still a lot of room for improvement.
Imagine for instance, a transactional database like Postgres hosted on one of the data platform, made directly available and in real-time for analytical processing through a change data capture feed managed by the platform upon a simple definition of metadata on the transaction table, something like a tag like: usage=[transactional, analytical]
A lot of the integrations would be natively managed, reducing the need for data engineer to setup multiple integrations jobs, and data would become by itself an API, and a managed platform offering seamless integration of real-time data feeds between the transactional and analytical realm .
This is an area that Customer data platform (CDP) has been trying to tackle for a long time, although specializing on a particular problem space. These platforms have been designed to be marketeer friendly, rather than supporting data engineers and developers, but the space is changing rapidely with the increased need to process a high volume of data, increasing complexity and regulation. Within the CDP space, the current direction is towards “composable” CDPs, platforms that can sit on top of existing data platforms addressing only the specific need for Customer Data Management, such as segmentation, id resolution and data integration onto marketing systems rather than fully fledges data platforms themselves.
With the the trend towards a data fabric, the roles of the data and backend engineers are likely to significantly changes:
Data Engineers would be able to focus more on value added tasks of data modelling, data quality, implementing business logic and machine learning decisioning systems rather than focusing on integration, dependency management and the management of infrastructure components.
This will help data engineers deal with some of the increased complexity in these areas as innovation drives past current methodologies and toolings. Data engineers are entering a time of blurred lines, for realtime data transformations where the distinctions between what is a data transformation and what is a table isn’t clear, with federated engine like kafka tiered storage, which abstract data storage from data access, or from future “intelligent tiering” features that could be setup on top of these platforms. This evolution is similar to how distributed engines have abstracted a lot of the work needed for processing purpose.
Data Engineers need to nonetheless understand how the systems or platform behave and how to best interact with them.
Backend Engineers — backend engineers will have to contend with a world of “Data platforms”, whatever it moves towards. They will need to understand whether they can leverage the existing platforms for their use cases and build upon it or risk having to deal with costly external integrations.
5 years in and the future of data engineering seems to back to its roots, that of the data warehouse. The data warehouse of today is however quite different from the data warehouse of the past.
The data warehouse of now, the “Lakehouse” has been rebuilt from scratch to tackle the challenges of today’s data volume, taking along with it some of the best practices and approach taken along the way.
But If the past is any indication of the future … we should look back at the previous era of data warehouses to understand where our future might lie.
Just like in its’ days where Oracle databases were used for data warehouse, but as well to power operational applications built upon PL/SQL and offered a full set of capabilities such as real-time triggers capabilities , queuing , statistical and ml extensions … The new data platforms of this era are trying to be ubiquitous in usage, this time with their own marketplace, handling of big data, MLOps capabilities and more.
We are entering the cycle once more, long live the data warehouse.