Data Modeling seems to have become a lost art amongst data engineers. What was once the primal part of the job of a data engineer seems to have been relegated to a secondary rank.
Shaping the data by developing an understanding of the underlying data and the business process going along with it doesn’t seem nearly as important these days as the ability to move data around.
In a large number of organizations, the role of a data engineer has been transformed from a data shaper to a data mover.
Data Engineering has been a fast-evolving role, at the same time rising in complexity as well as increasing its scope of impact. Different trends that are contributing to the evolution of the role are also contributing to its’ shift away from being data shaper role to a data mover role.
The data lake introduces a new paradigm for dealing with data, the “ELT” process. Instead of being deliberate about what data ended up being stored, people started looking at the Data Lake as a place to dump all of their data and make it accessible to the rest of the organization.
This approach has a few advantages, but one is particularly impacting the role of the Data Engineer— the organization is no longer bottlenecked by Data Engineers needing to transform the data before it can become accessible. Data Scientists and Analysts can access the raw data and do their own transformation on top of it without needing to file a ticket and waiting for a Data Engineer to pick it up.
The move away from RDBMS also made it harder to enforce certain types of constraints such as key’s uniqueness necessary for dimensional modeling.
Depending on the skills of the data consumers and the complexity of the data, it might be more important for Data Engineers to leverage a Data Lake/ELT approach to simply make the data accessible than to spend time shaping it.
The development of the cloud has also contributed to the shift away from shaping data. Migrations from on-premise solutions to cloud-based solutions pushed for data engineers adept with data movement, while saas data integration solutions such as Stich or Fivetran offer pre-modeled datasets for a number of integrations.
Other contributing factors are the added need to understand the infrastructure, the new tools offered as well of develop new skills such as infrastructure as code.
The rise in importance of Machine Learning and the development of AutoML has also pushed more towards Data Engineering as data movers.
Machine Learning became yet another area that Data engineers need to know and integrate data to, while at the same time, AutoML reduced the need for modeling, in favor of a programmatic way to engineer features.
The rise of streaming and real-time processing has changed a bit the way Data Engineers might think about transformations. While it is still possible to do aggregations on streams, leveraging systems like Spark or Flink, most of the work performed on streams tends to be data filtering (triggers) and enrichment rather than data shaping.
The need for real-time processing also pushes some of the initial transformations away from data engineering to the core application.
Even in the age of Data Movers data modeling is still relevant. Data modeling empowers more advanced data contracts, data warehousing & BI use cases as well as allows to unleash some more advanced real-time use cases.
Data models can be seen as an extension of standard data contracts. While in most instances when exchanging data, restrictions tend to primarily be placed on the schema. Data models obtained through dimensional modeling typically place additional restrictions such as granularity into these contracts. They are in the end just another API.
The dimensional methodology is in general very appropriate to be leveraged in Datawarehouse. Tech companies such as Gitlab, Shopify, or Picnic still make use of dimensional methodologies such as Kimball to model their data. While in other companies newer methodology exists such as “Functional Data Engineering” — some form of dimensional still tends to exists to simply the discoverability and ease of access to the data.
Dimensional methodologies make it easier to understand how the data works and how it can be leveraged overall. They end up close to the users’ needs.
Besides the greater discoverability and interpretability offered by having properly modeled the data, Advanced Analytics can also benefit from dimensional modeling, techniques such as denormalization allow obtaining better reading performance as well as facilitate greater re-use of the data. This often contrasts with some of the more modern “proof of concepts” approaches that data scientists often choose, creating features out of raw data and integrating them directly onto a feature store.
Data modeling is however not just useful in providing a data source for computing the feature, but as well is useful in engineering the features that will be incorporated in models. Being able to embed business logic typically results in potential higher model performance for lower computational power.
Dimensional modeling techniques can also be applied to real-time use cases, for instance leveraging real-time fact tables and lambda architecture.
While the introduction of stream-stream join allows for handling dimensional updates and reduces the need for leveraging reconciliation patterns. The leveraging of the mixed streaming / NoSQL approach also allowed us to unlock quite a few us cases in terms. Streaming databases are taking it a step further in making real-time data streams data an integral part of the data model. This architecture named “Kappa Architecture” has its pros and cons, but what it does offer is a simplified approach for operating real-time data at a cost of lower accuracy.
With the arrival of data lake formats such as Delta Lake, Hudi, Iceberg, or MPP databases such as Snowflake; the distinction between data lakes and data warehouse is becoming fuzzier.
The new systems sometimes referred to as “Lake Houses” incorporate some of the features traditional to data warehouses onto the data lake such as transactions & ACID, or bringing over temporal tables to more easily enable point-in-time queries.
The convergences are still not complete, these systems are still missing certain features of the data warehouse such as locking of primary keys, multi-table transactions, or the handling of foreign keys.
Overall these new features bring the world of the data lake closer to the data warehouse and ease some of the modeling use cases. Still compared to traditional data warehousing there will still be the need for some level of adaptation to handle some of the missing features.
Trends and increasing complexity in the role of the data engineer have pushed towards the role towards an increasing coverage of potential use cases, rather than mastery and depth towards Data modeling and ETL transformations.
Coupled with emerging tools such as DBT that make it easier to orchestrate a series of transformations, other roles such as that of Analytics engineers have emerged to take on part of the modeling needs. This role while typically less technical than that of a data engineer still needs some engineering mindset and knowledge of engineering best practices and is still hard to find.
Data and dimensional modeling are still relevant these days, even in the age of the data lake, and of real-time processing, there are still a lot of use cases for Engineers who know how to model data rather than just move it. These engineers are however harder to find in the midst of the increasing responsibilities of Data Engineers.