WiseAnalytics | The advent of the (Big) Data Architect

Insights

The advent of the (Big) Data Architect

8 min read

By Julien Kervizic

About three years ago, Maxime Beauchemin wrote the “Rise of the data engineer”. Since then the Data Engineer job has become more and more complex, domain-specific expertise has also pushed for separate job functions such as Machine Learning Engineers and the cloud has pushed the boundary of the role towards DevOps and DataOps horizon.

At the same time, the need for data from a business and product perspective has pushed the role from being reporting and analysis oriented to being production and action-oriented. Data is now more and more, leveraged in real-time to produce real-time marketing triggers, machine learning models are created to produce product recommendations, …

This increase in the demand to leverage data, and in the amount of resources being committed to that goal has created the need for a new role, meant to oversee this transformation in the use of data from insights into production. This has given rise to the Data Architect, who faced with now a plethora of options is responsible for the different technology and integration choices to make this dream a reality.

While the role of a Data Architect can vary depending on the organization, it usually encompasses some of these components:

Help define the right choice of technology stack
Define the data structures and data-flows
Act as a technical product manager
People Management

The role of the data architect requires to be very close to both the business and the development side of the equation.

Technology Stack

DataStores:

Moving way from the traditional days of RDBMS, the choice for data-stores has now increased more than 10 folds. And the variety of options data-stores now offer multiple dimensions to evaluate:

Databases come in SQL and NoSQL variants, NoSQL variants are furthermore split in different types such as Column, Key-Value, Document, or Graph databases.
Databases can have SMP (Symmetric Multi-Processing) or MPP (Massively Parallel processing) engines such as Presto
Can be in memory such as Redis, MemSQL or Apache Ignite or leverage on-disk data to handle larger datasets
Can be having variants coded in different programming languages, such as Cassandra and Scylladb, offering a tradeoff between feature richness and performance
Can have varying degree in terms of throughput performance for read or write operation and fast access vs. latency in processing
Can be serving different areas of CAP theorem(Consistency, availability, partition tolerance)
Can be offered as a fully managed service from cloud providers such as Amazon Dynamodb or self-hosted
...

There is the choice of a growing number in terms of datastore and figuring out which one is fit for purpose for the specific project/product and the organization needs specific consideration. The role of the Data Architect is to understand the trade-offs between the different systems, their applications to the product/project, and the impact of introducing new components to the ecosystems and suggest the right course of action.

Message Brokers and Streams

Messages brokers and Streams are at the cornerstone of real-time data processing. There are a few main message and stream brokers out there:

Apache Kafka and RabbitMQ
Amazon SQS and Kinesis
Azure Service Bus and EventHub
Google Pub/Sub

Data architects need to understand the different features and trade-offs between these messages brokers, know how they should be configured, and establish patterns for using them.

Processing Layer

There are different choices to be made in the way the processing layer is set up, should jobs and applications run on bare metals, VMs, on an environment like Mesos or Yarn, run on containerized environment such as Kubernetes or run through Serverless functions or applications. Each of these has its’ own trade-off in terms of performance, ease of management, and level of control that they offer.

On top of this, decisions need to be made on the type of processing framework to use, should a well known but not extremely scalable framework such as Pandas be used, more distributed abstraction such as Dask or Apache Spark be used, or perhaps Map Reduce would be more appropriate.

ETL & Data Orchestration

There are quite a few choices to be made around how data is sourced and ETL/ELT is done and orchestrated.

Do you use a built-in ETL tool such as Microsoft SSIS or Oracle Data Integrator, a cloud-based ETL tool such as Azure Data Factory or Amazon Glue, an auto ETL such as FiveTran or Alooma to get your raw data or an open-source tool such as Singer.

Or rather rely on custom code, where choice needs to be made on whether to rely on cron jobs, triggers such as file drop triggers, use a workflow tool such as Airflow, Luigi, Prefect or rely on cloud workflow management tool such as Azure logic app.

Peripheral Applications:

In certain situations, the Data Architect might be responsible for some peripheral applications, that either consume or surface data (or both). These can involve, business intelligence tooling, marketing automation tool, micro-services, customer and data management platforms, …

Handling the choice of these components:

In order to be handling the choice of these components, it is important for Data Architects to trial some of these components through proof of concepts and leveraging one’s experience to understand which solution would be more appropriate for both current and future requirements.

Data Structures, Entities and flows

In this data era, there is an increased complexity in data-flows and an increasing need for integration. This is coupled with an increased number of integration patterns. Data Architects are there to help define how the data should be stored and flow between the different systems.

Data Structures

There is an increasing need for structuring the data available throughout companies, from the data available client-side through a dataLayer, to how data is stored within the different systems of the company or how it is transferred between them.

DataLayer: be it to provide data to Google Analytics or to provide data to a different sets of tags, such as Facebook or Google Ads, there is a need to structure the information, particularly when dealing with multiple different websites. Defining a common set of data structures, help avoid mistakes to be made and a lot of rework being done due to customization.

Database structure: Architects are responsible to define what data needs to be stored within the databases, and how they need to store the information, for example using normalized or denormalized schemas. This is to ensure that the applications, would be able to ensure both current and future requirements, as well as ensuring consistency and performance.

Message structures: Data needs to be transferred between applications to achieve various tasks, along with engineers and external parties (where applicable) data architects help define the structure of these messages.

Integration patterns

Data Architect is also responsible for the choice of integration patterns.

These choices involve the means of data transfer: ie should data be transferred by FTP, API, Message Brokers/Streams, direct access to database tables, or front end tags.

But also involve lower-level details, for instance, if using an API, should it be inbound or outbound, so it be stateless, should it be a Rest, Soap, or GraphQL API?

Integration flow

The architect should be responsible for designing the integration flow, whether the data needs to be provided to a message broker first, need to land on a staging table, or need to populate back the source systems.

Technical Product Manager

Handle communication, collaboration and coordination

Architects need to communicate with both the business and development counterparts, they serve as the translation layer between what is required from a business point of view and the how it should be accomplished in practice, as such one of the role of the Architect is to translate the business requirements into technical requirements.

Architects make use of different architecture artifacts to communicate these requirements to the development team such as context diagrams, application, sequence, or class/object diagrams.

They are also responsible for aligning roadmaps and requirements between internal teams and third parties.

Handle the strategic vision

The role also requires architects to manage the strategic vision for the applications, platform being set up, and developed.

They need to think about how they will cater for future and perhaps unforeseen requirements, think about what capabilities to enable and understand how the business or development team would benefit from them.

Technical Requirements

Architects set forth the technical requirements, for instance in case of integration for instance, they provide the structure of the data feeds and define the integration pattern to use, in case of systems like APIs what’s the integration flow to handle. Architects are there to also handle potential trade-offs occurring during the implementation phase, make decisions and adjust the requirements accordingly.

Review and Validation

Architects review the work done by the development team, through code and manual tests to validate that the integration is working. They ensure that the application creates the expected message with the appropriate mapping and transformation, as well as that the approach taken during development would satisfy the non-functional requirements, such as properly handling the expected load for the application, being resource-effective, is up to the security standard…

People Management

Generally, architects do not tend to be a people manager, however, this is not always the case, in some situations data architects as senior IT professionals need to people manage.

To some extent, the role can be defined as more of a Data Engineering or software engineering Manager. In this sense, Architects need to provide guidance and mentor more junior employees.

Wrap up

The role of an architect can be quite varied, but their overall goal is to lead the development effort in terms of how the data is transferred, processed, and stored. The role needs to work alongside business stakeholders and the development teams to guide the development effort.

‍