WiseAnalytics | Real-time Data Pipelines — Complexities & Considerations

Insights

Real-time Data Pipelines — Complexities & Considerations

29 min read

By Julien Kervizic

A conceptual illustration of real-time data pipelines, showing interconnected nodes and data streams represented by glowing lines, all connected to a central hub symbolizing processing and analytics. The background is a mix of dark and vibrant colors to highlight complexity, with icons symbolizing data storage, servers, and cloud computing. The scene is futuristic and clean, evoking high-tech and efficient systems.

‍

The shift towards real-time data flow has a major impact on the way applications are designed and on the work of data engineers. Dealing with real-time data flows brings a paradigm shift and an added layer of complexity compared to traditional integration and processing methods (i.e., batch).

There are real benefits to leveraging real-time data, but it requires specialized considerations in setting up the ingestion, processing, storing, and serving of that data. It brings about specific operational needs and a change in the way data engineers work. These should be taken into account when considering embarking on a real-time journey.

Use cases for leveraging Real-time Data

Streaming data integration is the foundation for leveraging streaming analytics. Specific use cases such as Fraud detection, contextual marketing triggers, Dynamic pricing all rely on leveraging a data feed or real-time data. If you cannot source the data in real-time, there is very little value to be gained in attempting to tackle these use cases.

Besides enabling new use cases, real-time data ingestion brings other sets of benefits, such as a decreased time to land the data, need to handle dependencies, and some other operational aspects:

If you don’t have a real-time streaming system, you have to deal with things like, okay, so data arrives every day. I’m going to take it in here. I’m going to add it over there. Well, how do I reconcile? What if some of that data is late? I need to join two tables, but that table is not here. So, maybe I’ll wait a little bit, and I’ll rerun it again. — Ali Ghodsi on a16z

Infrastructure for Real-time Data flows

ClickStream Ingestion: Ingesting clickstream data often require a specific infrastructure component to be present to facilitate that. Snowplow and Dilvote are two open-source clickstream collectors. Simultaneously, Google Analytics 360 allows raw data export of clickstream data to BigQuery, and some CDPs like Segment or Tealium allow to capture and export clickstream data to streams or databases.
Ingestion framework: Frameworks such as Apache Flumes, Apache Nifi, offering features such as data buffering and backpressure, help integrate data onto message queues/stream.

“When I introduce Nifi to people, I usually say that Nifi is the perfect gateway to get the data in.” — Pierre Villard, Senior Product Manager at Cloudera

Message Bus / Streams: A message bus, streams is the component that will serve to transfer the data across the different components of the real-time data ecosystem. Some of the typical technologies used are Kafka, Pulsar, Kinesis, Google Pub/Sub, Azure Service Bus, Azure Event Hub, and Rabbit MQ, to name just a few.
Processing: Different processing framework exists to simplify computation on data streams. Technologies such as Apache Beam, Flink, Apache Storm, Spark Streaming can significantly help with the more complicated processing of data streams.
Stream querying: It is possible to query streams directly using SQL, like the type of languages. Azure Event Hub supports Azure Stream Analytics, Kafka KSQL, and Spark offers Spark Structured Streaming to query multiple types of message streams.
Decision Engine: real-time actions need real-time data and a way to process this information systematically. Decision engines help make the incoming flow of data actionable. There are two main types of decision engines stateless (e.g., CLIPS, Easy Rules, Gandalf) and stateful decision engines (e.g., Drools).
ML Framework + Processing: Machine learning models can be leveraged within a real-time architecture. They can help make better decisions by calculating scores, such as the propensity to fraud. Different types of framework exists with varying degrees of sophistication, such as XGBoost, Tensorflow or Spark MLib.
Data Store: Depending on the specific integration needs, leveraging real-time data might require some fit for purpose data stores. Specific OLAP type of database such as Druid, might be required to do slice and dice analytic on the incoming data, HTAP datastore such as Kudu, Cassandra or Ignite might be required for handling specific enrichments, Elastic Search for needle in the haystack type of queries, S3 for long term archival purposes, RDMBS or even leveraging the stream directly (using Kafka directly for instance).
Query Federation: With such a diverse ecosystem of datastores, having the ability to query them using the same interface and tool becomes a growing need. Tools such as Spark and Presto provide this type of query federation.
Dashboarding: Different types of dashboards are available to handle real-time use cases. While it is still possible to leverage traditional dashboarding solutions such as Tableau, solutions such as Grafana or Kibana are usually more appropriate.

Ingestion

Source of data

There are different sources of data that can be leveraged in a real-time pipeline. Data can be sourced from external services, internal Back-end applications, front-end applications, or databases—the type of source dictating the available integration patterns.

External Applications: There are different ways external applications might be integrated into a real-time pipeline. The typical ways rely on webhooks, creating a specific API consumer, or having them publish directly onto a stream/message queue in a “firehose” manner.

Internal backend Applications: Internal back-end applications have quite a few ways to publish events to other applications by calling an API, connecting directly to a stream, or leveraging an integration SDK.

Front-end: Real-time event ingestion from the front-end is typically handled by combining an event ingestion framework (e.g., snowplow), Tracking Pixel, and Tracking Script. Besides allowing the capture of granular click data, this type of approach has the added benefit of allowing some ad blockers' bypass.

Database: To ingest data in real-time from Databases, it is possible to leverage the Database bin logs. Database bin logs contain the records of all the changes that happened on the database. Bin logs have traditionally been used in database replication but can also be used for more generic real-time data ingestion.

Infrastructure

To ingest streaming data, including from the front end, several components are needed:

A collector application, essentially an API that is there to receive data from the front-end and any back-end application that wishes to call it through a web service.
a message broker to transport the data across applications in real-time
A schema registry to validate the events coming in.
a (typically front-end) SDK to send the information in a structured way to the event collection pipeline
A tracking pixel to track activity where Javascript might not (always) be enabled.

Collector Application: There are multiple ways to set up the infrastructure for Streaming ingestion. Basic collector applications can be set up in ~10 minutes (including schema validation) using low/no-code kind of toolings. More extensive setup can be obtained through leverage open source components such as Snowplow, for instance.

Message Brokers: Regarding the message brokers, they come into different variations; some, for instance, are more oriented towards “stream” processing, more easily supporting the replay of events. This is the case of solutions like Kafka and Kinesis; they are particularly well suited for computation of real-time aggregates and dealing with architecture patterns, such as the Kappa Architecture, an architecture pattern dealing with sourcing data directly from historical streaming data.

The main drawback of dealing with this type of message broker is their poor “transactional” handling, making it ill-suited to deal with high-value messages type of integration. For instance, if you were looking to integrate into a CRM system, it would be beneficial to know that a given order couldn’t be pushed to the target system and that after 10 retries, it had still failed to reach. More transactional types of message brokers such as Service Bus offer these kinds of features such as transactional locks and dead letter queues.

Dealing with different types of messages also has implications as to how the messages broker would be configured. Message brokers such as Kafka can offer different “delivery guarantees,” providing guarantees as to whether a message can be processed at least, at most, or exactly once. Depending on the options chosen, it may lead to a higher likelihood of event duplications.

For messages brokers, not strong in the transactional aspect. In case messages are not getting processed correctly, they need to be handled into a separate stream and stored for further processing, hoping that they will then be processed correctly by that further stream after the fact or “curated” ad-hoc, increasing overhead.

Schema validation: With regards to schema validation, different types of tooling exist to help with the management of schema and enable schema validation in real-time such as iglu, confluent schema registry, … Kafka, for instance, has direct integration with a schema registry and provides a built-in way to do schema validation in which case, validation errors can be thrown directly to the producer providing feedback as to the issues. Other approaches rely on a more asynchronous way of validation.

SDK: A SDK is particularly useful to simplify the integration of data flowing through the real-time pipelines. This is particularly the case when dealing with front-end events. SDK simplifies the integration of events into a common structure, automatically providing attributes to these events based on context (think about attributes such as whether or not the environment is a production one, what is the userId associated with the events, what is the IP of the client). Many of these enhancements are already pre-build in SDKs such as Google Analytics’, but quite a few data-driven companies take it to themselves to develop their own to power their data collection initiatives. For instance, this is the case of slack, which creates specific libraries for tackling front-end event data acquisition.

Tracking Pixel: A tracking pixel provides a way to extend one’s tracking off the site, where Javascript might not be supported. This is, for instance, the case in email clients in which Javascript is very rarely supported.

It is worth noting that the decisions as to what component to include within the infrastructure is highly dependent on whether or not there is some specific need to track front-end clickstream data directly and if there is an advantage of getting extra speed from leveraging an existing ecosystem.

For instance, if there is only the need to source data from back-end applications, there might not be a need for an event collector, SDK, or tracking pixel. The backend application can be responsible for validating its front-end component and integrating directly into a message topic/stream. Most message servers such as Kafka can handle concurrent write from diverse applications. Therefore, they can support receiving messages directly from multiple applications without the need for a centralizing “event collector.”

When looking to source clickstream data from the front-end, applications such as snowplow offer additional feature already built in such as Geolocation lookup, tracking scripts, pixel, and plugins in applications such as DBT to offer common types of processing on the data, or google analytics to provide a simple hook to onto GA’s tracking to provide granular website visit data.

Event content

Handle different types of events differently.

It is good to differentiate between the different types of events, be them technical events, front-end events or backend ”business events,” or surrogate database update events.

It is important to provide a structure that can easily differentiate these different event kinds, to, as much as possible, handle their ingestion and changes programmatically.

Different types of events also need different kinds of information. Database events to be fully traceable require to have fields related to the date of creation and update. Front-end data might need specific data such as UserAgents, IP address, etc., to be enriched to be fully effective (e.g., geolocation, devices …) or client time to more accurately calculate.

They also have different storage needs; technical events most likely only need to be stored in hot (such as Elastic Search) and cold(er) storage (such as S3), often only requiring summary metrics to be exposed into “warm” storage. While business events and database entities changes most often need to be stored in warm storage (like a data warehouse).

Standardization and Planning

To minimize the work needed to maintain the real-time dataflows, it is important to take several steps towards a schema definition and establish data contracts between the originating applications and the consuming applications such as the data warehouse.

In this context, special attention needs to be paid to the event structure. Setting up the right event structure can avoid extra work, make the data more accessible and improve the general data quality.

The data planning: This is the first step towards defining a common understanding of the data that is going to be sent. This step allows making sure that the requirements are going to be fulfilled by the data being sent. The added visibility allows to create the appropriate downstream table structures properly, implementing the data quality checks, and making the right inference of the resulting data.

It is important during the data planning phase to understand the different business processes around the data. This is needed to provide the right insights as to how to process the data. Walmart, for instance, explained that as part of its process for preparing for an event-driven data.

Structuring Event Schemas: There are multiple trade-offs to handle when defining schemas for events, their generalization vs. specificity, how the information contained within them will ultimately be processed and accessed, the flexibility of adding new information to the messages without impacting running processes, and the adaptability of messages to future requirements.

Although created for a completely different purpose (use of structured data for SEO purposes), the schemas provided by schema.org often provide a good starting point to provide the structure for an event, and it’s attributes.

Depending on how the data will be processed, there should be some consideration as to whether to allow deeply nested structures within the schema or whether it would be needed to flatten them or promote some of their key attributes to the root of the event. SQL, for instance, is not particularly friendly with dealing with deeply nested structures.

Providing an extra_data property within event schemas allows events to handle a certain degree of structure, for instance, for properties that need to be generalized across multiple sources, events, or that need a strong enforcement.

Event Generalization: As the amount of data and its complexity grows, it is important to make sure that data is sent generically to mitigate the downstream impact of changes.

The generalization makes it easier to apply enrichments or processing downstream without too many adaptations, for example, to calculate A/B test experimentation results out of the ingested data.

It also makes it easier to analyze directly from raw data; imagine the comparison between having to deal with an event called “customer_interaction”, containing all channel interactions (channel in: email, phone, SMS, in-store…) and type of interactions (e.g., WELCOME, PURCHASE…), versus having to go through 10s of events like SMS_WELCOME or STORE_ADVICE. Having particular events makes it much harder to analyze and ensure that every channel/interaction type will be included in the analysis.

Depending on the context, the implementation of this type of event generalization typically happens either through the setup of specific contracts (specialized schema/APIs) or through the incorporation within specific logging framework / SDKs of specialized entities target. The importance of a specific logging framework is not to be under-estimated. They provide the starting point towards enabling and enforcing specific event structures.

Naming conventions: Naming conventions help people get a quicker and better understanding of the data and assist the downstream processing of events and ad-hoc analysis. They apply to both the event fields as well as the content of these fields.

Take, for instance, the field utm_campaign. The field name itself indicates that it relates to the campaign that brought the visitor to the website. Its’ content, a string, is likely as well to be obtained through naming conventions. An utm_campaign parameter such as ENNL_FB_Retargeting_Cat_BuyOnline can, for instance, indicate the language of the campaign (EN), the country of targeting (NL), the platform of targeting (FB), the mode of targeting (re-targeting in contrast, for instance, with demographic or interest targeting), the mode of delivery (Catalogue ads on FB) and the type of the Campaign BuyOnline.

In the above example, these naming conventions make the information easier to consume than containing merely an identifier and does not require some metadata table to be loaded in to be interpretable.

Event Grammar & Vocabulary: Building an event grammar is the next step to make event data more understandable and increase the data's degree of generalization. The event grammar describes the different interactions and the different entities. The activity stream standard provides a structure and vocabulary for capturing different events “Actions”.

Technical Events

Technical events often need to be ingested and surfaced so that the engineering team has access to the proper log data to debug their code. Sometimes, given the type of integration and nature of the application, it might be necessary to leverage historical logs and look back at what happens a month or a year ago for a given person or process.

This can be the case when dealing with a subscription type of product. Questions, such as there was consent given to store a payment token, did it end up getting saved properly, what was its’ expiration date' questions that may arise when dealing with a payment failure now.

Some developers advocate for logging everything that’s happening within an application, including trace and unique sequence numbers, capturing every hop's meta information, such as how long a specific request took.

The way technical events tend to get accessed, store, or process does not usually require the same strictness of structure that business or database change events would require, instead favoring the ability to search through them for specific occurrences. These requirements should be reflected in a more flexible schema, and a need for less strict field level validations than for business events.

Front-end Data and tie into google analytics

When looking to capture front-end events, a question that often comes up is how to ensure that we can capture and gain access to all the raw events pushed to Google Analytics.

There are different approaches to achieving such a goal, all with a different tradeoff between flexibility and integration cost. Depending on the approach taken, leveraging streaming data ingestion from the front-end may duplicate work or may end up in some events not being propagated to the Data lake.

Google Analytics 360 — Provides a way to export the raw events data from Google Analytics in near real-time. Google Analytics 360 has direct integration with Big Query making the data directly and easily accessible. This approach has the same restriction regarding what data can transit through this integration (e.g., no PII) as Google analytics and requires the upfront cost for the Google Analytics 360 suites (~$150k/year).
Google Analytics Custom Task: Possible integration patterns such as creating custom tasks for google analytics, to add another receiver to the data exist, allowing to hook into Google Analytics — but provide less flexibility in terms of providing additional attributes to be sent to the backend and putting the tracking script at the same consent restriction as google analytics. Google Analytics tracking scripts are furthermore often blocked by several adblockers, providing poor events coverage.
Tag Manager-ish: middle-ground approach such as using a Tag Manager, Custom Tracking script, and leveraging the same Data Layer as Google Analytics, provide added flexibility in terms of the selection of attributes, and on how to deal with GDPR consent restrictions, at the cost of added configuration in the Tag Manager. A similar approach can also be made server-side when leveraging specific Customer data platforms such as Segment, which allows routing specific data such as Google Analytics and a Database.
Specific Integration: Creating a separate integration provides the most flexibility of all, decreases the likelihood of having the tracking data blocked by adblockers, but comes at the cost of an increased integration effort.

Business Backend Events

For “business” backend events, i.e., specific “factual” events, it is important to leverage some type of standardization when dealing with events. When integrating specific types of data flows, leveraging a common unified or canonical model helps secure a perennial integration dataflow for each type of event.

For instance, an e-commerce website might be interested in specific events such as Order Purchased, Order/Item Shipped. These events can be standardized, to a certain extent, across different solutions into these canonical models. This standardization also applies to the minimum set of attributes provided for each event, such as the specific shop id, for instance.

Other types of information are usually necessary to provide when pushing these types and needing to correlate them across time, particularly when dealing with a distributed application context.

It is important to make sure that instead of relying on a created_at timestamp either generated by the applications/event processing pipeline but rather on leveraging sessions/actions and sequence numbers. Relying on created_at timestamp can lead to the wrong assumption of ordering in a distributed context. The impact of relying directly on create_at timestamp is highly dependent on 1) the data velocity of the application, 2) on the overall ingestion speed, 3) on the technical infrastructure setup — whether, for instance, session affinity has been configured for the infrastructure.

Database change events

Database change types of events should be provided in an event programming CDC type of way. There are a few reasons for leveraging an event programming approach to CDC, rather than leveraging bin-logs directly (typical CDC):

1. Decoupling the implementation of the operational data store and the dataflows.

2. Providing additional information that we might not want/need stored within the operational datastore.

In this approach, the underlying database entity changes end up being reflected in the data platform through events operation such as “creation”, “update”, “deletion”. Therefore, there should be a separation between the different event names, database target entity, and the type of change being applied.

The way updates are performed on the databases should be provided consistently, i.e., the fields propagated as events should either contain only the changed fields or a holistic view of the entity (last snapshot) or provide a clear distinction of how they should be processed.

The logging framework / SDK should force the inclusion of certain data fields, such as “created” and “updated” timestamp originating from the database entity.

Having a consistent set of operations, and these different timestamps and means of integration, allows to propagate and integrate the changes originating on the different domains back onto the data platform.

Control structure

There needs to be a certain level of control and checks — Data Ops to leverage events and ensure a decent data quality level.

End 2 End testing

When leveraging real-time ingestion, it is crucial to put the right safeguards to avoid having regressions on the data feeding into the data platform. End to End tests encompassing the data production journey from the originating application to the datastore should be performed to safeguard the quality of the data ending up in the data platform.

Validations, Data Quality checks, and Data Monitoring

Handling data being pushed through events requires some validation in place. Schema and attribute should be validated. Schemas can be validated through native message broker capabilities (e.g., Kafka) or through specific applications (e.g., Snowplow); validating attributes is a more complex affair.

We need to distinguish static attribute validation, which can be included in a schema such as AVRO/JSON or Protobuf for validation. More dynamic types of attributes require another form of validation altogether.

To a large extent, in dynamic validation, most of the onus for validating the data should be on the originating application. Nevertheless, a process should be set up on the receiving end (i.e., the data platform) to ensure that it conforms to expectations.

To ensure that the data platform can perform its own share of the validation process, it is important to have specific rather than generic definitions or the different events being sent, for instance, to ensure that that field truly corresponds to its content (e.g., an order_id ending up in a cart_id field).

Data validation and monitoring do not just stop at Schema and attribute validations. Automated checks should be performed to identify whether certain attributes haven’t been sent in a period of X days or referential integrity between the different events. Regressions happen in code, and setting up a proper control structure help minimize the impact, or change in the application might not have been properly communicated or handled.

When dealing with events type of sources, validations and checks should happen in multiple layers :

at the originating source: To ensure that the data received conforms to expected values and/or the business logic
at ingestion time: Ensuring that schema is expected and that the data platform is not receiving unexpected events.
and post-ingestion: Checking for things such as referential integrity, as these assumptions end up being relaxed during the transfer of data, or lifecycle checks to ensure that all the events were received for a given series of actions.

Processing

Real-time processing

There needs to be a certain level of control and checks — Data Ops to leverage events and ensure a decent data quality level.

Control structure

Stream / Message Enrichment

Streams typically need to be enriched to provide additional data meant to be used in realtime. They can either do lookups on additional services, databases, do first stage ETL transformations, or add machine learning scores onto the stream.

Enrichments of messages typically happens through a producer/consumer or publisher/subscriber type of pattern.

These enriched applications can be coded in any language and often do not require a specific framework for this type of enrichment. Although specialized frameworks and tooling exist, such as Spark Streaming, Flink, or Storm, for most use cases, a normal service application would be able to perform adequately without the overhead, complexity, or the specific expertise of a streaming computation framework.

Stateful Enrichment and Cleanup

Stateful enrichments and cleanup of the data might be needed to be used downstream.

Stateful enrichment: Event-driven applications might need to consume data containing data enriched with historical data (i.e., state). Think of a potential trigger providing you a discount to purchase a product if you have been visited the website at least three times in the past 24 hours without ordering. The downstream application that will ultimately decide whether to offer you the discount will need to know that you are currently visiting the website (real-time event), your history of visit (X visits in 24 hours), and your order history (has ordered in the past 24hours) to decide to offer a discount or not.

Stateful cleanup: This can be the case when attempting to use customer data coming from multiple sources to be used in CRM systems that want to leverage a 360 view of the customer, for instance, to leverage contextual marketing triggers:

“The success of a digital business is all about being relevant to the customer at every interaction. Relevance is contextual. Therefore we start with a fundamental requirement of being real-time and be able to respond to events in the customer’s timeline rather than the marketing campaign timeline. That’s why the Customer 360 golden record must be made into a Customer Movie (timeline events by each unique customer) and that becomes the core of event-driven data architecture” — Lourdes Arulselvan, Head of Data Archictecture at Grandvision, Former Product Manager Decisioning at Pegasystems

In that specific case, some initial merging and unification of customer data might be required before feeding into downstream applications. A more thorough processing could happen afterward, in an offline/batch propagating the different applications' changes.

Stateful deduplication: Some message brokers offer at least once delivery option, creating the need to deduplicate events. Depending on the specific option chosen, some solutions such as Azure Service Bus offer native messages deduplication option while others might require an external stateful deduplication.

Aggregations

There are two main use cases for performing aggregations on data streams.

1. Providing up to date real-time analytics data, for example, through dashboards.

2. Speeding up the downstream — typically batch — computations for very large datasets.

These operations can typically be handled through applications spark (see spark structured streaming) or Presto, which can perform time window aggregation.

Rule Engines, Complex Event Processing, and Triggers

Rules engines, complex event processing (CEP), and real-time triggers, help convert the collected data into operational intelligence.

There are different rule engine types; CLIPS, Pyke, Pyknow, and Drools are just some of the different open source rule engines available. Rule engines come in different flavors and support different standards and languages. Some are stateless; some are stateful; they can support rule language such as OSP5, Yasp, CLIPS, JESS, or their own language construct and standards such as RuleML or DMN.

Business rules management systems that leverage the DMN standard can benefit from a wide set of editors, which allow modeling, visualize and export the decision logic in the DMN format without requiring code to be written for its implementation. Thus, allowing for stronger collaboration with analysts and the business in the implementation of complex event-triggered logic. The drawback of the DMN format is that it relies on stateless computation.

There are two main types of technologies that help support this type of technologies, decision engine/business rules management systems, that incorporate building decision logic and directly leveraging stateful streams processing applications such as Apache Flink or Storm.

Some decision engine such as Drools, can incorporate stateful decision and handle the requirements of CEP. Drools support stateful competition when leveraging its’ .drl format rather than the DMN format.

Frameworks such as Flink or Storm allows to code the specific application logic to generate a decision. For instance, Flink provides the example of Fraud detection, which generates different alerts based on the different rules that have triggered exceptions.

Real-Time Machine Learning

Another type of processing done on the data is that of Machine Learning. Machine learning models can be trained using specific “online learners” or generating predictions based on the incoming data.

Storing & Serving

Real-time data brings about different challenges in terms of storing and serving collected and process data. Data tends to have different access patterns, latency, or consistency requirements, impacting how data needs to be stored and served.

Sinks

Sinks typically relate to the last stage of the data pipeline, in which the data is exported to a target system. A given data flow can have multiple sinks, should you need to export data to different systems from the same data source. This is something quite typical in real-time data, where fit for purpose integrations are common-place.

Storage Layer

To properly handle the different needs arising from real-time processing, it is important to have the appropriate systems to handle the type of workload and access pattern for the data. Depending on how the data is consumed and the volume/velocity of data, there might to complement the data platform with OLAP, OLTP, HTAP, or search engine systems.

For dashboard/aggregations type of use case, OLAP types of databases offer lower latency than typical data lake/data warehouse components. Facebook previously described how they used their in-house database Scuba, to support query deep dives and real-time dashboard aggregations for use cases such as performance monitoring, trend analysis, and pattern mining. Netflix discussed their use of Druid — an open source OLAP database, for powering their real-time insights. Other well-known open-source alternatives are Clickhouse and Pinot.

OLTP for Online transactional processing is a type of database that would typically be leveraged as part of a core production system. In contrast to OLAP databases, OLTP databases are made to deal with comparatively simple queries and handle well both read and write operations. The typical types of databases fitting within this category are the typical relational databases (RDBMS) or NoSQL databases such as MongoDB or DynamoDB.

HTAP for Hybrid Transactional/Analytical Processing offer the ability to handle both OLTP and OLAP type of workloads. Apache Kudu, whose goal is to “enable fast analytics on fast data”, is one of the open-source engine fitting within this category.

Search Engines such as Elastic search can be integrated with dashboard solutions such as Grafana or Kibana for real-time analytics, but the real value is derived when performing specific search analytics, significantly decreasing the queries' latency run.

Data warehouses: There are two main architectures for dealing with real-time data, the lambda and kappa architecture. When wanting to leverage the benefit of real-time events and consistency guaranteed of batch processing, there is a need for higher complexity in leveraging a lambda architecture. The lambda architecture requires a partial duplication of code for the batch and real-time “speed” layer. The other architecture is that of Kappa, which does not provide such a high consistency guarantee. Still, it provides a much simpler computation; both the current (i.e., real-time) and historical data are served from the same flow of events.

DATA LAKE / WAREHOUSE

Staging Layer

When ingesting data in a staging layer of the data warehouse, different approaches are considered when looking to structure the event data.

Different approaches exist, each offering different levels of specificity and unification. The three main approaches are 1) Specific surfaced table for each type of event collected 2) leveraging a common structure but specific attributes in JSON fields (such as an extra data approach 3) leveraging a nested structure to store specific data.

The decision on which approach to take should be balanced for your specific context; how different events relate to each other, how they are integrated into the data warehouse and the level of processing and unification done beforehand, how they are typically accessed and used,…

Data quality Layer

Attribute Lookups /Keys

It is necessary to create common lookup tables within the staging structures to ensure the appropriate data quality. There are many reasons why events/messages may need to be enriched through these lookup tables:

1. Change of Schema, schema can change drastically between different application release — there often need to be some schema unification to complement missing information in some schema versions.

2. Data quality issues come from regressions in the code. Data can stop being provided downstream due to regressions in code, and appropriate data quality enrichment needs to be performed — where possible to complement these regressions.

3. A-posteriori enrichment of the data (e.g., mapping a cart_id to an order_id after purchase), not all the data we want to leverage is present at the time of its’ generation, and the received events/messages only contains the immutable representation of the data at the time they were generating. It is necessary to complement the data a-posteriori to provide the necessary lookup keys to fetch additional information within the data warehouse.

Change in field meaning

The meaning of an attribute within an event field may change over time.

In certain fields, additional post-processing of the data may be necessary to isolate the different field variations properly. Think of an order_id field that once contained the order_id from the ERP system as the front-end directly integrated with it, which started to represent the order_id of the webshop instead. In the collected events, both order_ids might be represented in the same order_id field but might be parsable into an erp_order_id and shop_order_id, based on either timestamp (the change occurred then) or datatypes; for instance, one could be an INT and another a GUID.

Besides splitting a field into different fields based on their different meaning, it is also sometimes necessary to remap previously received events based on a new categorization (e.g., a customer service tool contained previously 12 categories, now only 6).

Analytical Layer

Within the analytical layer of a data lake/warehouse, it is important to restructure the events appropriately and reconstruct DIM, HDIM, and FCT tables from the events rather than leverage them directly. Doing this provides an interface to the technical implementation of the events, which may change over time and make it easier to consume by the diverse users of the data warehouse.

Overall, when trying to build an analytical layer out of events data, there are multiple considerations to have:

Attributes within the different events tend to be denormalized in nature, and values tend to be surfaced rather than identifiers. This fits well for use in a real-time processing use case, but when needing to maintain a historical view of the data, it is often needed to leverage some form of normalization. Referential and Master data should be handled separately from the rest of the information contained within the events. The best approach often turns out to include both attribute values and identifiers. This ensures that the data stream will be useful for both real-time and long term use case.

Attributes within the different events tend to be denormalized in nature, and values tend to be surfaced rather than identifiers. This fits well for use in a real-time processing use case, but when needing to maintain a historical view of the data, it is often needed to leverage some form of normalization. In this case, ensuring that events are provided with both attribute values and identifiers when they are propagated ensures that the data stream will be useful for real-time and long-term use cases.

Another aspect that needs to be considered when building the analytical layer out of events data is record or attribute “Survivorship”. There is a need to define what values will be represented in the “golden” record.

Serving Layer

There are multiple ways to integrate real-time data; the most typical are through Dashboards, Query Interface, APIs, Webhook, Firehose, or Pub/Sub and directly integrating into OLTP databases. The particular method the data will be served through will be heavily dependent on the nature of the use case intended.

For instance, when needing to integrate onto a production application, different options are available, offering an API, publishing events through webhook, firehose, or a pub/sub mechanism; alternatively directly integrating onto an OLTP database.

Analysts, on the other hand, might find a dashboard or a query interface fitter for purpose.

Operational considerations

There are specific operational considers when looking at setting up a real-time event processing system; it carries the need for 24/7 infrastructure monitoring and maintenance, whereas batch processes might not require any specific SLA.

The need for operational support is further accentuated when needing to elastically scale based on the volume of incoming data or throttle ingestion or processing needs. These types of considerations are highly dependent on the specific infrastructure and architecture selected, both on the data platform side and core /product platform, and the required ingestion/processing speed.

Leveraging real-time pipelines also causes a higher operational burden when dealing with replay of data (e.g., data changed at the source), when dealing with reconciliation processes (e.g., calling APIs) reprocessing of data that were not validated, enriched, or integrated properly.

Data Engineers need to adapt to this dichotomy and build the appropriate data structures to event log of data and its immutable properties, both in terms of types of transformation and modeling methodology used. Modeling methodology such a Kimball is usually a good fit for this type of data if the rate of change in grain/relationship, while the data vault methodology can prove appropriate when dealing with immutable log data with a high rate of change.

Dealing with immutable data increase the complexity for data engineers to do certain types of transformations while at the same time providing a more accurate and more easily accessible historization. For instance, historization of this type of data can easily be constructed through applying an analytical transformation on top of the event log data, while in more traditional data warehousing, it would have required snapshotting the data. Simultaneously, slowly dimension 1 table might need to be recreated out of a set of “create” or “update” events. This could have been directly accessible through a database extract without the need to perform any transformation.

The work of the data engineering team in the context of real-time pipeline

Leverage real-time ingestions and pipelines require some decent shift in practices for data engineers. There are different aspects of their work that changes when applied to event logs rather than the more traditional type of data engineering.

The overall push architecture of a real-time decreases the need for the data engineering team to work on ingesting particular datasets — for instance, calling APIs, setting up CRON jobs with dependencies management etc.… Instead of placing the onus on infrastructure maintenance, applying transformation on immutable data feeds, parsing and structuring unstructured or semi-structured data, and supporting the underlying data changes through remodeled tables.

Working with immutable data feed, through events and real-time pipelines, place additional responsibility of re-apply schema changes to the data platform — if handling things like CDC, be it a direct CDC or programmatic — allows these changes to be reflected onto the target database as an automated process.

Leveraging real-time data pipelines also has the effect of pushing the responsibility for cleaning data closer to the source rather than relying on the Datawarehouse/Data Lake as the place where cleaning can happen after the fact. The opportunities to clean the data in real-time being limited, and the need for certain use cases such as real-time analytics dashboard and other applications feeding from these data flow, such as CRM applications to have clean data to function, pushes the responsibility to do some initial data cleaning top power these use cases.

Should you implement real-time pipelines?

There are many reasons why you might not want to invest in setting up real-time pipelines now, be it the reliance on legacy systems not being able to push data in real-time, not having a team with the right skill set to truly take advantage of real-time yet or not yet having the right use cases or systems to really take advantage of feeding in real-time data.

Moving towards real-time data flows is, first and foremost, a question of maturity. Real-time data is becoming increasingly important. Setting up the right foundation for ingestion and processing this type of data is primordial if there is the will to one day venture towards use cases requiring real-time processing and integration. Failure to have set up the right foundation can result in high re-platforming and re-integration needs, ultimately causing delays and extra costs, sometimes making certain use-cases financially unviable.

There are, however, cases where it might not make sense to embark on a real-time data journey. If you already have a well-established batch platform, and you do not intend to tackle any use case that requires true contextual or real-time integration, it is possible to get the benefit of fresher data at a more minimal cost than adopting real-time pipelines. Speeding up ingestion and processing through mini or micro-batches can lead to great results in some cases without the need to significantly change your code, infrastructure, or processes.

Where is real-time data headed?

The increased importance of event-driven microservices and event sourcing brings real-time data flows at the center stage.

Some are already advocating for stream-oriented databases, sharing the benefits of real-time stream processors and databases. These, coupled with SQL Frameworks such as Puma, KSQL, or Spark structured streaming, will continue to make real-time data even more accessible.

‍