There are four key pillars that rule the lifecycle of analytics projects, from data acquisition, processing to surfacing, and actioning on the data. Each of these contribute to a significant part of the value chain of analytics.
The data acquisition pillar consists of a wide range of tasks, systems, and technology knowledge, one needs to possess to be effective in acquiring the required data. What is required from an analytics professional is very domain-dependent.
Types of data
With respect to data acquisition, we can consider four main types of data Clickstream, Databases APIs and Logs, each have their challenges and ways to handle data collections.
Clickstream data is generally obtained through integration with a tool such as google analytics or adobe analytics. The role of clickstream data is to provide an understanding of user behavior on the site or apps that we are running. The default version is to extract raw data from such a source through Google Big Query (for Analytics 360). If this is unavailable, open-source tools such as Snowplow or Divolte can help integrate collect raw clickstream data to a big data platform. The role of the Analytics practitioner here, is to define the metrics for collection, set up goals within the analytics tools, help setup extra logging, analyze user paths, handle experiments setup, and deep dives
Databases are normally the source of information for internal system information that needs to be persisted. It usually contains transactional information, relationships between different objects, profile information, etc.. The traditional method for extraction out of databases is SQL. Datasets are typically queried and extracted either as a snapshot or a sequence of events depending on the form of the data.
The role of the analytics practitioner with respect to the acquisition of data within this domain is to model and structure the data needed to be exported from these databases. The practitioner needs to be merging and deep-diving into the datasets to extract valuable data out of it. The result being reports, dashboards, extracts for statistical modeling, etc.. The analytics practitioner needs to work trying to extract data out of these systems need to work with the engineering team to make sure the right data attributes are being captured within these databases, make sure that there is a quality process on the different data inputs, etc.
AAPI calls are the usual way of acquiring data when dealing with external systems. This can be, for instance, acquiring data from operating a webshop on amazon to getting some metrics related to your ad spend on Facebook or by getting some information from some external data provider.
The analytics practitioner trying to acquire these types of data needs to spend significant times looking at the different API endpoints, prototyping API calls, getting some understanding of these external data, and structuring them into a format that is usable for internal use cases.
Logs are another source of data, usually captured within internal systems to store and analyze events data. Traditionally handled as data streams with Terms such as data firehose and stored in Big data platforms.
The analytics practitioner trying to acquire Logs data, typically spend a lot of time on working on optimizing their code to process the data-streams and on monitoring the ingestion of these data. He works with the engineering team on capturing essential information within these data-streams or on making attributes directly available to make data collection a more efficient endeavor.
Architecture for Data Collection
Clickstream data acquisition: Analytics practitioners trying to acquire new clickstream data typically define new events or attribute to be collected within the tag management system. Data is then sent from users visiting a page hosted by an external webserver to an external cloud, like Google Cloud or Adobe’s for data collection and processing purposes. Users can typically access the raw and processed directly from these clouds in a seamless manner.
Database data Acquisition: Internal servers store information within a database as part of their processes. Development teams need to make sure the right information is stored within those. To ingest the data for analytics purposes, these production databases often need to be replicated, and operation needs to be performed onto them to extract the right kind of information needed.
API data Acquisition: Calls to external servers need to be made in this specific case; a worker needs to be produced that calls the different API endpoints and structure the data for ingestion. Data is then placed either in a database, usually a data-mart or onto a big data file system.
Logs data acquisition: Logs data is collected either as part of a process to put events onto an event bus from an API or through an internal log collection process. Once onto an event bus, they can be pushed to a big data platform or potentially a database through a data sink connector.
Technology Knowledge
Each of these data domains requires some specific type of knowledge to be able to execute onto a data acquisition process fully.
Clickstreams: For obtaining clickstream data, Javascript knowledge, mainly jQuery and tag management systems, is useful to have to be able to define what type of events or attributes need to be ingested by the systems.
Databases: To extract data from a database, thorough knowledge of SQL is needed, for more advanced operations knowledge of ETL tools such as Airflow might be useful to have.
APIs: Knowledge of how to interact with APIs, including Authorization, SOAP, and REST APIs as well as programming knowledge, is needed to be able to interface with these types of data sources.
Logs: Interaction of log data tend to be a bit more technical than the other data sources mentioned before. Log data tend to be operated at a very high level and are sometimes acquired and process in real-time. The typical frameworks for operating on log data tend to be working with Spark and Kafka and pushing it for long term storage or offline processing onto a Hadoop platform.
The data processing pillar is responsible for transforming raw data and refine it into informative data. It consists of different sub-tasks that need to be processed on datasets, cleansing, combining and structuring the datasets, handling aggregation as well as performing any additional advanced analytics processing on top of the data.
Cleansing
Data cleansing is a task everybody working in the field of analytics must do. It requires to deep dive into the data and looking for potential gaps or anomalies, trying to structure the data in a way that it could tackle most of the problems.
At the heart of data cleansing a few types of data cleansing need to be performed:
The identification and cleaning of each of these cases is a time-consuming effort that needs to be performed to a certain degree within each data sources through performing data audits, looking at replicating where possible the cause of the bad data and working with the engineering teams to fix the problem long term.
Merging and Denormalizing
Another step of the data processing pillar is the merging and denormalization of datasets. Looking at combining different datasets and making more actionable and easily queriable datasets.
Understanding the concept of grain of a dataset and then the normal forms of data helps operate this part. Another part of the merging and denormalization process is to set up the dimension and metrics. The process of merger and denormalization is to make a dataset that is available for further uses that contains the relevant information needed for further processing in an easy way.
Aggregation
Different levels of aggregations are needed for different purposes. There are different levels of aggregations needed for different purposes:
Advanced Analytics Processing
Different advanced analytics and machine learning method can be applied on top of the aggregates that have been computed, ranging from clustering methods to propensity modeling using methods such as random forests or others. The purpose of the advanced analytics step is to create synthetic data that can have predictive power and purpose in decisioning.
Informative data needs to be surfaced effectively to be meaningful. Different methods of data surfacing exists ranging from making data available into a dashboard or standard report, an analysis deck, an OLAP cube, or just opening data as a service.
Dashboards and standard reports
Dashboard and standard reports tend to be the first way to share the processed information. It usually sits within the performance measurement part of the role of an analytics professional. I previously explained how the measurement processed benefits from a data surfacing strategy, dashboards and standard reports are an integral part of this.
Dashboards/Reports tend to be the first user-facing deliverable of analytics practitioners, can help in getting buying, and can help in providing stakeholders confidence that an analytics project is on track.
Analysis decks
Analysis decks and reports tend to be another way to share the insights gleaned during the different of the analytical process. Depending on the technicality of the tasks and the intended audience, the report tends to be shared as a PowerPoint, a word document, or a plain Jupyter notebook.
OLAP datasets
OLAP cubes allow for slice and dice processing of data. It is a particularly effective tool for highly dimensional datasets. Open-source tools such as druid enable this type of processing. From a data-surfacing point of view, setting up aggregates that can be dived into easily by the business or product teams allows to empower these teams while removing certain sets of questions and tasks normally tackled by the analytics professional.
System Integration
While the other methods of data surfacing focused on surfacing data and information directly to humans, this one is intended to be directly for machines. Integrating aggregates and predictions into production systems, be it by offering an API, storing them into Database Tables, surfacing file exports … It is another way an analytics professional can surface data, in this case, to be used directly within products or processes.
We can sometimes see analytics being divided within three separate subdomain Descriptive, Predictive and Prescriptive Analytics. This separation is, in my view, quite restrictive. Analytics to be useful should be prescriptive, but it can use statistical or modeling techniques that are descriptive or predictive, for instance.
Providing a churn propensity for instance without taking it into the context of a decision rule that action this information is merely Analytics without action is just research. There has been a lot of talk about analytics need to provide actionable insights. In my view, it is not actionable insights that should be the holy grail of analytics but the conversion of these insights into practical actions. McKinsey Consulting has advised to focus on the last mile of Analytics, and of embedding analytics into the decision-making process of the organization, this helps in this conversion process. However, analytics professionals should still be the flag bearer of this process rather than purely relying on it from an organizational perspective.