About a month back I wrote on the evolution of data engineering, which as a role has been rapidly evolving in terms of requirements and decided to take a look at what is a data-scientist in 2018.
Looking at the current requirements and orientation within five key axis of data science, namely: Data Acquisition, Experimentation, Predictive Modeling, Data Vizualization and Productionization. We can see that there is a stratum of roles and functions within data science and that more often than none a data scientist is not another in terms of job function.
Within the real of data acquisition, I can see three different types of advanced orientations for data scientists digital analytics, backend engineering and data engineering.
For those more oriented towards UX and digital analytics, the focus on data acquisition tends to gravitate towards the creation of custom logging in google analytics, the implementation of data layers and different tags. The implementation knowledge of custom events and of tag management systems allows to bypass the traditional engineering development and release cycle for shorter time to market in terms of data analysis. These practitioners tend to develop their code using Javascript/JQuery and benefit from having a dashboard that doesn’t need further integration with google data studio and a deep dive tool in google analytics that allow them to query specific events and segments. Those who have within their company a google analytics 360 license further benefit from a direct export of these events data into google Big Query, or those who have snowplow or divolte and used the plugin functionality of google analytics who can also get their events data exported to an alternative big data platform.
Data Scientists who have more of a backend engineers orientation are generally more experts in sourcing data from APIs and scraping websites, they benefit from having knowledge of parsing framework such as Beautiful Soup, RESTs and SOAP APIs as well as authentication frameworks such as OAuth. These days with the explosion in popularity of GraphQL and of graph database is also adding another layer in complexity to these type of tasks. These data science users are more traditionally working completely within the python ecosystem and leveraging its different libraries.
While those of a more data engineering, would typically master ETL frameworks such as airflow and Luigi and learn to source their data from file dumps and different database and data-warehouses systems such as Hadoop/Hive. They are more experts are creating multi-step pipelines handling the different stage of data transformation stage and are traditionally more working using a mix of python and SQL for the purpose of data acquisition. The focus for those in this orientation is typically a focus in terms of efficiency.
The experimentation space itself is home different class of focus, from a focus towards more front-end development, to analysis and to statistical methods.
Data-scientists can sometimes take an active development role in setting up the full experiment, this is sometimes true for those embedded in the digital analytics space, especially those more focused on UX/CRO. For these data-scientists the focus is on obtaining the quickest turn-around in terms of experimentation and skipping an external dependency is sometimes the best route to market. The focus for these data-scientists is to leverage tools such as Google Optimize, Optimizely or Qubit and developing experiments using HTML and JQuery to test improvements on UX flows and conversion rates. The results of which would typically be studied in these tools themselves or in google analytics.
Those of a more product analytics orientation would help setup A/B tests, integrating different metrics into an experimentation framework, helping define the right metrics, sample size, target segment and overall population for the experiments. They would deep dive into the results of the experiments trying to get a sense as to how the experiment impacted different class of users and trying to come up with potential recommendation for future features or experiments. The focus for them is one of exploration and discovery of the different datasets.
For data-scientists focusing more on the statistical nature of the job, the experimental focus of the job is centered more around the use advanced statistical techniques to get a more accurate grasp on the uplift measurement, and/or on automating the scaling up and down of experiments. The statistical techniques used for this purposes tend to be more centered around synthetic control methods, CUPED adjusted metrics, one/multi armed bandit for instance rather than a mere A/B test would suggest.
Data-scientists from a more statistical background are adept of using principal and independent component analysis, mixed effect models, as well as diverse time series models to make predictions. They can also make use of more bayesian probabilistic approach of computation using frameworks such as Stan or PyMC. They tend to gravitate towards the more research oriented data sciences roles, or roles with in the finance industry where there is significant impact in having more accurate high level prediction.
Data-scientist from a machine learning background on the other will be familiar with propensity modeling, clustering and recommender. They use techniques such as Random forest, Gradient Boosted Trees, DBScan and Matrix Factorization. They normally leverage libraries such Spark’s MLLib, XGBoost or CatBoost to provide their prediction. More often than not, they tend to work for highly digital companies, where data is abundant.
A quite different knowledge is required from those working in image or text processing, deep learning and/or natural processing is needed for these use cases. The technical stack for these practitioners is also different where Tensorflow, Keras or NLTK are the skills to have. These focus of these roles for data-scientists tend to be either heavily research focus or highly data product driven.
Finally a different type of data-scientist with respect to modeling, are those of a more Operation Research background, their focus tend to be heavily based towards optimisation & simulation. Their mathematical skills tend to be focused linear and non-linear optimization, Their tools of choices tend to FICO Xpress, CPLEX or Octave. A lot of their focus tend to be in manufacturing and supply chain.
With respect to data visualization, in terms of orientation in data science, if we make abstraction of those only running data visualization and charting for exploratory purposes, there are two kinds of orientation front-end data viz and business intelligence orientation.
Data-scientists having a front-end orientation towards data-visualization would be focused towards building visualization D3.js, or vis.gl both leveraging upon a knowledge of javascript to build interactive data visualization. The data-scientists with that type of orientations are traditionally working more towards building front-end facing data products.
The data visualization focus for data-scientists having a more business intelligence type of orientation would be around setting up dashboards based on tools such as Tableau, Looker or Superset. These tools allow for a quick setup often requiring l ittle more than a SQL query and a few drag and drops to get started with some visualizations. The focus of this orientation is to provide different internal stakeholders with the right information to make decisions.
While certain roles within data-science are still focused on analysis, for the data-scientist type B, the focus is clearly to put the fruit of their work into production. In this context two approaches emerges, one focusing on data engineering & integration, the other on exposing real-time model delivery through micro-services and serverless.
On the productionisation side, data-scientists can have orientation towards Data Engineering & Integration, these practitioners normally have a mix approached between data engineering and backend engineering towards productionizing their work. Tools such as airflow which allows to write pipelines marrying ETL and software into directed acyclic graphs (DAGS) allow for the creation of dynamic pipelines powering re-usability and extensibility of the pipelines. Integration into customer profiles, databases or CRM systems can easily be achieve by the different tools being used. The focus however tend to be more around batch pre-computation than on delivering instantly calculated scores.
Those more focused on real-time model delivery and prediction would be more focused on setting up micro-services using flask or a serverless approach using tools such as AWS lambda. Depending on the nature of the role, there might also be a focus on subscribing or publishing to an event bus such as Kafka or Kinesis, this is usually the case when the data-scientists’ work requires to generate advanced analytics triggers for instance for CRM decisioning. The practitioners within this domain tend to have developed a backend and dev-ops approach to their data science’s craft.
There is however some development in this area with the process of notebook-erization of data-science, startups such as data haiku and databricks are working to democratize the deployment of data-science to production. They do not cater to every need for productionalization for data-science but help bridge the knowledge gap and time and effort required to put a model into production.
Across each of axis within data-science skills there is a variety of orientation and inclination as to what a given data-scientist can do. It is important to really understand what type of work the data-scientist should be doing. Even for data-scientists that can be considered full-stack for each of the axis there is still certain affinity and degree of knowledge one would have within each subdomain. We should therefore consider data-science as a broad class of jobs rather than a single job title.