With everyone trying to get everything out of the door before the Holiday season, November was a busy month for the data world, Airflow 2.0 moved to beta status, new tooling was released by Google to help with Machine Learning in the space of NLP and managing ML model bias, and Apple released some benchmark of the performance of their new M1 chip for ML workloads.
SQL got some attention this month; Google released an upgrade to their managed Postgres instance to the latest version, Postgres 13. Databricks released SQL Analytics providing a familiar SQL interface for querying delta lake tables. SQL Analytics provides both a workspace and connectors to popular BI solutions to facilitate the work of analysts. Amazon also played into the SQL game by introducing a SQL compatible query language for DynamoDB.
Amazon introduced a managed service for data workflows leveraging Apache Airflow. Talking about Airflow, Airflow 2.0 entered in BETA state this November and is due to be upgraded to release candidate status in December. Airflow also received a new Provider for Great Expectations.
Data Quality was a focus for more than Airflow this month. A case study for Great Expectation was released with Heineken. DBT got its’ own port of Great Expectations in the form of DBT-Expectations. Simultaneously, Airbnb gave an overview of their work around improving data quality in a two-parts blog post [1] [2].
Testing with DBT was the focus of an article from Shopify Engineering, introducing their Seamster framework for performing unit tests on DBT’s SQL models, leveraging as well tests defined through great expectations. Testing was also the focus of an article on towardsdatascience, introducing us to synthetic data generation using the Synthetic Data Vault (SVD) Library. The folks at LakeFs had a different idea of the scope of testing, attempting to introduce Netflix’s Monkey approach towards the data lake.
Along the line of Data Management, Quality, and Data Governance, Uber introduced Databook to the world. Databook is their solution for facilitating dataset discovery and providing dataset metadata to end-users.
Data Meshes, Data lakes, and Data warehouses architecture were from Picnic introducing us to their Lakeless data warehouse. At the same time, interest provided us with insights on their shift from Lambda to Kappa architecture for their Visual signal infrastructure.
Real-time processing was the focus of an article from Cloudera providing an overview of the real-time solutions on their platform. Walmart shared their views on best preparing event-driven data, the typical pattern for integration in real-time driven architectures. Slack shared with us how they created their analytics logging library using React to facilitate the sending of events for customer behavior tracking.
Stopford's article looked beyond the current stream processing paradigm towards a hybrid of tables and streams such as provided through ksqlDB. The article makes a case for the creation of better tooling to support this increasingly important use case.
On the Machine Learning front, Apple published the article Leveraging ML Compute for Accelerated Training on Mac, showing how the new Macbook with processor based on the new M1 processor improved upon the previous Macbook for Machine Learning workload.
Machine Learning Lifecycle and Governance — Salesforce provided some insights on their training and experimentation platform, the engine behind Salesforce Einstein. While we could find an article on towardsdatascience on the engineering practices for machine learning lifecycle at Google and Microsoft. H2O provided an overview of their H2O AutoDoc, helping automate machine learning model documentation. Microsoft shared some insights on how to manage machine learning governance at scale.
Machine Learning training and tuning — Machine Learning model Bias has been a topic of concern. To help mitigate the impact, Google released MinDiff, a new regularization technique on the Tensorflow Library. In contrast, Expedia gave us more practical examples of tuning and regularizing decision trees for learning to rank.
On the NLP front, Google introduced us to their Language Interpretability tool (LIT) to understand NLP models. Amazon provided us with an overview on how to make AI better at reading comprehension.
Deep Learning — Google provided some GANS output to create fantastical chimera images automatically. They showed how 3D creature models created by digital artists could create the required number of images needed for training the GANS model and how fine-tuning the perceptual loss could result in drastically different rendering.
Machine learning education front, Google launched a machine learning engineer certification.
For web analytics, two articles caught my attention this month, one from Pinterest about “A better clickthrough rate,” providing insights as some of the pitfalls to leveraging click-through rate as a metrics and what potential treatment can be applied to mitigate some of its’ shortcomings.
The following article by Murat Ova on Linkedin caught my attention. The article discusses the potential of leveraging behavioral economics with web analytics to optimize and influence purchasing decisions through the vector of recommendations. While I am not certain of the practicality of trying to influence customer behavior through this implementation versus using optimizing recommendation for increased AOV using a machine learning model, I would be curious to see the results coming from implementing such theory.