I recently wrote about using clickstream events collectors, such as Snowplow or Divolte, to power more reliable and deeper analytics. It is however, possible to create your clickstream event collector in a few clicks using Microsoft’s Cloud.
Besides the tracking script, the Azure stack can handle all the functions of a clickstream collector with three components: Logic App, EventHub, and DataLake Storage.
1. Tracking Script: A tracking script is a piece of javascript that will be downloaded by the browser, and that will track and send the different events to the clickstream collector.
2. Logic app: The role of Azure logic is only to capture the message and push it to an event hub “topic”.
3. EventHub: EventHub is there to host the events for real-time processing, a specific setting called data capture allows to export the data onto an Azure Blob Storage or DataLake Storage.
4. DataLake Storage: Provides a long term storage for the data pushed to Event Hub
Once the data is in data lake storage, it is possible to query it using Microsoft’s Data Analytics USQL.
There are a few pros and cons related to leveraging this type of serverless solution to capturing clickstream data.
The Clickstream collector can be set up in a three-step process. First, setting up the storage layer (blog or DataLake), then setting up EventHub with Data Capture and finally create a logic app that will send events back to the event hub.
1. The first step is to create a blob storage or a data lake storage. This is where the data will be hosted in the end.
2. The second step is to create an event hub with data capture turned on. This will export to blob/data lake storage the data ingested at specific (configurable) intervals.
3. The third step relates to the creation of the logic app. The logic app needs to be composed of only two components. An HTTP Request receives and sends event to the event hub.
The HTTP request logic app component provides an option for schema validation. The basic logic app event schema that we are using in this example is provided below:
The following python code can push data to the logic app, where the URI variable is the hosted HTTP endpoint provided by the logic app. It is worth noting that to be able to push to event-hub, the content needs to be encoded in base64.
A tracking script can be created in Javascript in the same manner; an example javascript implementation is shown below.
If the logic app has been able to push the data to event hub, it should show events as succeeded within the UI as below:
After the data capture interval has elapsed, a new file should appear in the data storage.
Using AVRO reader, we can read the content of the file, and see how the data is being stored:
There are a few things that can be done to productionalize this:
Using the combination of Logic App/EventHub/DataLake Storage, provides a quick way to deploy and gather clickstream data. There is still some more work that might be required to having it more production-ready. Still, by default, Logic Apps handle availability, scalability, the retrying logic as well as the logging.
One of the main factors to consider is the pricing logic, which is per execution, sites with a low amount of visits would benefit pricing wise from relying on this type of integration, but sites with a high volume of visits might want to look at a different approach.