dorzey.net

How Miro migrated its analytics event tracking system

Paul Doran — Tue, 14 Nov 2023 21:56:41 GMT

In this post, we’ll take you behind the scenes of how we migrated our analytics event system — responsible for handling ~3.5 billion events per day — without losing a single event. We’ll go into the technical details and the strategy we used that allowed us to leverage Kubernetes, Kafka, Apache Spark, and Airflow to safeguard the events that power some of Miro’s key business decisions.

In the beginning: 2017–2021

By 2017 Miro had reached 1M users. In September of that year, Miro put its in-house analytics event tracking system, called Event Hub, into production. On its first day it received its first 1.5 million events. Event Hub consists of the following:

A Spring Boot API that collects events from various backend and frontend clients; the events are written to a Kafka topic;
A Kafka that is used as a durable message broker;
An Apache Spark Data Pipeline, in Java, that runs once a day. It conforms the events into a common format, applies enrichments, and makes the events available for downstream use cases.

The initial deployment of Event Hub was done as a raw deployment on EC2. Only some parts of the deployment were done as infrastructure-as-code, and it turned out to be remarkably stable with very few incidents.

During Hypergrowth: 2021–2022

During 2021–2022, Miro experienced dramatic growth, reaching 30M registered users in 2021. In turn, our Event Hub instance was receiving around 2.5 billion events per day.

Miro also grew its data team during this time, which helped us improve our data capabilities. This led to the data pipeline being simplified, re-implemented in Scala Spark, migrated to Databricks, and being orchestrated via Airflow. This allowed the team to increase the frequency the pipeline ran to hourly. This allowed us to both provide fresher data to downstream analytics use cases and reduce our mean time to recover if there was a failure.

In this period, Miro also adopted Kubernetes and outlined a golden path for teams to deploy their services.

Decision time: 2023

By the beginning of 2023 we were processing around 3 billion events per day. At the same time, a few circumstances coincided which altogether led us to the decision of completing the migration:

Event quality was beginning to suffer. This negatively impacted Miro’s ability to make reliable data-informed decisions. At the beginning, Miro deliberately chose a permissive model for events, but once the organization scaled this became unsustainable. We wanted to introduce Event Validation to improve the quality, but this would have been risky to do on the existing infrastructure.
We were asked to bring the Event Hub deployment in line with Miro best practices and leverage all the internal tooling around Kubernetes, Istio, and Terraform.
While Event Hub had very few incidents during its lifespan, maintenance and upgrades were a delicate matter. It was now running on several EC2 instances. We knew that losing even just one of them would have meant hours of downtime. The number of moving parts was destined to increase with the growth in Miro users. And so was the probability of a catastrophic failure.

The migration: 2023

We decided to migrate, but now we needed to understand how to make it happen.

“Plans are of little importance, but planning is essential” — Winston Churchill.

Birds migrating

1. Define the plan

We started the process knowing that we needed a plan. But we all agreed that we would not have all the knowledge we needed at the beginning, so we should be adaptable to changing the plan as we progressed. We had a regular sync on progress to understand if we needed to update the plan.

2. Create infrastructure

We needed somewhere to deploy Event Hub, so we first created the infrastructure needed for our deployment. We could reuse some parts of the existing infrastructure, like Kubernetes, but other parts, like Kafka, had to be created.

3. Observability

The previous iteration of Event Hub had been running for years. The engineering team that cared lovingly for this system grew intimately familiar with its usage profile. But the significant design changes we were about to make were a source of uncertainty. Observability proved to be crucial to proceed with confidence. Luckily, a lot of instrumentation at the API Gateway, Kubernetes, and Infrastructure level was provided out of the box by Miro internal tooling. Moreover, the experience with the previous system informed us where to focus our attention first.

4. Migrate Event Hub to staging

Once we had the infrastructure necessary in the staging environment, we could deploy Event Hub to it. At this point we were able to go end to end. We could post an event into the API, move it through Kafka, and process it through the Spark Data Pipeline.

5. Load test

We were now running Event Hub in staging, but it was new infrastructure. We were unsure how it would perform, and if the infrastructure was correctly sized. Paired with the effort put toward observability highlighted above, we were able to run a load test that provided a realistic production volume of events. This required several iterations to correctly size the infrastructure to achieve the necessary performance. This also provided valuable insight into how we should define our auto-scaling policies.

6. Deploy to production

We were now confident that we could deploy Event Hub to production and begin the process of switching the event tracking over to it.

7. Prepare the data pipelines

Before we could switch the traffic, we had to make some changes to the Spark Data Pipeline so it could process the events from the old Event Hub and the new Event Hub deployments.

8. Switch Clients

Now we could switch over the event tracking. We had identified ~40 different services/clients sending events to Event Hub. Each one needed to be updated to point to the new infrastructure. We started with the clients with the lowest priority and/or least critical events first. This allowed us to validate the full end-to-end flow while minimizing the risk if we did hit a problem. It took about 6 weeks to switch over all the clients and lots of coordination with different engineering teams.

9. Success

By August 2023 we had migrated to the new deployment and were processing ~3.5 billion events per day. We had successfully switched to the new Event Hub deployment and not lost a single event or caused any delays to the data we deliver to the downstream use cases.

The lessons we learned

During the migration we learned a few valuable lessons:

Make your work visible

When many teams, streams, and stakeholders are involved, you must be transparent about the work in progress and what has been completed.

Individuals reaching out made a big difference

While we aligned on expectations and priorities, we hit the inevitable problem that teams had conflicting priorities, deadlines, etc. We found out that individuals reaching out to people made a big difference. We were able to provide the context and specific help to each of the teams to help the work to be completed faster.

Bake in event quality from the beginning

If you’re starting from scratch, then make event quality a first class concern from the beginning. Going back and adding it was much more difficult and carried more risk.

Lean on internal tooling

A lot of the machinery that allowed us to pull this off was available because of the excellent work other teams in Miro did. Event Hub evolved from being directly exposed to the public internet through a simple load balancer to benefitting from state of the art scalable infrastructure and DDoS protection.

Big Data LDN 2019: Taking control of user analytics with Snowplow

Paul Doran — Mon, 16 Dec 2019 12:00:00 GMT

User analytics is well established at Auto Trader but breakthroughs in the cloud, big data and streaming technologies offer clear benefits above and beyond what we could do today. This talk describes how we migrated to Snowplow on Google Cloud for our user-analytics platform. Snowplow is an event streaming pipeline that provides many features such as a unified log of your users, event enrichments and a schema registry to enforce data integrity. I'll walk through why we chose to migrate, the overall Snowplow platform, the architectural benefits and how using Google BigQuery and DataFlow has been a huge success for Auto Trader.

Continuous deployment of machine learning models

Paul Doran — Thu, 16 May 2019 12:00:00 GMT

This talk describes how Auto Trader launched a suite of new machine learning models with the ability serve low-latency predictions in real time. These models are automatically retrained and redeployed using continuous deployment pipelines in our existing deployment infrastructure, making use of technology including Apache Spark, Airflow, Docker and Kubernetes. Since models are deployed without manual intervention, we developed a robust testing strategy to ensure deployments will not cause a drop in model performance, including accuracy and coverage.

How we use Architectural Decision Records (ADRs) on Data Engineering

Paul Doran — Wed, 20 Mar 2019 12:00:00 GMT

When I joined our data engineers over a year ago they had already adopted Architectural Decision Records (ADRs) to document architectural decisions made whilst building Auto Trader’s Data Platform. ADRs are listed in ThoughtWorks’ Technology Radar as a technique to adopt; our data engineers were the first team I’ve worked on to use them. They have allowed us to capture the context and consequences of the decisions we make; in a way that provides transparency and allows the whole team, and wider organisation, to contribute.

What is an ADR?

ADRs are short text files that capture a single decision; see here for a more in-depth description. The ADRs are numbered sequentially and monotonically; with no number reused.

We have adopted the below convention for our ADRs:

Title: A descriptive title in the following format:
Status: It must have one of the below statuses.

ADRs can only change status along the permitted transitions shown in the diagram above.

Context: The facts about why we need to make this decision. For example, they could be technological, such as we need to provide a certain capability; or they could be legal, such as compliance requirements.
Decision: The actual decision. We provide a description of what we decided and provide justifications if needed. The decision should be an action we will undertake.
Consequences: We provide a description of all the potential consequences; be they good, bad or neutral.

How are ADRs used in Data Engineering?

Our data engineers maintain a website in GitHub Pages using Jekyll. This was already used to share common links, jargon bust and build statuses - so the ADRs were placed there too. The squad website is readable by anyone at Auto Trader and contributions are welcome from anyone.

There are ADRs covering a wide range of subjects; how to partition data in the data lake, choice of business intelligence tool, proposed changes to metric algorithms, etc.

Having a place to record the ADRs is not enough. Consensus needs to be reached on whether ADRs should change status or not. We either address this at standup or with specific meetings for more in-depth discussions.

As the Data Platform was built lots of architectural decisions were made. We’ve found it worthwhile revisiting the decisions and see if they still hold; given what we’ve learnt as the platform grows.

Benefits of ADRs

One: Clearly reasoned decisions

The format of the ADRs enforces consistency in approach to documenting decisions. This makes it easier for both the author and the reader. The author knows what information to provide to produce an ADR, and the reader knows what to expect when reading an ADR and where to find it.

As the Data Platform is being built new technology is continuously adopted. The ADRs allow us to show stakeholders that we are in control of technology adoption and change.

Two: Good for new team members

I found the ADRs invaluable when I first joined Data Engineering. They allowed me to get up to speed quickly on why the team had made certain decisions. I could read the ADRs to gain a broad understanding of the architecture and to understand the context of the decision. The ADRs told me the history of the decisions made by Data Engineering.

Three: Version control for the history of team decisions

Given we are using GitHub Pages to store our ADRs we get a versioned history of their changes for free. I have found it useful to review the history of an ADR to understand its evolution or to know who to ask for clarification if something is unclear.

Four: Keeping them lightweight

We don’t embrace ‘big design up front’ so the ADRs are not mammoth documents that nobody will ever read. The theme used for Jekyll has a built-in reading time estimator; the longest ADR to date has an estimated reading time of 4 minutes.

Challenges of ADRs

One: Deciding when to create one

There are no defined rules for when to create an ADR. Individuals are free to propose an ADR when they feel one is necessary. The hope being that by doing this we are capturing all the ADRs needed. I think over time, and with experience, we’ve gotten better at ensuring ADRs are written when required. There have been only a few occasions where I have felt that an ADR was missed.

Two: Keeping them up-to-date

The environment our data engineers work in is reasonably fast paced and lots of decisions can get made in a short space of time. I have found, on occasion, that the ADRs are out of date.

If we fail to write them in a timely manner then we reduce the value of the ADRs; future potential readers have no way of knowing what was not recorded. This was one of the reasons we now have semi-regular catch-ups to discuss ADRs.

Three: Keeping them lightweight

The preference for keeping the ADRs lightweight can make them challenging to write. When writing an ADR you want to provide enough information to a future reader, but do it in a concise way. This requires good clear writing; which isn’t the easiest thing to achieve when documenting a detailed technical decision.

Conclusions

We’ve found great value in ADRs as both a historical record of decisions and a way to capture the thinking behind architectural choices. Their number continues to grow.

They were incredibly useful to me when I first joined and I continue to find them a valuable tool.