Using Akita for Discovery and Monitoring Across Our API at Flickr
This is a guest post by Jeremy Ruppel, an Engineering Manager on the frontend team at Flickr.
As an Engineering Manager responsible for a cross-functional team of engineers, my job is to make sure my team can productively add features to a code base older than Billie Eilish. While we spend most of our time building new features, my team is also responsible for maintenance work, for instance on refactoring legacy data models.
As part of moving fast without breaking old code, I am responsible for having good technical and people processes in place. Part of this involves making sure not only that we’re monitoring what we’re supposed to, but that we’re spending an appropriate amount of time on it. I’m working with a small but mighty DevOps team and no Developer Productivity team, so many of the solutions out there that assume we can have an engineer spend a quarter building dashboards simply are not accessible to us.
Because monitoring is so critical while bandwidth is limited, ease of installation and ease of maintenance are key to the success of monitoring. For the whole time I’ve been at Flickr, I’ve been looking for a solution that could give us "hands-free" monitoring and observability, without requiring a lot of time to install and learn. We were recently able to install Akita quite easily, in large part due to their non-invasive monitoring approach.
In this blog post, I’ll talk about the difficulties of adopting tooling in a system with any amount of legacy code, the monitoring and observability challenges we’ve encountered during my charge at Flickr, and the straightforward integration process we had with Akita. I worked on this integration in collaboration with my former colleague Nick Scheiblauer, an Engineering Director in charge of Flickr’s backend.
Monitoring and observability challenges in a legacy system
The challenge with introducing any kind of new tool in a legacy system is that there’s a whole lot of code that needs to now adapt to the new tooling.
At Flickr, we’re on relatively modern infrastructure: AWS, ECS, and Docker. But we have pre-REST API endpoints that were first introduced in the 1990s—and the longer they persist, the less likely it is we’ll rewrite everything using the hot new standard. While we have plans to write new parts of our system with protocols like GraphQL, we’ve accepted that the legacy subsystems are here to stay.
The need to instrument code has been a blocker to improving our monitoring and observability. For instance, in order to adopt AWS X-Ray for analyzing and debugging production code, we needed to find a third party PHP client, wrap it and fix a few things, then instrument certain parts of the code. We now have a partial integration covering only DB and cache utilization. It's challenging to invest in it further. Even tools that simply require dropping in certain libraries can quickly become prohibitively expensive for us to adopt, since it requires potentially transitively upgrading a large amount of code that people have not touched for years.
And while you may think that Flickr has these problems more seriously than many other companies, it turns out we’re fairly standard for companies of our size and scale. First, many applications have large amounts of legacy code, some of which is becoming decades old, including software you might use on a regular basis from banks, retail companies, and more. Second, even if a company is building from scratch with no legacy code, writing any amount of code without having some of these tools in mind means that the organization is building up tooling debt.
What a drop-in solution means to us
Given my team’s experience with monitoring and observability tools in our solution, I became convinced in the value of a drop-in solution. To us, this means three things:
- Ease of initial installation. The tool needs to be able to be installed without needing to instrument code or include libraries.
- Ease of ongoing use. We’d like our teams to be able to move quickly without needing to be trained in how to instrument code to log the correct stats to the tool.
- Ease of interpreting results. Our engineers are smart, but they are busy and do not necessarily have the time to become DevOps experts.
For these reasons, when we first encountered Akita, we were excited that it might fulfill these drop-in requirements. Akita passively watches API traffic via an agent, which means that one should theoretically be able to drop Akita into any system without needing to change any code or include any libraries. Akita also automatically models the API traffic in order to make it so developers don’t have to put any annotations in the code—and to produce higher-level, easier-to-digest summaries of what is going on with APIs.
Many tools claim to be easy to install. These last couple of months, my colleagues and I put Akita to the test.
Our Akita setup
After initially trying Akita out on test traffic, I started getting set up to try Akita in production. Because Akita works by automatically watching traffic, integration simply means running the container alongside our production containers where they can observe traffic, and there is no code to integrate. There is a tradeoff of engineering time for production resources: the overhead hasn’t been significant for us, as the Akita agent doesn’t run in the data path, but the Akita agent does have CPU and memory requirements to run.
Here’s how our production deployment works. Our canary takes on about a one hundredth of production traffic, and because Flickr is at scale that's still plenty to get a good sampling of traffic. Traffic is split randomly across the production and canary clusters, making the canary for all intents and purposes a normal production environment. Deploys hit the canary first so if things go sideways there we can respond before widespread user impact. Our goal was to run Akita on canary deployments to get a quick overview of our deploys.
To set up Akita, most of the bottlenecks turned out to be administrative rather than technical.
This almost feels anticlimactic (and it should; that's the whole point!) but after the plugin was written and we showed that we could get useful specifications, this is what happened on our side:
- My colleague Nick Scheiblauer wrote a project requirements doc detailing the work and the potential risks and rewards.
- Thus commenced a lot of meetings.
- Integrating Akita was simply a matter of following the instructions, along with a minor refactor of our config to factor out canary from production.
- We spent a couple of weeks adjusting Akita’s rate limiting to meet scale demands.
DONE.
One of our major takeaways here is that adding a container to the environment is a lot less effort than instrumenting endless amounts of code. As I mentioned, AWS X-Ray did not have a PHP solution when we adopted it, so we needed to cobble together our own solution. Akita was a lot more straightforward than that. In practice, because teams are often a mix of application developers and systems engineers, any task that can be entirely done by a systems engineer is intrinsically easier than a task requiring a handoff. One person doing something within their wheelhouse is fast: and that's how the Akita integration went down.
Now that it's integrated, it's clear how valuable this technique is. We’ve already gotten benefit out of Akita’s continuously generated API maps and it’s been helping us modernize our architecture. The best part is that API maps and per-endpoint metrics stay up-to-date as we add new code, without us having to do anything.
Engineering availability is the most difficult constraint to overcome, even on large teams. A tool that makes engineers more effective, but costs a ton of engineering time to set up, can be really self-defeating. Yeah, it could be worth it in the long run. But the short run matters too!