Congratulations, It’s Your Problem Now: Legacy Systems and API Observability
There are a host of reasons you might find yourself managing the production deployment for a system you didn’t write.
Maybe you’ve newly joined a team and are responsible for keeping a software product healthy. You’ve inherited a pile of code and are scrambling to figure out exactly what you’re responsible for and what your customers actually use.
Maybe the people who originally built your product have left your organization—and with them the institutional knowledge of how they kept it working. Maybe they’ve gone as far as instrumenting the code, but nobody really understands what all those metrics mean. How do you tell whether everything is OK?
Maybe your team doesn't have the devops headcount that you really need. Your organization can’t afford a dedicated ops team or spare a developer to babysit a troublesome service. You need a way to focus on what really matters, but without spending a quarter building the right dashboards.
At Akita, we’ve found that it’s easy for developers to end up under-resourced for the scope and expectations associated with the service they’re running, especially if they’re managing production code they did not write. And we’ve heard a common set of stories from what we call the “99% developers,” the developers working outside of FAANG (Facebook, Amazon, Apple, Netflix, and Alphabet) and outside of what the developer-influencers are heralding. You need answers, not another JIRA ticket to “implement observability”. And you’re not alone.
In this post, I’ll talk about common issues I’ve seen among developers who have become responsible for maintaining other people’s code in production. I’ll talk about how API observability, especially as my team and I are working on it, addresses the issue of running legacy services. I believe this is especially important to talk about because few people seem to realize how common the problems are!
Understanding your “legacy” system
Are you responsible for a few key APIs, or a sprawling set of dozens or even hundreds of endpoints? Do you know the entire set of endpoints your users interact with?
It seems almost silly to ask. Some API frameworks are self-documenting and you can download an OpenAPI spec directly. Others will have a clearly defined router and you’ll be able to walk your way down a list.
But, as we’ve noticed with our users, legacy systems can contain plenty of surprises, even if “legacy” means they’re just a few months old. Maybe your middleware is intercepting or rewriting certain paths. Perhaps the set of paths are annotated on methods spread across hundreds of files— and only at run time is it clear which are actually available for use. Worse, your API dispatch could be a tangled mess of regular expressions and nested branches left behind by an engineer who was solving a problem that has lost all context. Or you simply aren’t using a tool chain that makes it easy to see your routes.
Knowing the set of endpoints is usually not enough, though. You will want to know which API endpoints actually see use. Which are common, and which are rare? Is something still in active use, or can you sunset it?
For example, at one of my former companies our UI was originally written against a SOAP backend, and migrated gradually to REST APIs. Not even the UI engineers could be certain that all the SOAP calls were gone without turning them off in the backend and seeing what happened!
If there’s one thing we’ve learned over the course of working on our API observability tool at Akita, it’s that it’s much harder than people might hope or expect to even understand what endpoints are in use and what the usage is like.
Understanding which endpoints need attention
Understanding the urgent issues in one’s system, especially a system written by other people, is not easy. Your service as a whole might look fine, but that could disguise a high error rate in one or two important APIs, as described in my blog post, “The Case for API-Centric Observability.” A 5% error rate during checkout is really bad news, even if your site as a whole is 99.9% reliable. Similarly, a good average latency across the entire service may disguise very high latencies or timeouts that are causing user frustration.
On the other hand, we’ve also heard from users who know a particular API is bad and have already made the decision that it doesn’t matter to their business. You will need to be able to exclude that endpoint from your overall picture of system reliability.
Once the most egregious examples are identified and taken care of – or, let’s be honest, put on a list to take care of – you probably will still have a lot of APIs whose behavior you don’t entirely understand. What does normal look like in terms of volume, response time, and latency? You’d like to know when one of your endpoints starts behaving differently.
As a real-world example, we recently rolled out a new API client and a well-behaved API started becoming less well-behaved. Although its relative error rate was still small (less than 0.03%) this was way more than the usual number of errors. The new client had exposed a latent bug in the API, and our monitoring told us something had changed.
Especially in a system you did not write, answering these questions can be quite challenging! Even in systems you did write, my take is that it’s not your job to educate the monitoring tool on what is usual or unusual – it’s the monitoring tool’s job to tell you.
Getting your head around what you don’t know
Here’s a pattern I’ve noticed when people talk about instrumenting their systems for observability:
1. The production system has an incident.
2. A developer instruments their code with additional metrics, logs, or trace spans in order to find the error.
3. The team identifies a root cause, and is now protected against this specific kind of incident in the future.
Incidents can be particularly challenging when they happen in systems you’re not entirely familiar with. You may not have the immediate know-how to do the level of instrumentation that finds the problem, or be unfamiliar with the semantics of the instrumentation left over from the last time a developer went through this process. When working with legacy systems, properties of the code or technology stack itself may mean that quickly getting this level of instrumentation is often impossible.
Here are a few reasons why it’s hard to track down issues in any system and especially legacy systems:
- Multiple endpoints are likely involved. It is rare for a client to need only a single API endpoint. Most pages are assembled from several API calls. Or, a user transaction spanning multiple pages will be connected together by a sequence of API calls. Ultimately, to improve performance and user experience, you’ll need to understand these patterns in client behavior that span multiple endpoints.
- Multiple components are likely involved. Your APIs probably depend on other APIs – ones for the SaaS services you use, or the infrastructure on which you run, or the databases you use for storage. When services are built from a variety of in-house and external libraries, they contain dependencies that are buried or nonobvious. In time, you’ll need to understand how these dependencies affect your system’s performance, and what your critical path looks like. (And not all those dependencies are under your control, so you can’t put OpenTelemetry spans in everything!)
- Multiple endpoint usage conventions are likely involved. The various endpoints in a service should work together as a unit, and provide consistent interfaces for their clients. But your API will often contain more than one set of competing conventions– for example, sometimes time might be a string and other times it’s an epoch time given as a number. Many cross-component issues come from misunderstandings about how interfaces are to be used.
- Subtle invariants may be the culprit. Sometimes, the crucial piece of information is not which API is being called but what parameters or fields are present. An endpoint may misbehave because an optional value was included, or a list was too long. Once upon a time, your team may have known “don’t do that unless X” or “avoid calling Y before Z”, but that institutional knowledge has probably faded. Without some way to introspect the behavior of your API, these pitfalls may take a long time to puzzle out.
How drop-in API observability helps with legacy systems
As I’ve written about before, I believe that a solution for bridging the API monitoring gap must:
1. focus on API endpoints,
2. focus on developer needs,
3. be drop-in, and
4. work across frameworks.
These requirements are helpful for any team to understand their systems better, but they are crucial if you weren’t the original author of the code you’re running. If you’re the one responsible for an API-based system, especially a legacy system, then you need a developer tool – not a business analytics tool – that tells you which endpoints need your attention. But, the last thing you want to do is start your tenure by making a bunch of changes; you need something that you don’t need to deeply understand the code to use.
The API observability solution that my team and I have been building at Akita is designed to make it easy to understand any system, whether you wrote it yourself or not. Our solution is a drop-in agent that passively observes network traffic – all traffic, to all API endpoints, no matter what their implementation technology. The Akita agent requires no code changes and no proxies for setup.
Even if you’re unfamiliar with the underlying code, Akita’s passive observation shows you the truth about what’s actually in your API, so that you know the scope of your responsibility. Akita also shows you a service graph, demonstrating the patterns of communication across services in your organization, as well as per-endpoint call volume, latency, and errors so you can tell what’s behaving normally, versus what needs attention. In our beta, we’re actively working on better ways to automatically recommend and set thresholds for the hundreds of endpoints you have to manage. This gives you the confidence to deploy knowing that even less-frequently-used endpoints are being monitored, and deviations will be highlighted.
Whether you’re taking over an existing API, or maintaining a service with fewer team members, or building a system from scratch with limited resources – we’ve been where you are. Our goal is to tell you the things that are most critical for you to do your job: the APIs you are responsible for, their behavior, and how they’re being used. And we believe we’re part of a movement of new tools that work with your existing systems, without requiring you to touch every part of your code in order to get the monitoring and observability you need.
If you’d like help getting a handle on your legacy system, we’d love to have you join our beta.