The Surprising Complexities of Building Audit Logs
Audit logs are a feature every enterprise customer wants in every product they use. Customers need to know who changed which settings and at what time. They need to know when someone creates a user account in their company’s instance of your product, who accessed what data, and more. If something goes wrong, they need to track down what happened and who is responsible. They not only want these logs for their own internal usage, but they often also need them for compliance.
Originally posted November 2020 on the Split blog
Building Audit Logs at Split
I was lucky enough to be the technical lead for our project to vastly expand our audit logs at Split to include all administrative activity. We also added a webhook for subscribing to these admin audit logs. While we were already adding and exposing many event types, we also took the opportunity to standardize and simplify our entire system.
Our audit logs story is similar to many other companies out there. Initially, we got requests for audit logs for one or two small things. We were small and strapped for time, so we built something quickly without a detailed design. Over time, customers requested more and more audit log types and what was a small, simple system quickly became large, complicated, and convoluted. By the time my project came into the picture, we already had a few dozen audit log types. Things were already starting to get complex and were only going to worsen as we more than doubled the types of events we were logging. To keep things under control, we took the opportunity to simplify what we already had and to plan for the future. We learned a lot in the process and what follows are our most important takeaways.
Audit Logs Are Forever
One of the tricky things about audit logs is that they last forever. Once a user creates a log, most companies want that log for years. Often seven or more years if they care about compliance. You absolutely can not delete that log, lose that log, or stop showing that log because it is in an old format. If a record disappears for any reason, even if you still have the data, you risk providing incomplete and misleading information and upsetting your customers.
Because you need to continue to show old logs, if you ever change your log format, you will either need a database migration (to update your old logs) or conversion code to convert old data. Neither of these options is ideal. If you have a considerable volume of data, running a migration will be slow and challenging. If you instead write conversion code, you will have unneeded complexity and indirection in your codebase. If you change formats more than once, this code will quickly become unmanageable.
Design For Generic
As I mentioned, it is common for products to start with a few logs, then add a few more, and a few more. Before our most recent change, Split had logs for splits, segments, and metrics. We were also logging many more events without exposing them. However, even if you already have logs for all of your functionality, this is only the start. As you add new features, you will be continuously adding new logs.
You will never get a chance to look at all of your logs at once. At best, you can look at what you have now and what you anticipate having. Therefore, it is easy to end up with different event types, data stored in unique ways, and inconsistencies in general. The abundance of event types and inconsistencies is problematic for you and your customers. A plethora of inconsistent logs results in complex branching code to support it.
Initially, logs are deceptively simple. They are often thought of as an action taken on an object by a subject. However, sometimes there are other objects involved. For example, if I move a file from one folder to another, I also want to record those folders. Now I have three objects involved. Furthermore, how do I categorize the action itself? For this case, would it be something like
folder.addFile, or both? What constitutes an object? If I add an API key tied to a given user, is the action
user.addApiKey? Does it matter if the user who created the apiKey is different from the user who will own the apiKey?
With an increasing number of complex scenarios, it can be very tempting to have a bunch of unique log structures. However, even if you find a way to store and process logs with a variety of data structures, you will still need to find a way to display all of those logs consistently and understandably in your UI. Additionally, if you build an API for your logs, you will need to find a way to make your logs consistent in that API response. No developer wants to use an API that has different response types for every single log type, not to mention that developer has no idea what response types you might add in the future.
The other key reason to design as generically as possible is related to the fact that your logs last forever. Even if you have data structures that are flexible enough to cover what you currently log, you may still run into trouble when you add more log types. When this happens, you will need to either migrate your data or repurpose a data structure in a way it was never intended to be used.
Design for API
Even if you don’t get the request for APIs immediately for audit logs, that request is definitely coming. Customers want two main API types: an events stream, such as the webhooks we built as a part of this release, and an on-demand search API. A third common ask is a bulk download. While bulk download could be built without an API, it would still require a consistent format for the logs themselves.
In the webhook use case, customers want to take actions programmatically based on what happens in your product. Perhaps they want to send particular actions to their analytics system or create a Jira ticket every time a user adds a split. Meanwhile, for the events stream and the bulk download use cases, they want to be able to run audits on their usage. If something goes wrong in their account, they want to track down what happened and who caused it. Even if these are not part of your initial audit log requirements, they are coming, so design for all of these API use cases up front.
Similar to logs living forever, APIs live forever. When you combine the two, audit log APIs are pretty much immortal. Because there is a strong use case for using audit log APIs, they will get used. As more customers use your APIs, it becomes very hard to change these APIs. Ever.
All of this is to say that taking a couple of extra weeks to design your APIs now can save you a lot of headaches later. Also, since these APIs will get used, their design will either please or frustrate your customers.
Focus On Audit Log Use Cases
In many companies, the audit log use case is slightly different from other product use cases. It is easy to think that audit logs have pages of results similar to everywhere else in the product. It can be tempting to treat these pages of results the same way you do elsewhere. Furthermore, it is easy to reuse front-end components, back-end components, and even infrastructure. In fact, this is what you usually want to do. However, there are a few key differentiators between audit logs and most of the other use cases at Split. These differentiators mean that using different components and infrastructure may make sense.
The first is volume. I will have far more logs than any other object. Assuming that I log (at a minimum) object creation, update, and deletion, it is guaranteed that I have at least one log per application object even if nothing is ever updated or deleted. In some cases, we also want access logs — who viewed those items? You can see how this quickly adds up, and we are guaranteed to have many more logs than any other object type in our system¹.
The second key difference is the access pattern. When I search for a file in a folder, I may want to sort by created date, file name, or the date it was last updated. Likewise, if I view all of my feature flags, I want to be able to sort on several different attributes — name, created time, when the rollout last changed, how much traffic it is getting, and more. For logs, however, I rarely want to sort by anything other than date (with the newest first). Similarly, unless I am downloading all logs associated with some particular audit event, it is unlikely that I care about more logs than the first few pages of results. If I do want older logs, I want to be able to filter down to a particular time window before I look at any results. In the rare cases where I want more than a couple of pages of results, I will want those results programmatically from an API, not through the UI. It is impractical to go through that much data manually. Audit logs are a rare instance where filtering alone, and not sorting, can solve the use case.
Digging into this access pattern a bit more. If a user is downloading logs from an API, there is a good chance they may be downloading a large volume. Pagination is an important mechanism to protect your storage systems. In most cases, a user will not page through more than a few pages, so it largely doesn’t matter, but accessing logs through the API is one case where a user may access a huge number of pages. In these cases, marker-based pagination provides better performance than offset-based pagination.
Within log types, there can even be different use cases, data volumes, and access patterns. For example, at Split, we have general and admin logs. General logs are related to what a user did to splits, segments, and metrics. These events include changing the rollout percentage on a split or adding a userId to a segment. Admin logs, meanwhile, are associated with anything that happens within the admin panel. These events include creating a user, turning on a 2FA requirement, or configuring an integration. As you can imagine, the general logs have a significantly higher volume. They are also more likely to be used for things like an events engine both for internal and external applications. For these logs, any given use case typically only cares about super recent logs and only a very filtered subset of log types (for example, only split changes). By comparison, the admin logs happen less frequently and are accessed much less often. They are primarily accessed for debugging or for audit purposes.
These differences may not matter at lower volumes, and creating interfaces that allow for lots of flexibility is useful. However, when you reach higher volumes, some of these differences matter. Some things to consider here include how you are storing the data — if the logs are living for a long time, that can be a lot of data, which can get costly. Would it make sense to only have the last month or two in fast-access storage? Many companies may require you to keep 7 or 10 years of logs, but do they need instant access to those? Or is it good enough if you can send the logs to them in an hour? Does it have to be through the UI? Can older logs be accessed in some other way? What guarantees do you need around that log getting delivered to the webhook? Is there an SLA on when that delivery needs to happen relative to the time of the event? While it is nice to sort results on basically any field, sorting, particularly in conjunction with paging, can become extremely difficult on large data sets. Can you get away with only sorting by date?
Take the Time to Get Logging Right
Audit logs are an important, powerful, and often requested feature. They are more tricky than they seem at first and are extremely hard to change later. Therefore, when you are building them, it is worth taking a little extra time to make sure you get the design right. It is challenging to find a generic design that also works well for APIs, but taking the time to do so can save you a lot of trouble and headaches later.
¹It may be worth noting that we actually have more impressions and events, which are both effectively customer-produced logs. We do not create logging events for these logs and therefore they actually have more objects in our system than our audit logs. They are a special case though.