All Things Authorization
A Brief Overview of Split’s Authorization Framework Investigation
I’ve previously written several articles on Box’s rewrite of their authorization framework. I was on the team there that evaluated the landscape, chose a solution and built out the initial implementation. I previously discussed at a high level what access control frameworks are, a more in-depth look at what Box chose, and how we made the decision to go with Balana. Now that I’m working at Split, I find myself again exploring options to revamp an authorization framework. While I already had deep knowledge of the landscape, I wanted to make sure I did my due diligence before pushing us toward any solution. There are many things that are unchanged from when we did the research at Box, but as with any technologies, some things are also changing.
I’ve found that people often confuse authentication (sometimes called authN) and authorization (sometimes called authZ) — it really doesn’t help that the words are so similar. I even occasionally have trouble with mixing up the words. Authentication is the process of verifying the identity of the user while authorization is determining the access rights or privileges that user has to given resources. To use an example, when you get carded going into a bar, when the bouncer looks at the picture and tries to decide if the ID is yours, that’s authentication. When the bouncer then looks at your birth date, that’s the bouncer trying to decide if he should authorize you to access the bar. Access Control is the restriction of access while access management is the process of restricting access. Identity and Access Management (also called Identity Management) is the framework of policies and technologies encompassing both authentication and authorization. It’s worth noting that while these are the definitions of these terms, they are often used very interchangeably and imprecisely.
When I started the project at Split, I looked at both authorization and authentication. I found that while there are a number of products that do both, most products focus on one or the other. I quickly determined that while we had needs in both areas, authorization was a more pressing issue for us. Given that, I largely ignored authentication in this exploration.
Before I get further into the details, there are three levels of authorization when building most applications. There’s the user-facing authorization features, there’s the application-level framework and there’s the underlying system and infrastructure authorization. User-facing authorization features, includes things like allowing a user to make their profile public or private or allowing a user to add a collaborator on a file. These are features that allow end users to set limited permissions on objects in the application. They change the access to a particular object but the user is able to do so through some feature that is a part of the product. Meanwhile, the application-level framework is the mechanism by which we decide if the current user can perform a given action on an item. For example, can I actually read that file I just asked for? Can I update it? Can I change the owner? This takes into account anything set by the user-facing features but can also take into account a number of other things as well. System and infrastructure authorization is the authorization that determines which of our developers are allowed to update config on a machine or which servers are allowed to access which other servers. This is the internal authorization for the systems and infrastructure that run our application and is not a part of the actual code we ship.
Within a single application, all three of these levels of authorization will likely be stored in different ways, and that’s fine. Both the project at Box and the project at Split focused on the application level framework, so for the purposes of this post, I’m going to focus on the application level framework. It’s worth noting quickly that just because I pick one type of access control for this framework doesn’t mean that I can’t build user-facing features utilizing something different. The only caveat there is that the user-facing features can only be as flexible as the application level framework, so if I pick a less flexible framework for my application level, it will be extremely difficult to do something more complex as a user-feature on top of that. This means that we still need to take into account the user-facing features, but we won’t need to rewrite any existing features and we can also largely ignore the system and infrastructure level needs when picking an application-level framework.
Types of Access Control
There are two dimensions over which access control types are categorized. The first is based on who has access to or controls the policies themselves. On this dimension, there are two main types — Mandatory Access Control (MAC) and Discretionary Access Control (DAC). With mandatory access control, all policies are controlled by a central policy administrator and individual users cannot override policies. In most applications, this would take the form of an administrative user restricting access for their organization in some way. For example, in Box, an enterprise admin can prevent users in their enterprise from uploading files at their root level. Individual users could not override this. Meanwhile, with discretionary access control, users can grant access to objects to other users. The formal definition doesn’t say anything about owners, but in most systems, this takes the form of objects having an owner and that owner granting or removing access to others. To again use the Box example, when I upload a file to my folder, I own that file. I can then add a friend as a collaborator. That friend now has access because I’ve granted it to them. It’s worth noting that while mandatory and discretionary access control are distinct, they are often both present within an application. Any authorization framework you use will likely need to be able to handle both (and most of them do).
The other dimension that categorizes access control is based on how the policies themselves are written or modeled. The types you’ll hear of in this space include:
- Rule-Set Based Access Control (RSBAC)
- Policy Based Access Control (PBAC)
- Access Control Lists (ACLs)
- Role Based Access Control (RBAC)
- Attribute Based Access Control (ABAC)
- AWS Identity and Access Management (IAMs) — this one isn’t quite a framework the way the others are, but this comes up enough when people talk about this space that I do want to talk a bit about it.
Rule-Set Based Access Control is an implementation of both MAC and RBAC and is specific to current Linux Kernals, so I’m not going to get into it here.
Policy Based Access Control is a rebranding of ABAC. I’ve come across sites arguing that they have an awesome PBAC system that is so much better than ABAC and that their data is much more flexible than ABAC attributes, etc. Don’t be fooled, they’re just new names for the same things. It is true that ABAC is typically implemented using the XACML standard, which traditionally uses XML, but that isn’t actually part of the ABAC definition. If you come across any PBAC solutions, they’re going to be the same as ABAC from a conceptual standpoint, so I won’t spend more time talking about PBAC here.
Access Control Lists or ACLs store the access of each user or group per object. This means that lookup is super fast because, assuming I’ve set up my indexes correctly, I can look up by objectId and userId to get a near instantaneous response. There are two major problems with ACLs — one is that if you have a lot of users, a lot of objects and a lot of permission types, you quickly end up needing to store a very large amount of data. The most common use-case for ACLs are file-systems where typically there are at most a few users with access, meaning this wouldn’t be a problem. The other major problem with ACLs is that if you update one thing — say Oranization B didn’t pay and we want to disable their account, we suddenly need to figure out all of the objects that might be affected by that change and update the stored permissions for each of those objects. This means that for a single action, we could end up modifying thousands, if not millions of records. This is both slow and error prone. If it’s too slow, we also end up with a security vulnerability where there is a window of time between when the user thought access was revoked and when it’s actually gone.
Role Based Access Control or RBAC is one of the most common access control frameworks. In RBAC, you create roles, assigning users to those roles and associating those roles to sets of permissions. For example, I might have a manager role that has access to particular things that a manager should be able to do. I might also have an engineer role that has access to everything an engineer should be able to do. I can add people to one or both of these roles. If they’re added to both, they get the intersection of permissions allowed to either. This allows you to update the permissions for everyone with a particular role very quickly. Likewise, it allows you to remove or add all needed permissions to a user very quickly — when someone is promoted to manager, I just add that role to them and they immediately have access to all of the things they should. One of the big problems with RBAC comes if you try to model something like ACLs as a user-facing feature on top. If you have something like this, you could quickly end up with the case that it’s no longer if I have permission to edit everything, but if I can edit things on one particular item and because the user can pick and chose which items I can edit, suddenly it could be the case that to represent exactly what I can do, I’ll need very specific roles that allow access to item3 and item6 but nothing else and I easily end up with a huge explosion of roles. RBAC often works very well for systems and infrastructure level authorization and it can work well for much more simple authorization schemes at the framework level. However, for both Box and Split, it wasn’t flexible enough for our use-cases. As a side note here, despite the fact I’ve never picked RBAC as a solution, if you can make it fit your needs and don’t anticipate any future use-cases where it will be problematic, I would highly recommend using RBAC.
Attribute Based Access Control or ABAC is the most flexible of these options. Unlike ACLs or RBAC, ABAC doesn’t store permissions, but instead calculates those permissions on demand based on a number of attributes. These attributes can be anything and can either be passed in with the request or looked up on the fly. The most common implementation of ABAC follows the XACML standard. With this, attributes are categorized in one of three ways — Subject (data related to the user), Resource (data related to the object we’re trying to access) and Environment (literally anything else — it could be if we’re getting too many requests or if the temperature is too hot outside right now). ABAC’s two biggest strengths are its flexibility and the fact that if I update a policy, that change will take place immediately since no cascading needs to happen to get it applied to previously stored permissions. It has two primary downsides — because permissions are calculated dynamically, this takes time on each request (usually small, but non-zero). Additionally, if attributes are needed to make a decision and any of those downstream services I use to fetch my attributes are unavailable, the permissions decision can’t be made. Despite the drawbacks, ABAC is the framework that both Split and Box decided to go with — primarily because of it’s flexibility and fit to our use-cases.
AWS Identity and Access Management (IAMs) as I mentioned doesn’t quite fit here since, to my knowledge, there are no use-able frameworks that claim to support specifically this. Instead this is AWS’s user-facing feature authorization, which, if I had to guess, implements something like ABAC on the back. That said, it gets brought up a lot, so what is it? IAMs at a very simplistic level has identity policies and resource policies. An identity policy defines what actions a user or group of users can perform on which resources under what conditions. A resource policy, meanwhile is attached to a specific resource and defines what actions are available to which users under what conditions. In the same account, having both largely doesn’t make sense because they define the same thing in different ways. The usefulness of both comes up if you have two separate accounts talking to each other — in this case, you want both the calling account to be able to have control (if EmployeeA was fired, account A knows this and can have them removed from their policy while it may take account B a while to find out) and the receiving account to have control (if I want to suddenly restrict nearly all access to an object I own, I should be able to do that on my end without having to contact and then trust those calling it to do that). When two accounts are in play, in order for a user to be able to perform the action on the object, they must be allowed to do so by both the identity policy and the resource policy.
No matter which of these access control types is used, there are a couple of options for how this framework fits into the architecture. In the first option, we use an API Gateway and that gateway calls out to the authorization service. API Gateways are common in microservice architectures to assist in routing and typically serve as the single entry-point into the back-end architecture. As such, this is a common point to add the authentication checks and can also be a natural point to add authorization checks. In a variation of this, we co-locate the authorization service into the API Gateway itself. Either as something fully built in or as a sidecar service. By co-locating, we save on network traffic time, but can somewhat over complicate the service. The other issue with co-locating is that it means that the only way to access the authorization service is through the API Gateway which can become problematic — service to service calls typically do not go through the API Gateway. This means that all authorization needs to be done on the initial call. Authorization up front can be useful in that it ensures that it will happen and can be very very effective in more simplistic authorization schemes. However, it is less configurable by individual teams and makes authorization in other parts of the stack more complicated and awkward.
Another option is to have authorization as a separate service that is called by your other services. This is most commonly done through filter chain logic — either incoming or outgoing (or both). The use-case for using an incoming filter chain is that we want to check if the currently logged in user has access to the endpoint they’re trying to call with the parameters they’re using. The use-case for outgoing is for cases where I may have access to request a list of items from an endpoint but then I want to check that I actually have access to each of the individual items that were found before returning them. Almost all use-cases can be fit into either an incoming or outgoing filter-chain. However, if you’re anything like us at Split, your existing code has some calls from the middle. While these could be re-written, we don’t want to try to do that at the same time that we export our authorization service and instead want to do it over the course of a couple of steps. Given this, it may be common, at least temporarily, to have calls to the authorization service from the middle of a code path inside a service as well.
Similar to the API Gateway case, instead of having the authorization service as a separate service, we can also host it as a sidecar to each service. There are some definite pros and cons to each approach. Hosting as a sidecar allows us to separate authorization policies and clearly store those with code specific to the service. This keeps clean lines of ownership and limits team dependencies. Additionally, it saves on network call time and is a good way to load balance requests — ensuring that if one service is overwhelming the authorization service, another can still access it. The biggest downside is that if multiple authorization sub-requests are done as a part of a user request, none of the common attributes that were looked up as a part of the request will be cached locally. If each of the policies I access needs some extra data stored on the user object, I will look that up each time and since it’s coming from a different service each time, I won’t have that object already. Some of this can be mitigated by looking up and passing around objects that are commonly used. However, this starts to add a dependency between the caller and the authorization policy (i.e. the caller needs to have some idea of what data the authorization policy may need and pass that in). Whether a separate service or a sidecar is the better option will largely come down to your specific use-cases and access patterns.
We looked into a lot of solutions — at least from a very cursory level. The things we looked at from at least a high level are in the image to the left. The dollar signs indicate paid solutions while the others are open source. WSO2 — Identity Server is listed as one dollar sign because they have a free solution but have a paid option to get support and updates. From this list, because we had determined that we were looking for an ABAC solution, we eliminated everything in the RBAC and Other categories pretty quickly. It’s worth noting that Oracle and PlainID also appear to support ABAC. However, the Oracle solution is targeting a slightly different use-case from ours. There were several things that wouldn’t fit well — such as they call from the API Gateway layer and they host the solution within the Oracle cloud (meaning this wouldn’t just be a network call but an external call). Similarly, PlainID seemed to be targeting the use-case of restricting access to various applications for employees with within an enterprise. They also had very little documentation and there seemed to be almost nothing on sites like stack overflow which signals to me that integration will be more challenging.
Of the remaining ABAC options, we eliminated everything in my list below OPA largely because they either had almost zero recent action on their source code (these are open source solutions) or they had very little documentation and little community discussion anywhere. We wanted a solution that would have at least a low level of support (even if that’s only through stack overflow) and that was still actively supported at some level. This left four options — WSO2 Identity Server, Balana, Axiomatics and OPA.
Both Identity Server and Balana come out of WSO2. We suspect that Balana may be the underlying authorization framework under Identity Server, but I don’t know this for a fact. Identity Server is their slightly better supported solution. Identity server also actually encompasses more than just authorization and the authorization that they do support is simplified from what is available in Balana. As with most simplifications, this makes it both easier to use but also more limited than Balana. Between the two of these, we determined that Identity Server wouldn’t support all of our use-cases we needed with their somewhat more rigid solution. Balana, meanwhile, is probably the best open source ABAC solution implementing the XACML standard. They’ve been around a while but still have (limited) updates to their code and reasonable documentation.
Axiomatics, meanwhile, is the leader in this space from the paid solution standpoint. They are a well established company with a number of big name customers. They seem to work really well with older tech stacks, but haven’t invested as much in newer tech stacks, so things like containerization isn’t yet fully supported. They really shine in the policy administration point space — being able to edit policies in a GUI. I’m not sure I remember any other products with much in this space and Axiomatics does a good job here. However, at least for how we’re planning to use the system, this is lower on our priorities list.
The final main option that we evaluated was Open Policy Agent or OPA. OPA is by far the newest solution. It came across our radar when I was at Box, but we didn’t really consider it there because the project had just exited stealth mode at the point when we were making a decision and we didn’t want anything quite that cutting edge. Now that they have several years under their belt, they’ve had time to mature and prove themselves a bit more. They’re one of the ABAC solutions that does not implement XACML — and in fact they don’t really claim to be ABAC, but conceptually they fit that pattern. Their most commonly uesd use-case is for authorization with Kubernetes, which fits into that system and infrastructure authorization that I talked about and isn’t what we were investigating. That said, they do also support the application framework use-case and have documentation talking through that. Of any of the solutions we looked into, OPA had by far the most hype, documentation and active online community. It’s also a fairly flexible solution. They’re the only solution to fully support both the separate server and the side-car architectures. They best integrate with Go — and have a Go library option (in addition to the separate server and side-car options). However, with the available REST endpoint, any language can be used (this is true of Axiomatics and Balana as well).
To be honest, we haven’t yet picked which of these solutions we will go with. Balana, Axiomatics and OPA are all solid options. We will likely go with either Balana or OPA largely because, as a somewhat money-conscious startup, Axiomatics doesn’t have enough of a differentiator in the areas we care about to justify the cost. OPA seems the most promising, but we also know the least about it — I worked with Balana previously at Box and I’ve been able to talk extensively with the engineers there who are still working with it. Given that, we’re hoping to do a brief proof of concept with OPA to verify that it meets our needs and to hopefully reveal any of the gaps it will have before making a final decision.
Authorization doesn’t quite have a one size fits all option and I strongly suspect that a large number of companies are using built-in-house solutions for this space. Some are probably really good, and some, like what we ended up with at Box (before we started the transition to Balana), slowly grew organically over time until they because almost too complex to understand. The good news is that there are good options available whether you’re looking for RBAC (largely not included here because we just did a lot less research there), a paid solution, something easy to edit policies in for non-developers, an open source solution and everything in between.