Two Wrongs

Fast SQL for Inheritance in a Django Hierarchy

Fast SQL for Inheritance in a Django Hierarchy

Imagine, for a moment, that you want to emulate some sort of traditional file system on a Django powered page. This means you want a model for directories, and directories should be able to contain sub-directories, and so on. In short – a tree structure. I've talked about tree structures on this blog before, and it's not a coincidence; trees are surprisingly useful things.

Prerequisites

How would you make trees of existing models in a Django application? Why, Django MPTT, of course! MPTT is fantastic at just plugging itself into your models and giving them a tree structure with minimal work on your end.

All is well until you also want to assign different groups in your application access to different directories in the tree. The Django permission system can by default only give groups access to either all directories, or none of them. That doesn't cut it, so we bring in Django Guardian, which allows you to set view access to groups for particular directories. (The docs for Guardian is written by an awesome Polish guy, by the way, and his English isn't perfect, but in a very charming way. When I read the docs I get the feeling that the eponymous "guardian" is an angry Pole with a baseball bat.)

After we've put in Guardian, everything is cool, no? Not quite. We quickly realise that in an ordinary file system, we expect a certain amount of inheritance in the permission set. If only the www-data group can access /var/www, then we expect that only the www-data group should be able to access /var/www/two-wrongs as well, unless another group is explicitly assigned to that directory, in which case both groups should have access to it.

Inheritance of Permissions in Python

It's somewhat easy to solve this problem for a single directory in Python with the convenience methods MPTT gives us. (Note that this is assuming directories have a group manager, which might not be a great idea. I'll leave it as an exercise to the reader – I haven't had time to do it myself yet – to figure out a better way of getting all the groups associated with viewing a directory.)

ancestors = directory.get_ancestors(include_self=True)

accumulated = set()
for ancestor in ancestors:
    ancestor_groups = set(ancestor.groups.all())

    # This ancestor does not have any groups assigned, so we just keep
    # the group set from the previous iteration.
    if not ancestor_groups:
        continue

    # If we haven't come across any group restrictions yet or if this
    # ancestor has a strictly tighter set of restrictions just use the
    # restrictions from this ancestor.
    elif not accumulated or accumulated > ancestor_groups:
        accumulated = ancestor_groups

    # Since we have accumulated groups from the ancestors before this
    # one and this one has groups set as well, the result should be
    # the inclusive combination of both restrictions.
    else:
        accumulated |= ancestor_groups

# If there are no group restrictions in the path to this directory,
# or if the group restrictions intersect with the current users
# groups, allow access to the directory
if not accumulated or accumulated & set(request.user.groups.all()):
    return True

This is decent code, but it breaks down completely if you want to give a user a list of the directories they are allowed to view. The reason is that it needs to execute one database query for every directory in the file system. Not ideal. There are some ways to cache the results and speed the whole process up, but it'll still be a significant number of queries.

One solution would be to denormalise the group restriction information. If each directory entry in the database stores the group restrictions it has inherited from its parents, the problem goes away. Or, rather, the problem moves to another part of the code. Keeping that information updated will be a pain, because basically any change in the file system will propagate permission changes throughout the rest of the file system. And getting it right will be tough. So it's time to look for other options.

PostgreSQL Recursive Queries

Since the problem of inheriting permissions is inherently an iterative problem, we can actually make use of the so-called "recursive queries" that Postgres provide for us.

The idea is that if we can take the loop we just wrote in Python, and somehow convert it to something Postgres understands, then Postgres can run that for all directories in your file system simultaneously, and in the end give us back a list of the directories a user has access to with a simple WHERE clause.

How the Postgres recursive queries work is actually fairly well documented, although you might need to experiment a bit to wrap your head around it. The basic idea is that Postgres starts by populating a temporary table with some base information, and then for every iteration of the query it updates that temporary table based on information it found in the temporary table. It's really just a loop of queries.

Our Python code, translated somewhat directly to a recursive query which works on all directories at once, looks something like

WITH RECURSIVE
  accumulated(id, parent_id, group_id) AS (
    SELECT
      filesystem_directory.id, parent_id,
      COALESCE(groups.group_ids, ARRAY[]::integer[])
    FROM filesystem_directory LEFT OUTER JOIN groups
           ON filesystem_directory.id=groups.object_pk
    WHERE filesystem_directory.id=?
  UNION ALL
    SELECT
      filesystem_directory.id, filesystem_directory.parent_id,
      CASE WHEN groups.group_ids = '{}'
                THEN accumulated.group_id
           WHEN accumulated.group_id = '{}' OR
                accumulated.group_id @> groups.group_ids
                THEN groups.group_ids
           ELSE ARRAY(SELECT unnest(accumulated.group_id) UNION
                      SELECT unnest(groups.group_ids))
      END
    FROM
      filesystem_directory LEFT OUTER JOIN groups
        ON filesystem_directory.id=groups.object_pk,
      accumulated
    WHERE filesystem_directory.parent_id=accumulated.id
  )
SELECT id
FROM accumulated
WHERE group_id = '{}' OR
      ARRAY(SELECT unnest(group_id) INTERSECT
            SELECT unnest(?::integer[])) != '{}';

There are some things worth noting here. This pretends it has access to a table called groups, which is just a mapping between directories (groups.object_pk) and their group restrictions (groups.group_ids), the latter in the form of a Postgres array.

The first part of the recursive query (between WITH RECURSIVE and UNION ALL) provides the "base data" that the temporary table is initially filled with. In this case, this represents the permissions for the root directory (the ID of the root directory is the query parameter.) I coalesce groups.groud_ids with an empty array to make sure that if no group restrictions are set for the root directory, its gets an empty list of restrictions instead of a null value.

The second part of the recursive query (from UNION ALL to the end parenthesis) is the iterative bit. That draws on information from the groups table as well as the accumulated table (currently filled with the base information from the first part) to fill the accumulated table with new information about the children of the directories already in it.

The big-ass CASE … END clause contains the logic from the Python loop body, and it performs the same basic set operations to assign the correct inherited group restrictions to each directory in the accumulated table.

The last WHERE clause just ensures to return only the directories which either have no group restrictions specified or which has group restrictions specified which intersects with the user's groups. (That's the second query parameter – thankfully Psycopg allows us to pass in Python lists as query parameters and they get converted to Postgres arrays.)

groups Table

The only thing that's really missing from this is the groups table, which is easily constructed by joining a few Django-specific tables relating to content type, permissions and Guardian. Most of this should be self-explainatory if you managed to keep up with the previous one.

groups(object_pk, group_ids) AS (
    SELECT object_pk, array_agg(group_id) AS group_ids
    FROM
      guardian_groupobjectpermission
      JOIN django_content_type
        ON guardian_groupobjectpermission.content_type_id =
           django_content_type.id
      JOIN auth_permission
        ON guardian_groupobjectpermission.permission_id =
           auth_permission.id
    WHERE
      django_content_type.app_label='filesystem' AND
      django_content_type.model='directory' AND
      auth_permission.codename='view_directory'
    GROUP BY object_pk
  )

Potential Follow-up

I realise this article has been a lot of SQL and very little explanation. This post is mostly just for my own sake to remember what I discovered when figuring this out. I might do a follow-up later where I build from the ground up. Not sure about that yet because it will require a lot of work. But we'll see.

I do also plan on making an article on how Django default and Guardian permissions work, what kind of problems they solve, what problems they don't solve, and so on, in case anyone else found it just as confusing as I did at first.