NOTE - OBSOLETE

This document is obsolete. User namespaces are now fully implemented as of 3.12. The approach used differs from the one detailed below. It is based on 1-1 mappings from userspace uids to kernel 'kuids'. For instance uid 0 in a container maps to uid 100000 on the host, naturally insulating the host from any privilege leaks in the container.

I'm leaving this document up for historical reference.

User Namespaces in Linux

User namespaces in Linux have a few goals:

Status

Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID capabilities. What this gets you is a whole new set of userids, meaning that user 500 will have a different 'struct user' in your namespace than in other namespaces. So any accounting information stored in struct user will be unique to your namespace.

However, throughout the kernel there are checks which

As a result, the lxc implementation at lxc.sf.net does not use user namespaces. This is actually helpful because it leaves us free to develop user namespaces in such a way that, for some time, user namespaces may be unuseful.

Currently developed patchset

Work is under way to continue the development of user namespaces. It is being stored at http://kernel.ubuntu.com/git?p=serge/natty-userns.git;a=summary. Note that this tree is frequently completely recreated (unavoidable as it sits on the Ubuntu development tree which itself is rebased) and only intended as a patch cache.

Design

The design of user namespaces can be described in several parts:

flexible uid mapping

This is very open to change, but reflects the latest discussions with Eric as of 2008.

First, the inode->uid (and ->gid) will reflect init_user_ns owners of the file. Xattrs will list (uid,userns) owners for the file. So if (500:0, 0:1) creates a file, it will have inode->uid 500, and a (0:user1_ns) xattr.

But remember user namespaces are not named. This will not change, because if it did we would need a namespace of network namespaces. So next, a simple policy will dictate names which specific users may use for their child user namespaces. For instance:

    [domains]
    INIT    1
    serge   2
    vs2 3

    [owners]
    serge serge.INIT
    vs2 root.INIT

So init_user_ns is called '1', a user_ns 'serge' is '2', and one called 'vs2' is '3'. User serge in the init_user_ns can create a user_ns called 'serge, and a user called 'vs2' in init_user_ns can create a user_ns called 'vs2'. If (serge:1, 0:2) creates a file, it will have inode->uid=serge and an xattr (0:2), since '2' is userns 'serge'.

The mechanism for the association of a name with a created user_ns is not yet certain. It may simply be done using a mount flag.

By way of implementation, not much should need to be done. The fs can fix its getattr() to return the (uid, gid) which should be valid in current's userns, and then use those for judging permission.

Discussion

So right now you can clone with CLONE_NEWUSER and end up with a process which can be useful on the system or even for a full container, but have separate accounting for userids. It's not particularly useful, but not completely useless.

With the current development patchset, functionality is very different. When you clone with CLONE_NEWUSER, you certainly cannot start a container. However, the resulting task can be much better contained.

Development Next steps

I've been thinking about how to best approach the development of the remaining features. I intend to do it in 3 steps:

At the end of the first step, we may have a user namespace which is safe for unprivileged users to unshare.

At the end of the second step, we should have something which full containers are able to use.

At the end of the third step, we have something which more complicated application containers (which bind-mount part of the hostfs into themselves) can use, and which users can safely use to mount removable filesystems from other hosts with different userid mappings. Furthermore, I believe we'll have full in-kernel support for what the 'fakeroot' utility currently does.

Links:

UserNamespace (last edited 2014-08-12 22:15:03 by serge-hallyn)