IRC log from a user namespaces discussion on #lxcmeeting on Feb 7, 2013 (17:00 UTC)
15:57 <idkfa> hi there
17:01 <serue> Hey
17:01 <serue> ok, when this is done, i will post the log (perhaps sanitized) and send a link to the list
17:02 <serue> I'm going to blab a bit first about user namespaces, then we can mozy over to an ec2 instance for some demo
17:02 <serue> So first off, I assume anyone interested enough to be here knows this, but the state of things right now
17:03 <serue> is that when you start a container, a userid in a container is the same user as it is on the host
17:03 <serue> we contain it somewhat with LSMs (apparmor/selinux), capability bounding sets, seccomp, and cgroups,
17:03 <serue> and we try to keep it from getting access to somethings by not mapping them in a namespace,
17:03 <serue> but the userid is the same.
17:04 <serue> Uh, just to be sure, can someone ack that I'm actually showing up?
17:04 <dengen> ack
17:04 <dash17291> ack
17:04 <serue> woot thx
17:04 <serue> ok, so Eric and I have been talking since I think 2007 about how we wanted user namespaces to look
17:04 <serue> some of the goals we had were:
17:04 <serue> one user on the host should be able to to use multiple uids in the container
17:05 <serue> ability map uids on disk
17:05 <serue> ability to use privilege in the container
17:05 <serue> safety of the host from privilege in the container
17:05 <serue> uid on host which created the userns should own everything int he userns
17:05 <serue> sane lifetimes for userids (this was a problem before uids)
17:05 <serue> sane conversion of userid comparisons in the kernel - this one especially was a huge problem before kuids
17:06 <serue> so about a year ago I think, Eric came up with the brilliant idea we're using now:
17:06 <serue> uids are mapped 1-1, by ranges
17:06 <serue> so for instance, uids 0-99 in the container might map to userids 10100-10199 on the host
17:06 <serue> uids are split into two types: the old uid_t, and a kuid_t for kernel uids
17:06 >< JOIN (#lxcmeeting):
17:07 <serue> in the example above, uid_t is the uid in the container, i.e.0
17:07 <serue> kuid_t is the uid as it is seen on the host, i.e. 10100
17:07 <serue> THe kuids->uid_t are mapped at the kernel->user boundary
17:07 <serue> So if I'm writing to disk from inside a container, and my uid_t 0 is mapped to kuid_t 100000,
17:08 <serue> then when I do a stat, I'll see st_uid as 0, but on disk it will read 100,000,
17:08 <serue> The user who creates a namespace owns a namespace, and root in a namespace owns all the resources "owned" by the container
17:08 <serue> If I create a new user_ns, and just that, then I won't have privilege over the nics
17:09 <serue> but if I now create a new network ns, then in that ns I can create and manage new nics
17:09 <serue> Creating a uid mapping requires privilege to the parent user_ns
17:09 <serue> So ignoring containers for now, Eric's shadow patch (still pending) introduces a new /etc/subuid (and /etc/subgid) file,
17:09 <serue> which lists the uids which a uid on the host can use,
17:10 <serue> for instance uid 1000, hallyn, can use uids 100000 through 110000
17:10 <serue> then, a setuid-root program (pair) newuidmap/newgidmap, can be used by user hallyn to map ids in that range to whatever he wants in a new userns
17:10 <serue> Any questions so far?
17:11 <serue> While we wait to see if there are, please go ahead and log in as user guest, password lxcguest, to 18.104.22.168
17:11 <serue> you'll be sent into a shared (ro) screen session
17:11 <dengen> in your example above stat sees uid 0 in container, host stat sees 100000?
17:11 <serue> yup
17:12 <serue> ok, so on that host I've installed a kernel and lxc version from ppa:serge-hallyn/userns-natty. The kernel is pretty much all stuff that will be upstream in 3.9
17:12 <serue> and the lxc patch, if I'm not mistaken, actually is already in staging, so I ought to be able to use stgraber's daily ppa (but didn't)
17:12 <serue> So I create a regular container,
17:13 <stgraber> serue: yep, the daily should work
17:13 <serue> Then I will use a script container-userns-convert (from ppackage nsexec in my userns-natty ppa)
17:13 <serue> let's look at that script real quick,
17:14 <serue> So it calls a program called 'uidmapshift' on the container rootfs,
17:14 <serue> which just shifts the uids/gids of all files into the mapped range,
17:14 <serue> so fora instance if we've mapped 0-10000 to 100000-110000, it'll chown root owned files to 100000
17:15 <serue> (and re-set setuid/setgid if needed)
17:15 <serue> The only other thing the script does is to add the lxc.id_map entries to the container configuration
17:15 <serue> So I run the script,
17:15 <serue> and start the container,
17:15 <serue> log in,
17:15 <serue> I'm userid 1000, ubuntu
17:16 <serue> (my screen shortcuts aren't workign quite right)
17:16 <serue> you see processes in the container are actually owned by uid 100000
17:16 <serue> but network is working, files are looking fine,
17:17 <serue> the one weirdness i've not yet figured out (something to do with sessions i guess),
17:17 <serue> if I switch back to the /dev/console login,
17:17 <serue> you see it isn't particularly happy,
17:17 <serue> and in fact if i ctrl-c now, it'll reboot the container,
17:17 <stgraber> serue: I don't think the screen guests follow window changes, at least mine kept showing the /dev/console tty
17:17 <serue> stgraber: feh!
17:18 <serue> ok, was what you saw helpful enough, or should i redo more slowly?
17:18 <serue> in the meantime, i switched back to the /dev/tty1 login, from where iw as able to run sudo shutdown just fine
17:19 <dengen> helpful enough here
17:19 <serue> cool.
17:19 <serue> ok the other thing i could do is show a more manual setup of a user ns, but I'm afraid I'll get bogged down in details and not help anything -
17:19 <serue> so if someone wants to see that, pls shout,
17:20 <serue> Meanwhile, this shows that basic user namespace exploitation by containers now works,
17:20 <serue> but there's a next step we want to take,
17:20 <serue> which is to have unprivileged users create and run them
17:21 <serue> There are several issues there - they may all be trivial small steps each, but I haven't started them:
17:21 <serue> 1. unprivileged user somehow has to create the rootfs owned by the uids in the container
17:21 <serue> I don't knwo if just running tar -xvf from inside a mapped userns will suffice
17:21 <serue> (just haven't tried it)
17:22 <serue> 2. hooking the network up at the host end.
17:22 <stgraber> 1) that'd be a nice trick, and on paper, it "should" work
17:22 <serue> that will need some way for host admins to dole out a bridge to hook into
17:23 <serue> stgraber: Right, there may be a bootstrapping problem though, I'm nto sure
17:23 <serue> also it'd be nice to be able to do something with block devices,
17:23 <stgraber> 2) can we even create a veth pair without privileges? I vaguely assumed you couldn't create a network interface in the host without being uid 0 there
17:23 <serue> but for the moment that's moot as you can't mount anything but ext2 anyway
17:24 <serue> stgraber: right, you'd have to have either a privileged tiny helper on the host create it, or create it in the guest and pass it back somehow
17:25 <serue> The newuidmap/newgidmap shadow model may be one we can follow somewhat
17:25 <serue> have a new file in /etc listing the bridge a user can hook to, and how many veth's he can create/hook up?
17:25 <stgraber> serue: ok, so we'd need /etc/lxc/lxc.conf to store a list of allowed bridges to hook into, then a setuid helper that'll be poked by the unprivileged lxc to create the veth pair, move one into the new netns and hook the other into the bridge
17:25 <serue> Then leave it up to something/someone else to deal with any firewalling they want to do
17:25 <serue> right
17:26 <serue> Another more basic problem we have is discoverability of containers by lxc-ls if they are in per-user locations,
17:26 <serue> and corresponding naming issues
17:26 <serue> maybe we dont' really care, I guess the qeustion becomes,
17:26 <stgraber> set_path was meant to help with that
17:26 <serue> do we want lxc to be more like a host-wide management suite,
17:27 <serue> or a tool that anyone on the system can use/configure as they like
17:27 <serue> perhaps the set_path should be overrideable on command line,
17:27 <serue> and we completely democratize this
17:27 <serue> so I can do lxc-ls --lxcpath=/home/serue/lxcpath1
17:28 <serue> root doesn't see those and has no easy way to discover them, but really is there anything special about containers versus just running 'unshare' by hand
17:28 <serue> if admin wants to throttle me they can either ask me to do it, or just start killing my process trees
17:28 <serue> <shrug>
17:29 <serue> what do you think? fully democratized versus system-side container listings and system-wide unique container names?
17:30 <stgraber> so my vague plan was to do it the "upstart way" where we now have per-user upstart and per-system upstart
17:30 <stgraber> that'd mean that all commands would grow a new --system and --user option which wouldn't be required by default
17:30 <stgraber> if uid != 0, you default to --user and it uses whatever default path we agree on
17:31 <stgraber> if uid == 0, it defaults to system mode, where it uses /var/lib/lxc (or whatever the path is)
17:31 <serue> that seems potentially limiting,
17:31 <serue> (with no advantage)
17:31 <serue> i.e. I may want separate container lists for juju and for manual lxc
17:31 <serue> I really don't want lxc-ls showing me all that juju crap unless I ask for it
17:31 <serue> with lxc-ls --lxcpath=~/jujulxc
17:32 <serue> heh, rather than stating it so, lemme ask - what is the advantage to trying to enforce a common path?
17:32 <stgraber> so people don't need to pass it on the command line in 90% of the cases
17:32 <serue> (cause I'm talking off the top of my head on this, and have not sufficiently thought it through)
17:32 <serue> sure - we can have a *default*
17:32 <dengen> i've wondered if we need container groups
17:32 <serue> I'm all for sane simple easy defaults
17:33 <serue> ah, cgroups,
17:33 <serue> my feeling is we want to have /sys/fs/cgroup/$cgroup/$user anyway,
17:33 <serue> Then lxc cgroups would go into /sys/fs/cgroup/$cgroup/$user/lxc/$container,
17:33 <serue> which belies what I said a few lines ago
17:33 <dengen> but do we want a general facility to group sets of containers?
17:33 <stgraber> so yeah, I suppose we could make sure all the commands grow a --lxcpath which overrides our default (where the default differs if you're root or non-root as we don't want to end up with containers in /root/.something)
17:34 <dengen> $user is certainly one logical grouping
17:34 <serue> dengen: we'd at least have a system (root) group and a group for each user, with what i describe above,
17:34 <stgraber> I suppose we could grow a new container config option to prefix the cgroups, which would give you cgroup grouping
17:35 <serue> yes, and perhaps even group-wide settings
17:35 <stgraber> so if you set lxc.cgroup.prefix = blah, you'd end up with /sys/fs/cgroup/memory/stgraber/lxc/blah/ as the cgroup memory path
17:35 <serue> Note that upstream is working hard to make hierarchy in cgroups more sensible
17:35 <stgraber> well, /blah/containername
17:35 <dengen> yeah it might make use_heirarchy interesting
17:35 <serue> stgraber: I like the idea
17:36 <serue> dengen: would that suffice for what you're thinking, or is there more you'd wnat?
17:36 <dengen> well, i'm wondering if people want to set for instance a memory limit on a set of containers, so how we should group them
17:37 <dengen> i think that would work and you just have to set the limit at the right level of the tree
17:37 <serue> dengen: actually combining stgraber's idea with a per-user libcgroup configuration might work
17:37 <serue> So from libcggroup we would want (assuming we dont' want to roll our own, which I don't think we do)
17:38 <serue> libcgroup-pam being usable, and per-user limits set by libcgroup-pam on first login
17:38 <serue> (The former we've talked about recently and is on jbernard's list, the latter not yet)
17:40 <serue> everyone is happy with that as a starting design?
17:40 <stgraber> can we delegate (chown) cgroups? and if we can, how do we prevent the user raising a limit?
17:40 <serue> hierarchies is how we prevent raising the limits
17:41 <dengen> i think that works for users, i'm wondering if we need to allow for groupings based on other criteria, but i haven't thought it through yet
17:41 <serue> dengen: when you think it through some more, can you send an email to the list?
17:41 <dengen> yep
17:41 <serue> stgraber: and yeah we chown cgroups
17:41 <stgraber> serue: right, but then we'd want to set the limit below the one we chown right?
17:41 <serue> but since hierarchies still suck we haven't bothered so far
17:42 <serue> above? <tries to get hsi mental picture straight>
17:42 <stgraber> heh, yeah, above
17:42 <serue> (you can play with that now btw with cgroup-bin, which lets you specify user/group to own a cgroup you create with cgcreate)
17:42 <serue> (then you can enter it unprivileged with cgexec)
17:43 <stgraber> so you have /sys/fs/cgroup/memory/stgraber chowned to me, I can change /sys/fs/cgroup/memory/stgraber/<whatever> so the only way to prevent me from raising the limit is to set it in /sys/fs/cgroup/memory/
17:43 <stgraber> if that's the case, then you need two levels per user, one where you set the limit and one you chown to the user
17:43 <serue> yes
17:44 <serue> in fact that's why we use two levels per container in the cgroup-enabling mounthook
17:44 <stgraber> so in you current plan, you'd only chown /lxc and keep /<user> owned by root?
17:45 <serue> I don't think so, because that's too lxc specific
17:45 <serue> I'd rather do /sys/fs/cgroup/memory/stgraber/user/ or something like that
17:45 <serue> where /sys/fs/cgroup//memory/stgraber has the limits and .../user is chowned
17:45 <serue> then lxc would be under that
17:45 <serue> it's a long name, yes, but we should move toward using cgroup-bin or lxc-cgroup to set things
17:46 <stgraber> that's ugly and long (my cgroup htop column will completely overflow) but yeah, that's the only reasonable way to implement this
17:46 <serue> especially since we're moving toward only one hierarhcy mount per controller,
17:46 <dengen> is there a performance penalty for deep nesting?
17:46 <serue> there is, actually
17:47 <serue> I'm hoping they will work on that
17:47 <serue> we might need to discuss with upstream and get guidance
17:47 <dengen> (particularly with heirarchy i guess)
17:47 <serue> right. libvirt in fact has recently shortened their depth for that reason
17:48 <serue> Yeah we'll need to ping upstream and see if that is expected to get better or not
17:48 <serue> (for the moment it's moot, until all cgroups respect hierarchy, anyway
17:48 <serue> So that brings up a short-term workaround we'll need,
17:48 <stgraber> sounds like a case where you could trade memory for performance and do a bit of caching so you don't need to resolve the whole hierarchy to check the limits
17:49 <serue> well, again i need to think on it more, but we might want a privileged helper to set up the cgroups for now
17:50 <serue> s/might/will/
17:50 <serue> it's not safe to have contaienrs write to cgroups yet, and long as we have to prevent that with LSM anyway,
17:50 <serue> we may as well use a shortened depth and keep it all owned by root
17:50 <serue> so perhaps /sys/fs/cgroup/memory/user.serge.lxc1 ?
17:51 <serue> u.serge.lxc1 ?
17:51 <serue> This means we'd have to have an executable, setuid-root, to set that up for us during lxc-start
17:52 <serue> but that would be more immediately usable than the pie-in-the-sky hierarchy plan
17:52 <stgraber> lxc.$USER may make more sense, that way we have /sys/fs/cgroup/memory/lxc for the system one and /sys/fs/cgroup/memory/lxc.$USER for the user ones
17:52 <serue> yeah that's good
17:53 <stgraber> and AFAIK we only had the cgroup paths hardcoded in the python code and that's been fixed with the new get/set_cgroup_item API calls, so we can change the paths without risking breaking anything (in theory ;))
17:53 <serue> I'm also not sure about seccomp - we may not be able to set that without privilege,
17:53 <serue> cool
17:54 <serue> ok, so thus far we will require callouts to setuid-root helpers to do: newuidmap, cgroup setup, and nic hookup,
17:54 <stgraber> serue: I don't care all that much about seccomp as the biggest reason for it was for distros without apparmor or userns trying to make their containers safer
17:55 <stgraber> serue: so lacking seccomp support for userns initially shouldn't really be a problem
17:55 <serue> stgraber: heh, I don't feel quite the same way - I think it's an extra layer still below apparmor, for when some compat_xyz syscall gets 0wned
17:55 <serue> but fact is I don't think anyone is using it, so...
17:55 <stgraber> oh, and once we have the wrapper land, I'd strongly suggest we burn lxc-setcap and lcx-setuid to their death, they've only been a source of problem
17:55 <serue> it shouldn't hold us back fornow.
17:56 <serue> I wouldn't object to dropping those out of the source
17:56 <serue> the idea was nice, but noone has tracked them so I doubth they work, or if they do, i dont think they're safe...
17:56 <stgraber> serue: well, I see seccomp as being mostly useful for an admin who doesn't trust some syscalls and wants to block them, but if it's a multi-user machine with random users being able to login, well they can just as well call those syscalls outside of lxc...
17:57 <stgraber> well, setuid should work, but I don't trust it. I'm pretty sure setcap is broken as nobody really updated it in a while (that I can remember)
17:57 <serue> yes, but what if my user decides to start a container on which to host some service which he announces on freenode
17:57 <serue> isn't lxc-setuid just to un-do lxc-setcap?
17:57 <serue> oh, no, i guess not
17:58 <serue> Anyway! is there anything else about userns we should discuss?
17:58 <stgraber> nah, lxc-setuid actually sets the binaries setuid root IIRC
17:58 <stgraber> (I intend to drop lxc-setcap and lxc-setuid when we rebase the Ubuntu package on alpha3, other distros can keep them for now if they wish though)
17:59 <serue> ok - I will post a log of this on wiki.ubuntu.com/ somewhere, and post a link to the mailing list.
17:59 <serue> After som ethinking, I/we can come up with a design and plan for enabling unprivielged user-ns inlxc
18:00 <dengen> i think they don't have man pages, so i'm in favor of dropping them
18:00 <serue> oh, right, that reminds me, one other thing we'll need to decide is how to specify lxc.uid_map at lxc-create time
18:01 <stgraber> oh, I may end up doing that now then, as I said I'd write manpages for any tool without one, that'd be two less to do
18:01 <serue> it's complicated and verbose enough that just requiring the use of a custom lxc.conf is fine with me, but that does feel like more work
18:01 <serue> so I'd like lxc-create -t ubuntu -n r1 -map 100000:10000 perhaps
18:01 <serue> anyway, go forth and cull dnagerous binaries from the package
18:01 <dengen> stgraber: exactly
18:02 <serue> thanks guys, ttyl
18:07 <stgraber> serue: thanks!