http://ramirose.wix.com/ramirosen1/121
Rami Rosen
Haifux, May 2013
www.haifux.org
Resource management:
Linux kernel Namespaces and
cgroups
http://ramirose.wix.com/ramirosen2/121
TOC
PID namespaces
cgroups
Note: All code examples are from for_3_10 branch of cgroup git tree (3.9.0-rc1, April 2013)
links
Mounting cgroups
user namespaces
UTS namespace
Network Namespace
Mount namespace
http://ramirose.wix.com/ramirosen3/121
General
The presentation deals with two Linux process resource
management solutions: namespaces and cgroups.
We will look at:● Kernel Implementation details.●what was added/changed in brief. ● User space interface.● Some working examples.● Usage of namespaces and cgroups in other projects.● Is process virtualization indeed lightweight comparing to Os
virtualization ?●Comparing to VMWare/qemu/scaleMP or even to Xen/KVM.
http://ramirose.wix.com/ramirosen4/121
Namespaces
● Namespaces - lightweight process virtualization.
– Isolation: Enable a process (or several processes) to have different
views of the system than other processes.
– 1992: “The Use of Name Spaces in Plan 9”
– http://www.cs.bell-labs.com/sys/doc/names.html
● Rob Pike et al, ACM SIGOPS European Workshop 1992.
– Much like Zones in Solaris.
– No hypervisor layer (as in OS virtualization like KVM, Xen)
– Only one system call was added (setns())
– Used in Checkpoint/Restart
● Developers: Eric W. Biederman, Pavel Emelyanov, Al Viro, Cyrill Gorcunov, more.
–
http://ramirose.wix.com/ramirosen5/121
Namespaces - contd
There are currently 6 namespaces:
● mnt (mount points, filesystems)
● pid (processes)
● net (network stack)
● ipc (System V IPC)
● uts (hostname)
● user (UIDs)
http://ramirose.wix.com/ramirosen6/121
Namespaces - contd
It was intended that there will be 10 namespaces: the following 4
namespaces are not implemented (yet):
● security namespace
● security keys namespace
● device namespace
● time namespace.
– There was a time namespace patch – but it was not applied.
– See: PATCH 0/4 - Time virtualization:
– http://lwn.net/Articles/179825/
● see ols2006, "Multiple Instances of the Global Linux Namespaces" Eric
W. Biederman
http://ramirose.wix.com/ramirosen7/121
Namespaces - contd
● Mount namespaces were the first type of namespace to be
implemented on Linux by Al Viro, appearing in 2002.
– Linux 2.4.19.
● CLONE_NEWNS flag was added (stands for “new namespace”; at
that time, no other namespace was planned, so it was not called
new mount...)
● User namespace was the last to be implemented. A number of Linux
filesystems are not yet user-namespace aware
http://ramirose.wix.com/ramirosen8/121
Implementation details
●Implementation (partial):
- 6 CLONE_NEW * flags were added:
(include/linux/sched.h)
● These flags (or a combination of them) can be
used in clone() or unshare() syscalls to create a
namespace.●In setns(), the flags are optional.
http://ramirose.wix.com/ramirosen9/121
CLONE_NEWNS 2.4.19 CAP_SYS_ADMIN
CLONE_NEWUTS 2.6.19 CAP_SYS_ADMIN
CLONE_NEWIPC 2.6.19 CAP_SYS_ADMIN
CLONE_NEWPID 2.6.24 CAP_SYS_ADMIN
CLONE_NEWNET 2.6.29 CAP_SYS_ADMIN
CLONE_NEWUSER 3.8 No capability is required
http://ramirose.wix.com/ramirosen10/121
Implementation - contd
● Three system calls are used for namespaces:
● clone() - creates a new process and a new namespace; the
process is attached to the new namespace.
– Process creation and process termination methods, fork() and exit() methods,
were patched to handle the new namespace CLONE_NEW* flags.
● unshare() - does not create a new process; creates a new
namespace and attaches the current process to it.
– unshare() was added in 2005, but not for namespaces only, but also for security.
see “new system call, unshare” : http://lwn.net/Articles/135266/
● setns() - a new system call was added, for joining an existing
namespace.
http://ramirose.wix.com/ramirosen11/121
Nameless namespaces
From man (2) clone:
...
int clone(int (*fn)(void *), void *child_stack,
int flags, void *arg, ...
/* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );
...●Flags is the CLONE_* flags, including the namespaces
CLONE_NEW* flags. There are more than 20 flags in total.● See include/uapi/linux/sched.h
●There is no parameter of a namespace name.● How do we know if two processes are in the same namespace ?● Namespaces do not have names. ● Six entries (inodes) were added under /proc/<pid>/ns (one for
each namespace) (in kernel 3.8 and higher.)● Each namespace has a unique inode number.●This inode number of a each namespace is created when the namespace is created.
http://ramirose.wix.com/ramirosen12/121
Nameless namespaces
●ls -al /proc/<pid>/nslrwxrwxrwx 1 root root 0 Apr 24 17:29 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Apr 24 17:29 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Apr 24 17:29 net -> net:[4026531956]
lrwxrwxrwx 1 root root 0 Apr 24 17:29 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Apr 24 17:29 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Apr 24 17:29 uts -> uts:[4026531838]
You can use also readlink.
http://ramirose.wix.com/ramirosen13/121
Implementation - contd
● A member named nsproxy was added to the process descriptor
, struct task_struct.●A method named task_nsproxy(struct task_struct *tsk), to access
the nsproxy of a specified process. (include/linux/nsproxy.h)
● nsproxy includes 5 inner namespaces: ● uts_ns, ipc_ns, mnt_ns, pid_ns, net_ns;
Notice that user ns is missing in this list,
● it is a member of the credentials object (struct cred) which is a
member of the process descriptor, task_struct.
● There is an initial, default namespace for each namespace.
http://ramirose.wix.com/ramirosen14/121
Implementation - contd
● Kernel config items: CONFIG_NAMESPACES
CONFIG_UTS_NS
CONFIG_IPC_NS
CONFIG_USER_NS
CONFIG_PID_NS
CONFIG_NET_NS
● user space additions:● IPROUTE package ●some additions like ip netns add/ip netns del and more.●util-linux package●unshare util with support for all the 6 namespaces.●nsenter – a wrapper around setns().
http://ramirose.wix.com/ramirosen15/121
UTS namespace
● uts - (Unix timesharing)
– Very simple to implement.
Added a member named uts_ns (uts_namespace object) to the
nsproxy. process descriptor (task_struct)
nsproxy
uts_ns (uts_namespace object)
name (new_utsname object)
sysname
nodename
release
version
machine
domainname
new_utsname struct
http://ramirose.wix.com/ramirosen16/121
UTS namespace - contd
The old implementation of gethostname():
asmlinkage long sys_gethostname(char __user *name, int len)
{
...
if (copy_to_user(name, system_utsname.nodename, i))
... errno = -EFAULT;
}
(system_utsname is a global)
kernel/sys.c, Kernel v2.6.11.5
http://ramirose.wix.com/ramirosen17/121
UTS namespace - contdA Method called utsname() was added:
static inline struct new_utsname *utsname(void)
{
return ¤t->nsproxy->uts_ns->name;
}
The new implementation of gethostname():SYSCALL_DEFINE2(gethostname, char __user *, name, int, len)
{
struct new_utsname *u;
...
u = utsname();
if (copy_to_user(name, u->nodename, i))
errno = -EFAULT;
...
}
Similar approach in uname() and sethostname() syscalls.
http://ramirose.wix.com/ramirosen18/121
UTS namespace - Example
We have a machine where hostname is myoldhostname.
uname -n myoldhostname
unshare -u /bin/bashThis create a UTS namespace by unshare()
syscall and call execvp() for invoking bash.
Then:hostname mynewhostname
uname -nmynewhostname
Now from a different terminal we will run uname -n, and we will
see myoldhostname.
http://ramirose.wix.com/ramirosen19/121
UTS namespace - Example
nsexecnsexec is a package by Serge Hallyn; it consists of a
program called nsexec.c which creates tasks in new
namespaces (there are some more utils in it) by clone() or by
unshare() with fork().
https://launchpad.net/~serge-hallyn/+archive/nsexec
Again we have a machine where hostname is myoldhostname.
uname -n myoldhostname
http://ramirose.wix.com/ramirosen21/121
IPC namespaces
The same principle as uts , nothing
special, more code.
Added a member named ipc_ns
(ipc_namespace object) to the nsproxy.
●CONFIG_POSIX_MQUEUE or CONFIG_SYSVIPC must be set
http://ramirose.wix.com/ramirosen22/121
Network Namespaces
● A network namespace is logically another copy of the network stack,
with its own routes, firewall rules, and network devices.
● The network namespace is struct net. (defined in
include/net/net_namespace.h)
Struct net includes all network stack ingredients, like:
– Loopback device.
– SNMP stats. (netns_mib)
– All network tables:routing, neighboring, etc.
– All sockets
– /procfs and /sysfs entries.
http://ramirose.wix.com/ramirosen23/121
Implementations guidelines
• A network device belongs to exactly one network namespace.● Added to struct net_device structure: ● struct net *nd_net;
for the Network namespace this network device is inside.●Added a method: dev_net(const struct net_device *dev)to access the nd_net namespace of a network device.
• A socket belongs to exactly one network namespace.● Added sk_net to struct sock (also a pointer to struct net), for the
Network namespace this socket is inside.● Added sock_net() and sock_net_set() methods (get/set network
namespace of a socket)
http://ramirose.wix.com/ramirosen24/121
Network Namespaces - contd
● Added a system wide linked list of all namespaces: net_namespace_list, and a macro to traverse it (for_each_net())
● The initial network namespace, init_net (instance of struct net), includes
the loopback device and all physical devices, the networking tables, etc.
● Each newly created network namespace includes only the loopback device.
● There are no sockets in a newly created namespace:
netstat -nl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
Active UNIX domain sockets (only servers)
Proto RefCnt Flags Type State I-Node Path
http://ramirose.wix.com/ramirosen25/121
Example
● Create two namespaces, called "myns1" and "myns2":
● ip netns add myns1
● ip netns add myns2
– (In fedora 18, ip netns is included in the iproute package).
● This triggers:
● creation of /var/run/netns/myns1,/var/run/netns/myns2 empty folders
● calling the unshare() system call with CLONE_NEWNET.
– unshare() does not trigger cloning of a process; it does create a new namespace (a network namespace, because of the CLONE_NEWNET flag).
● see netns_add() in ipnetns.c (iproute2)
http://ramirose.wix.com/ramirosen26/121
● You can use the file descriptor of /var/run/netns/myns1 with the setns() system call.
● From man 2 setns:
...
int setns(int fd, int nstype);
DESCRIPTION
Given a file descriptor referring to a namespace, reassociate the calling
thread with that namespace.
...
● In case you pass 0 as nstype, no check is done about the fd.
● In case you pass some nstype, like CLONE_NEWNET of CLONE_NEWUTS, the
method verifies that the specified nstype corresponds to the specified fd.
http://ramirose.wix.com/ramirosen27/121
Network Namespaces - delete
● You delete a namespace by:
● ip netns del myns1
– This unmounts and removes /var/run/netns/myns1
– see netns_delete() in ipnetns.c
– Will not delete a network namespace if there is one or more processes attached to it.
● Notice that after deleting a namespace, all its migratable network devices
are moved to the default network namespace;
● unmoveable devices (devices who have NETIF_F_NETNS_LOCAL in their
features) and virtual devices are not moved to the default network namespace.
● (The semantics of migratable network devices and unmoveable devices
are taken from default_device_exit() method, net/core/dev.c).
http://ramirose.wix.com/ramirosen28/121
NETIF_F_NETNS_LOCAL
● NETIF_F_NETNS_LOCAL ia a network device feature
– (a member of net_device struct, of type netdev_features_t)
● It is set for devices that are not allowed to move between network namespaces; sometime
these devices are named "local devices".
● Example for local devices (where NETIF_F_NETNS_LOCAL is set):
– Loopback, VXLAN, ppp, bridge.
– You can see it with ethtool (by ethtool -k, or ethtool –show-features)
– ethtool -k p2p1
netns-local: off [fixed]
For the loopback device:
ethtool -k lo
netns-local: on [fixed]
http://ramirose.wix.com/ramirosen29/121
VXLAN
● Virtual eXtensible Local Area Network.
● VXLAN is a standard protocol to transfer layer 2 Ethernet packets
over UDP.
● Why do we need it ?
● There are firewalls which block tunnels and allow, for example, only
TCP/UDP traffic.
● developed by Stephen Hemminger.
– drivers/net/vxlan.c
– IANA assigned port is 4789
– Linux default is 8472 (legacy)
http://ramirose.wix.com/ramirosen30/121
When trying to move a device with NETIF_F_NETNS_LOCAL flag, like
VXLAN, from one namespace to another, we will encounter an error:
ip link add myvxlan type vxlan id 1ip link set myvxlan netns myns1
We will get: RTNETLINK answers: Invalid argument
int dev_change_net_namespace(struct net_device *dev, struct net *net, const char *pat){
int err;
err = -EINVAL;if (dev->features & NETIF_F_NETNS_LOCAL)
goto out;...}
http://ramirose.wix.com/ramirosen31/121
● You list the network namespaces (which were added via “ ip netns
add”)
● ip netns list
– this simply reads the namespaces under:
/var/run/netns● You can find the pid (or list of pids) in a specified net namespace by:
– ip netns pids namespaceName
● You can find the net namespace of a specified pid by:
– ip/ip netns identify #pid
http://ramirose.wix.com/ramirosen32/121
You can monitor addition/removal of network
namespaces by:
ip netns monitor
- prints one line for each addition/removal event it sees
http://ramirose.wix.com/ramirosen33/121
● Assigning p2p1 interface to myns1 network namespace:
● ip link set p2p1 netns myns1
– This triggers changing the network namespace of the net_device to “myns1”.
– It is handled by dev_change_net_namespace(), net/core/dev.c.
● Now, running:
● ip netns exec myns1 bash
● will transfer me to myns1 network namespaces; so if I will run there:
● ifconfig -a
● I will see p2p1 (and the loopback device);
– Also under /sys/class/net, there will be only p2p1 and lo folders.
● But if I will open a new terminal and type ifconifg -a, I will not see
p2p1.
http://ramirose.wix.com/ramirosen34/121
● Also, when going to the second namespace by running:
● ip netns exec myns2 bash
● will transfer me to myns2 network namespace; but if we will run
there:
● ifconfig -a
– We will not see p2p1; we will only see the loopback device.
● We move a network device to the default, initial namespace by:
ip link set p2p1 netns 1
http://ramirose.wix.com/ramirosen35/121
● In that namespace, network application which look for files under
/etc, will first look in /etc/netns/myns1/, and then in /etc.
● For example, if we will add the following entry "192.168.2.111
www.dummy.com"
● in /etc/netns/myns1/hosts, and run:
● ping www.dummy.com
● we will see that we are pinging 192.168.2.111.
http://ramirose.wix.com/ramirosen36/121
veth
● You can communicate between two network namespaces by:
● creating a pair of network devices (veth) and move one to another
network namespace.
● Veth (Virtual Ethernet) is like a pipe.
● unix sockets (use paths on the filesystems).
Example with veth:
Create two namesapces, myns1 and myns1:
ip netns add myns1
ip netns add myns2
http://ramirose.wix.com/ramirosen37/121
vethip netns exec myns1 bash
- open a shell of myns1 net namespace
ip link add name if_one type veth peer name if_one_peer
- create veth interface, with if_one and if_one_peer
- ifconfig running in myns1 will show if_one and if_one_peer
and lo (the loopback device)
- ifconfig running in myns2 will show only lo (the loopback
device)
Run from myns1 shell:
ip link set dev if_one_peer netns myns2
move if_one_peer to myns2
- now ifconfig running in myns2 will show if_one_peer
and lo (the loopback device)
- Now set ip addresses to if_one (myns1) and if_one_peer
(myns2) and you can send traffic.
http://ramirose.wix.com/ramirosen38/121
unshare util
● The unshare utility
● Util-linux recent git tree has the unshare utility with support for all six namespaces:
http://git.kernel.org/cgit/utils/util-linux/util-linux.git
./unshare –help
...
Options:
-m, --mount unshare mounts namespace
-u, --uts unshare UTS namespace (hostname etc)
-i, --ipc unshare System V IPC namespace
-n, --net unshare network namespace
-p, --pid unshare pid namespace
-U, --user unshare user namespace
http://ramirose.wix.com/ramirosen39/121
● For example:
● Type:
● ./unshare --net bash
– A new network namespace was generated and the bash process was
generated inside that namespace.
● Now run ifconfig -a
● You will see only the loopback device.
– With unshare util, no folder is created under /var/run/netns;
also network application in the net namespace we created, do
not look under /etc/netns
– If you will kill this bash or exit from this bash, then the network
namespace will be freed.
–
–
–
–
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
40
/12
1
This
is n
ot th
e c
ase a
s w
ith ip n
etn
s e
xec m
yns1 b
ash
; in
th
at
case,
killing/e
xitin
g th
e b
ash
does n
ot tr
igger
destr
oyin
g t
he
nam
espace.
For
imple
menta
tion d
eta
ils, lo
ok in
put_
net(
str
uct net
*net)
and th
e r
efe
rence c
ount (n
am
ed “
count”
) of th
e n
etw
ork
nam
esp
ace s
truct n
et.
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
41
/12
1
Mount nam
espaces
● A
dd
ed
a m
em
be
r n
am
ed
mn
t_n
s(m
nt_
na
me
sp
ace
ob
ject)
to
th
e n
sp
roxy.
● W
e c
op
y th
e m
ou
nt n
am
esp
ace
of th
e c
allin
g p
roce
ss
usin
g g
en
eric file
syste
m m
eth
od
(se
e c
op
y_
tre
e()
in
du
p_
mn
t_n
s()
).
● In
th
e n
ew
mo
un
t n
am
esp
ace
, a
ll p
revio
us m
ou
nts
will b
e
vis
ible
; a
nd
fro
m n
ow
on
:● m
ou
nts
/un
mo
un
ts in
th
at m
ou
nt n
am
esp
ace
are
in
vis
ible
to
the
re
st o
f th
e s
yste
m.
● m
ou
nts
/un
mo
un
ts in
th
e g
lob
al n
am
esp
ace
are
vis
ible
in
tha
t n
am
esp
ace
.●pam
_nam
espace
mo
dule
uses m
oun
t nam
espace
s (
with
unsha
re(C
LO
NE
_N
EW
NS
) )
(module
s/p
am
_nam
espace/p
am
_na
mespa
ce.c
)
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
42
/12
1
mo
un
t nam
espaces: e
xam
ple
1
Exa
mp
le 1
(te
ste
d o
n U
bun
tu):
Ve
rify
th
at /d
ev/s
da
3 is n
ot m
ou
nte
d:
mo
un
t | g
rep
/d
ev/s
da
3
sh
ou
ld g
ive
no
thin
g.
un
sh
are
-m
/b
in/b
ash
mo
un
t /d
ev/s
da
3 /m
nt/sd
a3
no
w r
un
mo
un
t | g
rep
sd
a3
We
will se
e:
/de
v/s
da
3 o
n /m
nt/sd
a3
typ
e e
xt3
(rw
)
rea
dlin
k /p
roc/$
$/n
s/m
nt
mn
t:[4
02
65
32
11
4]
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
43
/12
1
Fro
m a
no
the
r te
rmin
al ru
n
read
link /p
roc/$
$/n
s/m
nt
mn
t:[4
02
653
18
40]
Th
e r
esu
lts s
ho
ws tha
t w
e a
re in
a d
iffe
ren
t
na
me
sp
ace
.
No
w r
un
:
mo
un
t | g
rep
sd
a3
/de
v/s
da
3 o
n /m
nt/sda
3 type
ext3
(rw
)
Wh
y ?
We
are
in
a d
iffe
ren
t m
oun
t n
am
espa
ce
?
We
sh
ou
ld h
ave
no
t se
e the
mo
un
t w
hic
h w
as
do
ne
fro
m a
no
ther
nam
espa
ce
!
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
44
/12
1
The
an
sw
er
is s
imple
: ru
nn
ing m
ou
nt is
no
t g
ood
en
ou
gh
wh
en
work
ing
with
mo
un
t n
am
esp
ace
s.
Th
e r
ea
so
n is th
at m
ou
nt re
ad
s /e
tc/m
tab
, w
hic
h
wa
s u
pd
ate
d b
y th
e m
ou
nt co
mm
an
d; m
ou
nt
com
ma
nd
do
es n
ot acce
ss th
e k
ern
el str
uctu
res.
Wh
at is
th
e s
olu
tio
n?
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
45
/12
1
To
acce
ss d
ire
ctly the
ke
rne
l da
ta s
tru
ctu
res, yo
u
sho
uld
ru
n:
cat /p
roc/m
ou
nts
| g
rep
sd
a3
(/p
roc/m
oun
ts is in
fa
ct sym
bo
lic lin
k to
/pro
c/s
elf/m
ou
nts
).
No
w y
ou
will g
et n
o r
esults, a
s e
xp
ecte
d.
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
46
/12
1
mo
un
t nam
espaces: e
xam
ple
2E
xa
mp
le2
: te
ste
d o
n F
ed
ora
18
Ve
rify
th
at /d
ev/s
db
3 is n
ot m
ou
nte
d:
mo
un
t | g
rep
sd
b3
sh
ou
ld g
ive
no
thin
g.
un
sh
are
-m
/b
in/b
ash
mo
un
t /d
ev/s
db
3 /m
nt/sd
b3
no
w r
un
mo
un
t | g
rep
sd
b3
Yo
u w
ill se
e:
/de
v/s
db
3 o
n /m
nt/sd
b3
typ
e e
xt4
(rw
,re
latim
e,d
ata
=o
rde
red
)
rea
dlin
k /p
roc/$
$/n
s/m
nt
mn
t:[4
02
65
32
38
1]
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
47
/12
1
Fro
m a
no
the
r te
rmin
al ru
n:
read
link /p
roc/$
$/n
s/m
nt
mn
t:[4
02
653
18
40]
Th
is s
how
s tha
t w
e a
re in a
diffe
rent nam
espace.
No
w r
un:
mo
un
t | g
rep
sd
b3
/dev/s
db3 o
n /m
nt/sd
b3 typ
e e
xt4
(rw
,rela
tim
e,d
ata
=ord
ere
d)
- W
e k
now
now
that
we s
hou
ld u
se c
at /p
roc/m
ounts
(and n
ot
mou
nt)
to g
et th
e r
ight
answ
er
when w
ork
ing w
ith n
am
espace;
so
:
cat /p
roc/m
ou
nts
| g
rep
sd
b3
/dev/s
db3 /
mnt/
sdb3 e
xt4
rw
,rela
tim
e,d
ata
=ord
ere
d 0
0
Why is it
so
? W
e s
hould
ha
ve s
een
no r
esults,
as in p
revio
us
exam
ple
.
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
48
/12
1
An
sw
er:
Fed
ora
ru
ns s
ys
tem
d;s
yste
md u
ses th
e s
hare
d fla
g fo
r m
oun
tin
g /.
Fro
m s
yste
md
sou
rce c
ode
: (s
rc/c
ore
/moun
t-setu
p.c
)
int
mount_
setu
p(b
ool lo
aded_policy)
{
.
..
if (m
ount(
NU
LL,
"/",
NU
LL, M
S_R
EC
|MS
_S
HA
RE
D,
NU
LL
) <
0)
log_w
arn
ing("
Failed t
o s
et
up t
he r
oot
directo
ry for
share
d m
ount
pro
pagation: %
m")
;
.
..
} (MS
_R
EC
sta
nd
s fo
r re
curs
ive m
ou
nt)
Ho
w d
o I
kn
ow
whe
the
r w
e h
ave
a s
ha
red
fla
gs ?
ca
t /p
roc/s
elf/m
ou
ntin
fo
| gre
p s
hare
dw
e w
ill se
e:
...
33
1 8
:3 / / r
w,r
ela
tim
e s
hare
d:1
- e
xt4
/de
v/s
da3
rw
,da
ta=
ord
ere
d...
Wh
at
to d
o ?
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
49
/12
1
mo
un
t --
ma
ke
-rp
riva
te -
o r
em
ou
nt / /d
ev/s
da
3T
his
ch
an
ges th
e s
ha
red
fla
g to p
riva
te,
recu
rsiv
ely
.
--m
ake
-rp
riva
te –
set th
e p
rivate
fla
g r
ecu
rsiv
ely
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
50
/12
1
Share
d s
ubtr
ees
/us
ers
/bin
/
/mn
t
/us
ers
/us
er1
/us
ers
/us
er2
No
w, w
e w
an
t th
at u
se
r1 a
nd u
se
r2 f
old
ers
will see
the
wh
ole
file
syste
m; w
e w
ill ru
n
mo
un
t –b
ind
/
/us
ers
/us
er1
mo
un
t –b
ind
/
/us
ers
/us
er2
By d
efa
ult, th
e file
sysyte
m is m
oun
ted a
s p
riva
te,
un
less t
he s
ha
red
mo
un
t fla
g is s
et e
xp
licitly
.
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
51
/12
1
Sh
are
d s
ubtr
ees -
contd
/us
ers
/bin
/
/mn
t
/us
ers
/us
er1
/us
ers
/us
er2
/m
nt
/bin
/mnt
/bin
/use
rs/u
sers
/use
r1/u
se
rs2
/us
er1
/us
er2
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
52
/12
1
Sh
are
d s
ub
tre
es –
Qu
iz
Qu
iz:
Now
, w
e m
ount a
usb d
isk o
n k
ey o
n /
mnt/
dok.
Will it b
e s
een
in /u
se
rs/u
se
r1/m
nt
or
/users
/user2
/mnt?
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
53
/12
1
Sh
are
d s
ub
tre
es -
co
ntd
Th
e a
nsw
er
is n
o, sin
ce b
y d
efa
ult, th
e file
sysyte
m is
mo
un
ted a
s p
riva
te. To
ena
ble
that th
e d
ok w
ill be s
ee
n
als
o u
nder
/users
/user1
/mnt o
r /u
sers
/user2
/mnt, w
e
sho
uld
mo
unt
the f
ilesyste
m a
s s
hare
d:
mo
un
t / --
make-r
share
dA
nd
th
en m
ount th
e u
sb d
isk o
n k
ey a
gain
.
Th
e s
ha
red
su
btr
ees p
atc
h is f
rom
20
05
by R
am
Pa
i.
It a
dd
so
me
mo
unt
flag
s lik
e –
make
-sla
ve, --
make-r
sla
ve,
-ma
ke
-
un
bin
dab
le, --
make-r
unbin
da
ble
an
d m
ore
. T
he
pa
tch
ad
ded
th
is k
ern
el
mo
unt
fla
gs:
MS
_U
NB
IND
AB
LE
, M
S_P
RIV
AT
E,
MS
_S
LA
VE
and
MS
_S
HA
RE
DT
he s
hare
d f
lag is in u
se b
y t
he f
use f
ilesyste
m.
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
54
/12
1
PID
nam
espaces
●A
dd
ed a
mem
ber
nam
ed p
id_
ns
(pid
_n
ames
pac
e o
bje
ct)
to t
he
n
spro
xy.
●P
roce
sses
in
dif
fere
nt
PID
nam
esp
aces
can
hav
e th
e sa
me
pro
cess
ID
.
●W
hen
cre
atin
g t
he
firs
t pro
cess
in
a n
ew n
ames
pac
e, i
ts P
ID i
s 1
.
●B
ehav
ior
like
the
“in
it”
pro
cess
:
–W
hen
a p
roce
ss d
ies,
all
its
orp
han
ed c
hil
dre
n w
ill
no
w h
ave
the
pro
cess
wit
h P
ID 1
as
thei
r p
aren
t (
chil
d r
eap
ing
).
–S
end
ing
SIG
KIL
L s
ignal
do
es n
ot
kil
l p
roce
ss 1
, re
gar
dle
ss o
f w
hic
h n
ames
pac
e th
e co
mm
and w
as i
ssu
ed (
init
ial
nam
esp
ace
or
oth
er p
id n
ames
pac
e).
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
55
/12
1
PID
nam
espaces -
contd
●W
hen
a n
ew
na
me
spa
ce
is c
reate
d, w
e c
an
not
se
e fro
m it th
e P
ID
of
the
pa
rent n
am
esp
ace; ru
nn
ing
ge
tpp
id()
fro
m th
e n
ew
pid
na
me
spa
ce
will re
turn
0.
●B
ut a
ll P
IDs w
hic
h a
re u
sed
in
this
na
me
spa
ce
are
vis
ible
to t
he
pa
rent n
am
esp
ace.
●pid
nam
esp
ace
s c
an
be
ne
ste
d, u
p t
o 3
2 n
esting
levels
.
(MA
X_P
ID_N
S_L
EV
EL).
●S
ee:
multi_
pid
ns.c
, M
ich
ael K
err
isk, fr
om
htt
p:/
/lw
n.n
et/A
rtic
les/5
3274
5/.
●W
hen
try
ing to
run
mu
lti_
pid
ns w
ith 3
3, you
will g
et:
–clo
ne: In
valid a
rgum
en
t
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
56
/12
1
User
Nam
espaces
● A
dde
d a
me
mb
er
na
me
d u
se
r_n
s
(use
r_n
am
esp
ace o
bje
ct)
to
th
e n
sp
roxy.
●in
clu
de
/lin
ux/u
se
r_n
am
esp
ace
.h
●In
clu
de
s a
po
inte
r n
am
ed
pa
ren
t to
th
e u
se
r_n
am
esp
ace
tha
t cre
ate
d it.
●str
uct u
se
r_n
am
esp
ace
*
pa
ren
t;
●In
clu
de
s th
e e
ffe
ctive
uid
of th
e p
roce
ss th
at cre
ate
d it:
●ku
id_
t o
wn
er;
● A
pro
ce
ss w
ill h
ave
dis
tin
ct se
t o
f U
IDs, G
IDs
an
d c
apa
bilitie
s.
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
57
/12
1
User
Nam
espaces
Cre
atin
g a
new
user
nam
espace
is d
one b
y p
assin
g
CL
ON
E_N
EW
US
ER
to
fo
rk()
or
un
sh
are
().
Exa
mp
le:
Ru
nn
ing fro
m s
om
e u
ser
account
id -
u
10
00
// 1
000 is the e
ffective u
ser
ID.
id -
g 10
00
// 1
000 is the e
ffective g
rou
p I
D.
(usu
ally the fir
st u
ser
add
ed g
ets
uid
/gid
of 1000)
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
58
/12
1
User
Nam
espaces -
exam
ple
Cap
bil
ties
:ca
t /p
roc/
self
/sta
tus
| gre
p C
apC
apIn
h:
000000000
0000000
Cap
Prm
:000000000
0000000
Cap
Eff
:000000000
0000000
Cap
Bnd:
0000001ff
ffff
fff
In o
rder
to c
reat
e a
use
r nam
espac
e an
d s
tart
a s
hel
l, w
e w
ill
run f
rom
th
at n
on-r
oot
acco
unt:
./n
sexec
-cU
/bin
/bash
●T
he
c fl
ag i
s fo
r usi
ng c
lone
●T
he
U f
lag i
s fo
r usi
ng u
ser
nam
espac
e (C
LO
NE
_N
EW
US
ER
fla
g f
or
clone(
))
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
59
/12
1
User
Na
mespaces -
exa
mple
-contd
No
w fro
m th
e n
ew
sh
ell r
un
id -
u 65
53
4
id -
g 65
53
4
● T
hese a
re d
efa
ult v
alu
es for
the
eU
ID a
nd
eG
UID
In
the n
ew
nam
espace.
● W
e w
ill get
the s
am
e r
esults f
or
effe
ctive u
ser
id a
nd e
ffective
root id
als
o w
hen r
unnin
g /
nse
xec
-cU
/bin
/ba
sh a
s ro
ot.
● T
he
def
au
lts
can
be
chan
ged
by:
/pro
c/sy
s/ker
nel
/ove
rflo
wu
id,
/pro
c/sy
s/ker
nel
/ove
rflo
wgid
● I
n f
act
, th
e u
ser
nam
espace
th
at
was
crea
ted h
ad f
ull
capabil
itie
s,
bu
t th
e ca
ll t
o e
xec
() w
ith
bash
rem
ove
d t
hem
.
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
60
/12
1
cat /p
roc/s
elf/s
tatu
s | g
rep
C
ap
CapIn
h:
000
000000
0000
000
CapP
rm:
000
000000
0000
000
Cap
Eff
:000
000000
0000
000
CapB
nd:
000
0001fffffffff
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
61
/12
1
Use
r N
am
espaces -
contd
No
w r
un
:
ech
o $
$ (
ge
t th
e b
ash
pid
)
No
w, fr
om
a d
iffe
rent
roo
t te
rmin
al, w
e s
et th
e u
id_m
ap:
Fir
st,
we c
an s
ee that uid
_m
ap is u
nin
itia
lized b
y:
ca
t /p
roc/<
pid
>/u
id_m
ap
Th
en
:
ech
o 0
10
00
10
> /p
roc/<
pid
>/u
id_m
ap
(<p
id>
is th
e p
id o
f th
e b
ash
pro
ce
ss fro
m p
revio
us s
tep
).
En
try in
uid
_m
ap
is o
f th
e fo
llo
win
g fo
rma
t:
na
me
sp
ace
_firs
t_u
id h
ost_
firs
t_u
id n
um
be
r_o
f_u
ids
So
th
is s
ets
th
e first u
id in
th
e n
ew
na
me
sp
ace
(w
hic
h
co
rre
sp
on
d to
uid
10
00
in
th
e o
uts
ide
wo
rld
) to
be
0; th
e
se
co
nd
will b
e 1
; a
nd
so
fo
rth
, fo
r 1
0 e
ntr
ies.
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
62
/12
1
Use
r N
am
espaces -
contd
No
te: yo
u c
an
se
t th
e u
id_
ma
p o
nly
on
ce
fo
r a
sp
ecific
pro
ce
ss. F
urt
he
r a
tte
mp
ts w
ill fa
il.
run
id -
u Yo
u w
ill g
et 0
.
wh
oa
mi
roo
t
●U
se
r n
am
esp
ace
is th
e o
nly
na
me
sp
ace
wh
ich
ca
n b
e
cre
ate
d w
ith
ou
t C
AP
_S
YS
_A
DM
IN c
ap
ab
ility
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
63
/12
1
ca
t /p
roc/s
elf/s
tatu
s | g
rep
Ca
p
Ca
pIn
h:
00
000
00
000
00
000
0
Ca
pP
rm:
00
000
01
fffffffff
Ca
pE
ff:
00
000
01
fffffffff
Ca
pB
nd:
00
000
01
fffffffff
Th
e C
apE
ff (
Effective
Capa
bilites)
is 1
fffffffff-
> t
his
is 3
7 b
its o
f '1
' ,
wh
ich
me
ans a
ll c
apa
bilitie
s.
Qu
iz:
Will u
nsh
are
--n
et b
ash w
ork
no
w ?
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
64
/12
1
An
sw
er:
no
unsha
re -
-net ba
sh
unsha
re:
canno
t set g
roup
id: In
valid
arg
um
ent
Bu
t aft
er
runnin
g, fr
om
a d
iffe
ren
t te
rmin
al, a
s r
oot:
echo 0
100
0 1
0 >
/p
roc/2
429/g
id_m
ap
It w
ill w
ork
.
ls
/root
will fa
il h
ow
ever:
ls /
root/
ls:
cann
ot o
pen d
irecto
ry /ro
ot/:
Perm
issio
n d
enie
d
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
65
/12
1
Sho
rt q
uiz
1:
I a
m a
re
gula
r u
se
r, n
ot ro
ot.
Will clo
ne
() w
ith
(C
LO
NE
_N
EW
NE
T)
wo
rk ?
Sh
ort
qu
iz 2
:
Will c
lon
e()
with
(C
LO
NE
_N
EW
NE
T | C
LO
NE
_N
EW
US
ER
)
wo
rk ?
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
66
/12
1
●Q
uiz
1 : N
o.
● In o
rde
r to
use t
he C
LO
NE
_N
EW
NE
T w
e n
eed
to h
ave
CA
P_
SY
S_A
DM
IN.
un
sha
re -
-ne
t b
ash
un
sha
re: un
sha
re f
aile
d: O
pera
tion
no
t p
erm
itte
d
●Q
uiz
2: Y
es.
na
mespa
ces c
ode
gu
ara
nte
es u
s tha
t u
ser
na
mespace c
reatio
n is th
e
firs
t to
be
cre
ate
d.
Fo
r cre
ating
a u
se
r n
am
espa
ce
we
do
'nt n
ee
d
CA
P_
SY
S_A
DM
IN. T
he
user
nam
esp
ace is c
reate
d w
ith
fu
ll
ca
pab
ilitie
s, so
we
ca
n c
rea
te th
e n
etw
ork
nam
espace s
ucce
ssfu
lly.
./u
nsh
are
--
net
--u
se
r /b
in/b
ash
N
o e
rro
rs!
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
67
/12
1
Quiz
3:
If y
ou r
un, fr
om
a n
on
roo
t user,
unsare
–user
bash
An
d the
n
cat /p
roc/s
elf/s
tatu
s | g
rep
CapE
ffC
apE
ff:
000
000000
0000
000
This
mean
s n
o c
ap
abilitie
s. S
o h
ow
was the n
et n
am
esp
ace,
whic
h n
ee
ds C
AP
_S
YS
_A
DM
IN, cre
ate
d ?
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
68
/12
1
An
sw
er:
we f
irst
do u
nshare
;
It is f
irst do
ne w
ith u
ser
nam
espace. T
his
en
able
s a
ll c
ap
abilitie
s.
The
n w
e c
reate
the n
am
espace. A
fterw
ard
s,
we c
all e
xec f
or
the
shell; e
xe
c r
em
oves c
apabilitie
s.
Fro
m u
nshare
.c o
f util-linux:
if (
-1 =
= u
nshare
(unshare
_flags))
err
(EX
IT_FA
ILU
RE
, _
("un
sh
are
faile
d")
);..
.
exec_shell(
);
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
69
/12
1
An
ato
my
of
a u
se
r n
am
es
pa
ce
s v
uln
era
bilit
yB
y M
ich
ael K
err
isk, M
arc
h 2
01
3
Abo
ut C
VE
20
13
-18
58
- e
xp
loita
ble
se
cu
rity
vuln
era
bility
http
://lw
n.n
et/A
rtic
les/5
43
27
3/
http://ramirose.wix.com/ramirosen70/121
cgroups
● cgroups (control groups) subsystem is a Resource Management solution providing a
generic process-grouping framework.
● This work was started by engineers at Google (primarily Paul Menage and Rohit Seth) in
2006 under the name "process containers; in 2007, renamed to “Control Groups”.
● Maintainers: Li Zefan (huawei) and Tejun Heo ;
● The memory controller (memcg) is maintained separately (4 maintainers)
● Probably the most complex.
– Namespaces provide per process resource isolation solution.
– Cgroups provide resource management solution (handling groups).
● Available in Fedora 18 kernel and ubuntu 12.10 kernel (also some previous releases).
– Fedora systemd uses cgroups.
– Ubuntu does not have systemd. Tip: do tests with Ubuntu and also make sure that cgroups are not mounted after boot, by looking with mount (packages such as cgroup-lite can exist)
http://ramirose.wix.com/ramirosen71/121
● The implementation of cgroups requires a few, simple hooks into the rest
of the kernel, none in performance-critical paths:
– In boot phase (init/main.c) to preform various initializations.
– In process creation and destroy methods, fork() and exit().
– A new file system of type "cgroup" (VFS)
– Process descriptor additions (struct task_struct)
– Add procfs entries:
● For each process: /proc/pid/cgroup.
● System-wide: /proc/cgroups
http://ramirose.wix.com/ramirosen72/121
– The cgroup modules are not located in one folder but
scattered in the kernel tree according to their functionality:
● memory: mm/memcontrol.c
● cpuset: kernel/cpuset.c.
● net_prio: net/core/netprio_cgroup.c
● devices: security/device_cgroup.c.
● And so on.
http://ramirose.wix.com/ramirosen73/121
cgroups and kernel namespaces
Note that the cgroups is not dependent upon namespaces; you can build cgroups without namespaces kernel support.
There was an attempt in the past to add "ns" subsystem (ns_cgroup, namespace cgroup subsystem); with this, you could mount a namespace subsystem by:
mount -t cgroup -ons.
This code it was removed in 2011 (by a patch by Daniel Lezcano).
See:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a77aea92010acf54ad785047234418d5d68772e2
http://ramirose.wix.com/ramirosen74/121
cgroups VFS
● Cgroups uses a Virtual File System
– All entries created in it are not persistent and deleted after
reboot.
● All cgroups actions are performed via filesystem actions
(create/remove directory, reading/writing to files in it,
mounting/mount options).
● For example:
– cgroup inode_operations for cgroup mkdir/rmdir.
– cgroup file_system_type for cgroup mount/unmount.
– cgroup file_operations for reading/writing to control files.
http://ramirose.wix.com/ramirosen75/121
Mounting cgroups
In order to use a filesystem (browse it/attach tasks to cgroups,etc) it must be mounted.
The control group can be mounted anywhere on the filesystem. Systemd uses /sys/fs/cgroup.
When mounting, we can specify with mount options (-o) which subsystems we want to use.
There are 11 cgroup subsystems (controllers) (kernel 3.9.0-rc4 , April 2013); two can be built as modules. (All subsystems are instances of cgroup_subsys struct)
cpuset_subsys - defined in kernel/cpuset.c.
freezer_subsys - defined in kernel/cgroup_freezer.c.
mem_cgroup_subsys - defined in mm/memcontrol.c; Aka memcg - memory control groups.
blkio_subsys - defined in block/blk-cgroup.c.
net_cls_subsys - defined in net/sched/cls_cgroup.c ( can be built as a kernel module)
net_prio_subsys - defined in net/core/netprio_cgroup.c ( can be built as a kernel module)
devices_subsys - defined in security/device_cgroup.c.
perf_subsys (perf_event) - defined in kernel/events/core.c
hugetlb_subsys - defined in mm/hugetlb_cgroup.c.
cpu_cgroup_subsys - defined in kernel/sched/core.c
cpuacct_subsys - defined in kernel/sched/core.c
http://ramirose.wix.com/ramirosen76/121
Mounting cgroups – contd.
In order to mount a subsystem, you should first create a folder for it
under /cgroup.
In order to mount a cgroup, you first mount some tmpfs root folder:
● mount -t tmpfs tmpfs /cgroup
Mounting of the memory subsystem, for example, is done thus:
● mkdir /cgroup/memtest
● mount -t cgroup -o memory test /cgroup/memtest/
Note that instead “test” you can insert any text; this text is not
handled by cgroups core. It's only usage is when displaying the mount
by the “mount” command or by cat /proc/mounts.
http://ramirose.wix.com/ramirosen77/121
Mounting cgroups – contd.
● Mount creates cgroupfs_root object + cgroup (top_cgroup) object
● mounting another path with the same subsystems - the same
subsys_mask; the same cgroupfs_root object is reused.
● mkdir increments number_of_cgroups, rmdir decrements number_of_cgroups.
● cgroup1 - created by mkdir /cgroup/memtest/cgroup1.
struct super_block *sbThe super block being used. (in memory).
struct cgroup top_cgroup
unsigned long subsys_mask bitmask of subsystems attached to this hierarchyint number_of_cgroups
cgroupfs_root
cgroup
cgroup1 cgroup2
parent parent
parent
parent
cgroup3
cgroupfs_root *root
http://ramirose.wix.com/ramirosen78/121
Mounting a set of subsystems
From Documentation/cgroups/cgroups.txt:
If an active hierarchy with exactly the same set of subsystems
already exists, it will be reused for the new mount.
If no existing hierarchy matches, and any of the requested
subsystems are in use in an existing hierarchy, the mount will fail
with -EBUSY.
Otherwise, a new hierarchy is activated, associated with the
requested subsystems.
http://ramirose.wix.com/ramirosen79/121
First case: Reuse
● mount -t tmpfs test1 /cgroup/test1
● mount -t tmpfs test2 /cgroup/test2
● mount -t cgroup -ocpu,cpuacct test1 /cgroup/test1
● mount -t cgroup -ocpu,cpuacct test2 /cgroup/test2
● This will work; the mount method recognizes that we want to
use the same mask of subsytems in the second case.
– (Behind the scenes, this is done by the return value of sget() method, called
from cgroup_mount(), found an already allocated superblock; the sget()
makes sure that the mask of the sb and the required mask are identical)
– Both will use the same cgroupfs_root object.
● This is exactly the first case described in Documentation/cgroups/cgroups.txt
http://ramirose.wix.com/ramirosen80/121
Second case: any of the requested
subsystems are in use
● mount -t tmpfs tmpfs /cgroup/tst1/
● mount -t tmpfs tmpfs /cgroup/tst2/
● mount -t tmpfs tmpfs /cgroup/tst3/
● mount -t cgroup -o freezer tst1 /cgroup/tst1/
● mount -t cgroup -o memory tst2 /cgroup/tst2/
● mount -t cgroup -o freezer,memory tst3 /cgroup/tst3
– Last command will give an error. (-EBUSY).
The reason: these subsystems (controllers) were been
separately mounted.
● This is exactly the second case described in Documentation/cgroups/cgroups.txt
http://ramirose.wix.com/ramirosen81/121
Third case - no existing hierarchy
no existing hierarchy matches, and none of the requested
subsystems are in use in an existing hierarchy:
mount -t cgroup -o net_prio netpriotest /cgroup/net_prio/
Will succeed.
http://ramirose.wix.com/ramirosen82/121
– under each new cgroup which is created, these 4 files are always created:
● tasks
– list of pids which are attached to this group.
● cgroup.procs.
– list of thread group IDs (listed by TGID) attached to this group.
● cgroup.event_control.
– Example in following slides.
● notify_on_release (boolean).
– For a newly generated cgroup, the value of notify_on_release in inherited
from its parent; However, changing notify_on_release in the parent does not
change the value in the children he already has.
– Example in following slides.
– For the topmost cgroup root object only, there is also a release_agent – a
command which will be invoked when the last process of a cgroup terminates; the
notify_on_release flag should be set in order that it will be activated.
http://ramirose.wix.com/ramirosen83/121
● Each subsystem adds specific control files for its own needs, besides
these 4 fields. All control files created by cgroup subsystems are given a
prefix corresponding to their subsystem name. For example:
cpuset.cpus
cpuset.mems
cpuset.cpu_exclusive
cpuset.mem_exclusive
cpuset.mem_hardwall
cpuset.sched_load_balance
cpuset.sched_relax_domain_level
cpuset.memory_migrate
cpuset.memory_pressure
cpuset.memory_spread_page
cpuset.memory_spread_slab
cpuset.memory_pressure_enabled
cpusetsubsystem
devices.allow
devices.deny
devices.list
devices subsystem
http://ramirose.wix.com/ramirosen84/121
cpu subsystem
cpu.shares (only if CONFIG_FAIR_GROUP_SCHED is set)
cpu.cfs_quota_us (only if CONFIG_CFS_BANDWIDTH is set)
cpu.cfs_period_us (only if CONFIG_CFS_BANDWIDTH is set)
cpu.stat (only if CONFIG_CFS_BANDWIDTH is set)
cpu.rt_runtime_us (only if CONFIG_RT_GROUP_SCHED is set)
cpu.rt_period_us (only if CONFIG_RT_GROUP_SCHED is set)
cpu subsystem
http://ramirose.wix.com/ramirosen85/121
memory subsystemmemory.usage_in_bytes
memory.max_usage_in_bytes
memory.limit_in_bytes
memory.soft_limit_in_bytes
memory.failcnt
memory.stat
memory.force_empty
memory.use_hierarchy
memory.swappiness
memory.move_charge_at_immigrate
memory.oom_control
memory.numa_stat (only if CONFIG_NUMA is set)
memory.kmem.limit_in_bytes (only if CONFIG_MEMCG_KMEM is set)
memory.kmem.usage_in_bytes (only if CONFIG_MEMCG_KMEM is set)
memory.kmem.failcnt (only if CONFIG_MEMCG_KMEM is set)
memory.kmem.max_usage_in_bytes (only if CONFIG_MEMCG_KMEM is set)
memory.kmem.tcp.limit_in_bytes (only if CONFIG_MEMCG_KMEM is set)
memory.kmem.tcp.usage_in_bytes (only if CONFIG_MEMCG_KMEM is set)
memory.kmem.tcp.failcnt (only if CONFIG_MEMCG_KMEM is set)
memory.kmem.tcp.max_usage_in_bytes (only if CONFIG_MEMCG_KMEM is set)
memory.kmem.slabinfo (only if CONFIG_SLABINFO is set)
memory.memsw.usage_in_bytes (only if CONFIG_MEMCG_SWAP is set)
memory.memsw.max_usage_in_bytes (only if CONFIG_MEMCG_SWAP is set)
memory.memsw.limit_in_bytes (only if CONFIG_MEMCG_SWAP is set)
memory.memsw.failcnt (only if CONFIG_MEMCG_SWAP is set)
memory
subsystem
up to 25 control files
http://ramirose.wix.com/ramirosen86/121
blkio subsystemblkio.weight_device
blkio.weight
blkio.weight_device
blkio.weight
blkio.leaf_weight_device
blkio.leaf_weight
blkio.time
blkio.sectors
blkio.io_service_bytes
blkio.io_serviced
blkio.io_service_time
blkio.io_wait_time
blkio.io_merged
blkio.io_queued
blkio.time_recursive
blkio.sectors_recursive
blkio.io_service_bytes_recursive
blkio.io_serviced_recursive
blkio.io_service_time_recursive
blkio.io_wait_time_recursive
blkio.io_merged_recursive
blkio.io_queued_recursive
blkio.avg_queue_size (only ifCONFIG_DEBUG_BLK_CGROUP is set)
blkio.group_wait_time (only ifCONFIG_DEBUG_BLK_CGROUP is set)
blkio.idle_time (only ifCONFIG_DEBUG_BLK_CGROUP is set)
blkio.empty_time (only ifCONFIG_DEBUG_BLK_CGROUP is set)
blkio.dequeue (only ifCONFIG_DEBUG_BLK_CGROUP is set)
blkio.unaccounted_time (only ifCONFIG_DEBUG_BLK_CGROUP is set)
blkio.throttle.read_bps_device
blkio.throttle.write_bps_device
blkio.throttle.read_iops_device
blkio.throttle.write_iops_device
blkio.throttle.io_service_bytes
blkio.throttle.io_serviced
http://ramirose.wix.com/ramirosen87/121
netprio
net_prio.ifpriomap
net_prio.prioidx
Note the netprio_cgroup.ko should be insmoded
so the mount will succeed. Moreover, rmmod will
fail if netprio is mounted
http://ramirose.wix.com/ramirosen88/121
– When mounting a cgroup subsystem (or a set of cgroup subsystems) , allall processes in the system belong to it (the top cgroup object).
● After mount -t cgroup -o memory test /cgroup/memtest/
– you can see all tasks by: cat /cgroup/memtest/tasks
– When creating new child cgroups in that hierarchy, each one of them will not have
any tasks at all initially.
– Example:
– mkdir /cgroup/memtest/group1
– mkdir /cgroup/memtest/group2
– cat /cgroup/memtest/group1/tasks
● Shows nothing.
– cat /cgroup/memtest/group2/tasks
● Shows nothing.
http://ramirose.wix.com/ramirosen89/121
●Any task can be a member of exactly one cgroup in a specific
hierarchy.●Example:●echo $$ > /cgroup/memtest/group1/tasks ●cat /cgroup/memtest/group1/tasks ●cat /cgroup/memtest/group2/tasks ●Will show that task only in group1/tasks.●After: ●echo $$ > /cgroup/memtest/group2/tasks ●The task was moved to group2; we will see that task it only in
group2/tasks.
http://ramirose.wix.com/ramirosen90/121
Removing a child groupRemoving a child group is done by rmdir.
We cannot remove a child group in these two cases:●When it has processes attached to it.●When it has children.
We will get -EBUSY error in both cases.
Example 1 - processes attached to a group:echo $$ > /cgroup/memtest/group1/tasks rmdir /cgroup/memtest/group1rmdir: failed to remove `/cgroup/memtest/group1': Device or
resource busy
Example 2 - group has children:mkdir /cgroup/memtest/group2/childOfGroup2cat /cgroup/memtest/group2/tasks
- to make sure that there are no processes in group2.
rmdir /cgroup/memtest/group2/rmdir: failed to remove `/cgroup/memtest/group2/': Device or resource busy
http://ramirose.wix.com/ramirosen91/121
● Nesting is allowed:
– mkdir /cgroup/memtest/0/FirstSon
– mkdir /cgroup/memtest/0/SecondSon
– mkdir /cgroup/memtest/0/ThirdSon
● However, there are subsystems which will emit a kernel warning when trying to nest; in this subsystems, the .broken_hierarchy boolean member of cgroup_subsys is set explicitly to true.
For example:
struct cgroup_subsys devices_subsys = {
.name = "devices",
...
.broken_hierarchy = true,
}
BTW, a recent patch removed it; in latest git for-3.10 tree, the only subsystem with broken_hierarchy is blkio.
http://ramirose.wix.com/ramirosen92/121
broken_hierarchy example
● typing:
● mkdir /sys/fs/cgroup/devices/0
● Will omit no error, but if afterwards we will type:
● mkdir /sys/fs/cgroup/devices/0/firstSon
● We will see in the kernel log this warning:
● cgroup: mkdir (4730) created nested cgroup for controller "devices"
which has incomplete hierarchy support. Nested cgroups may
change behavior in the future.
http://ramirose.wix.com/ramirosen93/121
● In this way, we can mount any one of the 11 cgroup subsystems
(controllers) under it:
● mkdir /cgroup/cpuset
● mount -t cgroup -ocpuset cpuset_group /cgroup/cpuset/
● Also here, the “cpuset_group” is only for the mount command,
– So this will also work:
– mkdir /cgroup2/
– mount -t tmpfs cgroup2_root /cgroup2
– mkdir /cgroup2/cpuset
– mount -t cgroup -ocpuset mytest /cgroup2/cpuset
–
http://ramirose.wix.com/ramirosen94/121
devices
● Also referred to as : devcg (devices control group)
● devices cgroup provides enforcing restrictions on opening and mknod operations
on device files.
● 3 files: devices.allow, devices.deny, devices.list.
– devices.allow can be considered as devices whitelist
– devices.deny can be considered as devices blacklist.
– devices.list available devices.
● Each entry is 4 fields:
– type: can be a (all), c (char device), or b (block device).
● All means all types of devices, and all major and minor numbers.
– Major number.
– Minor number.
– Access: composition of 'r' (read), 'w' (write) and 'm' (mknod).
http://ramirose.wix.com/ramirosen95/121
devices - example
/dev/null major number is 1 and minor number is 3 (You can fetch the major/minor number from
Documentation/devices.txt)
mkdir /sys/fs/cgroup/devices/0
By default, for a new group, you have full permissions:
cat /sys/fs/cgroup/devices/0/devices.list
a *:* rwm
echo 'c 1:3 rmw' > /sys/fs/cgroup/devices/0/devices.deny
This denies rmw access from /dev/null deice.
echo $$ > /sys/fs/cgroup/devices/0/tasks
echo "test" > /dev/null
bash: /dev/null: Operation not permitted
echo a > /sys/fs/cgroup/devices/0/devices.allow
This adds the 'a *:* rwm' entry to the whitelist.
echo "test" > /dev/null
Now there is no error.
http://ramirose.wix.com/ramirosen96/121
cpuset
● Creating a cpuset group is done with:
– mkdir /sys/fs/cgroup/cpuset/0
● You must be root to run this; for non root user, you will get
the following error:
– mkdir: cannot create directory ‘/sys/fs/cgroup/cpuset/0’:
Permission denied
● cpusets provide a mechanism for assigning a set of CPUs and
Memory Nodes to a set of tasks.
http://ramirose.wix.com/ramirosen97/121
cpuset example
On Fedora 18, cpuset is mounted after boot on /sys/fs/cgroup/cpuset.
cd /sys/fs/cgroup/cpuset
mkdir test
cd test
/bin/echo 1 > cpuset.cpus
/bin/echo 0 > cpuset.mems
cpuset.cpus and cpuset.mems are not initialized; these two initializations are
mandatory.
/bin/echo $$ > tasks
Last command moves the shell process to the new cpuset cgroup.
You cannot move a list of pids in a single command; you mush issue a separate
command for each pid.
http://ramirose.wix.com/ramirosen98/121
memcg (memory control groups)
Example:
mkdir /sys/fs/cgroup/memory/0
echo $$ > /sys/fs/cgroup/memory/0/tasks
echo 10M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
You can disable the out of memory killer with memcg:
echo 1 > /sys/fs/cgroup/memory/0/memory.oom_control
This disables the oom killer.
cat /sys/fs/cgroup/memory/0/memory.oom_control
oom_kill_disable 1
under_oom 0
http://ramirose.wix.com/ramirosen99/121
● Now run some memory hogging process in this cgroup, which is
known to be killed with oom killer in the default namespace.
● This process will not be killed.
● After some time, the value of under_oom will change to 1
● After enabling the OOM killer again:
echo 0 > /sys/fs/cgroup/memory/0/memory.oom_control
You will get soon get the OOM “Killed” message.
http://ramirose.wix.com/ramirosen100/121
Notification API
● There is an API which enable us to get notifications about changing
status of a cgroup. It uses the eventfd() system call
● See man 2 eventfd
● It uses the fd of cgroup.event_control
● Following is a simple userspace app , “eventfd” (error handling was
omitted for brevity)
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
10
1/1
21
No
tification A
PI – e
xam
ple
ch
ar
bu
f[2
56
];
int
eve
nt_
fd,
co
ntr
ol_
fd, o
om
_fd
, w
b;
uin
t64
_t u
;
eve
nt_
fd =
eve
ntf
d(0
, 0
);
co
ntr
ol_
fd =
op
en
("cg
rou
p.e
ve
nt_
co
ntr
ol"
, O
_W
RO
NLY
);
oo
m_
fd =
op
en
("m
em
ory
.oo
m_
co
ntr
ol"
, O
_R
DO
NLY
);
sn
pri
ntf(b
uf, 2
56
, "%
d %
d",
eve
nt_
fd,
oo
m_
fd);
wri
te(c
on
tro
l_fd
, b
uf, w
b);
clo
se
(co
ntr
ol_
fd);
for
(;;)
{
re
ad
(eve
nt_
fd,
&u
, siz
eo
f(u
int6
4_
t));
p
rin
tf("
oo
m e
ve
nt re
ce
ive
d fro
m m
em
_cg
rou
p\n
");
}
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
10
2/1
21
Notifica
tion A
PI – e
xam
ple
(contd
)
●N
ow
run t
his
pro
gra
m (
eventf
d)
thu
s:
●F
rom
/s
ys/fs/c
gro
up/m
em
ory
/0
./e
ven
tfd
cgro
up.e
ve
nt_
con
tro
l m
em
ory
.oom
_contr
ol
Fro
m a
second
term
inal ru
n:
cd
/s
ys/f
s/c
gro
up/m
em
ory
/0/
ech
o $
$
>
/s
ys/f
s/c
gro
up/m
em
ory
/0/t
asks
ech
o 1
0M
>
/sys/f
s/c
gro
up
/mem
ory
/0/m
em
ory
.lim
it_in
_byte
s
Th
en r
un a
mem
ory
hog p
roble
m.
Whe
n o
n O
OM
kille
r is
invoked,
you
will get
the m
essages f
rom
eventf
d u
sers
pace p
rogra
m,
“o
om
even
t re
ce
ived
fro
m m
em
_cg
rou
p”.
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
10
3/1
21
rele
ase
_ag
en
t e
xa
mp
le
●T
he r
ele
ase_a
ge
nt is
invoke
d w
he
n th
e la
st pro
ce
ss o
f a c
gro
up t
erm
inate
s.
●T
he c
gro
up s
ysfs
notify
_o
n_
rele
ase e
ntr
y s
hou
ld b
e s
et so t
ha
t re
lea
se_a
ge
nt w
ill be in
voke
d.
●A
sh
ort
scrip
t, /
wo
rk/d
ev/t/d
ate
.sh
:
#!/
bin
/sh
da
te >
> /
work
/log.t
xt
Run a
sim
ple
pro
ce
ss, w
hic
h s
imply
sle
eps f
ore
ver;
le
t's s
ay it's P
ID is p
idS
leep
ingP
rocess.
ech
o 1
> /
sys/fs/c
gro
up
/me
mory
/no
tify
_o
n_
rele
ase
ech
o /w
ork
/dev/t/d
ate
.sh
> /sys/f
s/c
gro
up/m
em
ory
/re
lease
_ag
en
t
mkd
ir
/sys/fs/c
gro
up/m
em
ory
/0/
ech
o p
idS
lee
pin
gP
rocess >
/sys/fs/c
gro
up
/me
mory
/0/tasks
kill
-9
pid
Sle
ep
ingP
roce
ss
This
activa
tes t
he
re
lease
_a
gen
t; s
o w
e w
ill se
e t
hat
the
curr
ent
tim
e a
nd d
ate
was w
ritte
n t
o
/work
/log.t
xt
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
10
4/1
21
Syste
md a
nd c
gro
ups
● S
yste
md
– d
eve
lop
ed
by L
en
na
rt P
oe
tte
rin
g, K
ay S
ieve
rs,
oth
ers
. ● R
ep
lace
me
nt fo
r th
e L
inu
x in
it s
crip
ts a
nd
da
em
on
.
Ad
op
ted
by F
ed
ora
(sin
ce
Fe
do
ra 1
5 )
, o
pe
nS
US
E , o
the
rs.
● U
de
v w
as in
teg
rate
d in
to s
yste
md
.
● s
yste
md u
se
s c
on
trol gro
up
s o
nly
for
pro
cess g
roupin
g;
no
t fo
r an
yth
ing
els
e lik
e a
llocatin
g r
eso
urc
es lik
e b
lock io b
and
wid
th,
etc
.
rele
as
e_
ag
en
t is
a m
ount
op
tio
n o
n F
edo
ra 1
8:
mo
unt
-a | g
rep s
yste
md
cgro
up o
n /
sys/fs/c
gro
up/s
yste
md t
ype c
gro
up
(rw
,nosuid
,nodev,n
oexec,r
ela
tim
e,r
ele
ase_ag
en
t=/u
sr/
lib
/syste
md
/syste
md
-cg
rou
ps-a
gen
t,nam
e=
syste
md)
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
10
5/1
21
cg
rou
p-a
gen
t is
a s
hort
pro
gra
m (
cg
roups-a
gent.c)
wh
ich
all it
doe
s is s
end d
bus m
essage v
ia the D
BU
S
api.
dbu
s_
message_n
ew
_sig
nal()/
dbus_m
essage_append_
arg
s()
/dbus_connection_send()
syste
md L
ightw
eig
ht C
onta
iners
new
fe
atu
re in F
edora
19:
http
s://f
edora
pro
ject.org
/wik
i/F
eatu
res/S
yste
mdLig
htw
ei
ghtC
on
tain
ers
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
10
6/1
21
ls /sys/fs/c
gro
up
/syste
md/s
yste
m
abrt
d.s
erv
ice
cro
nd
.serv
ice
rpcb
ind
.serv
ice
abrt
-oo
ps.s
erv
ice
cup
s.s
erv
ice
rsyslo
g.s
erv
ice
abrt
-xo
rg.s
erv
ice
dbu
s.s
erv
ice
sen
dm
ail.s
erv
ice
acco
un
ts-d
ae
mon
.serv
ice fir
ew
alld
.se
rvic
e
sm
art
d.s
erv
ice
atd
.se
rvic
e
getty@
.se
rvic
e
sm
-clie
nt.serv
ice
aud
itd
.serv
ice
iprd
um
p.s
erv
ice
ssh
d.s
erv
ice
blu
eto
oth
.se
rvic
e
iprin
it.s
erv
ice
syste
md-f
sck@
.se
rvic
e
cgro
up.c
lon
e_ch
ild
ren
ipru
pd
ate
.se
rvic
e
syste
md-j
ourn
ald
.serv
ice
cgro
up.e
ven
t_con
tro
l
ksm
tun
ed
.serv
ice
syste
md-l
ogin
d.s
erv
ice
cgro
up.p
rocs
mce
log
.serv
ice
syste
md-u
de
vd
.serv
ice
colo
rd.s
erv
ice
Ne
two
rkM
an
ag
er.serv
ice
ta
sks
con
fig
ure
-pri
nte
erv
ice
n
otify
_o
n_
rele
ase
udis
ks2
.se
rvic
e
con
sole
-kit-d
ae
mon
.se
rvic
e p
olk
it.s
erv
ice
upo
wer.serv
ice
We h
ave
here
34
serv
ice
s.
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
10
7/1
21
Exam
ple
fo
r b
lueto
oth
syste
md e
ntr
y:
ls /
sys/f
s/c
gro
up/s
yste
md/s
yste
m/b
lueto
oth
.serv
ice/
cgro
up.c
lone_childre
n
cgro
up.e
vent_
contr
ol c
gro
up.p
rocs
notify
_on_re
lease
tasks
cat
/sys/fs/c
gro
up/s
yste
md/s
yste
m/b
lueto
oth
.serv
ice/t
asks
70
9
Th
ere
are
serv
ices w
hic
h h
ave m
ore
than o
ne p
id in t
he t
asks c
ontr
ol file
.
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
10
8/1
21
●W
ith fe
do
ra 1
8, d
efa
ult lo
cation o
f cgro
up
mo
un
t is
: /s
ys/f
s/c
gro
up
●W
e h
ave 9
co
ntr
oll
ers
:●/s
ys/f
s/c
gro
up
/blk
io●/s
ys/f
s/c
gro
up
/cp
u,c
pu
acct
●/s
ys/f
s/c
gro
up
/cp
uset
●/s
ys/f
s/c
gro
up
/devic
es
●/s
ys/f
s/c
gro
up
/fre
ezer
●/s
ys/f
s/c
gro
up
/mem
ory
●/s
ys/f
s/c
gro
up
/net_
cls
●/s
ys/f
s/c
gro
up
/perf
_eve
nt
●/s
ys/f
s/c
gro
up
/syste
md
●In
bo
ot,
syste
md
pars
es /s
ys/f
s/c
gro
up
an
d m
ou
nts
all e
ntr
ies.
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
10
9/1
21
/p
roc/
cgro
up
s
In F
edora
18
, c
at
/pro
c/cg
rou
ps
giv
es:
#su
bsy
s_n
am
eh
iera
rch
yn
um
_cg
rou
ps
ena
ble
d
cpu
set
21
1
cpu
33
71
cpu
acct
3
37
1
mem
ory
41
1
dev
ices
51
1
free
zer
61
1
net
_cl
s7
11
blk
io8
11
per
f_ev
ent
91
1
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
110
/12
1
Lib
cgro
up
Lib
cg
rou
p
libcg
roup
is a
lib
rary
that abstr
acts
the c
ontr
ol gro
up file s
yste
m in L
inux.
lib
cg
rou
p-t
oo
ls p
ackag
e p
rovid
es
to
ols
fo
r p
erf
orm
ing
cg
rou
ps a
cti
on
s.
U
buntu
:apt-
get
insta
ll c
gro
up-b
in (
trie
d o
n U
buntu
12.1
0)
F
edora
: yum
insta
ll lib
cgro
up
cg
cre
ate
cre
ate
s n
ew
cgro
up;
cg
set
sets
para
mete
rs f
or
giv
en c
gro
up(s
); a
nd c
gexec
runs a
task in s
pecifie
d
co
ntr
ol gro
ups.
Exa
mp
le:
cg
cre
ate
-g
cp
uset:
/test
cg
set
-r c
pu
set.
cp
us=
1 /
test
cg
set
-r c
pu
set.
mem
s=
0 /
test
cg
exec -
g c
pu
se
t:/t
est
bash
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
111
/12
1
One
of th
e a
dva
nta
ge
s o
f cgro
up
s fra
me
wo
rk is
tha
t it is s
imp
le to a
dd
ke
rne
l m
od
ule
s w
hic
h w
ill
wo
rk w
ith
. T
here
are
on
ly tw
o c
allb
ack w
hic
h w
e
mu
st im
ple
me
nt, c
ss
_allo
c()
and
cs
s_
fre
e()
.
And
the
re is n
o n
ee
d to
patc
h th
e k
ern
el u
nle
ss
you
do
so
meth
ing
sp
ecia
l.
Th
us, n
et/core
/ne
tprio
_cg
rou
p.c
is o
nly
322
lin
es
of co
de
an
d n
et/sch
ed
/cls
_cg
rou
p.c
is 3
32
lin
es
of co
de
.
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
112
/12
1
Ch
eck
po
int/
Res
tart
Ch
eck
poin
tin
g is
to
the
op
erat
ion o
f a
Ch
eck
po
inti
ng
th
e st
ate
of
a g
rou
p o
f p
roce
sses
to
a si
ng
le f
ile
or
sev
eral
fil
es.
Res
tart
is
the
op
erat
ion o
f re
stori
ng
th
ese
pro
cess
es a
t so
me
futu
re t
ime
by
rea
din
g a
nd
p
arsi
ng
th
at f
ile/
file
s.
Att
em
pts
to m
erg
e C
heckpoin
t/R
esta
rt in t
he L
inux k
ern
el fa
iled:
Att
em
pts
to m
erg
e C
KPT o
f openV
Z f
ailed:
Ore
n L
aadan s
pent
about
thre
e y
ears
for
imple
menti
ng
checkpoin
t/re
sta
rt in k
ern
el; th
is c
ode w
as n
ot
merg
ed e
ither.
Checkpoin
t and R
esto
re In U
sers
pace (
CR
IU)
●A
pro
ject
of
OpenV
Z
●sponsore
d a
nd s
upport
ed b
y P
ara
llels
.
Uses s
om
e k
ern
el patc
hes
htt
p:/
/cri
u.o
rg/M
ain
_Page
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
113
/12
1
●W
ork
ma
n: (w
ork
loa
d m
ana
ge
me
nt)
It a
ims to p
rovid
e h
igh-leve
l re
sou
rce a
lloca
tion a
nd
man
agem
ent im
ple
me
nte
d a
s a
lib
rary
but pro
vid
es b
ind
ings for
more
lan
guages (
dep
ends o
n th
e G
Obje
ct
fram
ew
ork
; a
llow
s a
ll
the lib
rary
AP
Is to b
e e
xp
osed to
non
-C langu
ages lik
e P
erl,
Pyth
on,
JavaS
cript,
Va
la).
http
s://g
itorious.o
rg/w
ork
ma
n/p
age
s/H
om
e
●P
ax
Co
ntr
ola
Gro
up
ian
a –
a d
oc
um
en
t:●T
rie
s to d
efine
pre
ca
utio
ns th
at a
softw
are
or
user
can
ta
ke to a
void
bre
akin
g
or
con
fusin
g o
ther
use
rs o
f th
e c
gro
up
file
syste
m.
http://w
ww
.fre
ed
eskto
p.o
rg/w
iki/S
oftw
are
/syste
md
/Pa
xC
ontr
olG
roup
s
● a
ka
"H
ow
to
beh
av
e n
icely
in
th
e c
gro
up
fs t
rees
"
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
114
/12
1
No
te:
in th
is p
rese
nta
tion
, w
e r
efe
r to
tw
o
use
rsp
ace
packa
ge,
ipro
ute
an
d u
til-lin
ux. T
he
exa
mp
les a
re b
ase
d o
n th
e m
ost re
ce
nt git
sou
rce
co
de
of th
ese
pa
cka
ges.
You
ca
n c
heck n
am
espa
ce
s a
nd
cg
rou
ps
sup
po
rt o
n y
ou
r m
achin
e b
y r
unn
ing
:
lxc-c
he
ckco
nfig
(fro
m lxc p
ackag
e)
In F
edo
ra 1
8 a
nd
Ub
un
tu 1
3.0
4, th
ere
is n
o
sup
po
rt fo
r U
se
r N
am
espa
ce
s th
ou
gh
it is
ke
rne
l
3.8
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
115
/12
1
●O
n A
ndro
id -
Sa
msu
ng
Min
i G
ala
xy:
–ca
t /p
roc/m
ounts
| g
rep cgro
up
none /
acct
cgro
up r
w,r
ela
tim
e,c
pua
cct
0 0
none /
dev/c
pu
ctl c
gro
up r
w,r
ela
tim
e,c
pu 0
0
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
116
/12
1
Lin
ks
Na
mespa
ces in
op
era
tio
n s
erie
s B
y M
ich
ael K
err
isk,
Jan
uary
2013
:
pa
rt 1
: na
me
spa
ce
s o
verv
iew
htt
p://lw
n.n
et/A
rtic
les/5
31114/
pa
rt 2
: th
e n
am
esp
ace
s A
PI
htt
p://lw
n.n
et/A
rtic
les/5
3138
1/
pa
rt 3
: P
ID n
am
esp
aces
htt
p://lw
n.n
et/A
rtic
les/5
3141
9/
pa
rt 4
: m
ore
on P
ID n
am
espa
ces
htt
p://lw
n.n
et/A
rtic
les/5
3274
8/
pa
rt 5
: U
ser
nam
espaces
htt
p://lw
n.n
et/A
rtic
les/5
3259
3/
pa
rt 6
: m
ore
on u
se
r n
am
esp
aces
htt
p://lw
n.n
et/A
rtic
les/5
4008
7/
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
117
/12
1
Lin
ks -
contd
Ste
pp
ing
clo
se
r to
pra
ctica
l co
nta
ine
rs:
"syslo
g"
na
me
sp
ace
s
htt
p://lw
n.n
et/A
rtic
les/5
2734
2/
●tr
ee /
sys/fs/c
gro
up/
●D
evic
es im
ple
men
tation.
●S
erg
e H
allyn n
sexec
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
118
/12
1
Ca
pabilitie
s -
app
endix
inclu
de/u
ap
i/linux/c
apability.h
CA
P_C
HO
WN
C
AP
_D
AC
_O
VE
RR
IDE
CA
P_D
AC
_R
EA
D_S
EA
RC
H
CA
P_F
OW
NE
R
CA
P_F
SE
TID
CA
P_K
ILL
CA
P_S
ET
GID
CA
P_S
ET
UID
CA
P_S
ET
PC
AP
CA
P_LIN
UX
_IM
MU
TA
BLE
CA
P_N
ET
_B
IND
_S
ER
VIC
E
CA
P_N
ET
_B
RO
AD
CA
ST
CA
P_N
ET
_A
DM
IN
CA
P_N
ET
_R
AW
CA
P_IP
C_
LO
CK
CA
P_IP
C_O
WN
ER
CA
P_S
YS
_M
OD
ULE
CA
P_S
YS
_R
AW
IO
CA
P_S
YS
_C
HR
OO
T
CA
P_S
YS
_P
TR
AC
E
CA
P_S
YS
_P
AC
CT
CA
P_S
YS
_A
DM
IN
CA
P_S
YS
_B
OO
T
CA
P_S
YS
_N
ICE
CA
P_S
YS
_R
ES
OU
RC
E
CA
P_S
YS
_T
IME
CA
P_S
YS
_T
TY
_C
ON
FIG
CA
P_M
KN
OD
CA
P_L
EA
SE
CA
P_A
UD
IT_W
RIT
E
CA
P_A
UD
IT_C
ON
TR
OL
CA
P_S
ET
FC
AP
CA
P_M
AC
_O
VE
RR
IDE
CA
P_M
AC
_A
DM
IN
CA
P_S
YS
LO
G
CA
P_W
AK
E_A
LA
RM
CA
P_B
LO
CK
_S
US
PE
ND
See: m
an 8
setc
ap / m
an 8 g
etc
ap
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
119
/12
1
Sum
mary
●N
am
esp
ace
s
–Im
ple
menta
tion
–U
TS
nam
espace
–N
etw
ork
Nam
espaces
●E
xam
ple
–P
ID n
am
espaces
●cg
rou
ps
–C
gro
ups a
nd k
ern
el nam
espaces
–C
GR
OU
PS
VF
S
–C
PU
SE
T
–cpuset exam
ple
–re
lease_agent exam
ple
–m
em
cg
–N
otification A
PI
–devic
es
–Lib
cgro
up
●C
he
ckp
oin
t/R
esta
rt
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
12
0/1
21
Lin
ks
cgro
ups k
ern
el m
ailin
g lis
t arc
hiv
e:
htt
p:/
/blo
g.g
mane.o
rg/g
mane.lin
ux.k
ern
el.cgro
ups
cgro
up g
it tre
e:
git:/
/git.k
ern
el.org
/pub/s
cm
/lin
ux/k
ern
el/git/t
j/cgro
up.g
it
htt
p:/
/ram
iro
se.w
ix.c
om
/ram
iro
sen
12
1/1
21
Than
k y
ou
!