Date post: | 21-Jan-2018 |
Category: |
Software |
Upload: | akihirosuda |
View: | 295 times |
Download: | 0 times |
Copyright©2017 NTT Corp. All Rights Reserved.
Akihiro Suda ( @_AkihiroSuda_ )NTT Software Innovation Center
FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable
Container Image Layout
github.com/AkihiroSuda/filegrain
Open Source Summit North America (September 11, 2017) Lastupdate:September11,2017
2Copyright©2017 NTT Corp. All Rights Reserved.
• Software Engineer at NTT• Largest telco in Japan
• Docker Moby Core maintainer
• BuildKit initial maintainer (github.com/moby/buildkit)• Next-generation `docker build`• Executes DAG vertexes of Dockerfile-equivalent concurrently
In April, Docker [ as a project ] transited into Moby.Now Docker [ as a product ] has been developed as one of downstream of Moby.
: ≒ :RHEL Fedora
Who am I
3Copyright©2017 NTT Corp. All Rights Reserved.
•New image format that allows doing `docker run` before `docker pull` 100% finishes
•More benefits: deduplication, etc.•OCI compatible
/bin/shMinimal JREMinimal JDKOptional stuff
You just need to pull 20% forrunning the official OpenJDK 8 image!
(The rest 80% can be pulled on demand)
672 MB (total)
137 MB
535 MB
Summary
5Copyright©2017 NTT Corp. All Rights Reserved.
• In July 2017, Open Containers Initiative (OCI) announced the first release of its industry-standard container image spec• CoreOS's appc/ACI spec maintainers has also joined OCI
• The data structure is almost identical to Docker Image Manifest V2.2, but made separate from Docker Registry HTTP API
• Agnostic to distribution protocol; Any protocol with directory-like semantic will work. e.g. HTTP, NFS, IPFS• Some extra coordinator is needed though (e.g. lock for the image index)• I'm +1 for IPFS (P2P filesystem)
• Note: the spec is unrelated to Dockerfile syntax / instructions
OCI Image Spec
6Copyright©2017 NTT Corp. All Rights Reserved.
• Composed of TAR archives of AUFS layers• AUFS-style tar format is used as the lingua franca across different snapshot drivers
(e.g. OverlayFS, Devicemapper, BtrFS, ZFS)
Structure of Docker / OCI image format
TAR
TAR
FROM ubuntu:17.04RUN apt-get install foobarCOPY foobar.conf /etc
mount –t overlay –o lowerdir=0,upperdir=1 ..
mount –t overlay –o lowerdir=1,upperdir=2 ..
base layer • added & modified files• file deletion info ("whiteout")TAR
/bin/bash/bin/ls ...
/usr/bin/foobar/usr/lib/libfoobar.so .../var/lib/.wh.baz
/etc/foobar.conf
7Copyright©2017 NTT Corp. All Rights Reserved.
• Blobs such as TARs are stored with a Merkle tree structure•Generally, distributed via Docker Registry, with (e.g.) S3/Swift backend
Structure of Docker / OCI image format
{"schemaVersion": 2,"manifests": [
{"mediaType": "application/vnd.oci.image.manifest.v1+json","size": 7143,"digest":
"sha256:e692418e4cbaf90ca69d05a66403747baa33ee08806650b51fab815ad7fc331f"}
}
/index.json
/blobs/sha256/e692418e...... (Next slide)
JSON
JSON
8Copyright©2017 NTT Corp. All Rights Reserved.
/blobs/sha256/e692418e...{"schemaVersion": 2,"config": {"mediaType": "application/vnd.oci.image.config.v1+json","size": 7023,"digest": "sha256:b5b2b2c507a0944348e0303114d8d93aaaa081732b86451d9bce1f432a537bc7"
},"layers": [{"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip","size": 33554432,"digest": "sha256:61be55a8e2f6b4e172338bddf184d6dbee29c98853e0a0485ecee7f27b9af0b4"
},{"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip","size": 1073741824,"digest": "sha256:3c3a4604a545cdc127456d94e421cd355bca5b528f4a9c1905b15da2eb4a4c6b"
},{"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip","size": 73109,"digest": "sha256:ec4b8955958665577945c89419d1af06b5f7636b4ac3da7f12184802ad867736"
}]
}
/blobs/sha256/b5b2b2c5...Environment variables and so on
/blobs/sha256/61be55a8...Base layer
/blobs/sha256/3c3a4604...Diff layer
/blobs/sha256/ec4b8955...Diff layer
TAR
TAR
TAR
JSON
JSON
9Copyright©2017 NTT Corp. All Rights Reserved.
•Merkle tree ensures reproducibility of an environment(`docker pull foobar@sha256:e692418e..`)
Structure of Docker / OCI image format
/index.json
/blobs/sha256/e692418e...
/blobs/sha256/b5b2b2c5.../blobs/sha256/61be55a8.../blobs/sha256/3c3a4604.../blobs/sha256/ec4b8955...
Index
Manifest JSON
Config JSON(env vars and so on) AUFS layer TARs
Merkle tree
10Copyright©2017 NTT Corp. All Rights Reserved.
• TAR = Tape ARchiver, appeared in circa. 1979 (UNIX 7th Edition)• TAR was originally designed for magnetic tapesà Not suitable for Docker/OCI workloads in 2017
Problems of Docker / OCI image format
PDP-11, the target architecture of UNIX in 1970s,with TU56 DECtape drives
https://en.wikipedia.org/wiki/PDP-11
11Copyright©2017 NTT Corp. All Rights Reserved.
Without scanning the whole "tape"...
• Files cannot be listed upàCan't be mounted as a filesystem
• File offsets cannot be detectedà Files can't be accessed• We could create a separate index, but it is useless
when a TAR is gzipped, as it can't be seeked
Problem 1: TAR requires scanning the whole "tape"
Metadata 0File 0
Metadata 1File 1
Metadata {n-1}File {n-1}
Terminal zero bytes
...File name, permission, ...
Content
12Copyright©2017 NTT Corp. All Rights Reserved.
Problem 1: TAR requires scanning the whole "tape"
• If we could solve this problem, we can mount(2) an image and start a container without pulling the whole image• Only metadata are needed for mount(2)• Files are lazily pulled on demand
•à Shorter start-up time & Less network traffic
New containers can start immediately on newly added hosts
Faster testing anddeployment cycle
13Copyright©2017 NTT Corp. All Rights Reserved.
Problem 1: TAR requires scanning the whole "tape"
Detailed usecases:•Web apps with huge number of HTML files and graphic files
• Jupyter Notebook with various big data samples• Academic papers will be immediately reproducible by just running
`docker run some-single-huge-image@sha256:deadbeef..`
• Full GNOME/KDE/Windows(potentially) Desktop
• Java / dotNET runtimes
• Integration testing environment...
14Copyright©2017 NTT Corp. All Rights Reserved.
• A registry might contain very similar images• Different versions• Different architectures• Different configurations
• TARs of these images are likely to contain identical files, but waste storage without any data deduplication
Problem 2: No deduplication
FROM ubuntu:17.04RUN apt-get install foo
FROM ubuntu:17.04RUN apt-get install foo bar
FROM debian:9RUN apt-get install foo
FROM ubuntu:17.04RUN echo … > /etc/apt/source.listRUN apt-get install foo
15Copyright©2017 NTT Corp. All Rights Reserved.
• It might be good to pull a large TAR layer with multiple connections • Especially when multiple servers are available
• But not all protocol allows that• RFC7233 says HTTP/1.1 Range Requests are OPTIONAL
Problem 3: No concurrency
Range0-1023 Range1024-2047
16Copyright©2017 NTT Corp. All Rights Reserved.
1. TAR requires scanning the whole "tape"
2. No deduplication
3. No concurrency
Problems of Docker / OCI image format
Image: ht tps:/ /en.wik ipedia.org/wik i /Magnet ic_tape
17Copyright©2017 NTT Corp. All Rights Reserved.
New image format: FILEgrainhttps://github.com/AkihiroSuda/filegrain
18Copyright©2017 NTT Corp. All Rights Reserved.
• Single large TAR blob à Many small blobs•Metadata blob with content-addressability•Only metadata blob is needed for mounting the image• File blobs can be lazily pulled on demand
FILEgrain overview
Metadata 0File 0
Metadata 1File 1
Metadata {n-1}File {n-1}
Terminal zero bytes
...
Metadata 0 File 0Metadata 1
File 1
Metadata {n-1} File {n-1}
...
TARcontinuity File name, permission bits,
SHA256 digest...
content-addressable
19Copyright©2017 NTT Corp. All Rights Reserved.
• continuity: filesystem metadata manifest system used in containerd / Moby community (github.com/containerd/continuity)• Serializes file names, permission bits, XAttrs, and digest values as a
ProtocolBuffers message structure
• Similar format: mtree(8)
Metadata format: continuity
message Resource {repeated string path;int64 uid;int64 gid;uint32 mode;uint64 size;repeated string digest;repeated XAttr xattr;...
}
20Copyright©2017 NTT Corp. All Rights Reserved.
• A single OCI image can contain both FILEgrain manifests and traditional OCI manifests
Highly compatible with traditional OCI spec
/index.json
/blobs/sha256/e692...
/blobs/sha256/b5b2... /blobs/sha256/61be.../blobs/sha256/3c3a.../blobs/sha256/ec4b...
IndexTraditional OCI Manifest
Common config, such as env vars
AUFS layer TARs
/blobs/sha256/a8e3...
FILEgrain Manifest
/blobs/sha256/de81...continuity
/blobs/sha256/583f.../blobs/sha256/3af1.../blobs/sha256/5c2a.../blobs/sha256/39c1.../blobs/sha256/12ea... Bunch of files
`docker pull foo:v1-filegrain` `docker pull foo:v1`
{"v1-filegrain": "sha256:a8e3..", "v1":"sha256:e692.."}
21Copyright©2017 NTT Corp. All Rights Reserved.
• FILEgrain layers can be put on top of traditional OCI layers• Traditional OCI layers might be still suitable for frequently used files or large
number of small files
Highly compatible with traditional OCI spec
{..."layers": [{"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip","size": 33554432,"digest": "sha256:e692..."
},{"mediaType": "application/vnd.continuity.layer.v0+pb","size": 16724,"digest": "sha256:a18b..."
}]
}
TAR
continuity
Traditional OCI layer(Needs to be completely
pulled before starting a container)
FILEgrain layer(Can be lazy-pulled on demand)
22Copyright©2017 NTT Corp. All Rights Reserved.
1. TAR requires scanning the whole "tape"à Only metadata (continuity) blob is needed for mounting.
Other stuff can be lazily pulled on demand.
2. No deduplicationà Deduplication in the granularity of files
3. No concurrencyà Concurrency in the granularity of files
The problems are solved!
23Copyright©2017 NTT Corp. All Rights Reserved.
1. Larger number of blobs• /blobs/sha256 directory will contain huge number of files when laid out on
filesystem, and may make readdir(3) slow• Generally, this issue could be solved by "sharding" /blobs/sha256/deadbeef.. like
/blobs/sha256/de/deadbeef.., but not in this case, for compatibility with OCI.
2. More RPC overhead• Single traditional TAR is still suitable for small files so as to reduce number of RPCs
3. Less compression rate• Single traditional TAR is still suitable when the layer contains similar files
But anyway, these cons can be easily mitigated by composing FILEgrain layers on top of traditional OCI layers
Cons
24Copyright©2017 NTT Corp. All Rights Reserved.
Alternative 1: Use some external blob store?à No, because it depends on certain protocols. Also, it is unlikely to work with OCI Merkle tree.
Alternative ideas?
Related work:
Harter, Tyler, et al. "Slacker: Fast Distribution with Lazy Docker Containers." FAST. 2016.
• Loopback-mount an ext4 image located on NFS
• Support lazy-pulling and deduplication in granularity of blocks
25Copyright©2017 NTT Corp. All Rights Reserved.
Alternative 1: Use some external blob store?à No, because it depends on certain protocols. Also, it is unlikely to work with OCI Merkle tree.
Alternative ideas?
Related work:Lestaris, George. "Alternatives to layer-based image distribution: using CERN filesystem for images." Container Camp UK. 2016.
Blomer, Jakob, et al. "A Novel Approach for Distributing and Managing Container Images: Integrating CernVM File System and Mesos." MesosCon NA. 2016.
• CernVM FS• CernVM FS has its own Merkle tree
• Support lazy-pulling and deduplication in granularity of files (as in FILEgrain)
26Copyright©2017 NTT Corp. All Rights Reserved.
Alternative 1: Use some external blob store?à No, because it depends on certain protocols. Also, it is unlikely to work with OCI Merkle tree.
Alternative ideas?
• IPFS is also attractive protocol• P2P, content-addressable
• FILEgrain is made agnostic to transportation protocol as in OCI image spec;it works with HTTP / NFS / CernVM FS / IPFS / whatever
27Copyright©2017 NTT Corp. All Rights Reserved.
Alternative 2: Seek TAR?à No• Compressed TAR is not seekable• Even uncompressed TAR is sometimes not seekable, depending on transportation
protocol• Larger number of request for fetching the whole metadata
Alternative ideas?
Metadata 0File 0
Metadata 1File 1
Metadata {n-1}File {n-1}
Terminal zero bytes
...
Metadata can be pre-loadedwithout reading file contents? (No)
TAR
28Copyright©2017 NTT Corp. All Rights Reserved.
Alternative 3: Use (e.g.) ZIP instead of TAR?à No• still unseekable depending on transporation protocol• poor metadata support
Alternative ideas?
ZIP
Extra metadata 0Compressed file 0Extra metadata 1Compressed file 1
Extra metadata {n-1}Compressed file {n-1}
...
Footer
Metadata 0Metadata 1
...
Metadata {n-1}Metadata can be pulled at once firstly? (No)
30Copyright©2017 NTT Corp. All Rights Reserved.
• Implemented as a read-only FUSE filesystem
• No support for write operations• "Cattle" containers should be immutable; it is anti-pattern to do any write operation
against rootfs• /tmp and /run should be mounted as tmpfs• persistent data should be written to bind-mounted Ext4/XFS
• Actually, even Docker built-in storage drivers (overlayfs, AUFS) don't fully support write operations (github.com/AkihiroSuda/issues-docker)• e.g. Yum is known not to work on overlayfs, without installing yum-plugin-ovl
(neither a bug of overlayfs nor yum!)
Implementation
31Copyright©2017 NTT Corp. All Rights Reserved.
• Docker doesn't support running a container with a rootfs that is not managed by the Docker daemon
• So current FILEgrain POC is evaluated using runc
• TODO: reimplement as a containerd plugin
Implementation
32Copyright©2017 NTT Corp. All Rights Reserved.
See https://github.com/AkihiroSuda/filegrain/issues/17 for detailed information
Images for evaluation
Image Description rootfs size(after tar+gzip expansion)
openjdk:8
sha256:5da842d59f76009fa27ffde06888ebd560c7ad17607d7cd1e52fc0757c9a45fb
Debian 9.1, OpenJDK8u141
671.7MB
kdeneon/all
sha256:e3e7f216a5f8f1fdcff4eab8807d7afcd291c050099ab3e8a8355b7b28a19247
Ubuntu 16.04, KDE Plasma 5.10, Firefox 54..
4.8GB
kaggle/python
sha256:335103c998aea22a5608c2eeca7dcf109e0828ed233b75f5098182c5b058fe98
Debian 8.5, Variousmachine learning frameworks, NLTK (natural language toolkit) dataset..
8.3GB
33Copyright©2017 NTT Corp. All Rights Reserved.
• openjdk:8 (blobs in total = 671.7MB + meta 5.4MB)• mount: 5.4MB ( 2 blobs)
• only metadata are needed (manifest and continuity)• `sh`: 7.3MB ( 8 blobs) in total• `java –version`: 88.2MB (30 blobs) in total• `javac HelloWorld.java`: 137.3MB (50 blobs) in total
• kdeneon/all (4.8GB + 34.5MB)• mount: 34.5MB ( 2 blobs)• `sh`: 36.7MB ( 8 blobs)• `startkde`: 742.7MB (4,267 blobs)• `firefox`: 866.6MB (4,506 blobs)
note: commands are executed sequentially
Evaluation: blobs needed to start a container (uncompressed)
1/5
lessthan1/5
The evaluat ion data in the abstract text (633MB Java image) is an old data
34Copyright©2017 NTT Corp. All Rights Reserved.
• kaggle/python (8.3GB + 38.2MB)• mount: 38.2MB ( 2 blobs)• `sh`: 40.1MB ( 8 blobs)• `ipython –c ‘echo(“hello”)’`: 75.4MB (1033 blobs)• `ipython –c ‘import nltk’`: 352.0MB (2799 blobs)
Evaluation: blobs needed to start a container (uncompressed)
lessthan1/24
35Copyright©2017 NTT Corp. All Rights Reserved.
Evaluation: compression
Image rootfs TAR at once + gzip -9
FILEgrain + gzip -9 against each of blobs
openjdk:8 671.7MB 261.3MB 260.7MB(-645,604B)
kdeneon/all 4.8GB 2.1GB 2.1GB(-489,361B)
kaggle/python 8.3GB 3.6GB 3.6GB(+4,345,701B)
No more than a rounding error in these cases(If an image contained similar files, its compression rate would get worse)
36Copyright©2017 NTT Corp. All Rights Reserved.
Evaluation: deduplication
kdeneon/all(4.8GB)
kaggle/python(8.3GB)
75.4MB deduplication(KDE and Kaggle are mutually unrelated,
but have some common Debian files)
37Copyright©2017 NTT Corp. All Rights Reserved.
Evaluation: FUSE overhead
0.1
1
10
100
1 2 3 4 5 6 7 8 9 10
Time required for archiving /usr of openjdk(X: n-th experiment, Y:seconds)
Docker (overlay2) FILEgrain FUSEDocker 17.06 / Fedora 26 / 2 vCPUs, 2GB RAM, 2GB swap (VMware Fusion, MacBook Pro 2016)
38Copyright©2017 NTT Corp. All Rights Reserved.
Evaluation: FUSE overhead
0.1
1
10
100
1 2 3 4 5 6 7 8 9 10
Time required for archiving /usr of openjdk(X: n-th experiment, Y:seconds)
Docker (overlay2) FILEgrain FUSE
For FILEgrain, the first data contains the time for pulling some blobsneeded to start a container
(But no network overhead, as localhost is the remote in this evaluation)
But the Docker data doesn't contain the time for pulling,as a container can be started only after pulling the whole image.
39Copyright©2017 NTT Corp. All Rights Reserved.
0.1
1
10
100
1 2 3 4 5 6 7 8 9 10
Time required for archiving /usr of openjdk(X: n-th experiment, Y:seconds)
Docker (overlay2) FILEgrain FUSE
Evaluation: FUSE overhead
Getting faster due to in-kernel caching
For some implementation reason,in-kernel cache doesn't seem working..
But this cache should be easily enabled,as the filesystem is always read-only by design
40Copyright©2017 NTT Corp. All Rights Reserved.
•Overhead of larger number of RPC for pulling blobs• Depends on the transportation protocol• TODO: implement Docker Registry API client and evaluate the overhead
• Current POC just uses a local directory as a mock registry
Evaluation: others
42Copyright©2017 NTT Corp. All Rights Reserved.
• Even finer granularity (CHUNKgrain?)• For large files, it might be good compute SHA256 digests against each of chunks,
and serialize the digests in some format• Probably, this serialization would be separate from continuity manifest itself.
continuity#85
• Use traditional OCI TAR for files that are very likely to be needed immediately to start a container• We can easily detect such files by starting a container and capture FUSE calls
before pushing the image
Future work
43Copyright©2017 NTT Corp. All Rights Reserved.
• containerd is the next standard runtime in the container industry
Integration to the ecosystem
Docker
(Kubernetes)
containerd v0.2
runc
Moby Core(successor to Docker daemon)
Kubernetes(via CRI-containerd)
containerd v1.0
runc
44Copyright©2017 NTT Corp. All Rights Reserved.
• containerd plugin system allows new feature to be added without modifying the containerd upstream
Integration to the ecosystem
containerd v1.0
runtime service snapshot service differ service content serviceoverlayfs plugin
btrfs plugin
runc plugin generic plugin generic plugin
45Copyright©2017 NTT Corp. All Rights Reserved.
• FILEgrain will be reimplemented as set of containerd plugins• Can be easily integrated to the higher-level engines without
modification (Moby/Docker, Kubernetes, ..)
Integration to the ecosystem
containerd v1.0
runtime service snapshot service differ service content serviceoverlayfs plugin
btrfs plugin
runc plugin generic plugin generic plugin
FILEgrain plugins
47Copyright©2017 NTT Corp. All Rights Reserved.
•New image format that allows doing `docker run` before `docker pull` 100% finishes
•More benefits: deduplication, etc.•OCI compatible
/bin/shMinimal JREMinimal JDKOptional stuff
You just need to pull 20% forrunning the official OpenJDK 8 image!
(The rest 80% can be pulled on demand)
672 MB (total)
137 MB
535 MB
Recap