CERN Summer Student Program 2016
Evaluation of the Suitability ofAlluxio for Hadoop Processing
Frameworks
Author:
Christopher Lawrie
University of Warwick
Supervisor:
Prasanth Kothuri
CERN IT-DB
September 16, 2016
1 Introduction
1.1 What is Alluxio?
Alluxio [1] is an open source memory speed virtual distributed storage platform. It sits between
the storage and processing framework layers for big data processing and claims to heavily improve
performance when data is required to be written/read at a high throughput; for example when a
dataset is used by many jobs simultaneously. This report evaluates the viability of using Alluxio
at CERN for Hadoop processing frameworks.
1.2 Hardware and Software Configuration
Throughout the project a cluster with 6 nodes with the following configuration was used to test
Alluxio:
Memory (per node): 7 GB
CPU (per node): 2.4GHz (4 cores)
OS: CentOS Linux release 7.2.1511 (Core)
Software: Cloudera Manager: 5.8.1, Spark: 1.6.0, Hadoop: 2.6.0-cdh5.7.1, Alluxio: 1.2.0.
2 Installation
It is assumed that one already has a cluster setup which is running Cloudera Manager and all of
the appropriate software installed to test Alluxio (listed above). To install Alluxio on a cluster
with HDFS as the UnderFS storage (under storage filesystem) one must follow a combination
of two guides [2, 3]. The “alluxio” user will be used to install Alluxio and should have access
to run sudo commands. It should be assumed that where appropriate in the installation process
that commands were run as the alluxio user.
2.1 Obtain Alluxio
On each node of the cluster one must download and extract Alluxio. Alluxio can be found in
the downloads directory of their website [4]. I was careful to ensure that Alluxio was extracted
into the home directory of the alluxio user (/opt/alluxio), and furthermore that this user had
correct ownership/access to all Alluxio files and directories.
1
2.2 Build Alluxio
Alluxio with a hdfs as the underFS must be built with the correct version of Hadoop on each
machine. First, change the Hadoop version in the “pom.xml” file located in the root the Alluxio
directory. Find the “hadoop.version” tag and change this to the correct value. Then build
Alluxio from the home directory:
$ hadoop version
$ vim alluxio home/pom.xml # Change ”hadoop.version” tag
$ cd alluxio home && sudo mvn clean package −DskipTests
2.3 Quick Configuration
Alluxio can be configured on the master machine and these changes then copied over to the
workers. To quickly configure Alluxio run the following in the Alluxio directory of the master
machine:
$ ./bin/alluxio bootstrapConf $HOSTNAME hdfs
On the master machine in the alluxio-env.sh configuration file, list the address for the HDFS
running on the cluster and specify the Alluxio mount point within this HDFS (the port that the
HDFS is running can be found in cloudera manager as the value of “namenode port”). We must
also specify the volume of memory to assign to Alluxio on each worker.
$ vim ”alluxio path/conf/alluxio−env.sh”
ALLUXIO UNDERFS ADDRESS=${ALLUXIO UNDERFS ADDRESS:−”hdfs://master
address:hdfs port/alluxio mount point”}ALLUXIO WORKER MEMORY SIZE=${ALLUXIO WORKER MEMORY SIZE:−”2000MB”}
For testing purposes, Alluxio was mounted to “/alluxio” on the HDFS.
Again on the master machine, we must list the worker hostnames in the workers configuration
file:
Finally, copy this configuration to each of the workers by running from the Alluxio home
directory on the master machine:
$ ./bin/alluxio copyDir conf
2
Note that this will require passwordless SSH communication between the root user on the master
and the root user on each of the workers and the alluxio user on the master and the alluxio user
on each of the workers. For my particular setup I also had to disable the requiretty option for
each machine.
2.4 Allow Alluxio to access the HDFS
Alluxio runs its filesystem commands as the root user. We must give alluxio operations full
access to the HDFS directory where Alluxio is mounted. First on the cloudera manager enable
the configuration option “dfs.namenode.acls.enabled”. Then run the following on the namenode:
$ sudo −u hdfs hadoop dfs −mkdir ”/alluxio”
$ sudo −u hdfs hadoop dfs −chown −R root:supergroup ”/alluxio”
2.5 Start Alluxio
We can add the Alluxio commands to our “PATH” environment variable:
$ export PATH=$PATH:alluxio home/bin
Alluxio is now ready to be started. Run the following commands:
$ alluxio format
$ alluxio−start.sh all Mount
This formats the Alluxio mount point and starts the master and all workers. To check alluxio is
running you can go to “http://master hostname:1999”. Here you will find the Alluxio web UI.
Occasionally there can be discrepencies between files in the underFS and the Alluxio filesys-
tem due to files not being correctly removed. This should not occur if permissions have been
correctly configured. Often to resolve this issue one must format the Alluxio filesystem.
2.6 Quick Boot with Shell Script
Best practice for tweaking Alluxio and rebooting is to make any changes on the master and then
copy those changes over to the workers. A shell script can be written to quickly apply these
changes and restart Alluxio.
$ cd /opt/alluxio/alluxio−1.2.0
$ alluxio copyDir conf
$ alluxio−stop.sh all
$ alluxio−start.sh all Mount
3
2.7 Alluxio and Spark Hostname Issue
There is a documented issue which concerns how Spark and Alluxio interact with each other on
nodes [5]. The issue is concerned with the fact that Spark often uses IP addresses rather than
hostnames to reference workers, whereas Alluxio uses hostnames. The easiest solution is to force
Spark to use hostnames instead. This is achieved by including an extra line in the spark-env.sh
config file of each machine in the Spark cluster:
$ vim spark home/conf/spark−env.sh
SPARK LOCAL HOSTNAME=‘hostname‘
This change should be made on the Cloudera Manager web UI if the cluster is managed by
Cloudera.
One can check that there is no issue between Spark and Alluxio by running a Spark Job
which uses a file in Alluxio (for example run a count on a file in Alluxio). Then in the job history
webUI for Spark, check the locality level of the executions. If there is no issue then it should
always read “NODE LOCAL” as apposed to something like “RACK LOCAL”.
3 Alluxio Features
3.1 Further Configuration Options
Alluxio has many configuration options [6]. The two main configuration files are located at “al-
luxio home/conf/alluxio-env.sh” and “alluxio home/conf/alluxio-site.properties”. Each of these
files have templates which show which settings that can be changed from within them.
3.1.1 Example: Round-Robin Scheduling
To change the scheduling system for writing files to worker memory from its default value, a line
must be added to the ”alluxio-site.properties” file:
$ vim alluxio home/conf/alluxio−site.properties
alluxio .user. file .write. location .policy . class=alluxio. client . file .policy .RoundRobinPolicy
Upon restarting Alluxio and loading a file into memory the file will be distributed across workers
in a round-robin fashion.
4
3.1.2 Example: UnderFS Synchronisation
When writing files to Alluxio directly we can choose how Alluxio and the UnderFS storage
interact with each other. By default the parameter “alluxio.user.file. writetype.default” is set to
“MUST CACHE”. With this value a file written directly to Alluxio will not be written to the
underFS. We can change the value of this parameter to “CACHE THROUGH”.
$ vim alluxio home/conf/alluxio−site.properties
alluxio .user. file .writetype.default=CACHE THROUGH
$ cd alluxio home && alluxio copyDir conf # Sync changes.
Now when a file is written directly to Alluxio it is synchonously written to the underFS.
Alluxio can also write files asynchronously to the underFS, this feature is in alpha but is
enabled by setting “alluxio.user.file.writetype.default” to “ASYNC THROUGH”.
3.2 Command Line Interface
Alluxio can be used via a library of built in command line tools which have a very similar
structure to HDFS commands [7]. These tools work straight away after installing Alluxio. To
see a full list of them with descriptions run:
$ alluxio fs
3.2.1 Example: Loading a File into Alluxio
We can either load a file directly into Alluxio memory, or into the underFS out of memory but
such that it is still visible to Alluxio.
To load directly into memory we can use:
$ alluxio fs copyFromLocal /path to local file /alluxio storage path
We can see that the file is now visible to Alluxio and that the file has been distributed
across the worker memory. The writing of the file into Alluxio is done according to the “write-
5
type.default” configuration.
To load the file out of memory we can use:
$ hadoop dfs −put /path to file /alluxio/alluxio storage path
The file is now still visible to Alluxio, but it is not loaded into memory. The file now shows up
the same as before in the web UI except with 0% In-Memory. Note that we have to reference
the Alluxio folder (where Alluxio is mounted) when loading via hdfs commands.
3.2.2 Example: Removing a File from Alluxio Memory
When a file is loaded in Alluxio memory we can either remove the file completely from Alluxio
and the underFS, or just take the file out of memory and keep it in the underFS (assuming that
it is persisted).
$ alluxio fs free /path to free # Keep file within underFS.
$ alluxio fs rm −R /path to delete # Remove file from underFS.
3.3 Unified Namespace
When files are persisted through to the underFS from Alluxio memory (e.g.
“CACHE THROUGH” is set), all files and directories relative to the underFS mount point are
consistent. For example deleting or renaming a file via the Alluxio command line interface
will perform the same operations on the respective underFS directories. This keeps the Alluxio
filesystem and the underFS synchronised.
3.4 Web Interface
Alluxio has a nice web UI which can often be easier to use than the command line interface. To
access it go to “http://master hostname:1999” in a web browser. It should be noted that having
the correct version of Java running when starting Alluxio is important for the web UI to work.
I found that version 1.8.0 caused the web UI to crash but version 1.7.0 was fine.
3.4.1 Example: File Browsing
To view the files available to Alluxio on the web UI, click on the browse tab. Here you can see a
number of useful pieces of information about these files. We can see all the files and directories
visible to Alluxio, their sizes, percentage of them stored in memory etc. Clicking on a file shows
the locality of individual blocks.
6
3.4.2 Example: Alluxio Configuration
The Web UI also provides a very easy way to check that your changes to the Alluxio congifuration
have been made correctly, view this by clicking on the “configuration” tab:
3.5 Filesystem API
Instead of accessing Alluxio via the command line, we can also use the Filesystem API from
within Java and Scala. I have not found any use cases in the context of a Spark framework
where CLI or Spark commands have not served a better purpose than the filesystem API. A
possible use is for more general distributed programs to be written outside of a framework which
interface with Alluxio.
3.5.1 Java and Scala Compilation with Alluxio
To access the Alluxio classes from Java/Scala we need only add the Alluxio prebuilt jar to the
classpath:
$ export CLASSPATH=”$CLASSPATH:path\ to\ alluxio/core/client/target/alluxio−core−client−1.2.0−jar−with−dependencies.jar”
This jar was first generated when we built Alluxio with maven during the installation process.
Let us create a file called test stored in Alluxio memory with the filesystem API:
import alluxio . client . file .FileSystem;
import alluxio .AlluxioURI;
7
object test {def main(args: Array[String]) {
val fs = FileSystem.Factory.get()
val path = new AlluxioURI(”/test”)
val out = fs. createFile (path)
out.write(120)
out. close ()
}}
This creates a file called “test” in the root of the Alluxio storage directory. It writes the letter
“x” in this file and then closes it. To compile and execute the file we run the following:
$ scalac ”test . scala”
$ scala ”test”
An almost identical method is used for Java.
3.5.2 Scala Shell and Alluxio
The Scala shell can be useful to play around with Alluxio instead of having to compile a file
every time you want to do something. To access Alluxio with the scala shell simply have the
path to the prebuilt Alluxio jar in your classpath (see above section) and run the Scala shell.
You must then import all necessary Alluxio classes in the shell in order to perform your Alluxio
commands.
It is worth noting that the scala shell can be useful to view all available classes for Alluxio,
just press tab after writing “import alluxio.” in the scala shell:
3.5.3 Spark Shell and Alluxio
The Spark shell is also able to access the Alluxio filesystem. This is done in the same way as the
Scala shell. Ensure that the Spark classpath contains the Alluxio jar and start the Spark shell:
$ export SPARK CLASSPATH=”.:path to alluxio/core/client/target/alluxio−core−client−1.2.0−jar−with−dependencies.jar”
$ spark−shell −−master yarn
We can read/write files from/to Alluxio using Spark wrappers. Suppose we have a file called
8
“test” saved in our Alluxio storage directory (or the HDFS /alluxio directory). We can load the
file into Spark and count the number of lines in it:
val file = sc.textFile (”alluxio ://namenode adress:19998/test”)
file .count
Or we can take a file in the HDFS and save it into Alluxio. Suppose our file “test” is in the root
directory of our hdfs (where Alluxio is not mounted).
val file = sc.textFile (”hdfs://namenode adress:hdfs namenode port/test”)
file .saveAsTextFile(”alluxio://namenode adress:19998/test”)
3.6 Submit a Spark Job with Alluxio Using Maven
To submit an application to Spark which depends on Alluxio classes we must include the Alluxio
dependencies in our build tool (in this case Maven). Along with the standard Spark, Scala and
HDFS dependencies for a normal spark-maven build we add the following two dependencies to
the “pom.xml” file, which can be found on the maven repository:
<dependency>
<groupId>org.alluxio</groupId>
<artifactId>alluxio−core−common</artifactId>
<version>1.2.0</version>
</dependency>
<dependency>
<groupId>org.alluxio</groupId>
<artifactId>alluxio−core−client</artifactId>
<version>1.2.0</version>
</dependency>
3.6.1 Example: Accessing Alluxio from Spark Via Two Methods
We can use the following code to check our maven project build:
package package name
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext.
import org.apache.spark.SparkConf
import alluxio . client . file .FileSystem;
import alluxio .AlluxioURI;
9
object MainExample {def main(arg: Array[String]) {
// Method 1 − filesystem API.
val fs = FileSystem.Factory.get()
val path = new AlluxioURI(”/test”)
val out = fs. createFile (path)
out.write(120)
out. close ()
// Method 2 − Alluxio wrapper.
val conf = new SparkConf()
val sc = new SparkContext(conf)
val file = sc.textFile (”alluxio ://hostname.cern.ch:19998/test”)
file .saveAsTextFile(”alluxio://hostname.cern.ch:19998/test2”)
}}
Now we can package and submit the build:
$ mvn package
$ spark−submit −−class com.path to code −−master yarn
target/build name.jar
Assuming that we have the Spark classpath correctly set up (see above section) then we will
have access to the filesystem API and the Alluxio Spark wrapper. Note the wrapper is provided
by the prebuilt Alluxio Jar in the spark classpath and the Alluxio Filesystem API is provided
by the dependencies in the Maven build. I have not been able to find out how to include the
Alluxio-Spark wrapper dependencies into the “pom.xml” file although this should be possible.
3.7 Tiered Storage
Tiered storage [8] is a very powerful feature of Alluxio which allows the capacity of Alluxio to
overflow from memory into a set of other storage levels. This is very useful for many reasons.
Firstly, memory can be expensive and so designining a whole cluster with it’s entire primary data
store in memory is impractical. Also, memory I/O speeds are so high that often jobs become
throlled by processing times, therefore reducing these speeds by some margin will not see a huge
performance decrease.
This feature was tested by extending Alluxio to a second tier of SSD storage of 15GB per
worker. This is done by adding lines to the “alluxio.site.properties” file on each worker.
$ vim alluxio home/conf/alluxio−site.properties
alluxio .worker. tieredstore . levels =2
alluxio .worker. tieredstore . level0 . alias =MEM
10
alluxio .worker. tieredstore . level0 . dirs .path=/mnt/ramdisk
alluxio .worker. tieredstore . level0 . dirs .quota=${alluxio.worker.memory.size}alluxio .worker. tieredstore . level0 .reserved. ratio=0.1
alluxio .worker. tieredstore . level1 . alias =SSD
alluxio .worker. tieredstore . level1 . dirs .path=/ssd
alluxio .worker. tieredstore . level1 . dirs .quota=30GB
alluxio .worker. tieredstore . level1 .reserved. ratio=0.1
$ cd alluxio home && alluxio copyDir conf # Sync changes made
Now in the Alluxio web UI we see that the capacity of Alluxio has been greatly increased.
The MEM/SSD separation is completely internal to Alluxio. Any program which interfaces
with Alluxio will still work exactly as it had before. Internally Alluxio has allocators and evictors
which organise data into their respective tiers of storage. Allocators choose which level of storage
and which directory data is written into for a worker. Evictors choose which blocks should change
their storage level when there is a shortage of space in a tier which the allocator has decided to
write to. The default allocators/evictors always prioritise maxmising the usage of the lowest tier
of storage (memory in this case).
I have found that these default allocators and evictors do not allow files to be optimally
loaded into Alluxio. By default new blocks are prioritised for memory over older blocks and so
there is a constant cycle of blocks being written to memory and then being pushed down to SSD
storage. Custom allocators/evictors can however be written which could improve this.
Files can also be pinned, to prevent them from being moved out of memory by evictors.
3.8 Metrics
Alluxio provides a wealth of metrics on jobs which have been performed in an instance of the
program running. A full list of available metrics along with supported sinks are available online
[9].
3.8.1 Example: Comparing Data Locality and Fault-Tolerence for Two
Methods
We can track how well Alluxio and the framework running on top are optimising data locality
in jobs. The two main methods which can be used to load files into Alluxio memory are via the
CLI and the Spark saveAsTextFile command.
Writing a file to Alluxio via the saveAsTextFile command in Spark by default produces a
non-replicated unpersisted file which is in memory. This is because when the Spark shell loads
11
and accesses Alluxio, it loads the Alluxio default configuration; the Spark shell has no knowledge
of the extra configuration files for Alluxio.
When running a count on this file no replication is enforced and so no data is written remotely
to other nodes. The “Blocks Read Remotely” value does not increase from its initial value (which
is non-zero due to the initial saveAsTextFile command to write into Alluxio), whereas the Blocks
Read Locally value does. I have run the count a number of times to show that the Blocks Read
Remotely value stays small but the Blocks Read Locally value continues to increase. This hidden
default behaviour from within Spark means that no blocks can be copied from the HDFS to new
nodes during processing.
This behaviour is consistent no matter how many executors or cores are running. However,
fault-tolerence is not enabled. If a node is killed during the count command then the job fails.
This is because there is no persistence to the underFS by default (MUST CACHE enabled).
A second method for loading a file into memory is via the Alluxio CLI which behaves ac-
cording to the Alluxio configuration files setup.
This loads a non-replicated file into Alluxio which is persisted (clearly, as as the file was
already on the underFS). When running a count on this file, replication increases in memory.
Because memory I/O is so fast, often it is faster to copy a file from the HDFS to a new node
than to wait for another node to finish its current task and start the new task. When running
the count with 5 executors and 3 cores per executors, the replication was kept down to a factor
of two.
However, when running with 5 executors and 2 cores per exectuors, then replication is
increased as more files are copied while waiting for executors to finish (they have less cores).
12
Fault torerance is also enabled now. Note, this is not due to the replication in memory, it
is due to the persistence in the underFS. Spark reports that an executor has failed but the job
finishes. A node in the cluster was killed and the blocks contained on this node were redistributed
to new nodes from the underFS.
It appears that persistence to the underFS dictates that fault-tolerence of Spark jobs and
that there is some other hidden behaviour which I cannot understand which dictates where blocks
are allowed to be replicated across workers in memory to increases job performance by reducing
executor wait times. The behaviour could also be one from Spark rather than Alluxio.
3.9 Controlling Alluxio Behaviour from within Spark
The previous section has shown that is is the persistence to the underFS from Alluxio which
enables fault-tolerance. It has also been noted that the default behaviour of Spark with Alluxio
is to run Alluxio with its out-of-the-box configuration, however this can be changed [10].
To change an Alluxio configuration parameter from its default value in Spark, one must pass
it as an additional Java option. This is done by adding a new line to the “spark-defaults.conf”
file:
$ vim sark home/conf/spark−defaults.conf
spark.executor.extraJavaOptions=−Dalluxio parameter=value
For example to enable fault tolerance on the Alluxio files created from the saveAsTextFile
command I must enforce persistence to the underFS. In cloudera manager I add the following
line to the “spark-conf/spark-defaults.conf client config safety valve” configuration option.
And we see this Cloudera propgates this change across all node configurations:
13
Now files are persisted when written using saveAsTextFile in Spark. We may use this method
for any of the Alluxio configuration options.
3.10 Lineage API
Alluxio has its own fault-tolerance solution which is independent of the framework used [11].
Since it is currently in alpha stages of devlopment and Spark already provides fault-tolerance
which is compatible with Alluxio, this has not been investigated further.
4 Testing
4.1 Tiered Storage vs. Standard CERN Cluster
Much distributed computing at CERN is done on HDD clusters which are inevitably slow and
throttled by the data storage I/O speeds. We therefore tested the viability of a new architecture
of cluster for processing. With a core memory speed data store and a secondary SSD data store.
Each node had 3000MB of memory storage and 30GB of SSD storage. 100.69GB of pagecount
data from Wikipedia was loaded into Alluxio.
A standard count was performed on these files, this was done in Spark with 5 executors,
1GB per executor, 1GB for ApplicationMaster:
14
Clearly the MEM + SSD cluster is faster, however in this case it was only twice as fast.
We would expect a greater improvement than the one above. It is suspected this is due to only
having 4 cores per node and so the jobs were process bound. Alluxio should utilise full SSD (and
memory) throughput given more cores to test with.
The implementation of the tiered storage archictecture was incredibly easy. Configuration
takes no more than a few command lines (assuming you have the hardware ready to go, see 3.7).
Changing the code from running the test on Alluxio vs. running it on the HDFS was just a matter
of referencing “hdfs://hostname:hdfs port/path” vs. “alluxio://hostname:alluxio port/path”.
Tiered storage is an incredibly powerful feature of Alluxio which can make Alluxio a very
viable solution for high-capacity distributed storage. It can allow a mixture of high throughput
and cheaper hardware to store data in a distributive manner, allowing control over hardware
speed vs. practicality.
For more information see [12]
4.2 Comparison to HDFS Caching
HDFS caching is an extension to HDFS which allows files to be cached in a memory pool which
is reserved for the storage system. Alluxio and HDFS caching are both very similar solutions to
in-memory distributed storage. However Alluxio improves upon HDFS caching in a number of
ways.
Both filesystems are referenced from within Spark similarly:
val file = sc.textFile (”hdfs://hostname:8020/fileName”)
val file2 = sc.textFile (”alluxio ://hostname:19998/fileName”)
Alluxio files are loaded into memory usually via the command line:
$ alluxio fs copyFromLocal /fileName
15
Or by the Spark saveAsTextFile command. HDFS caching requires a cache pool to be created,
then a directive is added to the pool:
$ hadoop cacheadmin −addPool <name>
$ hadoop cacheadmin −addDirective −path <path> −pool <pool−name>
Both tools require the user to preallocate a volume of memory to be used for caching. Also
they both allow full control over what files are cached into memory. Finally they also both allow
the user to see exactly which files are cached and how much of them are cached.
To compare the speeds of the two solutions, a 13GB textfile was loaded into memory and
filter was performed. The file was small to ensure that the whole thing could be stored in memory
on the testing cluster. The same resources were given to Spark as in the tiered storage example.
For completeness, the task was also performed on a standard HDD HDFS filesysten.
The key results to take from these tests are that:
• HDFS caching and Alluxio caching are both much faster than a standard HDFS HDD
cluster.
• On run 1, Alluxio caching is much faster than HDFS caching. Frequent one time access
from multiple jobs to a file could be a common use case, so this is potentially significant.
• Times agree upon a second run and show memory-speed throughput.
Alluxio improves on HDFS caching in two main ways. It has a fully controllable tiered
storage feature, which has been desribed previously; and it also allows the writing of files directly
into memory. Thus in general, Alluxio is a better solution than HDFS caching for in-memory
distributed data storage.
16
4.3 Comparison to Spark Caching
There are not many similarities between Spark caching and Alluxio caching. Spark caching (and
persistence in general) in Spark is intended to be used to checkpoint data which will be frequently
accessed in a job lifecycle. The RDD cannot persist across Spark jobs and is obviously only
accessible in Spark. Furthermore an action has to be performed on the RDD to cache initially,
so there is no ability to write directly to memory.
I found it very difficult to control how much memory was preallocated to caching for Spark,
however the documentation seems to suggest that this is possible. When a file is only partially
cached in Spark, performance is actually significantly reduced and can be inconsistent. The 13GB
textfile test from above was also tested with Spark caching where the file is not fully cached:
To test Spark caching on a file which could be fully cahced, I used a 3300MB text file stored
in the HDFS. I tested Spark caching in a Spark shell with 21GB of resources; 1GB for the
application manager and around 20GB for the executors:
I ran a simple filter job 40 times and recorded the time taken to complete it. This was
compared with the same files cached with Alluxio with the Spark saveAsTextFile command and
uncached in the HDFS. The same resources were given to the Spark shell each time.
17
The difference in speeds between Alluxio and Spark caching lies in the overhead required
for Spark to connect to the Alluxio filesystem. This effect is very potent in this test as we have
run a filter on a small file many times; Spark is connecting the the filesystem many times. If we
could compare much larger files in memory then this overhead should not be anywhere near as
obvious. Unfortunately the test environment available does not allow this.
In the previous test I could have loaded the file into Alluxio outside of Spark.
alluxio fs load /bigFile
The behaviour of this loading method is different in a couple of ways. Firstly given enough space
in memory, there will be replication of blocks in Alluxio memory which I believe is to minimise
time wasted wating for executors to finish certain processes. I allocated Alluxio more memory
for this option so that there was enough memory for this replication to occur.
Secondly this file is automatically persisted to the underFS, whereas the previous method
produced a file only present in memory. This behaviour is due to Spark loading Alluxio default
values unless specified (see 3.9). Persisted files in Alluxio allow Spark to operate in a fault-
tolerant maner. The job times were essentially the same as the previous method using Alluxio:
18
One of the major advantages of using Alluxio is that cached data can be accessed by multiple
jobs. In the first example, where a file was stored in Alluxio using Spark, I can access this data
from another application. Access from a different framework or job should not add any overhead
to processing times.
5 Further Remarks
While Alluxio may provide performance increase and further features on top of alternative solu-
tions, there are some issues which should be mentioned.
The Alluxio documentation is currently poor and so the learning curve for the tool is steep.
The uptake also appears to be slow and so finding useful tutorials online is difficult.
While the github for the software is available online, understanding the inner workings of
Alluxio is still difficult. For example, I have not been able to understand the replication behaviour
of the saveAsTextFile command in Spark to Alluxio completely (however this may be Spark
behaviour rather than Alluxio behaviour).
That being said, the software is still new, and development is still active. Many of these
issues may disappear with time. Also, once some time is spent with the software and a certain
level of understanding has been achieved, the software becomes very easy to use.
Finally, it is worth noting that Alluxio supports being managed by Yarn and also has some
security options which are compatible with HDFS security. These features have not been inves-
tigated.
6 Conclusions
When looking at solutions to in-memory distributed data storage, Alluxio clealy has the most
potential and it’s features appear the most powerful. HDFS caching does not work nearly as well
and Spark caching is not a cross-platform solution at all. So for implementing a cluster utilising
in-memory storage, Alluxio should be the clear choice.
However, Alluxio is still in its very early stages and as described above, it does not appear
ready to be implemented as a large scale solution. There are many issues still to be fixed and
many features are still in alpha.
If Alluxio continues to be developed, and its short-comings improved, then it has the potential
to be very beneficial for a number of CERN related use cases.
References
[1] Alluxio Homepage, http://http://www.alluxio.org/, 04/08/2016.
[2] Alluxio Installation with HDFS UnderFS, http://www.alluxio.org/docs/master/en/
Configuring-Alluxio-with-HDFS.html, 04/08/2016.
19
[3] Alluxio Installation with a Cluster, http://www.alluxio.org/docs/master/en/
Running-Alluxio-on-a-Cluster.html, 04/08/2016.
[4] Alluxio download directory, http://alluxio.org/downloads/files/, 08/08/2016.
[5] Alluxio-Spark Hostname Issue, https://issues.apache.org/jira/browse/SPARK-10149,
31/08/2016
[6] Alluxio configuration options, http://www.alluxio.org/docs/master/en/
Configuration-Settings.html, 09/08/2016.
[7] Alluxio command line interface, http://www.alluxio.org/docs/master/en/
Command-Line-Interface.html, 09/08/2016.
[8] Alluxio Tiered Storage, http://www.alluxio.org/docs/master/en/
Tiered-Storage-on-Alluxio.html, 31/08/2016
[9] Alluxio metrics, http://www.alluxio.org/docs/master/en/Metrics-System.html,
11/08/2016.
[10] Accelerating On-Demand Data Analytics with Alluxio, http://www.alluxio.com/
assets/uploads/2016/08/Accelerating_OnDemand_Data_Analytics_w_Alluxio.pdf,
15/09/2016
[11] Alluxio Lineage API, http://www.alluxio.org/docs/master/en/Lineage-API.html,
31/08/2016
[12] Tiered Storage in Alluxio, https://db-blog.web.cern.ch/blog/christopher-lawrie/
2016-09-using-tiered-storage-alluxio, 31/08/2016
[13] Experience of Using Alluxio, https://db-blog.web.cern.ch/blog/
christopher-lawrie/2016-08-experiences-using-alluxio-spark, 31/08/2016
20