Date post: | 21-May-2018 |
Category: |
Documents |
Upload: | nguyenlien |
View: | 219 times |
Download: | 4 times |
Using Hadoop as a platform for quick and efficient pattern
matching of large scale biometric databases
Sneha Mallampati
Problem Report submitted
to the College of Engineering and Mineral Resources
at West Virginia University
in partial fulfillment of the requirements for the degree of
Masters of Science in
Computer Science
Dr. Vinod Kulathumani, Ph.D., Chair
Dr. Elaine M. Eschen, Ph.D.
Dr. Roy S Nutter, Ph.D.
Lane Department of Computer Science and Electrical Engineering
Morgantown, West Virginia
2014
Keywords: Hadoop, MapReduce, HDFS, Biometrics Systems, Fuzzy Matching
ABSTRACT
Using Hadoop as a platform for quick and efficient pattern
matching of large scale biometric databases
Sneha Mallampati
Biometric systems have evolved today as a reliable mechanism for accurate determination of an
individual‟s identity in the context of several applications like access control, personnel
screening and criminal identification. Several biometric modalities are in use, such as iris
recognition, facial recognition, fingerprint and keystroke dynamics.
Over time, the sheer volume of biometric data (being stored and processed for identification
purposes) is becoming larger and larger. Despite this extreme scale, it is required that the
systems should maintain high accuracy and quick information retrieval. Traditional parallel
architectures and database systems are largely inadequate to handle such huge datasets for
efficient biometric pattern matching. In this report, Hadoop has been used as an alternative
architecture for biometric systems implementation.
Hadoop is a popular open source framework known for its massive cluster based storage and also
it would provide a good platform to build a new distributed fuzzy matching. Hadoop uses
key/value pairs as its basic unit to work with unstructured data types. It implements a
MapReduce framework to process vast amounts of data in-parallel on various clusters in a
reliable and fault-tolerant manner. The data distribution and replication features of the Hadoop
Distributed File System (HDFS), the distributed storage system used by Hadoop would provide a
reliable storage system with enough parallelism to support fast queries running on multiple
machines. This report describes how the Biometrics System can be implemented on Hadoop. For
this we performed large scale pattern matching technique called fuzzy matching to identify
similar images on Hadoop. Due to lack of access to large scale biometric databases, normal
images were used to implement the fuzzy matching technique, but the same procedure can be
applied for biometric databases.
i
ACKNOWLEDGEMENTS
I would like to thank Dr. Vinod Kulathumani for his patience and constant support throughout
my research. He helped me a lot by providing good guidance and lots of valuable suggestions.
I would also like to acknowledge Dr. Elaine Eschen and Dr. Roy Nutter for being on the
committee.
I would like to thank all my friends and family members for their motivation and continuous
support throughout my project.
Finally, I would like to thank the most important people in my life, my parents and my beloved
brother for their encouragement, love and trust towards me.
ii
TABLE OF CONTENTS
ACKNOWLEDGEMENTS .................................................................................................................................. i
TABLE OF CONTENTS ..................................................................................................................................... ii
LIST OF FIGURES ........................................................................................................................................... iv
1. INTRODUCTION ..................................................................................................................................... 1
2. HADOOP OVERVIEW ............................................................................................................................. 4
2.1 Apache Hadoop ................................................................................................................................... 4
2.2 Hadoop Components .......................................................................................................................... 4
2.2.1 Hadoop MapReduce .................................................................................................................... 4
2.2.2 Hadoop Distributed File System (HDFS) ....................................................................................... 5
2.3 The Building Blocks of Hadoop ........................................................................................................... 6
2.3.1 NameNode ................................................................................................................................... 7
2.3.2 DataNode ..................................................................................................................................... 7
2.3.3 JobTracker .................................................................................................................................... 8
2.3.4 TaskTracker .................................................................................................................................. 8
2.3.5 Secondary NameNode ................................................................................................................. 9
2.4 Hadoop Distributed File System Client ............................................................................................. 10
2.4.1 Reading a File ............................................................................................................................. 10
2.4.2 Writing to a file .......................................................................................................................... 11
2.5 Replication Management .................................................................................................................. 12
2.6 Different Operational Modes of Hadoop .......................................................................................... 13
2.6.1 Local (Standalone) Mode ........................................................................................................... 13
2.6.2 Pseudo-Distributed Mode .......................................................................................................... 13
2.6.3 Fully Distributed Mode .............................................................................................................. 13
2.7 Web-based Cluster UI ....................................................................................................................... 14
2.8 Advantages and Disadvantages of Hadoop ...................................................................................... 16
2.8.1 Advantages ................................................................................................................................. 16
2.8.2 Disadvantages ............................................................................................................................ 17
3. BIOMETRIC SYSTEMS .......................................................................................................................... 18
3.1 Modes of Biometric System .............................................................................................................. 18
3.1.1 Verification Mode ...................................................................................................................... 18
3.1.2 Identification Mode ................................................................................................................... 18
iii
3.2 Biometrics System Design ................................................................................................................. 20
4. EFFICIENT PATTERN MATCHING OF LARGE SCALE BIOMETRICS DATA .............................................. 21
4.1 Fuzzy Matching ................................................................................................................................. 21
4.2 Hadoop and Biometrics System ........................................................................................................ 22
4.3 Development Method ....................................................................................................................... 22
4.3.1 Clustering on Hadoop ................................................................................................................ 22
4.3.2 Low Latency Fuzzy Matching ..................................................................................................... 26
5. HADOOP IMPLEMENTATION ............................................................................................................... 29
5.1 Development Method ....................................................................................................................... 29
5.2 Hadoop Cluster Setup ....................................................................................................................... 29
5.3 Creating Vectors from Images .......................................................................................................... 29
5.4 Running Clustering on Hadoop ......................................................................................................... 29
5.4.1 Apache Mahout and Apache Maven.......................................................................................... 29
5.5 Fuzzy Matching ................................................................................................................................. 33
6. CONCLUSION AND FUTURE WORK ..................................................................................................... 34
6.1 Conclusion ......................................................................................................................................... 34
6.2 Future Work ...................................................................................................................................... 34
REFERENCES ................................................................................................................................................ 36
iv
LIST OF FIGURES
Figure 2-1: MapReduce Framework ............................................................................................................. 5
Figure 2-2: HDFS Architecture .................................................................................................................... 6
Figure 2-3: NameNode/DataNode interaction in HDFS ............................................................................... 8
Figure 2-4: JobTracker and TaskTracker interaction .................................................................................... 9
Figure 2-5: Topology of a Hadoop cluster .................................................................................................. 10
Figure 2-6: A Client Reading Data from HDFS ......................................................................................... 11
Figure 2-7: A Client Writing Data to HDFS ............................................................................................... 12
Figure 2-8: A snapshot of HDFS web interface .......................................................................................... 14
Figure 2-9: A snapshot of the MapReduce web interface (JobTracker) ..................................................... 15
Figure 2-10: A snapshot of the MapReduce web interface (TaskTracker) ................................................. 16
Figure 3-1: Enrollment, Verification and Identification stages of Biometric System................................. 19
Figure 4-1: Fuzzy Matching of fingerprint images lifted from crime scene and from the law enforcement
database ....................................................................................................................................................... 21
Figure 4-2: Bulk clustering and real time classification ............................................................................. 26
Figure 4-3: Client Communication with Master server while performing Low latency fuzzy matching ... 27
Figure 4-4: Client Communication with Data server while performing Low latency fuzzy matching ....... 28
1
1. INTRODUCTION
Biometric systems have evolved today as a reliable mechanism for accurate determination of an
individual‟s identity in the context of several applications like criminal identification, personnel
screening and access control (replacing passwords and tokens) because of its reliability and
uniqueness. Several biometric modalities are in use such as finger print, face recognition, palm
print, iris recognition and retina (these are related to shape of the body); and typing rhythm,
voice etc. ( these are related to the pattern of behavior of the person).
The United States Department of Homeland Security checks a person‟s biometrics against the
entire database to determine if a person is using fraud identification or not. Currently US-VISIT
program is maintaining 148 million identities of individuals and around 188,000 identities are
enrolled or verified daily [14]. Also the Unique Identification Authority of India (UIDAI) is
planning to maintain a database of over 1 billion residents of India containing biometric (face,
fingerprint and iris) images along with other data [11]. Such massive biometrics systems require
sophisticated networks of computers to identify individuals accurately, reliably, and quickly.
In traditional High-Performance Computing (HPC) systems data is stored on large shared
centralized storage systems like Storage Area Network (SAN) or Network Area Storage
(NAS). In such systems when a job is executed, data is fetched from the central storage
system, after processing it the results are stored back on the central storage system. So, it
might cause collisions when many workers try to fetch the same data at the same time,
and with large data sets it quickly causes bandwidth contention.
Also typical distributed systems use MPI (Message Passing Interface) based architecture
for parallel computing. They might perform much better on less number of machines but
when adding more machines their performance increases non-linearly [12].
Hadoop is designed in such a way that it provides a flat scalability curve. A program
written in distributed frameworks other than Hadoop may require large amounts of
refactoring when scaling from ten to one hundred or one thousand machines, this may
involve having the program be rewritten several times [13].
2
Traditional databases index the records (structured data) in an alphabetical or numerical
order for efficient retrieval but biometric data which is unstructured doesn‟t contain any
sorting order [16].
So, traditional parallel architectures and database systems are largely inadequate to handle such
huge datasets (often referred as Big Data) for efficient biometric pattern matching.
Problem Statement: In order to handle such massive data sets and to perform searching in a
reasonable amount of time, we explored a cloud based system and software framework called
Apache Hadoop. The main objective of this report is to explain how the Hadoop framework can
be leveraged to solve this problem.
Hadoop is an open-source cloud computing environment designed for distributed processing of
massive amounts of data (terabyte range of data). It is a cluster-based data storage system, so that
vast amounts of data can be processed in-parallel on large clusters using the MapReduce
principal. A MapReduce job splits the input data into independent chunks so that they can be
processed by map tasks in a complete parallel manner. The framework sorts the output of map
tasks and then they are fed as input to the reduce tasks.
Hadoop also uses a distributed file system called Hadoop Distributed File System (HDFS) which
is based on Google File System to divide files among several nodes so that the processor of each
node works on their own storage. HDFS is highly fault-tolerant and provides high throughput for
applications that have large data sets when compared to other distributed file systems. In Hadoop
the computation is performed on the machine where the block is stored i.e. the computation logic
is moved rather than data block across the network.
Also Hadoop is designed to have a very flat scalability curve. The underlying Hadoop platform is
able to manage the data and hardware resources and provide performance growth proportionate
to the number of machines available. A Hadoop program that runs well for a few number of
nodes, is also scalable to multiple number of nodes (with minor changes in code) [13].
In Hadoop, data can originate in any form (semi-structured/unstructured), but it eventually
transforms into key/value pairs for the processing functions to work on [6].
3
Thus we expect that Hadoop would be a good platform for large-scale Biometric search i.e. for
implementing a highly scalable and low-latency method of fuzzy matching (returning the most
similar items when there is no exact match), but the challenge is to see how the MapReduce
framework can be leveraged for biometric databases.
The ability to run MapReduce program over all the biometrics database would help in clustering
it into several bins using the canopy and k-means clustering algorithms from the Apache Mahout
project. Thus each bin contains biometrics that are statistically similar, and also each bin contains
a „mean biometric‟ which represents an average of biometrics contained in that bin. During
querying, only the bins closest to the query biometric will be searched. This allows us to avoid
searching large portion of the database and only search the bins that contain the most similar
items to the query biometric [2].
The goal of this report is to give a brief overview about Hadoop and how to perform efficient
pattern matching of large scale biometrics data and also how to implement this low latency
pattern matching (fuzzy matching) technique on Hadoop. Due to lack of access to large scale
biometric databases, normal images were used in this project to implement the fuzzy matching
technique, but the same procedure can be applied for biometric databases.
4
2. HADOOP OVERVIEW
2.1 Apache Hadoop
Apache Hadoop is an open source software framework used for the distributed processing of
large amounts of data across clusters of computers using simple programming paradigm called
MapReduce programming model. MapReduce framework provides automatic parallelization,
fault tolerance, and the scale to process hundreds of terabytes of data in a single job over
thousands of machines. Hadoop also uses distributed file system called Hadoop Distributed File
System (HDFS) to support large-scale, data-intensive and distributed processing applications.
Hadoop runs the processing logic provided by the user on the machine where the data resides
rather than moving the data across the network i.e. Hadoop is designed for the programs to run
(code) are smaller than the data and are easier to move around [6].
2.2 Hadoop Components
2.2.1 Hadoop MapReduce
Hadoop MapReduce is an implementation of a MapReduce programming paradigm introduced
by Google in 2004 which is used for processing large amounts of data on clusters of computers
(nodes) in parallel. The term MapReduce actually refers to two independent and distinct tasks i.e.
“map” and “reduce” tasks. A MapReduec job splits the input data into number of independent
chunks (key/value pairs), which are processed by different map tasks in a complete parallel
manner. The framework sorts the outputs from map tasks (set of independent key/value pairs)
and they are given as input to the reduce tasks. The Reducer reduces those set of independent
key/value pairs which share a common key to a smaller set of values. The output of all reduce
tasks are combined together to produce the complete output of MapReduce job [4]. The main
advantage of MapReduce programming model is it‟s distributed processing of the map and
reduce operations. Since each task is completely independent of other, all map tasks can be
performed in parallel and hence the total computation time can be reduced.
5
2.2.2 Hadoop Distributed File System (HDFS)
HDFS is a file system designed to support large scale distributed data processing under the
MapReduce framework. HDFS will also provide high throughput, streaming reads and writes of
very large files i.e. it can store a dataset of almost 100 TB as a single file, which would be an
overwhelming task for other file systems [20]. Hadoop takes any input from the HDFS and also
stores the output on HDFS.
HDFS is similar to traditional file systems (NTFS, EXT3, HFS Plus etc.) in many ways, like files
are stored as data blocks and metadata exists that keeps track of mapping between the filename
and data block location, directory tree structure, permissions etc. However all the file systems
can span multiple disks but they cannot span multiple computers as HDFS does. Also, all the
traditional file systems are implemented as kernel modules i.e. they are completely linked to their
operating system‟s kernel, HDFS in contrast runs in user space so that we can use HDFS on any
operating system supported by java.
Another major difference between HDFS and other traditional file systems is its block size i.e. its
block size is 64 MB by default, but the users can configure it to 128 MB, 256 MB or 1 TB as per
the application requirements, whereas other file systems use a 4 KB or 8 KB block size for data.
Increasing the block size represents data can be written in larger continuous chunks on disk.
These results in better performance for applications which require data to be written and read in
larger sequential operations by minimizing the drive seek operations [8].
Reducer
Reducer
Reducer
Intermediate
Data
(Key, Value)
Pairs
Map
Map
Map
Map
Input
Data
Output
Output
Output
Figure 2-1: MapReduce Framework
6
HDFS replicates data blocks across multiple machines in the cluster, to make the system fault-
tolerant and not to depend on any specialized subsystem data protection. By default, each data
block is replicated three times. The data storage in HDFS is of type write once and read more,
i.e. once written, the file cannot be modified and this helps in maintaining consistency between
the replicas.
2.3 The Building Blocks of Hadoop
Hadoop implements the concepts of distributed storage and distributed computation using a set
of daemons which runs on different servers in Hadoop network. Hadoop employs master/slave
architecture for both distributed storage (HDFS) and the distributed computation (Hadoop
MapReduce).
The three daemons that make up a HDFS Cluster are the NameNode, Secondary NameNode and
the DataNode; and the two major daemons in Hadoop MapReduce are the JobTracker and the
TaskTracker.
Client
Metadata ops
Block ops
Metadata (Name, replicas, …):
/home/foo/data, 3, ….
NameNode
Read
Blocks
Client
DataNode
s
DataNode
s
Replication
Write
Rack 1 Rack 2
Figure 2-2: HDFS Architecture
7
These daemons have distinct roles (either master or slave), some of them exist only on one server
and some of them exist across multiple servers.
2.3.1 NameNode
The NameNode is the master of HDFS and is responsible for maintaining the file system
metadata. A Hadoop cluster contains a single NameNode but multiple DataNodes. NameNode
keeps track of how the files are broken down into different datablocks and also it knows which
DataNode stores which blocks. All the DataNodes report their status to the NameNode i.e. the
NameNode has a complete view of the DataNodes in the cluster and also their current health
through regular heartbeats. After certain period of no heartbeats, a DataNode is assumed to be
dead.
The file system metadata information is stored entirely in the main memory of the NameNode
machine for fast lookup and retrieval [6, 8].
2.3.2 DataNode
The DataNode is the slave of HDFS and is responsible for storing and retrieving the block data.
A Hadoop cluster may contain one to many DataNodes. In HDFS, a file is split into blocks and
when the client wants to read/write a file it will first contact NameNode which will give
information like which DataNode each block resides in. Then the clients will directly
communicate DataNode servers and requests them to process read/write operations. DataNodes
based on the instructions from NameNodes are responsible for performing block creation,
deletion and replication. These datablocks are stored in replication on different DataNodes so
that redundancy can be maintained; also it will ensure that if any DataNode crashes or becomes
unavailable over the network, we can still access the files from other DataNodes containing that
file blocks. The DataNodes will constantly inform the NameNode of the blocks its currently
storing [6, 8].
8
2.3.3 JobTracker
JobTracker is the master process of the MapReduce cluster and also acts as a liaison between the
client application and Hadoop. There is only one JobTracker per MapReduce cluster. Once a user
submits his code, the JobTracker will determine the complete execution plan i.e. how many
TaskTrackers and subordinate tasks should be created, assigning different tasks to different
TaskTrackers and also to verify whether all tasks are running or not. Similar to DataNode and
NameNode, each TaskTracker will frequently reports its execution status and completed tasks to
the JobTracker. If a task fails then the JobTracker will automatically re-assign the failed task to
different node. JobTracker communicates with each TaskTracker and client using remote
procedure calls (RPC) [6, 8].
2.3.4 TaskTracker
The TaskTracker is the slave process of the MapReduce cluster and it is responsible for
executing the tasks assigned by the JobTracker locally, and reports its status back to the
JobTracker periodically by the use of heartbeat. If the heartbeat is not received to the JobTracker
within a specified amount of time, it will assume that the TaskTracker is dead and will reassign
the tasks to another node in the cluster. Even though each node contains only single
TaskTracker, each TaskTracker can execute many map and reduce tasks in parallel [6, 8].
File metadata:
/user/chuck/data1 -> 1,2,3
/user/james/data2 -> 4,5
3
4 5
3
5
5
2
4 1
1 4
2
NameNode
DataNodes
Figure 2-3 Figure 2-3: NameNode/DataNode interaction in HDFS
9
2.3.5 Secondary NameNode
As the name indicates it is an assistant daemon for monitoring the HDFS cluster. Each cluster
contains a single Secondary NameNode. Since the NameNode is a single point of failure for a
Hadoop cluster, Secondary NameNode is also used to store the filesystem metadata log so that it
can be preserved. Secondary NameNode communicates regularly with the NameNode to take the
snapshots of HDFS metadata [6].
Client
JobTracker
TaskTracker
Map
p
Reduce
eeeee
TaskTracker
Map
p
Reduce
eeeee
TaskTracker
Map
p
Reduce
eeeee
Figure 2-4 Figure 2-4: JobTracker and TaskTracker interaction
10
Usually Namenode and JobTracker runs on same machine and it acts as master node, whereas
DataNode and TaskTracker runs on same machine and it acts as slave node.
2.4 Hadoop Distributed File System Client
2.4.1 Reading a File
To read a file in HDFS, the client first contacts the NameNode by indicating the file name it
would like to read. The client identity is first validated and then ownership and permissions of
that file were checked. If that file exists and the client has access to that file, the NameNode
returns the list of DataNodes having a copy of the blocks of that file to the client. The client now
directly connects to the nearest DataNode and reads the data block it needs. This process repeats
until all the data blocks have been read and then client closes the file stream [8].
Secondary NameNode
NameNode
JobTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
Figure 2-5 Figure 2-5: Topology of a Hadoop cluster
11
2.4.2 Writing to a file
To write a file to HDFS, the client should first create an empty file without any blocks of data; to
do that the client should have writing permissions. Then the metadata entry for the new file is
created by the NameNode and it will also allocate a set of DataNodes to which replicas of first
block should be written. The replication pipeline is formed as the client make direct connection
to the first DataNode in the list, and then to the second DataNode and so on. Each DataNode in
the replication pipeline acknowledges to the data block it has successfully written. When the first
block is completed, new DataNodes were requested by the client to store the replicas of next
blocks. Thus a new pipeline is created, and this process is repeated until the client closes the data
stream indicating that it finished sending data. If a DataNode in a pipeline fails, then the failed
DataNode in the pipeline is removed and the new DataNodes are chosen to write remaining data
blocks [8].
1.
Request: open foo.txt
Response: block 1, hosts: .. Client Application
NameNode
{
Filename: /user/chuck/foo.txt,
Owner: chuck,
Group: eng,
Blocks: *…+
},
….
Hadoop Library API
DataNode
1 4 …
1.
Request: read block 1
Response: data!
Figure 2-6 Figure 2-6: A Client Reading Data from HDFS
12
2.5 Replication Management
The NameNode is the one which is responsible for replication management. This replication
management is one of the main factors which determine HDFS reliability, availability and
performance. The NameNode should detect whether a block is over-replicated or under-
replicated. If a block is over replicated, the NameNode chooses a replica to be deleted. While
performing any action (i.e. either adding or deleting a replica) the first priority that the
NameNode should consider is that each replica is placed on unique racks so that it can prevent
data loss on entire rack failure. The next priority that needs to be considered while deleting is, to
remove the replica from DataNode with the least amount of available disk space.
If a block becomes under-replicated then it is kept in the replication priority queue. The block
with lowest replication value is given the highest priority. These priority values are considered
5. ack
6. ack
1. Open foo.txt to write Client Application
NameNode
{
Filename: /user/chuck/foo.txt,
Owner: chuck,
Group: eng,
Blocks: *…+
},
….
Hadoop Library API
DataNode
1 5 … 2. Write block 1
DataNode
1 7 …
DataNode
1 9 …
8. Close foo.txt
3. Write block 1
4. Write block 1
7. ack
Figure 2-7 Figure 2-7: A Client Writing Data to HDFS
13
while placing a new replica. If a data block has only one replica, then a new rack is chosen to
place the new replica. In case if the block has two replicas which are on the same rack, then the
third replica is placed on a different rack; otherwise third replica is placed in the same rack but
on a different DataNode.
The NameNode must also ensure that all replicas of the data block should not be located on one
rack. If the NameNode detects any such block replicas it will treat the block as under-replicated
and replicates the data block to different rack and also deletes the old replica [21].
2.6 Different Operational Modes of Hadoop
The majority of Hadoop settings are contained in the three XML configuration files: core-
site.xml, hdfs-site.xml, and mapred-site.xml. These three files are used to adjust the
configuration setting i.e. based on that we can make Hadoop to work on different modes. These
modes are described as follows:
2.6.1 Local (Standalone) Mode
It is the default mode for Hadoop, where all the three XML configuration files are empty. Hence
Hadoop will run completely on the local machine, as a single Java process. In this mode Hadoop
doesn‟t uses HDFS because there is no need to communicate with other nodes, and also it will
not launch any Hadoop daemons. Hence its major use is for developing and debugging the
MapReduce application logic without any additional complexity of interacting with the daemons
[6].
2.6.2 Pseudo-Distributed Mode
The pseudo-distributed mode is running Hadoop on a single node with all daemons running as a
separate Java process. We provide the simple XML files to configure a single machine in this
mode. In core-site.xml and mapred-site.xml we will specify the hostname and port number of the
NameNode and JobTracker respectively. In hdfs-site.xml we will specify the default replication
factor for HDFS, which will only be one because we are running on a single machine. All the
daemons running on a single machine will communicate with each other as if they were
distributed over a cluster [6].
2.6.3 Fully Distributed Mode
The fully distributed mode is running Hadoop on clusters ranging from few nodes to thousands
of nodes. It is the mode in which we get the actual benefits of distributed computing and
14
distributed storage. For each cluster we will have a master node which is the host of NameNode
and JobTracker; backup server which hosts the Secondary NameNode daemon; and slave nodes
running both DataNode and TaskTracker daemons. The default replication factor for HDFS is 3,
and we can either increase or decrease this value based on our application in hdfs-site.xml [6].
2.7 Web-based Cluster UI
Hadoop provides the web interfaces which helps to monitor the health of Hadoop clusters. This
browser interface helps the user to access information much faster than going through logs and
directories.
The NameNode hosts a general report which contains the status of each DataNode in the cluster
on port 50070. We can also determine the storage available on each individual node using this
web interface.
Figure 2-8: A snapshot of HDFS web interface
Hadoop also provides a similar status overview of MapReduce jobs on one hosted at port 50030
of JobTracker. This report contains the status of ongoing MapReduce tasks as well as the report
about the completed jobs.
15
Figure 2-9: A snapshot of the MapReduce web interface (JobTracker)
Similarly the TaskTracker hosts a report about running and non-running tasks at port 50060.
16
Figure 2-10: A snapshot of the MapReduce web interface (TaskTracker)
2.8 Advantages and Disadvantages of Hadoop
2.8.1 Advantages
Hadoop is a platform which provides both distributed storage and distributed
computation.
All the tasks (Map/Reduce) are independent, so that if there is any task failure we can
restart it without affecting other tasks; also independent tasks can help in parallel and
fast execution of the job.
The major advantage of Hadoop over other distributed systems is its flat scalability
curve. Some distributed framework such as MPI (Message Passing Interface) performs
better for small number of machines, but as the number of machines increases its
performance level decreases non-linearly [13].
However Hadoop is a highly scalable platform and its performance growth is almost
directly proportional to the number of machines available with little re-work [13].
Hadoop also offers a cost effective storage for terabytes of data using HDFS.
17
HDFS can store data reliably, even in the presence of failures.
HDFS uses large block sizes (64MB, 128MB etc.), and gives better performance when
manipulating large files.
HDFS can store and process different types of data i.e. both structured and unstructured,
unlike traditional relational database systems which can process only structured data.
HDFS can replicate files to specified number of times set by the user (default value is
3), which makes it tolerant to hardware and software failures; and also this data is
distributed among several nodes in cluster and helps in parallel processing.
Another major advantage of Hadoop is, the computation logic is moved to the data,
rather than the data moved close to the computation logic.
2.8.2 Disadvantages
The master nodes of HDFS and MapReduce i.e. NameNode and JobTracker are single
point of failures, so there might be risk of data loss.
Joining multiple datasets (i.e. structured, semi-structured and non-structured) are tricky
and slow.
HDFS is not efficient for handling large number of small files.
Hadoop is Write Once Read Many (WORM) architecture and hence it is not suitable for
the applications whose data needs to be modified over time.
Hadoop is a batch processing system designed at an expense of high throughput and
high latency, and hence is not suitable for real-time systems which require extremely
low latency.
18
3. BIOMETRIC SYSTEMS
A Biometric system is operated by capturing biometric data from an individual, performing
feature extraction of the captured data, and then comparing this feature set with the template set
in the database. Biometric modalities can be identified as two main categories: physical (or
passive) and behavioral (or active). Physical modalities include fingerprint, IRIS recognition,
face recognition, DNA, palm print, retina and hand geometry (related to physical traits of the
person). These are highly stable and subject to change very little over time and hence rarely
require re-enrollment. Behavioral modalities include typing rhythm, voice and gait (related to
pattern of behavior of the person). These are subjected to change easily and hence the accuracy
ranges for behavioral modalities are much wider than the physical modalities [11].
3.1 Modes of Biometric System
Biometric systems can be operated in two different ways: Verification mode or Identification
mode [9].
3.1.1 Verification Mode
In this mode the system check‟s the user identity by performing one-to-one comparison of the
captured biometric with her own biometric template(s) stored in the system database in order to
make sure that the user matches the proposed identity. The user who desires to be identified
should provide a primary identifier such as personal identification number (PIN), a user name or
ID card [9].
3.1.2 Identification Mode
In this mode the system doesn‟t have any primary identifier of the individual, but recognizes him
by comparing the captured biometric template against the entire database for a match. These
systems are much slower because of overhead in performing one-to-many comparison to identify
an individual [9].
19
User
Feature
Extractor
Claimed Identity
System
Database
Biometric
Sensor
Verification Process
Matching
Module
Template
One
True/False
User
Feature
Extractor
Identity
Template System
Database
Biometric
Sensor
Enrollment Process
User
Feature
Extractor System
Database
Biometric
Sensor
Identification Process
Matching
Module
(N matcher)
( Templates
N
User‟s Identity or not
Figure 3-1 Figure 3-1: Enrollment, Verification and Identification stages of Biometric System
20
3.2 Biometrics System Design
All biometric identity systems are designed using the following four main modules [9]:
Sensor Module (Data Collection), which captures the biometric data of an individual
from a biometric sensor hardware.
Feature Extraction Module, in which the feature extraction methods and algorithms
are applied to the acquired biometric data in order to extract a set of salient or
discriminatory features.
Matcher Module, in which the query biometric template is compared against the stored
biometric templates in order to generate a matching score. That score is used to make
decisions i.e. in verification systems if the similarity score is greater than a given
threshold value then the system decides that the two templates belong to the same
person, whereas in identification systems higher similarity score templates are selected
to determine if anyone was identified on stored biometric database.
System database module, which is used to store the biometric templates of the enrolled
users by the biometric system. After feature extraction we organize the biometric data
into a characteristic vector so that it is small and can reliably processed rather than
storing it in its original format i.e. digital image.
21
4. EFFICIENT PATTERN MATCHING OF LARGE SCALE
BIOMETRICS DATA
In this report Biometrics system which is operated in Identification Mode (i.e. the query
biometric image is compared against the entire database to find the most similar items) is
explained. In response to growing biometric databases there is a need for highly scalable and low
latency method of performing fuzzy matching i.e. returning most similar items when there is no
exact match [2].
4.1 Fuzzy Matching
Fuzzy matching is an operation that can compare two high-dimensional items and determine
whether they are similar to each other or not [2].
Fuzzy matching of biometric images involves extracting feature vectors from those images and
then performing a distance measure over the resulting feature vectors to produce a similarity
score.
Feature Extraction
and Normalization
Creating
Vectors
Feature Extraction
and Normalization
Creating
Vectors
Distance
Function * 2.41
*Euclidean Distance
Figure 4-1 Figure 4-1: Fuzzy Matching of fingerprint images lifted from crime scene and from the law
enforcement database
22
4.2 Hadoop and Biometrics System
It was clear that Hadoop would provide a good platform to perform low latency fuzzy matching
on the biometric databases [1].
Mahout and MapReduce frameworks are used for indexing and clustering the similar
biometric images to reduce the overall search space (Apache Mahout which runs on top
of Hadoop is used for implementing clustering algorithms).
HDFS with its data distribution and replication features is used as a reliable file storage
system for cluster points generated from the Apache Mahout/MapReduce. These features
will help in providing enough parallelism to support fast queries running on multiple
machines.
4.3 Development Method
The dataset used here for fuzzy matching consists of 400 normal images of around 500 MB
database size. The fuzzy matching process is to match one query image against the entire dataset.
In this section, we will demonstrate the development process.
The development process of Fuzzy matching on Hadoop is divided into two steps –
1. A clustering process to reduce the total search space for each query.
2. Performing low latency fuzzy matching in parallel across the Hadoop instance to retrieve
the most similar images.
4.3.1 Clustering on Hadoop
Clustering is a machine learning concept that can group data items which are similar to each
other. Clustering process was implemented here in order to reduce the total search space for each
query. For that we used two clustering algorithms – Canopy clustering and k-means clutering
from the Apache Mahout project. These clustering algorithms assign each biometric into a bin.
The „mean biometric‟ of a bin represents the average of biometrics in that bin.
4.3.1.1 Canopy Clustering
Canopy Clustering is a fastest approximate clustering technique i.e. it can create clusters quickly
(within a single pass over the data). This clustering algorithm uses a fast distance measure
(EuclideanDistanceMeasure, TanimotoDistanceMeasure etc) and two distance thresholds T1 and
T2, with T1>T2. It begins with a set of points and an empty list of canopies, removes one data
point at random and adds it to the canopy list as the center. And then iterates through the rest of
23
points in the data set one by one, for each data point it will calculate the distance to all the
centers in the list. If the distance is < T1 then it is added into that canopy list. If the distance is <
T2 then it is removed from the data point list and this helps in preventing it from forming a new
canopy. This process is repeated until the data set list is empty [7].
The main weakness of canopy clustering is, it may not give accurate and precise clusters
(because it is generating clusters within a single pass over the data). But its advantage over k-
means is it will generate optimal number of clusters without specifying the number of clusters, k.
Canopy clustering in action:
Map phase: In this stage, all the points within distance T2 are removed from the data
set and are prevented from forming the new canopies. Points within distance T1 are
also kept in the same canopy but they are also made available for other canopies. The
Mapper does all this process in a single pass [7].
Reduce phase: The Reducer computes the average of the centroid and merges close
canopies [7].
4.3.1.2 K-means Clustering
Suppose we have n data points, k-means algorithm requires the user to set the number of clusters
(k) and also the k centroid points as an input parameter. Then the algorithm starts by assigning
each data point to the cluster they are closest to by using a distance measure. After that a new
center is calculated for each cluster by averaging all the data points assigned to that cluster. This
process is repeated until max iteration count is reached or until the centroids converge to a fixed
point from which they won‟t move [7].
K-means clustering in action:
Map phase: During this stage the Mapper assigns each point to the cluster near to it [7].
Reduce phase: During this stage the Reducer will calculate the average of all data points
assigned to a cluster, thus new centroid is determined [7].
After each iteration, the output from reduce phase is given back to the same loop until the
centroids converge to a fixed point.
The main disadvantage of using k-means clustering is that the user only has to specify the value
of k, the number of clusters and also initial centroids. But with the given initial conditions, k-
means will try to put the centers at their optimal positions.
24
4.3.1.3 Finding the perfect k using Canopy Clustering
For most of the real-world clustering problems, the number of clusters is not known before. Also
initial centroid estimation is important because it will greatly affect the run time of k-means.
Good estimation of centroid points helps the algorithm to converge faster within fewer passes.
To overcome this problem we used both Canopy and K-means clustering algorithms i.e. the
canopy centroids generated from canopy clustering is given as initial centroids to the k-means
algorithm [7].
4.3.1.4 K-means Clustering using MPI (Message Passing Interface)
The following section describes how k-means clustering is implemented on typical distributed
systems which use MPI (Message Passing Interface) and how it is different from clustering on
Hadoop. An MPI program is loaded into the local memory of each machine, where every
processor/process has a unique ID. MPI allows the same program to be executed on multiple
data. When needed, the process can communicate and synchronize with others by calling MPI
library routines [17].
At first, the master machine chooses a set of initial cluster centers and broadcast them to all local
machines. Each machine identifies the most similar point to the centers by calculating distance
similarity measure between its points and the centers. Now local sums of all clusters are
calculated without any inter-machine communication. The master machine then obtains the sum
of all points in each cluster to calculate new centers, and then broadcasts them to all the
machines (most of the communication occurs here and this is a reduction operation in MPI).
Since local sums on each node are k vectors of length k, the communication cost per k-means
iteration is in the order of k^2. So for k-means, in each iteration we transfer O (k^2) values after
O (k^2 * (n/p)) operations in calculating the distance between n/p points and k cluster centers,
where p is the number of local machines. Thus MPI is more complicated due to its various
communication functions. Instead of using disk I/O, a function sends and receives data to and
from a node‟s memory [17].
In the Hadoop implementation, a Mapper is responsible for loading the current set of clusters
from disk each iteration. Once the clusters are in memory, the mapper then reads through the list
of points assigned to it one at a time, and identifies which cluster the point belongs to. Once a
point has been assigned to a cluster it is written out to a file. The overall cost of the Hadoop map
25
operation is kR + nkRW, where k is the number of clusters, n is the number of points a given
mapper is responsible for clustering, „R’ and „W’ are read and write times to disk [18].
The Hadoop reducer will first read in the current set of clusters from file and then read in the
output from all the mappers. As the reducer reads in the data, it writes out a modified cluster
once it has finished processing all points assigned to the cluster. The overall cost of the Hadoop
reduce operation is kR + mnRW, where m is the number of mappers in the system [18].
Thus the mapper runtime can be considered as an O (nk) operation and reducer runtime can be
considered as an O (n) i.e. majority of computation is spent in mappers, while the role of
reducers is relatively small in the overall computation [18].
26
Figure 4-2: Bulk clustering and real time classification
4.3.2 Low Latency Fuzzy Matching
This process involves performing low latency fuzzy matching of the query biometric with the
biometrics in the cluster i.e. returning the most similar items with a similarity score. This process
involves two stages:
1. In the first stage we will find cluster/clusters that a given query biometric is closest to by
comparing the distance between the query biometric and each cluster centroid to a threshold
value set by the user. If the distance is less than the threshold value then the cluster is said to
be closest to the query biometric and this cluster Id is saved for next stage, so that we can
only search the clusters which are closest to the query biometric.
27
The Client then looks up the location of all blocks/chunks of those closest clusters by
querying the master server. These blocks are hosted by several HDFS DataNodes so that a
single cluster can be searched by several machines at once.
Figure 4-3: Client Communication with Master server while performing Low latency fuzzy
matching
2. In this stage the client submits the query biometric and the ID of the closest cluster to
dataserver running alongside HDFS DataNodes. The DataServers will calculate the distance
between the query biometric and the cluster points. This distance is also compared against
the threshold value set by the user. If the distance is less than the threshold value then they
are said to be the most similar items and we get those items as an output.
28
Figure 4-4: Client Communication with Data server while performing Low latency fuzzy matching
29
5. HADOOP IMPLEMENTATION
5.1 Development Method
Due to lack of access to large scale biometric databases, normal images were used in this project
to implement the fuzzy matching technique, but the same procedure can be applied for biometric
databases. Our work involved in developing large scale pattern matching of biometric images on
Hadoop is organized as follows:
Setting up a Hadoop cluster in pseudo-distributed mode.
Creating Vectors from images.
Implementing clustering Algorithms from the Apache Mahout project to assign each
image vector into a cluster.
Performing the actual Fuzzy matching process, i.e. the query image is compared
against the closest cluster(s) to return the most similar items.
5.2 Hadoop Cluster Setup
We implemented this experiment by running Hadoop on Ubuntu Linux (single node setup) i.e. in
a pseudo-distributed mode with all daemons (NameNode, DataNode, JobTracker, TaskTracker)
running on that single machine. Since we are running Hadoop on one node the replication factor
should only be 1.
5.3 Creating Vectors from Images
As Hadoop works on the record format, the images are initially converted into vectors and are
stored in HDFS as text files. One of the best ways to convert images to vectors is using Feature
Extraction/ Feature selection techniques like PCA (Principal Component Analysis), SIFT (Scale-
Invariant Feature Transform), Harris Corner Detection etc. where we can get high quality
discriminant features of the images [19].
5.4 Running Clustering on Hadoop
5.4.1 Apache Mahout and Apache Maven
Mahout is an open source machine learning library and the algorithms in it are implemented as
MapReduce jobs that can execute on a Hadoop cluster. Mahout also provides some clustering
algorithms. Canopy clustering and k-means clustering are one among many clustering
algorithms.
30
Along with Mahout we also installed Apache Maven, a command-line tool which is used to build
and manage projects, dependencies, compiles and executes code. Using the project‟s Maven
configuration file i.e. pom.xml file we can add both Hadoop and Mahout‟s dependencies to it so
that we can easily compile and execute our Java programs which includes classes from both
Hadoop and Mahout.
5.4.1.1 Preparing Vectors for use by Mahout
To cluster objects, they should be first converted into vectors (In Mahout, vectors are
implemented as three different classes DenseVector, RandomAccessSparseVector, and
SequentialAccessSparseVector). For k-means clustering, which calculates the magnitudes of
vectors repeatedly, the implementation of SequentialAccessSparseVector is the best way than
RandomAccessSparseVector and DenseVector which are used for the algorithms containing
random insertions and update of vector values.
Now we need to convert the text documents containing feature vectors to the vectors of type
SequentialAccessSparseVector. For that we need to use two important tools:
1. The first one is SequenceFilesFromDirectory tool, which will convert the text document
into SequenceFiles. SequenceFile is a format from Hadoop library, which consists of a
series of key/value pairs.
Mahout command: Mahout Command which converts text documents containing
image vectors into SequenceFiles is as follows:
$ bin/mahout seqdirectory –c UTF-8 –i /user/sneha/bioimg-vectors –o
/user/sneha/bioimg-sequence
Where, <i> represents <input text documents containing image vectors>
<o> represents <output folder representing SequenceFile format of the text
documents>
SequenceFile Output: The SequenceFile output which is in the form of key/value
pairs can be seen using the Mahout utility called SeqDumper. To execute the
SeqDumper, we should run the following command:
$ bin/mahout seqdumper -i /user/sneha/bioimg-sequence/part-m-00000 |
more
31
2. The second one is SparseVectorsFromSequenceFiles tool, which will convert the text
document in SequenceFile format to Sparse Vectors.
Mahout command: Mahout Command which converts SequenceFile data to Sparse
Vectors is as follows:
$ bin/mahout seq2sparse -i /user/sneha/bioimg-sequence –o
/user/sneha/bioimg-seqvectors
Where, <i> represents <input SequenceFile format directory>
<o> represents <output SparseVector directory>
Sparse Vectors Output: SeqDumper can also be used to see the output of Sparse
Vectors. To execute the SeqDumper, we should run the following command:
$ bin/mahout seqdumper -i /user/sneha/bioimg-seqvectors/tfidf-vectors/part-
r-00000 | more
5.4.1.2 Canopy Clustering
Running Canopy Clustering
The Canopy Clustering algorithm can be run using the command invocation on
CanopyDriver.main, a built-in class in Mahout. The command-line implementation is as follows:
$ bin/mahout canopy –i /user/sneha/bioimg-seqvector/tfidf-vectors/ -o
/user/sneha/bioimg-canopy-centroids –dm
org.apache.mahout.common.distance.EuclideanDistanceMeasure –t1 100 –t2 110
Where, <-i> represents <input vectors directory>
<-o> represents <output directory containing centroids >
<-dm> represents <Distance measure>
<-t1 >represents < T1 threshold value>
<-t2> represents < T2 threshold value>
32
5.4.1.3 K-means Clustering
Running k-means clustering:
The canopy clustering algorithm can be run using the command invocation on
KMeansDriver.main, a built-in class in Mahout. The command-line implementation is as
follows:
$ bin/mahout kmeans –i /user/sneha/bioimg-seqvectors/tfidf-vectors/ -o
/user/sneha/bioimg-kmeans –dm
org.apache.mahout.commondistance.TanimotoDistanceMeasure –c /user/sneha/bioimg-
canopy-centroids/clusters-0-final/ -cd 0.1 –ow –x 20 –cl
Where, <i> represents <input vectors directory>
<o> represents <output directory containing cluster points>
<dm> represents <distance measure>
<c> represents <input clusters directory (here we are using the output directory of
canopy clustering)>
< cd> represents <convergence Delta>
<x> represents <maximum number of iterations>
<cl> represents <to ensure that list of which point will belong to which cluster is
printed i.e. creating ClusteredPoints directory in the Output directory specified>
5.4.1.4 Cluster Output
The output of both canopy and k-means clustering algorithms from the previous step is each
cluster containing biometrics that are statistically similar to each other. In Mahout the cluster
output can be inspected using the ClusterDumper tool. ClusterDumper can be executed by
running the following command:
$ bin/mahout clusterdump –dt sequencefile –d /user/sneha/bioimg-
seqvectors/dictionary.file-0 –i /user/sneha/bioimg-kmeans/clusters-20-final/ -o bioimg-
kmeans
33
The output of the ClusterDumper is the block of information, including each cluster‟s top-
weighted terms and also the centroid vector (which is the average of all the points in that cluster).
The dimensions of the vector are translated to corresponding word in dictionary and this output
can be stored in a text file bioimg-kmeans.
5.5 Fuzzy Matching
The final step in our experiment is to perform fuzzy matching of the query image with the
images in the cluster i.e. returning the most similar items with a similarity score. This process
involves two stages:
1. In the first stage we found cluster/clusters that a given query image is closest to by
comparing the distance between the query image and each cluster centroid to a
threshold value set by the user.
double d = measure1.distance(clusters.get(i).getCenter(), value.get());
/* Calculating Euclidean Distance between the vector generated from
query biometric to each cluster centroid */
if(d<= threshold){
clussearch[c] = i; /* Storing the Id’s of clusters which are closest to the query
biometric */
c++;}
If the distance is less than the threshold value then the cluster is said to be closest to
the query image and this cluster Id is saved for next stage, so that we can only search
the clusters which are closest to the query image.
2. In this stage we calculated the distance between the query image and the points (stored
images) of the clusters we saved in the previous stage. This distance is also compared
against the threshold value set by the user. If the distance is less than the threshold
value then they are said to be the most similar items and we get those items as an
output.
double d = measure1.distance(value1.getVector(), value2.get());
if(d<=threshold) {
System.out.println("distancepoint: " + d + "clusterId: " + key1.get());
System.out.println("clusterpoint: "+value1.getVector());
}
34
6. CONCLUSION AND FUTURE WORK
6.1 Conclusion
Hadoop is a versatile tool which is most popular for its distributed storage and distributed
computing of massive amounts of data. Hadoop has been chosen as the framework to process
biometric data because of its ability to handle large datasets of unstructured data; It is appropriate
for write-once-read-many model (because biometrics data once collected are generally used for
read operations); Hadoop provides high scalability (Biometrics data collection is increasing day
by day which results in a continually increase of data size and Hadoop can handle such
additional data by adding more clusters). Hadoop also transfers code to data, rather than
transferring large data sets to computation logic. And also another primary reason to use Hadoop
is its high degree of fault tolerance.
We also developed a distributed fuzzy matching technique to return the most similar items. For
that clustering process is implemented here in order to reduce the total search space for each
query. We used Canopy and K-means clustering algorithms from Apache Mahout to assign each
image into a bin. When performing queries we first look for the clusters that the given image is
closest to by, and then performing the fuzzy matching with the points in the closest clusters to
return the most similar items.
Along with many advantages there are also some limitations of using Hadoop framework, mainly
because of its centralized namespace storage which results in single point of failure. Also it is not
suitable for applications which use large number of small files. While Hadoop works great for
research and non-critical applications it is not suitable for real-world critical applications as it
takes initial setup time (at least 30 seconds).
6.2 Future Work
In this work we performed pattern matching on normal images using Hadoop framework, but in
future we are planning to do the same operations on biometric images. Also in this work we
clustered all the images in one big effort i.e. we performed batch or offline clustering. But for
biometrics system during enrollment phase, new biometrics will come in and they need to be
clustered too. It would be quite expensive to keep running the batch clustering again and again to
cluster that new biometrics. Hence in future we are planning to implement online clustering i.e.
35
for each new biometric we will use canopy clustering to assign it to the cluster whose centroid is
closest. If the new biometric is not associated with any of the old clusters, they will form new
canopies.
At present we implemented the pattern matching system on Hadoop in a pseudo distributed
mode; In future we are planning to implement this project in a fully distributed mode.
36
REFERENCES
[1] Jason Trost, Lalit Kapoor “Hadoop for Large-scale Biometric Databases, Biometric-Hadoop
Summit 2010”.
[2] Jon Zuanich, “Tackling Large Scale Data in Government”.
[3] Shelly and N.S. Raghava, “Iris Recognition on Hadoop: A Biometrics System
implementation on Cloud Computing, proceedings of IEEE CCIS2011”.
[4] Apache Hadoop web site [online]. Available: http://hadoop.apache.org/
[5] Apache Mahout web site [online]. Available: http://mahout.apache.org/
[6] Chuck Lam. (2011). Hadoop In Action. Stanford: Manning Publication Co.
[7] Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman. (2012). Mahout In Action. New
York: Manning Publication Co.
[8] Eric Sammer. (2012). Hadoop Operations. O'Reilly Media, Inc.
[9] Anil K. Jain, Arun Ross, and Salil Prabhakar, “An Introduction to Biometric Recognition,
IEEE Transactions on circuits and systems for video technology, Vol. 14, No. 1, January 2004”.
[10] Madhavi Vaidya, Swati Sherekar, “Study of Biometric Data Processing by Hadoop, MPGI
National Multi Conference, April 2012”.
[11] Wikipedia web site [online].Available: http://en.wikipedia.org/wiki/Biometrics
[12] https://developer.yahoo.com/hadoop/tutorial/module1.html
[13] Shivaraman Janakiraman, “Fault Tolerance in Hadoop for Work Migration”.
[14] U. S. Department of Homeland Security. (2014). Budget in Brief: Fiscal Year – 2014.
Retrieved from:
http://www.dhs.gov/sites/default/files/publications/MGMT/FY%202014%20BIB%20-
%20FINAL%20-508%20Formatted%20(4).pdf
[15] Dr. Rakesh Rathi, Sandhya Lohiya, “Big Data and Hadoop, International Journal of
Advanced Research in Computer Science & Technology (IJARCST 2014)”.
37
[16] Amit Mhatre, Srinivas Palla, Sharat Chikkerur and Venu Govindaraju, “Efficient Search and
retrieval in Biometric Databases”.
[17] Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, Edward Y. Chang, “Parallel
Spectral Clustering in Distributed Systems”.
[18] Kathleen Ericson, Shrideep Pallickara, “On the Performance of High Dimensional Data
Clustering and Classification Algorithms”.
[19] Dilipsinh Bheda, Asst. Prof. Mahasweta Joshi, Asst. Prof. Vikram Agrawal, “A Study on
Features Extraction Techniques for Image Mosaicing, International Journal of Innovative
Research in Computer and Communication Engineering, Vol. 2, Issue 3, March 2014”.
[20] Xiaoling Lu, Bing Zheng, “Distrbuted Computing and Hadoop in Staistcs”.
[21] Robert Chansler, Hairong Kuang, Sanjay Radia, Konstantin Shvachko, and Suresh Srinivas,
“The Hadoop Distributed File System”.