+ All Categories
Home > Documents > Using Hadoop as a platform for quick and efficient pattern matching of large...

Using Hadoop as a platform for quick and efficient pattern matching of large...

Date post: 21-May-2018
Category:
Upload: nguyenlien
View: 219 times
Download: 4 times
Share this document with a friend
43
Using Hadoop as a platform for quick and efficient pattern matching of large scale biometric databases Sneha Mallampati Problem Report submitted to the College of Engineering and Mineral Resources at West Virginia University in partial fulfillment of the requirements for the degree of Masters of Science in Computer Science Dr. Vinod Kulathumani, Ph.D., Chair Dr. Elaine M. Eschen, Ph.D. Dr. Roy S Nutter, Ph.D. Lane Department of Computer Science and Electrical Engineering Morgantown, West Virginia 2014 Keywords: Hadoop, MapReduce, HDFS, Biometrics Systems, Fuzzy Matching
Transcript
Page 1: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

Using Hadoop as a platform for quick and efficient pattern

matching of large scale biometric databases

Sneha Mallampati

Problem Report submitted

to the College of Engineering and Mineral Resources

at West Virginia University

in partial fulfillment of the requirements for the degree of

Masters of Science in

Computer Science

Dr. Vinod Kulathumani, Ph.D., Chair

Dr. Elaine M. Eschen, Ph.D.

Dr. Roy S Nutter, Ph.D.

Lane Department of Computer Science and Electrical Engineering

Morgantown, West Virginia

2014

Keywords: Hadoop, MapReduce, HDFS, Biometrics Systems, Fuzzy Matching

Page 2: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

ABSTRACT

Using Hadoop as a platform for quick and efficient pattern

matching of large scale biometric databases

Sneha Mallampati

Biometric systems have evolved today as a reliable mechanism for accurate determination of an

individual‟s identity in the context of several applications like access control, personnel

screening and criminal identification. Several biometric modalities are in use, such as iris

recognition, facial recognition, fingerprint and keystroke dynamics.

Over time, the sheer volume of biometric data (being stored and processed for identification

purposes) is becoming larger and larger. Despite this extreme scale, it is required that the

systems should maintain high accuracy and quick information retrieval. Traditional parallel

architectures and database systems are largely inadequate to handle such huge datasets for

efficient biometric pattern matching. In this report, Hadoop has been used as an alternative

architecture for biometric systems implementation.

Hadoop is a popular open source framework known for its massive cluster based storage and also

it would provide a good platform to build a new distributed fuzzy matching. Hadoop uses

key/value pairs as its basic unit to work with unstructured data types. It implements a

MapReduce framework to process vast amounts of data in-parallel on various clusters in a

reliable and fault-tolerant manner. The data distribution and replication features of the Hadoop

Distributed File System (HDFS), the distributed storage system used by Hadoop would provide a

reliable storage system with enough parallelism to support fast queries running on multiple

machines. This report describes how the Biometrics System can be implemented on Hadoop. For

this we performed large scale pattern matching technique called fuzzy matching to identify

similar images on Hadoop. Due to lack of access to large scale biometric databases, normal

images were used to implement the fuzzy matching technique, but the same procedure can be

applied for biometric databases.

Page 3: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

i

ACKNOWLEDGEMENTS

I would like to thank Dr. Vinod Kulathumani for his patience and constant support throughout

my research. He helped me a lot by providing good guidance and lots of valuable suggestions.

I would also like to acknowledge Dr. Elaine Eschen and Dr. Roy Nutter for being on the

committee.

I would like to thank all my friends and family members for their motivation and continuous

support throughout my project.

Finally, I would like to thank the most important people in my life, my parents and my beloved

brother for their encouragement, love and trust towards me.

Page 4: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

ii

TABLE OF CONTENTS

ACKNOWLEDGEMENTS .................................................................................................................................. i

TABLE OF CONTENTS ..................................................................................................................................... ii

LIST OF FIGURES ........................................................................................................................................... iv

1. INTRODUCTION ..................................................................................................................................... 1

2. HADOOP OVERVIEW ............................................................................................................................. 4

2.1 Apache Hadoop ................................................................................................................................... 4

2.2 Hadoop Components .......................................................................................................................... 4

2.2.1 Hadoop MapReduce .................................................................................................................... 4

2.2.2 Hadoop Distributed File System (HDFS) ....................................................................................... 5

2.3 The Building Blocks of Hadoop ........................................................................................................... 6

2.3.1 NameNode ................................................................................................................................... 7

2.3.2 DataNode ..................................................................................................................................... 7

2.3.3 JobTracker .................................................................................................................................... 8

2.3.4 TaskTracker .................................................................................................................................. 8

2.3.5 Secondary NameNode ................................................................................................................. 9

2.4 Hadoop Distributed File System Client ............................................................................................. 10

2.4.1 Reading a File ............................................................................................................................. 10

2.4.2 Writing to a file .......................................................................................................................... 11

2.5 Replication Management .................................................................................................................. 12

2.6 Different Operational Modes of Hadoop .......................................................................................... 13

2.6.1 Local (Standalone) Mode ........................................................................................................... 13

2.6.2 Pseudo-Distributed Mode .......................................................................................................... 13

2.6.3 Fully Distributed Mode .............................................................................................................. 13

2.7 Web-based Cluster UI ....................................................................................................................... 14

2.8 Advantages and Disadvantages of Hadoop ...................................................................................... 16

2.8.1 Advantages ................................................................................................................................. 16

2.8.2 Disadvantages ............................................................................................................................ 17

3. BIOMETRIC SYSTEMS .......................................................................................................................... 18

3.1 Modes of Biometric System .............................................................................................................. 18

3.1.1 Verification Mode ...................................................................................................................... 18

3.1.2 Identification Mode ................................................................................................................... 18

Page 5: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

iii

3.2 Biometrics System Design ................................................................................................................. 20

4. EFFICIENT PATTERN MATCHING OF LARGE SCALE BIOMETRICS DATA .............................................. 21

4.1 Fuzzy Matching ................................................................................................................................. 21

4.2 Hadoop and Biometrics System ........................................................................................................ 22

4.3 Development Method ....................................................................................................................... 22

4.3.1 Clustering on Hadoop ................................................................................................................ 22

4.3.2 Low Latency Fuzzy Matching ..................................................................................................... 26

5. HADOOP IMPLEMENTATION ............................................................................................................... 29

5.1 Development Method ....................................................................................................................... 29

5.2 Hadoop Cluster Setup ....................................................................................................................... 29

5.3 Creating Vectors from Images .......................................................................................................... 29

5.4 Running Clustering on Hadoop ......................................................................................................... 29

5.4.1 Apache Mahout and Apache Maven.......................................................................................... 29

5.5 Fuzzy Matching ................................................................................................................................. 33

6. CONCLUSION AND FUTURE WORK ..................................................................................................... 34

6.1 Conclusion ......................................................................................................................................... 34

6.2 Future Work ...................................................................................................................................... 34

REFERENCES ................................................................................................................................................ 36

Page 6: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

iv

LIST OF FIGURES

Figure 2-1: MapReduce Framework ............................................................................................................. 5

Figure 2-2: HDFS Architecture .................................................................................................................... 6

Figure 2-3: NameNode/DataNode interaction in HDFS ............................................................................... 8

Figure 2-4: JobTracker and TaskTracker interaction .................................................................................... 9

Figure 2-5: Topology of a Hadoop cluster .................................................................................................. 10

Figure 2-6: A Client Reading Data from HDFS ......................................................................................... 11

Figure 2-7: A Client Writing Data to HDFS ............................................................................................... 12

Figure 2-8: A snapshot of HDFS web interface .......................................................................................... 14

Figure 2-9: A snapshot of the MapReduce web interface (JobTracker) ..................................................... 15

Figure 2-10: A snapshot of the MapReduce web interface (TaskTracker) ................................................. 16

Figure 3-1: Enrollment, Verification and Identification stages of Biometric System................................. 19

Figure 4-1: Fuzzy Matching of fingerprint images lifted from crime scene and from the law enforcement

database ....................................................................................................................................................... 21

Figure 4-2: Bulk clustering and real time classification ............................................................................. 26

Figure 4-3: Client Communication with Master server while performing Low latency fuzzy matching ... 27

Figure 4-4: Client Communication with Data server while performing Low latency fuzzy matching ....... 28

Page 7: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

1

1. INTRODUCTION

Biometric systems have evolved today as a reliable mechanism for accurate determination of an

individual‟s identity in the context of several applications like criminal identification, personnel

screening and access control (replacing passwords and tokens) because of its reliability and

uniqueness. Several biometric modalities are in use such as finger print, face recognition, palm

print, iris recognition and retina (these are related to shape of the body); and typing rhythm,

voice etc. ( these are related to the pattern of behavior of the person).

The United States Department of Homeland Security checks a person‟s biometrics against the

entire database to determine if a person is using fraud identification or not. Currently US-VISIT

program is maintaining 148 million identities of individuals and around 188,000 identities are

enrolled or verified daily [14]. Also the Unique Identification Authority of India (UIDAI) is

planning to maintain a database of over 1 billion residents of India containing biometric (face,

fingerprint and iris) images along with other data [11]. Such massive biometrics systems require

sophisticated networks of computers to identify individuals accurately, reliably, and quickly.

In traditional High-Performance Computing (HPC) systems data is stored on large shared

centralized storage systems like Storage Area Network (SAN) or Network Area Storage

(NAS). In such systems when a job is executed, data is fetched from the central storage

system, after processing it the results are stored back on the central storage system. So, it

might cause collisions when many workers try to fetch the same data at the same time,

and with large data sets it quickly causes bandwidth contention.

Also typical distributed systems use MPI (Message Passing Interface) based architecture

for parallel computing. They might perform much better on less number of machines but

when adding more machines their performance increases non-linearly [12].

Hadoop is designed in such a way that it provides a flat scalability curve. A program

written in distributed frameworks other than Hadoop may require large amounts of

refactoring when scaling from ten to one hundred or one thousand machines, this may

involve having the program be rewritten several times [13].

Page 8: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

2

Traditional databases index the records (structured data) in an alphabetical or numerical

order for efficient retrieval but biometric data which is unstructured doesn‟t contain any

sorting order [16].

So, traditional parallel architectures and database systems are largely inadequate to handle such

huge datasets (often referred as Big Data) for efficient biometric pattern matching.

Problem Statement: In order to handle such massive data sets and to perform searching in a

reasonable amount of time, we explored a cloud based system and software framework called

Apache Hadoop. The main objective of this report is to explain how the Hadoop framework can

be leveraged to solve this problem.

Hadoop is an open-source cloud computing environment designed for distributed processing of

massive amounts of data (terabyte range of data). It is a cluster-based data storage system, so that

vast amounts of data can be processed in-parallel on large clusters using the MapReduce

principal. A MapReduce job splits the input data into independent chunks so that they can be

processed by map tasks in a complete parallel manner. The framework sorts the output of map

tasks and then they are fed as input to the reduce tasks.

Hadoop also uses a distributed file system called Hadoop Distributed File System (HDFS) which

is based on Google File System to divide files among several nodes so that the processor of each

node works on their own storage. HDFS is highly fault-tolerant and provides high throughput for

applications that have large data sets when compared to other distributed file systems. In Hadoop

the computation is performed on the machine where the block is stored i.e. the computation logic

is moved rather than data block across the network.

Also Hadoop is designed to have a very flat scalability curve. The underlying Hadoop platform is

able to manage the data and hardware resources and provide performance growth proportionate

to the number of machines available. A Hadoop program that runs well for a few number of

nodes, is also scalable to multiple number of nodes (with minor changes in code) [13].

In Hadoop, data can originate in any form (semi-structured/unstructured), but it eventually

transforms into key/value pairs for the processing functions to work on [6].

Page 9: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

3

Thus we expect that Hadoop would be a good platform for large-scale Biometric search i.e. for

implementing a highly scalable and low-latency method of fuzzy matching (returning the most

similar items when there is no exact match), but the challenge is to see how the MapReduce

framework can be leveraged for biometric databases.

The ability to run MapReduce program over all the biometrics database would help in clustering

it into several bins using the canopy and k-means clustering algorithms from the Apache Mahout

project. Thus each bin contains biometrics that are statistically similar, and also each bin contains

a „mean biometric‟ which represents an average of biometrics contained in that bin. During

querying, only the bins closest to the query biometric will be searched. This allows us to avoid

searching large portion of the database and only search the bins that contain the most similar

items to the query biometric [2].

The goal of this report is to give a brief overview about Hadoop and how to perform efficient

pattern matching of large scale biometrics data and also how to implement this low latency

pattern matching (fuzzy matching) technique on Hadoop. Due to lack of access to large scale

biometric databases, normal images were used in this project to implement the fuzzy matching

technique, but the same procedure can be applied for biometric databases.

Page 10: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

4

2. HADOOP OVERVIEW

2.1 Apache Hadoop

Apache Hadoop is an open source software framework used for the distributed processing of

large amounts of data across clusters of computers using simple programming paradigm called

MapReduce programming model. MapReduce framework provides automatic parallelization,

fault tolerance, and the scale to process hundreds of terabytes of data in a single job over

thousands of machines. Hadoop also uses distributed file system called Hadoop Distributed File

System (HDFS) to support large-scale, data-intensive and distributed processing applications.

Hadoop runs the processing logic provided by the user on the machine where the data resides

rather than moving the data across the network i.e. Hadoop is designed for the programs to run

(code) are smaller than the data and are easier to move around [6].

2.2 Hadoop Components

2.2.1 Hadoop MapReduce

Hadoop MapReduce is an implementation of a MapReduce programming paradigm introduced

by Google in 2004 which is used for processing large amounts of data on clusters of computers

(nodes) in parallel. The term MapReduce actually refers to two independent and distinct tasks i.e.

“map” and “reduce” tasks. A MapReduec job splits the input data into number of independent

chunks (key/value pairs), which are processed by different map tasks in a complete parallel

manner. The framework sorts the outputs from map tasks (set of independent key/value pairs)

and they are given as input to the reduce tasks. The Reducer reduces those set of independent

key/value pairs which share a common key to a smaller set of values. The output of all reduce

tasks are combined together to produce the complete output of MapReduce job [4]. The main

advantage of MapReduce programming model is it‟s distributed processing of the map and

reduce operations. Since each task is completely independent of other, all map tasks can be

performed in parallel and hence the total computation time can be reduced.

Page 11: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

5

2.2.2 Hadoop Distributed File System (HDFS)

HDFS is a file system designed to support large scale distributed data processing under the

MapReduce framework. HDFS will also provide high throughput, streaming reads and writes of

very large files i.e. it can store a dataset of almost 100 TB as a single file, which would be an

overwhelming task for other file systems [20]. Hadoop takes any input from the HDFS and also

stores the output on HDFS.

HDFS is similar to traditional file systems (NTFS, EXT3, HFS Plus etc.) in many ways, like files

are stored as data blocks and metadata exists that keeps track of mapping between the filename

and data block location, directory tree structure, permissions etc. However all the file systems

can span multiple disks but they cannot span multiple computers as HDFS does. Also, all the

traditional file systems are implemented as kernel modules i.e. they are completely linked to their

operating system‟s kernel, HDFS in contrast runs in user space so that we can use HDFS on any

operating system supported by java.

Another major difference between HDFS and other traditional file systems is its block size i.e. its

block size is 64 MB by default, but the users can configure it to 128 MB, 256 MB or 1 TB as per

the application requirements, whereas other file systems use a 4 KB or 8 KB block size for data.

Increasing the block size represents data can be written in larger continuous chunks on disk.

These results in better performance for applications which require data to be written and read in

larger sequential operations by minimizing the drive seek operations [8].

Reducer

Reducer

Reducer

Intermediate

Data

(Key, Value)

Pairs

Map

Map

Map

Map

Input

Data

Output

Output

Output

Figure 2-1: MapReduce Framework

Page 12: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

6

HDFS replicates data blocks across multiple machines in the cluster, to make the system fault-

tolerant and not to depend on any specialized subsystem data protection. By default, each data

block is replicated three times. The data storage in HDFS is of type write once and read more,

i.e. once written, the file cannot be modified and this helps in maintaining consistency between

the replicas.

2.3 The Building Blocks of Hadoop

Hadoop implements the concepts of distributed storage and distributed computation using a set

of daemons which runs on different servers in Hadoop network. Hadoop employs master/slave

architecture for both distributed storage (HDFS) and the distributed computation (Hadoop

MapReduce).

The three daemons that make up a HDFS Cluster are the NameNode, Secondary NameNode and

the DataNode; and the two major daemons in Hadoop MapReduce are the JobTracker and the

TaskTracker.

Client

Metadata ops

Block ops

Metadata (Name, replicas, …):

/home/foo/data, 3, ….

NameNode

Read

Blocks

Client

DataNode

s

DataNode

s

Replication

Write

Rack 1 Rack 2

Figure 2-2: HDFS Architecture

Page 13: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

7

These daemons have distinct roles (either master or slave), some of them exist only on one server

and some of them exist across multiple servers.

2.3.1 NameNode

The NameNode is the master of HDFS and is responsible for maintaining the file system

metadata. A Hadoop cluster contains a single NameNode but multiple DataNodes. NameNode

keeps track of how the files are broken down into different datablocks and also it knows which

DataNode stores which blocks. All the DataNodes report their status to the NameNode i.e. the

NameNode has a complete view of the DataNodes in the cluster and also their current health

through regular heartbeats. After certain period of no heartbeats, a DataNode is assumed to be

dead.

The file system metadata information is stored entirely in the main memory of the NameNode

machine for fast lookup and retrieval [6, 8].

2.3.2 DataNode

The DataNode is the slave of HDFS and is responsible for storing and retrieving the block data.

A Hadoop cluster may contain one to many DataNodes. In HDFS, a file is split into blocks and

when the client wants to read/write a file it will first contact NameNode which will give

information like which DataNode each block resides in. Then the clients will directly

communicate DataNode servers and requests them to process read/write operations. DataNodes

based on the instructions from NameNodes are responsible for performing block creation,

deletion and replication. These datablocks are stored in replication on different DataNodes so

that redundancy can be maintained; also it will ensure that if any DataNode crashes or becomes

unavailable over the network, we can still access the files from other DataNodes containing that

file blocks. The DataNodes will constantly inform the NameNode of the blocks its currently

storing [6, 8].

Page 14: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

8

2.3.3 JobTracker

JobTracker is the master process of the MapReduce cluster and also acts as a liaison between the

client application and Hadoop. There is only one JobTracker per MapReduce cluster. Once a user

submits his code, the JobTracker will determine the complete execution plan i.e. how many

TaskTrackers and subordinate tasks should be created, assigning different tasks to different

TaskTrackers and also to verify whether all tasks are running or not. Similar to DataNode and

NameNode, each TaskTracker will frequently reports its execution status and completed tasks to

the JobTracker. If a task fails then the JobTracker will automatically re-assign the failed task to

different node. JobTracker communicates with each TaskTracker and client using remote

procedure calls (RPC) [6, 8].

2.3.4 TaskTracker

The TaskTracker is the slave process of the MapReduce cluster and it is responsible for

executing the tasks assigned by the JobTracker locally, and reports its status back to the

JobTracker periodically by the use of heartbeat. If the heartbeat is not received to the JobTracker

within a specified amount of time, it will assume that the TaskTracker is dead and will reassign

the tasks to another node in the cluster. Even though each node contains only single

TaskTracker, each TaskTracker can execute many map and reduce tasks in parallel [6, 8].

File metadata:

/user/chuck/data1 -> 1,2,3

/user/james/data2 -> 4,5

3

4 5

3

5

5

2

4 1

1 4

2

NameNode

DataNodes

Figure 2-3 Figure 2-3: NameNode/DataNode interaction in HDFS

Page 15: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

9

2.3.5 Secondary NameNode

As the name indicates it is an assistant daemon for monitoring the HDFS cluster. Each cluster

contains a single Secondary NameNode. Since the NameNode is a single point of failure for a

Hadoop cluster, Secondary NameNode is also used to store the filesystem metadata log so that it

can be preserved. Secondary NameNode communicates regularly with the NameNode to take the

snapshots of HDFS metadata [6].

Client

JobTracker

TaskTracker

Map

p

Reduce

eeeee

TaskTracker

Map

p

Reduce

eeeee

TaskTracker

Map

p

Reduce

eeeee

Figure 2-4 Figure 2-4: JobTracker and TaskTracker interaction

Page 16: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

10

Usually Namenode and JobTracker runs on same machine and it acts as master node, whereas

DataNode and TaskTracker runs on same machine and it acts as slave node.

2.4 Hadoop Distributed File System Client

2.4.1 Reading a File

To read a file in HDFS, the client first contacts the NameNode by indicating the file name it

would like to read. The client identity is first validated and then ownership and permissions of

that file were checked. If that file exists and the client has access to that file, the NameNode

returns the list of DataNodes having a copy of the blocks of that file to the client. The client now

directly connects to the nearest DataNode and reads the data block it needs. This process repeats

until all the data blocks have been read and then client closes the file stream [8].

Secondary NameNode

NameNode

JobTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

Figure 2-5 Figure 2-5: Topology of a Hadoop cluster

Page 17: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

11

2.4.2 Writing to a file

To write a file to HDFS, the client should first create an empty file without any blocks of data; to

do that the client should have writing permissions. Then the metadata entry for the new file is

created by the NameNode and it will also allocate a set of DataNodes to which replicas of first

block should be written. The replication pipeline is formed as the client make direct connection

to the first DataNode in the list, and then to the second DataNode and so on. Each DataNode in

the replication pipeline acknowledges to the data block it has successfully written. When the first

block is completed, new DataNodes were requested by the client to store the replicas of next

blocks. Thus a new pipeline is created, and this process is repeated until the client closes the data

stream indicating that it finished sending data. If a DataNode in a pipeline fails, then the failed

DataNode in the pipeline is removed and the new DataNodes are chosen to write remaining data

blocks [8].

1.

Request: open foo.txt

Response: block 1, hosts: .. Client Application

NameNode

{

Filename: /user/chuck/foo.txt,

Owner: chuck,

Group: eng,

Blocks: *…+

},

….

Hadoop Library API

DataNode

1 4 …

1.

Request: read block 1

Response: data!

Figure 2-6 Figure 2-6: A Client Reading Data from HDFS

Page 18: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

12

2.5 Replication Management

The NameNode is the one which is responsible for replication management. This replication

management is one of the main factors which determine HDFS reliability, availability and

performance. The NameNode should detect whether a block is over-replicated or under-

replicated. If a block is over replicated, the NameNode chooses a replica to be deleted. While

performing any action (i.e. either adding or deleting a replica) the first priority that the

NameNode should consider is that each replica is placed on unique racks so that it can prevent

data loss on entire rack failure. The next priority that needs to be considered while deleting is, to

remove the replica from DataNode with the least amount of available disk space.

If a block becomes under-replicated then it is kept in the replication priority queue. The block

with lowest replication value is given the highest priority. These priority values are considered

5. ack

6. ack

1. Open foo.txt to write Client Application

NameNode

{

Filename: /user/chuck/foo.txt,

Owner: chuck,

Group: eng,

Blocks: *…+

},

….

Hadoop Library API

DataNode

1 5 … 2. Write block 1

DataNode

1 7 …

DataNode

1 9 …

8. Close foo.txt

3. Write block 1

4. Write block 1

7. ack

Figure 2-7 Figure 2-7: A Client Writing Data to HDFS

Page 19: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

13

while placing a new replica. If a data block has only one replica, then a new rack is chosen to

place the new replica. In case if the block has two replicas which are on the same rack, then the

third replica is placed on a different rack; otherwise third replica is placed in the same rack but

on a different DataNode.

The NameNode must also ensure that all replicas of the data block should not be located on one

rack. If the NameNode detects any such block replicas it will treat the block as under-replicated

and replicates the data block to different rack and also deletes the old replica [21].

2.6 Different Operational Modes of Hadoop

The majority of Hadoop settings are contained in the three XML configuration files: core-

site.xml, hdfs-site.xml, and mapred-site.xml. These three files are used to adjust the

configuration setting i.e. based on that we can make Hadoop to work on different modes. These

modes are described as follows:

2.6.1 Local (Standalone) Mode

It is the default mode for Hadoop, where all the three XML configuration files are empty. Hence

Hadoop will run completely on the local machine, as a single Java process. In this mode Hadoop

doesn‟t uses HDFS because there is no need to communicate with other nodes, and also it will

not launch any Hadoop daemons. Hence its major use is for developing and debugging the

MapReduce application logic without any additional complexity of interacting with the daemons

[6].

2.6.2 Pseudo-Distributed Mode

The pseudo-distributed mode is running Hadoop on a single node with all daemons running as a

separate Java process. We provide the simple XML files to configure a single machine in this

mode. In core-site.xml and mapred-site.xml we will specify the hostname and port number of the

NameNode and JobTracker respectively. In hdfs-site.xml we will specify the default replication

factor for HDFS, which will only be one because we are running on a single machine. All the

daemons running on a single machine will communicate with each other as if they were

distributed over a cluster [6].

2.6.3 Fully Distributed Mode

The fully distributed mode is running Hadoop on clusters ranging from few nodes to thousands

of nodes. It is the mode in which we get the actual benefits of distributed computing and

Page 20: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

14

distributed storage. For each cluster we will have a master node which is the host of NameNode

and JobTracker; backup server which hosts the Secondary NameNode daemon; and slave nodes

running both DataNode and TaskTracker daemons. The default replication factor for HDFS is 3,

and we can either increase or decrease this value based on our application in hdfs-site.xml [6].

2.7 Web-based Cluster UI

Hadoop provides the web interfaces which helps to monitor the health of Hadoop clusters. This

browser interface helps the user to access information much faster than going through logs and

directories.

The NameNode hosts a general report which contains the status of each DataNode in the cluster

on port 50070. We can also determine the storage available on each individual node using this

web interface.

Figure 2-8: A snapshot of HDFS web interface

Hadoop also provides a similar status overview of MapReduce jobs on one hosted at port 50030

of JobTracker. This report contains the status of ongoing MapReduce tasks as well as the report

about the completed jobs.

Page 21: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

15

Figure 2-9: A snapshot of the MapReduce web interface (JobTracker)

Similarly the TaskTracker hosts a report about running and non-running tasks at port 50060.

Page 22: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

16

Figure 2-10: A snapshot of the MapReduce web interface (TaskTracker)

2.8 Advantages and Disadvantages of Hadoop

2.8.1 Advantages

Hadoop is a platform which provides both distributed storage and distributed

computation.

All the tasks (Map/Reduce) are independent, so that if there is any task failure we can

restart it without affecting other tasks; also independent tasks can help in parallel and

fast execution of the job.

The major advantage of Hadoop over other distributed systems is its flat scalability

curve. Some distributed framework such as MPI (Message Passing Interface) performs

better for small number of machines, but as the number of machines increases its

performance level decreases non-linearly [13].

However Hadoop is a highly scalable platform and its performance growth is almost

directly proportional to the number of machines available with little re-work [13].

Hadoop also offers a cost effective storage for terabytes of data using HDFS.

Page 23: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

17

HDFS can store data reliably, even in the presence of failures.

HDFS uses large block sizes (64MB, 128MB etc.), and gives better performance when

manipulating large files.

HDFS can store and process different types of data i.e. both structured and unstructured,

unlike traditional relational database systems which can process only structured data.

HDFS can replicate files to specified number of times set by the user (default value is

3), which makes it tolerant to hardware and software failures; and also this data is

distributed among several nodes in cluster and helps in parallel processing.

Another major advantage of Hadoop is, the computation logic is moved to the data,

rather than the data moved close to the computation logic.

2.8.2 Disadvantages

The master nodes of HDFS and MapReduce i.e. NameNode and JobTracker are single

point of failures, so there might be risk of data loss.

Joining multiple datasets (i.e. structured, semi-structured and non-structured) are tricky

and slow.

HDFS is not efficient for handling large number of small files.

Hadoop is Write Once Read Many (WORM) architecture and hence it is not suitable for

the applications whose data needs to be modified over time.

Hadoop is a batch processing system designed at an expense of high throughput and

high latency, and hence is not suitable for real-time systems which require extremely

low latency.

Page 24: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

18

3. BIOMETRIC SYSTEMS

A Biometric system is operated by capturing biometric data from an individual, performing

feature extraction of the captured data, and then comparing this feature set with the template set

in the database. Biometric modalities can be identified as two main categories: physical (or

passive) and behavioral (or active). Physical modalities include fingerprint, IRIS recognition,

face recognition, DNA, palm print, retina and hand geometry (related to physical traits of the

person). These are highly stable and subject to change very little over time and hence rarely

require re-enrollment. Behavioral modalities include typing rhythm, voice and gait (related to

pattern of behavior of the person). These are subjected to change easily and hence the accuracy

ranges for behavioral modalities are much wider than the physical modalities [11].

3.1 Modes of Biometric System

Biometric systems can be operated in two different ways: Verification mode or Identification

mode [9].

3.1.1 Verification Mode

In this mode the system check‟s the user identity by performing one-to-one comparison of the

captured biometric with her own biometric template(s) stored in the system database in order to

make sure that the user matches the proposed identity. The user who desires to be identified

should provide a primary identifier such as personal identification number (PIN), a user name or

ID card [9].

3.1.2 Identification Mode

In this mode the system doesn‟t have any primary identifier of the individual, but recognizes him

by comparing the captured biometric template against the entire database for a match. These

systems are much slower because of overhead in performing one-to-many comparison to identify

an individual [9].

Page 25: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

19

User

Feature

Extractor

Claimed Identity

System

Database

Biometric

Sensor

Verification Process

Matching

Module

Template

One

True/False

User

Feature

Extractor

Identity

Template System

Database

Biometric

Sensor

Enrollment Process

User

Feature

Extractor System

Database

Biometric

Sensor

Identification Process

Matching

Module

(N matcher)

( Templates

N

User‟s Identity or not

Figure 3-1 Figure 3-1: Enrollment, Verification and Identification stages of Biometric System

Page 26: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

20

3.2 Biometrics System Design

All biometric identity systems are designed using the following four main modules [9]:

Sensor Module (Data Collection), which captures the biometric data of an individual

from a biometric sensor hardware.

Feature Extraction Module, in which the feature extraction methods and algorithms

are applied to the acquired biometric data in order to extract a set of salient or

discriminatory features.

Matcher Module, in which the query biometric template is compared against the stored

biometric templates in order to generate a matching score. That score is used to make

decisions i.e. in verification systems if the similarity score is greater than a given

threshold value then the system decides that the two templates belong to the same

person, whereas in identification systems higher similarity score templates are selected

to determine if anyone was identified on stored biometric database.

System database module, which is used to store the biometric templates of the enrolled

users by the biometric system. After feature extraction we organize the biometric data

into a characteristic vector so that it is small and can reliably processed rather than

storing it in its original format i.e. digital image.

Page 27: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

21

4. EFFICIENT PATTERN MATCHING OF LARGE SCALE

BIOMETRICS DATA

In this report Biometrics system which is operated in Identification Mode (i.e. the query

biometric image is compared against the entire database to find the most similar items) is

explained. In response to growing biometric databases there is a need for highly scalable and low

latency method of performing fuzzy matching i.e. returning most similar items when there is no

exact match [2].

4.1 Fuzzy Matching

Fuzzy matching is an operation that can compare two high-dimensional items and determine

whether they are similar to each other or not [2].

Fuzzy matching of biometric images involves extracting feature vectors from those images and

then performing a distance measure over the resulting feature vectors to produce a similarity

score.

Feature Extraction

and Normalization

Creating

Vectors

Feature Extraction

and Normalization

Creating

Vectors

Distance

Function * 2.41

*Euclidean Distance

Figure 4-1 Figure 4-1: Fuzzy Matching of fingerprint images lifted from crime scene and from the law

enforcement database

Page 28: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

22

4.2 Hadoop and Biometrics System

It was clear that Hadoop would provide a good platform to perform low latency fuzzy matching

on the biometric databases [1].

Mahout and MapReduce frameworks are used for indexing and clustering the similar

biometric images to reduce the overall search space (Apache Mahout which runs on top

of Hadoop is used for implementing clustering algorithms).

HDFS with its data distribution and replication features is used as a reliable file storage

system for cluster points generated from the Apache Mahout/MapReduce. These features

will help in providing enough parallelism to support fast queries running on multiple

machines.

4.3 Development Method

The dataset used here for fuzzy matching consists of 400 normal images of around 500 MB

database size. The fuzzy matching process is to match one query image against the entire dataset.

In this section, we will demonstrate the development process.

The development process of Fuzzy matching on Hadoop is divided into two steps –

1. A clustering process to reduce the total search space for each query.

2. Performing low latency fuzzy matching in parallel across the Hadoop instance to retrieve

the most similar images.

4.3.1 Clustering on Hadoop

Clustering is a machine learning concept that can group data items which are similar to each

other. Clustering process was implemented here in order to reduce the total search space for each

query. For that we used two clustering algorithms – Canopy clustering and k-means clutering

from the Apache Mahout project. These clustering algorithms assign each biometric into a bin.

The „mean biometric‟ of a bin represents the average of biometrics in that bin.

4.3.1.1 Canopy Clustering

Canopy Clustering is a fastest approximate clustering technique i.e. it can create clusters quickly

(within a single pass over the data). This clustering algorithm uses a fast distance measure

(EuclideanDistanceMeasure, TanimotoDistanceMeasure etc) and two distance thresholds T1 and

T2, with T1>T2. It begins with a set of points and an empty list of canopies, removes one data

point at random and adds it to the canopy list as the center. And then iterates through the rest of

Page 29: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

23

points in the data set one by one, for each data point it will calculate the distance to all the

centers in the list. If the distance is < T1 then it is added into that canopy list. If the distance is <

T2 then it is removed from the data point list and this helps in preventing it from forming a new

canopy. This process is repeated until the data set list is empty [7].

The main weakness of canopy clustering is, it may not give accurate and precise clusters

(because it is generating clusters within a single pass over the data). But its advantage over k-

means is it will generate optimal number of clusters without specifying the number of clusters, k.

Canopy clustering in action:

Map phase: In this stage, all the points within distance T2 are removed from the data

set and are prevented from forming the new canopies. Points within distance T1 are

also kept in the same canopy but they are also made available for other canopies. The

Mapper does all this process in a single pass [7].

Reduce phase: The Reducer computes the average of the centroid and merges close

canopies [7].

4.3.1.2 K-means Clustering

Suppose we have n data points, k-means algorithm requires the user to set the number of clusters

(k) and also the k centroid points as an input parameter. Then the algorithm starts by assigning

each data point to the cluster they are closest to by using a distance measure. After that a new

center is calculated for each cluster by averaging all the data points assigned to that cluster. This

process is repeated until max iteration count is reached or until the centroids converge to a fixed

point from which they won‟t move [7].

K-means clustering in action:

Map phase: During this stage the Mapper assigns each point to the cluster near to it [7].

Reduce phase: During this stage the Reducer will calculate the average of all data points

assigned to a cluster, thus new centroid is determined [7].

After each iteration, the output from reduce phase is given back to the same loop until the

centroids converge to a fixed point.

The main disadvantage of using k-means clustering is that the user only has to specify the value

of k, the number of clusters and also initial centroids. But with the given initial conditions, k-

means will try to put the centers at their optimal positions.

Page 30: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

24

4.3.1.3 Finding the perfect k using Canopy Clustering

For most of the real-world clustering problems, the number of clusters is not known before. Also

initial centroid estimation is important because it will greatly affect the run time of k-means.

Good estimation of centroid points helps the algorithm to converge faster within fewer passes.

To overcome this problem we used both Canopy and K-means clustering algorithms i.e. the

canopy centroids generated from canopy clustering is given as initial centroids to the k-means

algorithm [7].

4.3.1.4 K-means Clustering using MPI (Message Passing Interface)

The following section describes how k-means clustering is implemented on typical distributed

systems which use MPI (Message Passing Interface) and how it is different from clustering on

Hadoop. An MPI program is loaded into the local memory of each machine, where every

processor/process has a unique ID. MPI allows the same program to be executed on multiple

data. When needed, the process can communicate and synchronize with others by calling MPI

library routines [17].

At first, the master machine chooses a set of initial cluster centers and broadcast them to all local

machines. Each machine identifies the most similar point to the centers by calculating distance

similarity measure between its points and the centers. Now local sums of all clusters are

calculated without any inter-machine communication. The master machine then obtains the sum

of all points in each cluster to calculate new centers, and then broadcasts them to all the

machines (most of the communication occurs here and this is a reduction operation in MPI).

Since local sums on each node are k vectors of length k, the communication cost per k-means

iteration is in the order of k^2. So for k-means, in each iteration we transfer O (k^2) values after

O (k^2 * (n/p)) operations in calculating the distance between n/p points and k cluster centers,

where p is the number of local machines. Thus MPI is more complicated due to its various

communication functions. Instead of using disk I/O, a function sends and receives data to and

from a node‟s memory [17].

In the Hadoop implementation, a Mapper is responsible for loading the current set of clusters

from disk each iteration. Once the clusters are in memory, the mapper then reads through the list

of points assigned to it one at a time, and identifies which cluster the point belongs to. Once a

point has been assigned to a cluster it is written out to a file. The overall cost of the Hadoop map

Page 31: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

25

operation is kR + nkRW, where k is the number of clusters, n is the number of points a given

mapper is responsible for clustering, „R’ and „W’ are read and write times to disk [18].

The Hadoop reducer will first read in the current set of clusters from file and then read in the

output from all the mappers. As the reducer reads in the data, it writes out a modified cluster

once it has finished processing all points assigned to the cluster. The overall cost of the Hadoop

reduce operation is kR + mnRW, where m is the number of mappers in the system [18].

Thus the mapper runtime can be considered as an O (nk) operation and reducer runtime can be

considered as an O (n) i.e. majority of computation is spent in mappers, while the role of

reducers is relatively small in the overall computation [18].

Page 32: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

26

Figure 4-2: Bulk clustering and real time classification

4.3.2 Low Latency Fuzzy Matching

This process involves performing low latency fuzzy matching of the query biometric with the

biometrics in the cluster i.e. returning the most similar items with a similarity score. This process

involves two stages:

1. In the first stage we will find cluster/clusters that a given query biometric is closest to by

comparing the distance between the query biometric and each cluster centroid to a threshold

value set by the user. If the distance is less than the threshold value then the cluster is said to

be closest to the query biometric and this cluster Id is saved for next stage, so that we can

only search the clusters which are closest to the query biometric.

Page 33: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

27

The Client then looks up the location of all blocks/chunks of those closest clusters by

querying the master server. These blocks are hosted by several HDFS DataNodes so that a

single cluster can be searched by several machines at once.

Figure 4-3: Client Communication with Master server while performing Low latency fuzzy

matching

2. In this stage the client submits the query biometric and the ID of the closest cluster to

dataserver running alongside HDFS DataNodes. The DataServers will calculate the distance

between the query biometric and the cluster points. This distance is also compared against

the threshold value set by the user. If the distance is less than the threshold value then they

are said to be the most similar items and we get those items as an output.

Page 34: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

28

Figure 4-4: Client Communication with Data server while performing Low latency fuzzy matching

Page 35: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

29

5. HADOOP IMPLEMENTATION

5.1 Development Method

Due to lack of access to large scale biometric databases, normal images were used in this project

to implement the fuzzy matching technique, but the same procedure can be applied for biometric

databases. Our work involved in developing large scale pattern matching of biometric images on

Hadoop is organized as follows:

Setting up a Hadoop cluster in pseudo-distributed mode.

Creating Vectors from images.

Implementing clustering Algorithms from the Apache Mahout project to assign each

image vector into a cluster.

Performing the actual Fuzzy matching process, i.e. the query image is compared

against the closest cluster(s) to return the most similar items.

5.2 Hadoop Cluster Setup

We implemented this experiment by running Hadoop on Ubuntu Linux (single node setup) i.e. in

a pseudo-distributed mode with all daemons (NameNode, DataNode, JobTracker, TaskTracker)

running on that single machine. Since we are running Hadoop on one node the replication factor

should only be 1.

5.3 Creating Vectors from Images

As Hadoop works on the record format, the images are initially converted into vectors and are

stored in HDFS as text files. One of the best ways to convert images to vectors is using Feature

Extraction/ Feature selection techniques like PCA (Principal Component Analysis), SIFT (Scale-

Invariant Feature Transform), Harris Corner Detection etc. where we can get high quality

discriminant features of the images [19].

5.4 Running Clustering on Hadoop

5.4.1 Apache Mahout and Apache Maven

Mahout is an open source machine learning library and the algorithms in it are implemented as

MapReduce jobs that can execute on a Hadoop cluster. Mahout also provides some clustering

algorithms. Canopy clustering and k-means clustering are one among many clustering

algorithms.

Page 36: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

30

Along with Mahout we also installed Apache Maven, a command-line tool which is used to build

and manage projects, dependencies, compiles and executes code. Using the project‟s Maven

configuration file i.e. pom.xml file we can add both Hadoop and Mahout‟s dependencies to it so

that we can easily compile and execute our Java programs which includes classes from both

Hadoop and Mahout.

5.4.1.1 Preparing Vectors for use by Mahout

To cluster objects, they should be first converted into vectors (In Mahout, vectors are

implemented as three different classes DenseVector, RandomAccessSparseVector, and

SequentialAccessSparseVector). For k-means clustering, which calculates the magnitudes of

vectors repeatedly, the implementation of SequentialAccessSparseVector is the best way than

RandomAccessSparseVector and DenseVector which are used for the algorithms containing

random insertions and update of vector values.

Now we need to convert the text documents containing feature vectors to the vectors of type

SequentialAccessSparseVector. For that we need to use two important tools:

1. The first one is SequenceFilesFromDirectory tool, which will convert the text document

into SequenceFiles. SequenceFile is a format from Hadoop library, which consists of a

series of key/value pairs.

Mahout command: Mahout Command which converts text documents containing

image vectors into SequenceFiles is as follows:

$ bin/mahout seqdirectory –c UTF-8 –i /user/sneha/bioimg-vectors –o

/user/sneha/bioimg-sequence

Where, <i> represents <input text documents containing image vectors>

<o> represents <output folder representing SequenceFile format of the text

documents>

SequenceFile Output: The SequenceFile output which is in the form of key/value

pairs can be seen using the Mahout utility called SeqDumper. To execute the

SeqDumper, we should run the following command:

$ bin/mahout seqdumper -i /user/sneha/bioimg-sequence/part-m-00000 |

more

Page 37: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

31

2. The second one is SparseVectorsFromSequenceFiles tool, which will convert the text

document in SequenceFile format to Sparse Vectors.

Mahout command: Mahout Command which converts SequenceFile data to Sparse

Vectors is as follows:

$ bin/mahout seq2sparse -i /user/sneha/bioimg-sequence –o

/user/sneha/bioimg-seqvectors

Where, <i> represents <input SequenceFile format directory>

<o> represents <output SparseVector directory>

Sparse Vectors Output: SeqDumper can also be used to see the output of Sparse

Vectors. To execute the SeqDumper, we should run the following command:

$ bin/mahout seqdumper -i /user/sneha/bioimg-seqvectors/tfidf-vectors/part-

r-00000 | more

5.4.1.2 Canopy Clustering

Running Canopy Clustering

The Canopy Clustering algorithm can be run using the command invocation on

CanopyDriver.main, a built-in class in Mahout. The command-line implementation is as follows:

$ bin/mahout canopy –i /user/sneha/bioimg-seqvector/tfidf-vectors/ -o

/user/sneha/bioimg-canopy-centroids –dm

org.apache.mahout.common.distance.EuclideanDistanceMeasure –t1 100 –t2 110

Where, <-i> represents <input vectors directory>

<-o> represents <output directory containing centroids >

<-dm> represents <Distance measure>

<-t1 >represents < T1 threshold value>

<-t2> represents < T2 threshold value>

Page 38: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

32

5.4.1.3 K-means Clustering

Running k-means clustering:

The canopy clustering algorithm can be run using the command invocation on

KMeansDriver.main, a built-in class in Mahout. The command-line implementation is as

follows:

$ bin/mahout kmeans –i /user/sneha/bioimg-seqvectors/tfidf-vectors/ -o

/user/sneha/bioimg-kmeans –dm

org.apache.mahout.commondistance.TanimotoDistanceMeasure –c /user/sneha/bioimg-

canopy-centroids/clusters-0-final/ -cd 0.1 –ow –x 20 –cl

Where, <i> represents <input vectors directory>

<o> represents <output directory containing cluster points>

<dm> represents <distance measure>

<c> represents <input clusters directory (here we are using the output directory of

canopy clustering)>

< cd> represents <convergence Delta>

<x> represents <maximum number of iterations>

<cl> represents <to ensure that list of which point will belong to which cluster is

printed i.e. creating ClusteredPoints directory in the Output directory specified>

5.4.1.4 Cluster Output

The output of both canopy and k-means clustering algorithms from the previous step is each

cluster containing biometrics that are statistically similar to each other. In Mahout the cluster

output can be inspected using the ClusterDumper tool. ClusterDumper can be executed by

running the following command:

$ bin/mahout clusterdump –dt sequencefile –d /user/sneha/bioimg-

seqvectors/dictionary.file-0 –i /user/sneha/bioimg-kmeans/clusters-20-final/ -o bioimg-

kmeans

Page 39: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

33

The output of the ClusterDumper is the block of information, including each cluster‟s top-

weighted terms and also the centroid vector (which is the average of all the points in that cluster).

The dimensions of the vector are translated to corresponding word in dictionary and this output

can be stored in a text file bioimg-kmeans.

5.5 Fuzzy Matching

The final step in our experiment is to perform fuzzy matching of the query image with the

images in the cluster i.e. returning the most similar items with a similarity score. This process

involves two stages:

1. In the first stage we found cluster/clusters that a given query image is closest to by

comparing the distance between the query image and each cluster centroid to a

threshold value set by the user.

double d = measure1.distance(clusters.get(i).getCenter(), value.get());

/* Calculating Euclidean Distance between the vector generated from

query biometric to each cluster centroid */

if(d<= threshold){

clussearch[c] = i; /* Storing the Id’s of clusters which are closest to the query

biometric */

c++;}

If the distance is less than the threshold value then the cluster is said to be closest to

the query image and this cluster Id is saved for next stage, so that we can only search

the clusters which are closest to the query image.

2. In this stage we calculated the distance between the query image and the points (stored

images) of the clusters we saved in the previous stage. This distance is also compared

against the threshold value set by the user. If the distance is less than the threshold

value then they are said to be the most similar items and we get those items as an

output.

double d = measure1.distance(value1.getVector(), value2.get());

if(d<=threshold) {

System.out.println("distancepoint: " + d + "clusterId: " + key1.get());

System.out.println("clusterpoint: "+value1.getVector());

}

Page 40: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

34

6. CONCLUSION AND FUTURE WORK

6.1 Conclusion

Hadoop is a versatile tool which is most popular for its distributed storage and distributed

computing of massive amounts of data. Hadoop has been chosen as the framework to process

biometric data because of its ability to handle large datasets of unstructured data; It is appropriate

for write-once-read-many model (because biometrics data once collected are generally used for

read operations); Hadoop provides high scalability (Biometrics data collection is increasing day

by day which results in a continually increase of data size and Hadoop can handle such

additional data by adding more clusters). Hadoop also transfers code to data, rather than

transferring large data sets to computation logic. And also another primary reason to use Hadoop

is its high degree of fault tolerance.

We also developed a distributed fuzzy matching technique to return the most similar items. For

that clustering process is implemented here in order to reduce the total search space for each

query. We used Canopy and K-means clustering algorithms from Apache Mahout to assign each

image into a bin. When performing queries we first look for the clusters that the given image is

closest to by, and then performing the fuzzy matching with the points in the closest clusters to

return the most similar items.

Along with many advantages there are also some limitations of using Hadoop framework, mainly

because of its centralized namespace storage which results in single point of failure. Also it is not

suitable for applications which use large number of small files. While Hadoop works great for

research and non-critical applications it is not suitable for real-world critical applications as it

takes initial setup time (at least 30 seconds).

6.2 Future Work

In this work we performed pattern matching on normal images using Hadoop framework, but in

future we are planning to do the same operations on biometric images. Also in this work we

clustered all the images in one big effort i.e. we performed batch or offline clustering. But for

biometrics system during enrollment phase, new biometrics will come in and they need to be

clustered too. It would be quite expensive to keep running the batch clustering again and again to

cluster that new biometrics. Hence in future we are planning to implement online clustering i.e.

Page 41: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

35

for each new biometric we will use canopy clustering to assign it to the cluster whose centroid is

closest. If the new biometric is not associated with any of the old clusters, they will form new

canopies.

At present we implemented the pattern matching system on Hadoop in a pseudo distributed

mode; In future we are planning to implement this project in a fully distributed mode.

Page 42: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

36

REFERENCES

[1] Jason Trost, Lalit Kapoor “Hadoop for Large-scale Biometric Databases, Biometric-Hadoop

Summit 2010”.

[2] Jon Zuanich, “Tackling Large Scale Data in Government”.

[3] Shelly and N.S. Raghava, “Iris Recognition on Hadoop: A Biometrics System

implementation on Cloud Computing, proceedings of IEEE CCIS2011”.

[4] Apache Hadoop web site [online]. Available: http://hadoop.apache.org/

[5] Apache Mahout web site [online]. Available: http://mahout.apache.org/

[6] Chuck Lam. (2011). Hadoop In Action. Stanford: Manning Publication Co.

[7] Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman. (2012). Mahout In Action. New

York: Manning Publication Co.

[8] Eric Sammer. (2012). Hadoop Operations. O'Reilly Media, Inc.

[9] Anil K. Jain, Arun Ross, and Salil Prabhakar, “An Introduction to Biometric Recognition,

IEEE Transactions on circuits and systems for video technology, Vol. 14, No. 1, January 2004”.

[10] Madhavi Vaidya, Swati Sherekar, “Study of Biometric Data Processing by Hadoop, MPGI

National Multi Conference, April 2012”.

[11] Wikipedia web site [online].Available: http://en.wikipedia.org/wiki/Biometrics

[12] https://developer.yahoo.com/hadoop/tutorial/module1.html

[13] Shivaraman Janakiraman, “Fault Tolerance in Hadoop for Work Migration”.

[14] U. S. Department of Homeland Security. (2014). Budget in Brief: Fiscal Year – 2014.

Retrieved from:

http://www.dhs.gov/sites/default/files/publications/MGMT/FY%202014%20BIB%20-

%20FINAL%20-508%20Formatted%20(4).pdf

[15] Dr. Rakesh Rathi, Sandhya Lohiya, “Big Data and Hadoop, International Journal of

Advanced Research in Computer Science & Technology (IJARCST 2014)”.

Page 43: Using Hadoop as a platform for quick and efficient pattern matching of large …wvuscholar.wvu.edu/reports/SnehaMallampati.pdf · Using Hadoop as a platform for quick and efficient

37

[16] Amit Mhatre, Srinivas Palla, Sharat Chikkerur and Venu Govindaraju, “Efficient Search and

retrieval in Biometric Databases”.

[17] Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, Edward Y. Chang, “Parallel

Spectral Clustering in Distributed Systems”.

[18] Kathleen Ericson, Shrideep Pallickara, “On the Performance of High Dimensional Data

Clustering and Classification Algorithms”.

[19] Dilipsinh Bheda, Asst. Prof. Mahasweta Joshi, Asst. Prof. Vikram Agrawal, “A Study on

Features Extraction Techniques for Image Mosaicing, International Journal of Innovative

Research in Computer and Communication Engineering, Vol. 2, Issue 3, March 2014”.

[20] Xiaoling Lu, Bing Zheng, “Distrbuted Computing and Hadoop in Staistcs”.

[21] Robert Chansler, Hairong Kuang, Sanjay Radia, Konstantin Shvachko, and Suresh Srinivas,

“The Hadoop Distributed File System”.


Recommended