+ All Categories
Home > Documents > O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of...

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of...

Date post: 28-Mar-2015
Category:
Upload: jayden-clayton
View: 216 times
Download: 2 times
Share this document with a friend
Popular Tags:
25
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning Clustering Xiaohui Cui, Ph.D. and Thomas E. Potok, Ph.D. Applied Software Engineering Research Group Oak Ridge National Laboratory
Transcript
Page 1: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning Clustering

Xiaohui Cui, Ph.D. and Thomas E. Potok, Ph.D. Applied Software Engineering Research Group

Oak Ridge National Laboratory

Page 2: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Outline

Introduction of Dynamic Information Stream and the issues

Bio-inspired Clustering MSF Clustering Model Based on Bird Flock

Collective Behavior TFIDF not practical for dynamic data MSF Document Clustering Algorithm Multi-Agent Document Clustering Implementation Future works and Conclusion

Page 3: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Text Challenge

Problem How to effectively reduce the size of a large, streaming

set of documents “Give me the 10 documents that I need to read, out of

the 1000 I received today?”

Characteristics A steady flow of simple documents Need to rapidly organize the documents into subsets Select representative documents from the subsets

Page 4: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Approach

Use standard IR techniques to convert text to vectors

Use unsupervised learning/text clustering to organize the documents

Look for improvements in term weighting approaches

Page 5: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Standard Information Retrieval

ArmySensorTechnologyHelpFindImproviseExplosive DeviceORNL develop homeland DefenseMitre won contract

Term List Doc 1 Doc 2 Doc 3Army 1 0 0Sensor 1 1 1Technology 1 1 0Help 1 0 0Find 1 0 0Improvise 1 0 0Explosive 1 0 1Device 1 0 1ORNL 0 1 0develop 0 1 1homeland 0 1 1Defense 0 1 1Mitre 0 0 1won 0 0 1contract 0 0 1

Vector Space Model

The Army needs senor technology to help find improvised explosive devices

ORNL has developed sensor technology for homeland defense

Mitre has won a contract to develop homeland defense sensors for explosive devices

ArmySensorTechnologyHelpFindImproviseExplosive device

ORNL develop sensor technology homeland defense

Mitre won contract develop homelanddefensesensor explosive devices

Document 1 Terms

Document 2

Document 3

Page 6: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Standard Textual Clustering

Doc 1 Doc 2 Doc 3Army 1 0 0Sensor 1 1 1Technology 1 1 0Help 1 0 0Find 1 0 0Improvise 1 0 0Explosive 1 0 1Device 1 0 1ORNL 0 1 0develop 0 1 1homeland 0 1 1Defense 0 1 1Mitre 0 0 1won 0 0 1contract 0 0 1

Doc 1 Doc 2 Doc 3Doc 1 100% 17% 21%

Doc 2 100% 36%

Doc 3 100%

Vector Space Model

Dissimilarity Matrix

TFIDF

Documents to DocumentsD1 D2 D3

Cluster Analysis

Most similar documents

Euclidean distance

O(n2Log n)

Time Complexity

n

NfW ijij 22 log*1log

Page 7: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Issues (1)

Analysts are currently overwhelmed with the amount of information streams generated everyday.

Researches in clustering analysis mainly focus on how to quickly and accurately cluster static data collection.

Research on clustering the dynamic information stream is limited.

Page 8: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Solution: Bio-inspired Clustering

New computational algorithms inspired from biological models, such as ant colonies, bird flocks, and swarm of bees etc., can solve problems in dynamical environment.

These algorithms are characterized by the interaction of a large number of agents that follow the same rules.

The bio-inspired clustering algorithms apply the self-organizing and collective behaviors of social insects for organizing of dynamical changed data.

Page 9: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

1

2

3

Deneubourg proposed the first clustering solutions inspired by ant colonies in 1991.

Agent (ant) action rule: agent move randomly in the grid. Agents only recognize objects immediately in front of them. Picking up or dropping item based on pickup probability and drop probability.

The movement of data objects has to be implemented through the movements of a small number of ant agents, which will slow down the clustering speed.

Data Clustering by Ant Clustering Algorithm

Page 10: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Trivial Behavior Emergent behavior = flocking

A New Clustering Algorithm Based on Bird Flock Collective Behavior

Page 11: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Flocking model, one of the first bio-inspired computational collective behavior models, was first proposed by Craig Reynolds in 1987.

Alignment : steer towards the average heading of the local flock mates

Separation : steer to avoid crowding flock mates

Cohesion : steer towards the average position of local flock mates

Alignment Separation Cohesion

Flocking Model

n

xxarbxbx v

nvdPPddPPd

1),(),( 21

n

x bx

bxsrbx PPd

vvvdPPd

),(),( 2

n

xbxcrbxbx PPvdPPdPPd )(),(),( 21

Page 12: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Flocking Demo

Page 13: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Multiple Species Flocking (MSF) Model

Feature similarity rule: Steer away from other birds that have dissimilar features and stay close to these birds that have similar features.

Page 14: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Issues (2)

Every added or removed document from the set requires recalculation of the entire VSM

TFIDF not practical for dynamic data Requires sequential processing Not good for a distributed agent approach

Document Set must be known

before VSM can be

calculated

Page 15: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Inverse Corpus Frequency

1

1log1log 22 c

CfW ijij

0

50000

100000

150000

200000

250000

5 55 105 155 205 255 305 355 405 455 505 555 605 655 705 755 805 855 905

Number of Documents (K)

Uni

que

Term

Cou

nt

• Look at the forest, not the trees

We analyzed near 1 million documents from 6 major research corpora

We found 229,023 unique terms (A large dictionary contains around 70,000 terms)

We use this term frequency distribution as our “global” term frequency

Reed, Jiao, et al., “TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams,” The Fifth International Conference on Machine Learning and Applications (2006) to appearReed et al., “Multi-Agent System for Distributed Cluster Analysis,” Third International Workshop on Software Engineering for Large-Scale Multi-Agent Systems (SELMAS'04), May 24-25, 2004, Edinburgh, Scotland

Page 16: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Why this matters

We can now generate an accurate vector directly from a text document

That vector can be generated where ever the document resides

We can now use agents to create vectors from documents over a broad range of computers

Page 17: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Multiple Species Flocking (MSF) Document Clustering

Each document is projected as a bird in a 2D virtual space.

The birds that have similar document vector feature (same as the bird’s species and colony in nature) will automatically group together and became a bird flock.

Other birds that have different document vector features will stay away from this flock.

Page 18: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

MSF Document Clustering Demo

Category/TopicNumber

of articles

1 Airline Safety 10

2China and Spy Plane

and Captives4

3Hoof and Mouth

Disease9

4 Amphetamine 10

5 Iran Nuclear 16

6N. Korea and

Nuclear Capability5

7 Mortgage Rates 8

8 Ocean and Pollution 10

9Saddam Hussein

and WMD10

10 Storm Irene 22

11 Volcano 8

The Document collection Dataset

Page 19: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Performance Results of MSF, K-means and Ant Clustering Algorithm

* Four data types and each includes 200 two dimensional (x, y) data objects. x and y are distributed according to Normal distribution. ** 112 news article dataset, 11 categories *** The k-means algorithm has pre-knowledge of the cluster number.

The clustering results of K-means, Ant clustering and MSF clustering Algorithm on synthetic* and document** datasets after 300 iterations

Ref: X. Cui, J. Gao and T. E. Potok, A Flocking Based Algorithm for Document Clustering Analysis, Journal of Systems Architecture, Volume 52, Issues 8-9 , pp. 505-515, August 2006, ISSN: 1318-7621

AlgorithmsAverage cluster

number

Average F-measure value

Synthetic Dataset

MSF 4 0.9997

K-means (4)*** 0.9879

Ant 4 0.9823

Real Document Collection

MSF 9.105 0.7913

K-means (11)*** 0.5632

Ant 1 0.1623

Page 20: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

MSF Clustering Algorithm for Information Stream

The MSF clustering algorithm can achieve better performance in document clustering than the K-means and the Ant clustering algorithm.

This algorithm can continually refine the clustering result and quickly react to the change of individual data. This character enables the algorithm suitable for clustering dynamic changed document information, such as the text information stream.

Page 21: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Multi-Agent Document Clustering Implementation

JADE platform. (http://jade.tilab.com/) Linux Cluster Machine.

One main node and three client nodes, which are connected with a Gigabit Ethernet switch. Each node contains a single 2.4G Intel Pentium IV processor and 512M memory.

Document datasets are derived from TREC collections. TREC: Text REtrieval Conference (http://trec.nist.gov/)

Page 22: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Current and Future Works

Switched agent platform from JADE to our light agent platform (ORMAC).

Built a control agent for automatically generating and deploying flock agents on all available cluster nodes of 135 node cluster.

Built agents to monitor the news update on several popular Internet news websites and collect news and feed into the system in real-time.

Building a better GUI interface

Page 23: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Conclusion

The heuristic searching mechanism of flocking model helps document agents to quickly form flocks and react to the change of any individual documents.

TFIDF enhancement, the TFICF vector space model, allows for parallel or distributed algorithms for information stream clustering

Agent architecture provides analysis approach that can run on cluster computers.

Page 24: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Thank you!

Page 25: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Node1Node3

Node2

Location proxy agents Boid

agents

Head Node

JADE systemagents

JADE mainContainer

JADE Container

The architectures the central model and distributed model

the distributed model

Node1

Boid agents

Location proxy agent

Head Node

JADE mainContainer

JADE Container

JADE systemagents

the Single Processor model


Recommended