OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning Clustering
Xiaohui Cui, Ph.D. and Thomas E. Potok, Ph.D. Applied Software Engineering Research Group
Oak Ridge National Laboratory
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Outline
Introduction of Dynamic Information Stream and the issues
Bio-inspired Clustering MSF Clustering Model Based on Bird Flock
Collective Behavior TFIDF not practical for dynamic data MSF Document Clustering Algorithm Multi-Agent Document Clustering Implementation Future works and Conclusion
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Text Challenge
Problem How to effectively reduce the size of a large, streaming
set of documents “Give me the 10 documents that I need to read, out of
the 1000 I received today?”
Characteristics A steady flow of simple documents Need to rapidly organize the documents into subsets Select representative documents from the subsets
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Approach
Use standard IR techniques to convert text to vectors
Use unsupervised learning/text clustering to organize the documents
Look for improvements in term weighting approaches
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Standard Information Retrieval
ArmySensorTechnologyHelpFindImproviseExplosive DeviceORNL develop homeland DefenseMitre won contract
Term List Doc 1 Doc 2 Doc 3Army 1 0 0Sensor 1 1 1Technology 1 1 0Help 1 0 0Find 1 0 0Improvise 1 0 0Explosive 1 0 1Device 1 0 1ORNL 0 1 0develop 0 1 1homeland 0 1 1Defense 0 1 1Mitre 0 0 1won 0 0 1contract 0 0 1
Vector Space Model
The Army needs senor technology to help find improvised explosive devices
ORNL has developed sensor technology for homeland defense
Mitre has won a contract to develop homeland defense sensors for explosive devices
ArmySensorTechnologyHelpFindImproviseExplosive device
ORNL develop sensor technology homeland defense
Mitre won contract develop homelanddefensesensor explosive devices
Document 1 Terms
Document 2
Document 3
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Standard Textual Clustering
Doc 1 Doc 2 Doc 3Army 1 0 0Sensor 1 1 1Technology 1 1 0Help 1 0 0Find 1 0 0Improvise 1 0 0Explosive 1 0 1Device 1 0 1ORNL 0 1 0develop 0 1 1homeland 0 1 1Defense 0 1 1Mitre 0 0 1won 0 0 1contract 0 0 1
Doc 1 Doc 2 Doc 3Doc 1 100% 17% 21%
Doc 2 100% 36%
Doc 3 100%
Vector Space Model
Dissimilarity Matrix
TFIDF
Documents to DocumentsD1 D2 D3
Cluster Analysis
Most similar documents
Euclidean distance
O(n2Log n)
Time Complexity
n
NfW ijij 22 log*1log
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Issues (1)
Analysts are currently overwhelmed with the amount of information streams generated everyday.
Researches in clustering analysis mainly focus on how to quickly and accurately cluster static data collection.
Research on clustering the dynamic information stream is limited.
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Solution: Bio-inspired Clustering
New computational algorithms inspired from biological models, such as ant colonies, bird flocks, and swarm of bees etc., can solve problems in dynamical environment.
These algorithms are characterized by the interaction of a large number of agents that follow the same rules.
The bio-inspired clustering algorithms apply the self-organizing and collective behaviors of social insects for organizing of dynamical changed data.
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
1
2
3
Deneubourg proposed the first clustering solutions inspired by ant colonies in 1991.
Agent (ant) action rule: agent move randomly in the grid. Agents only recognize objects immediately in front of them. Picking up or dropping item based on pickup probability and drop probability.
The movement of data objects has to be implemented through the movements of a small number of ant agents, which will slow down the clustering speed.
Data Clustering by Ant Clustering Algorithm
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Trivial Behavior Emergent behavior = flocking
A New Clustering Algorithm Based on Bird Flock Collective Behavior
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Flocking model, one of the first bio-inspired computational collective behavior models, was first proposed by Craig Reynolds in 1987.
Alignment : steer towards the average heading of the local flock mates
Separation : steer to avoid crowding flock mates
Cohesion : steer towards the average position of local flock mates
Alignment Separation Cohesion
Flocking Model
n
xxarbxbx v
nvdPPddPPd
1),(),( 21
n
x bx
bxsrbx PPd
vvvdPPd
),(),( 2
n
xbxcrbxbx PPvdPPdPPd )(),(),( 21
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Flocking Demo
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Multiple Species Flocking (MSF) Model
Feature similarity rule: Steer away from other birds that have dissimilar features and stay close to these birds that have similar features.
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Issues (2)
Every added or removed document from the set requires recalculation of the entire VSM
TFIDF not practical for dynamic data Requires sequential processing Not good for a distributed agent approach
Document Set must be known
before VSM can be
calculated
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Inverse Corpus Frequency
1
1log1log 22 c
CfW ijij
0
50000
100000
150000
200000
250000
5 55 105 155 205 255 305 355 405 455 505 555 605 655 705 755 805 855 905
Number of Documents (K)
Uni
que
Term
Cou
nt
• Look at the forest, not the trees
We analyzed near 1 million documents from 6 major research corpora
We found 229,023 unique terms (A large dictionary contains around 70,000 terms)
We use this term frequency distribution as our “global” term frequency
Reed, Jiao, et al., “TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams,” The Fifth International Conference on Machine Learning and Applications (2006) to appearReed et al., “Multi-Agent System for Distributed Cluster Analysis,” Third International Workshop on Software Engineering for Large-Scale Multi-Agent Systems (SELMAS'04), May 24-25, 2004, Edinburgh, Scotland
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Why this matters
We can now generate an accurate vector directly from a text document
That vector can be generated where ever the document resides
We can now use agents to create vectors from documents over a broad range of computers
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Multiple Species Flocking (MSF) Document Clustering
Each document is projected as a bird in a 2D virtual space.
The birds that have similar document vector feature (same as the bird’s species and colony in nature) will automatically group together and became a bird flock.
Other birds that have different document vector features will stay away from this flock.
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
MSF Document Clustering Demo
Category/TopicNumber
of articles
1 Airline Safety 10
2China and Spy Plane
and Captives4
3Hoof and Mouth
Disease9
4 Amphetamine 10
5 Iran Nuclear 16
6N. Korea and
Nuclear Capability5
7 Mortgage Rates 8
8 Ocean and Pollution 10
9Saddam Hussein
and WMD10
10 Storm Irene 22
11 Volcano 8
The Document collection Dataset
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Performance Results of MSF, K-means and Ant Clustering Algorithm
* Four data types and each includes 200 two dimensional (x, y) data objects. x and y are distributed according to Normal distribution. ** 112 news article dataset, 11 categories *** The k-means algorithm has pre-knowledge of the cluster number.
The clustering results of K-means, Ant clustering and MSF clustering Algorithm on synthetic* and document** datasets after 300 iterations
Ref: X. Cui, J. Gao and T. E. Potok, A Flocking Based Algorithm for Document Clustering Analysis, Journal of Systems Architecture, Volume 52, Issues 8-9 , pp. 505-515, August 2006, ISSN: 1318-7621
AlgorithmsAverage cluster
number
Average F-measure value
Synthetic Dataset
MSF 4 0.9997
K-means (4)*** 0.9879
Ant 4 0.9823
Real Document Collection
MSF 9.105 0.7913
K-means (11)*** 0.5632
Ant 1 0.1623
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
MSF Clustering Algorithm for Information Stream
The MSF clustering algorithm can achieve better performance in document clustering than the K-means and the Ant clustering algorithm.
This algorithm can continually refine the clustering result and quickly react to the change of individual data. This character enables the algorithm suitable for clustering dynamic changed document information, such as the text information stream.
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Multi-Agent Document Clustering Implementation
JADE platform. (http://jade.tilab.com/) Linux Cluster Machine.
One main node and three client nodes, which are connected with a Gigabit Ethernet switch. Each node contains a single 2.4G Intel Pentium IV processor and 512M memory.
Document datasets are derived from TREC collections. TREC: Text REtrieval Conference (http://trec.nist.gov/)
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Current and Future Works
Switched agent platform from JADE to our light agent platform (ORMAC).
Built a control agent for automatically generating and deploying flock agents on all available cluster nodes of 135 node cluster.
Built agents to monitor the news update on several popular Internet news websites and collect news and feed into the system in real-time.
Building a better GUI interface
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Conclusion
The heuristic searching mechanism of flocking model helps document agents to quickly form flocks and react to the change of any individual documents.
TFIDF enhancement, the TFICF vector space model, allows for parallel or distributed algorithms for information stream clustering
Agent architecture provides analysis approach that can run on cluster computers.
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Thank you!
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Node1Node3
Node2
Location proxy agents Boid
agents
Head Node
JADE systemagents
JADE mainContainer
JADE Container
The architectures the central model and distributed model
the distributed model
Node1
…
Boid agents
Location proxy agent
Head Node
JADE mainContainer
JADE Container
JADE systemagents
the Single Processor model