1
Monitoring Message Streams:
Algorithmic Methods for Automatic
Processing of MessagesFred Roberts, Rutgers University
2
Motivation: monitoring email traffic, news, communiques, faxes, voice intercepts (with speech recognition)
MMS: Goal
Monitor huge communication streams, in particular, streams of textualized communication to automatically detect pattern changes and "significant" events
3
• Synergistic improvements in- Performance in terms of space, time,
effectiveness, and/or “insight”- Understanding the tradeoffs among these types of
improvements- Compression for efficient resource use- Representation that aids fitting models- Efficient matching of text to text and model to
text- Learning models from data and prior knowledge- Reduction in need for large amounts of training
data or labor-intensive input- Fusion of complementary filtering approaches
MMS: Overall Objectives
4
• Emphasis on “Supervised” Filtering:
- Given example documents, textbook descriptions, etc., find documents on this topic in incoming stream or past data
• Less Emphasis on “Unsupervised” Event Identification:
- Detect emergent characteristics, anomalous patterns, etc. in incoming stream of text or historical statistics on the stream
MMS: Approaches
5
• Batch filtering: All training texts processed before any texts of active interest to user
• Adaptive filtering: User trains system during use- Value of examples
for both information and training must be considered
MMS Approaches: Supervised Filtering
6
• Creating summary statistics on massive data streams
- Detect outliers, heavy hitters (most frequent items) , etc.
- Allow us to return to past without keeping raw data
• Reducing need for labeled training examples in supervised classification
- Bayesian priors from domain knowledge
- Tuning on unlabeled data
MMS Approaches: Dealing with Massive Data
7
Bayesian Logistic Regression• Using sparseness-favoring priors, our methods
have produced outstanding accuracy and fast predictions with no ad-hoc feature selection
• State of art text classification effectiveness - Recently: Highest score on TREC 2004 triage
task• Public release of our Bayesian Binary Regression
(BBR) software (500 downloads)
Accomplishments Phase II (Jan ‘04 – Sep ‘04)
Thomas Bayes
8
Bayesian Logistic Regression (cont’d)• Ability to use domain knowledge to set prior
distributions led to large improvements in effectiveness when little training data is available
• New online algorithms: online updating of Bayesian models as new data become available
Accomplishments Phase II (Jan ‘04 – Sep ‘04)
9
Streaming Algorithms • New sketch-based algorithms for
detecting word frequency changes and other patterns in massive text streams
• Rapid methods for finding changing trends, outliers and deviants, rare events, heavy hitters
• Initial results using summarized data to search for meaningful answers to queries about the past
• Initial work on textual and structural patterns in informal communication networks
Accomplishments Phase II (Jan ‘04 – Sep ‘04)
101
110
1
00
1
1
10
Nearest neighbor classification: Fast implementation
• Continued development of heuristics for approximate neighbor finding with an in-memory inverted index
• Our results have reduced memory by 90% and time by 90 to 99% with minimal impact on effectiveness.
• Packaged and delivered kNN software• Developing algorithms for speeding up slow but
potentially highly effective “local learning” approach- Based on training a separate logistic regression on
the neighbors of each test document!- Slow, but with many avenues to large speedups
AccomplishmentsPhase II (Jan ‘04 – Sep ‘04)
11
Adaptive Filtering• Models to Aid in Learning: When to act
greedily (“exploit” -- submit documents we believe relevant) and when to take risks
(“explore” -- submit documents that can be irrelevant)• Seek approximate solutions to the
intractable optimal exploration/exploitation tradeoff
• Experiments show slight improvements in filtering effectiveness compared to greedy (exploit-only) approach
AccomplishmentsPhase II (Jan ‘04 – Sep ‘04)
12
• Bayesian methods assume prior beliefs about parameters before data is seen
- Project Phase I: generic, vague priors- Project Phase II: Reference materials or
intuitions about words may help predict class. Use these to set priors. (Material very unlike training examples)
• Goal: reduce need for training examples- Replace 1000’s of randomly sampled
examples with few, possibly biased examples
Some MMS Work in Depth: Bayesian Priors from Domain Knowledge
13
• Reference texts have some non-topical words
- Use words that discriminate among topics (use Inverted Document Frequency (IDF) weighting within reference collection)
• Small training sets increase problems with thresholding and text representation
- Use unlabeled data to aid thresholding and to learn IDF weights
- Use separate prior for intercept term of model
Knowledge-Driven Priors: Issues
14
• Topics: 27 Reuters Region categories
• Knowledge: CIA World Factbook (WFB) entries
• Examples: 10/topic
• Baseline results (F1 measure):
• WFB: 0.234 , no WFB: 0.052
• Better small training sets, improved algorithms
• WFB: 0.591, no WFB: 0.395
Knowledge-Driven Priors: Results
15
• Reference materials of text type very different from documents to be classified can aid supervised filtering
- In combination with tuning on unlabeled data, this technique can provide immediate practical benefits
• Current methods are crude and ad hoc – substantial improvements should be possible
Knowledge-Driven Priors: Summary
16
Some MMS Work in Depth: Streaming Analysis
• Problem: Monitor fast, massive text streams and support both online tracking as well as historic analysis for events.
• Multidimensional data: source, destination, time sent or received, metadata (reply, language), text
labels (words, phrases), links.• Goal: To use highly compact summaries
that are computed at stream speed and perform accurate analyses.
17
Streaming Analysis Tool: CM Sketch• Theoretical: We have developed the CM Sketch that
uses (1/) log 1/ space to approximate data distribution with error at most , and probability of success at least 1-. – All other previously known sample or sketch
methods use space at least (1/).– CM Sketch is an order of magnitude better.
• Practical: Few 10's of KBs gives accurate summary of large data: Create summaries of data that allow historic queries to find
– Heavy Hitters (Most Frequent Items)
– Quantiles of a Distribution (Median, Percentiles etc.)
– Finding items with large changes
18
Streaming Analysis: Using Web Logs• Web logs (blogs) or regularly updated on-line
journals provide informal, opinionated, candid data that is more like email than is the web.
• We have begun to automatically collect blogs, stripping formatting and tags, ads, etc., and outputting corresponding "bag of words" into streaming algorithms for analysis, archiving. 10’s to 100 GB scale.
• 3000:1 compression using CM Sketch methods.
• Allows accurate analysis of popular words, new emergent words, etc., including multilingual occurrences.
19
• Classic method: Rocchio • Classic method: Centroid• kNN with IFH (inverted file
heuristic)• Sparse Bayesian (Bayesian with
Laplace priors)• Combinatorial PCA• Homotopic Linking of Widely
Varying Rocchio Methods• aiSVM• Fusion
Deliverables: Phase I
20
• Revised and extended version of kNN code, including scripts for running local learning experiments
• Substantially extended version of BBR, including use of domain knowledge to set priors
• CM Sketch (C library for count-min sketching)
• Code to use CM Sketch to find heavy hitters, quantiles, and large changes in streams
Deliverables: Phase II
21
Bayesian• Expand types of domain knowledge usable
- For instance, making use of the taxonomies available in many subject areas
• Improve self-tuning of BBR software- Make it more effective for novice users- Surprisingly subtle questions: Cross-
validation, calibration, scaling (e.g., when multiple features)
• Incorporate previous work on online Bayesian methods into BBR
MMS: Future Directions
22
Streaming• Systematically explore summarization methods such
as sampling, bitmaps, sketches- Develop warehousing techniques for large scale
sketch-based historical analyses• Massiveness of data implies linear algorithms too
inefficient. Seek sublinear methods.• Develop sketch-based methods for link analysis in
temporally changing multigraphs - From and To addresses in email, links between
blogs, etc. • Add modeling component to the sketch-based
analysis: Exploit knowledge of distribution of the data.
MMS: Future Directions
23
kNN• kNN with small training sample for each of massive
number of topics - maybe only 5 to 10 known
relevant/irrelevant documents- Since small samples have little overlap, extend kNN
approach to deal with partially labeled datasets• Bayesian kNN
- Incorporate methods developed in our Bayesian work for dealing with small training sets (e.g., tuning thresholds on unlabeled data).
- More fundamental combinations of Bayesian and kNN methods (e.g., tunable distance metrics)
MMS: Future Directions
24
Greedy Round Robin Feature Selection• In phase I work: Explored greedy heuristic to choose subset of original set of terms as features
- Did extremely well in TREC2002 “topic intersection tasks”
• Will develop a Greedy Round Robin (GRR) method- Applies if features fall into two or more
“conceptually distinct” sets (e.g., metadata such as source/destination, genre or medium of the message)
- Each list of features is consulted in turn.- Plan experimental analysis of GRR - Plan theoretical analysis of GRR using simulation
MMS: Future Directions
25
Adaptive filtering• Experiment with new adaptive thresholding
methods (synergy with Bayesian thresholding work)- Scoring threshold is adjusted downward if
judging too many irrelevant documents; upward if judging too few relevant documents
• Aim for algorithm with state-of-art effectiveness and provable theoretical properties
• Compare rate of convergence of various algorithms on real data.
MMS: Future Directions
26
Paul Kantor, Rutgers Communic., Info.& Library StudiesDave Lewis, ConsultantMichael Littman, Rutgers CSDavid Madigan, Rutgers StatisticsS. Muthukrishnan, Rutgers CSRafail Ostrovsky, Telcordia/UCLAFred Roberts, Rutgers DIMACS/MathMartin Strauss, AT&T Labs/U. Michigan)Wen-Hua Ju, Avaya Labs (collaborator)Andrei Anghelescu, Graduate StudentSuhrid Balakrishnan, Graduate StudentAynur Dayanik, Graduate StudentDmitry Fradkin, Graduate StudentPeng Song, Graduate StudentGraham Cormode, postdocAlex Genkin, software developerVladimir Menkov, software developer
MMS PROJECT TEAM: