Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | martin-wade |
View: | 212 times |
Download: | 0 times |
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Entity-Based Data Mining fromSpatio-Temporal Events and Text Sources
Presentation at KD-D Program Review, Nov 18-19 2003
Padhraic Smyth, Sharad Mehrotra
Information and Computer ScienceUniversity of California, Irvine
{smyth, sharad}@ics.uci.eduwww.datalab.uci.edu
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Project Participants• Principal Investigators:
– Padhraic Smyth: Data mining – Sharad Mehrotra: Databases
• Collaborators– Mark Steyvers: Text and Author Modeling
• Postdoctoral Researchers– Michal Rosen-Zvi, Dmitri Kalashnikov
• Staff Programmer– Amnon Meyers: Information Extraction
• Students– Phd: Joshua O Madadhain, Scott White, Yiming Ma, Dawit Seid– Undergraduates: Yan-Biao Boey, Momo Alhazzazi
• Acknowledgements– Steve Lawrence for CiteSeer data
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Problem of Interest
• Intelligence Analysis today– Massive volumes/streams of data
• Text (newswire, reports, etc)• Web data• Transactions/events
• Central problems – Need flexible tools to support an analyst’s exploration of
the data– Automatically focus an analyst’s attention on interesting
parts of the data space– Need new theories/methods/tools….
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Entities and Events
• Entities = Individuals, groups, communities, organizations, etc• Events = Contacts, collaborations, meetings, products, etc
• Working hypothesis– A large component of intelligence work is centered on
entities and events • Extracting entity-information from text streams and
transaction data• Predicting entity behavior• Detecting groups of related entities
• Our broad goal– Develop next-generation data management, exploration,
and analysis tools for entity-event data
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Nodes = Entities = Biotech-Related OrganizationsEdges = Events = Collaborations
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Red indicates nodes selected bythe data analyst as important
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Algorithm determines blue nodes are important relative to red nodes (Oxford and Cambridge)
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Research Issues- Information extraction- Data management tools- Visualization techniques- Interactive ad hoc querying and mining - Statistical modeling of graph data- Query languages for graphs- Scalability to large graphs- ……
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Information Extraction
Entity-EventDatabases Statistical
Modeling andData Mining
Visualization
QueryLanguages
UserModeling
TextSources
Focus of Our Research
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Major Themes in Our Work• Focus on data in the form of graphs
– Nodes = entities, edges = events– Nodes and edges have attributes (e.g., temporal)– Year 1: entities = computer science researchers– Year 1: limited spatio-temporal aspects
• Integration and coupling of– Statistical modeling and data mining– Visualization– Query languages and data management
• Scalability– Methods should scale to millions of nodes and edges
• User Interaction– Conditional “query-driven” analysis and mining – Contrast with offline global modeling
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Accomplishments
• Infrastructure and Data Sets– Created testbed data sets, e.g., 100k entities, 400k events– Developed suite of text information extraction tools
Developed and released a general public-domain JAVA API for graph data analysis and visualization
• Statistical Modeling and Data Mining– Developed new statistical technique for modeling entities
based on authored text– Developed new class of scalable algorithms for interactive
graph-based data mining
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Accomplishments
• Graph-based Querying– Developed framework for general graph-based query
language– New accurate and efficient algorithms for interactive
similarity queries and query refinement on graphs
• Software Tools– Netsight: JAVA-based graph visualization and analysis tool– Browser tool for exploring author-topic models– Interactive query refinement system – Prototype system for graph-based query language for
interacting with heterogenous graph data
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Publications in Year 1
• Data Mining on Graphs– S. White and P. Smyth, Algorithms for Discovering Relative Importance In Graphs, Proceedings of the Ninth
International ACM SIGKDD Conference, August 2003. Extended version submitted to JICRD, June 2003.
– J. O'Madadhain, D. Fisher, S. White, and Y. Boey, The JUNG (Java Universal Network/Graph) Framework, UCI-ICS Tech Report 03-17, October 2003: invited presentation, Stanford Workshop on Statistical Inference, Computing and Visualization for Graphs, August 2003.
– Modeling the Internet and the Web: Probabilistic Methods and Algorithms, P. Baldi, P. Frasconi, and P. Smyth, Wiley, June 2003.
• Statistical Author-Topic Models– T. Griffiths and M. Steyvers (in press). Finding Scientific Topics. Proceedings of the National Academy of Sciences
– M. Steyvers, M. Rosen-Zvi, T. Griffiths, P. Smyth, Author Attribution with LDA, NIPS workshop on Syntax, Semantics, and Statistics, December 2003
• Data Management and Graph Querying– Y. Ma, S. Mehrotra, D. Seid, A Framework for Refining Similarity Queries Using Learning Techniques, UCI-ICS
Tech Report 03-19, Nov. 2003. Extended version submitted to EDBT 2004.
– Y. Ma, D. Seid, S. Mehrotra, Interactive Filtering of Data Streams by Refining Similarity Queries, UCI-ICS Tech Report 03-07, June. 2003.
– D. Seid, M. Ortega-Binderbergery, Z. Chen, and S. Mehrotra, Evaluating Top-k Selection and Preference Queries on Multiple Indexed Attributes. Submitted to EDBT'04.
– D. Seid, and S. Mehrotra, Complex Analytical Queries on Graphs and Hierarchies, (in preparation).
– L. Jin, C. Li, S. Mehrotra, Efficient Record Linkage in Large Data Sets, in the 8th International Conference on Database Systems for Advanced Applications (DASFAA 2003) 26 - 28 March, 2003, Kyoto, Japan.
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Data Sets
Dataset Documents Entities Extracted Abstracts
Words
CiteSeer 363K 100K 163K 12M
NSF Abstracts
129K 199K 129K 10M +
US Comp Science Depts
294 web sites
14K faculty
67K extracted citations
-
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Information Extraction
Extractor Field Dataset Upper Bound
NumberExtracted
EstimatedAccurac
y
CSNames Author CiteSeer 503K 467K 90-100%
CSWord Abstract CiteSeer 363K 163K 65-75%
Publication Crawler
Publication
US CS DeptWeb Sites
- 67K -
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Author Database Schema
write
AID
Author Paper
P_S
Source
P_F
ISA
d
Conference Journal Newsletters
FieldS_F
Sponsor Fund
PublishPublisher
ReferFromTo
SID FID
PID
FstNm
Position
Description
Vol
ISSN Pub_name
PubID
SpID Desc.
Desc.
PageNoFrm
Institute
CoReadRead
Also Read
From
MidNm LstNm
has
WebInfo
ParentFromTo
Title Keyword Abs Full Text Index
URL Date
Date
IID Type Location
Name Date
IssLocation Vol Iss
PageNoTo
Text Index
Note: “individual-centric” not “document-centric”
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Information Extraction
Entity-EventDatabases Statistical
Modeling andData Mining
Visualization
QueryLanguages
UserModeling
TextSources
Focus of Our Research
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
“9/11 Network”
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
From graphs to Markov chains
• Importance = recursive function of nodes pointing at you
B
C
D
A
3
4
2
2
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
From graphs to Markov chains
• Importance = recursive function of nodes pointing at you
B
C
D
A B
C
D
A
3
4
2
2
1.0
0.33
0.6
0.33
0.77
0.50.4
0.5
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
From graphs to Markov chains
• Importance = recursive function of nodes pointing at you
• Markov approach…– Notion of a “token” circulating around in Markov fashion– Important actors see the token more often – Importance = stationary probability of each node– PageRank: surfer randomly following links on the Web
B
C
D
A B
C
D
A
3
4
2
2
1.0
0.33
0.6
0.33
0.77
0.50.4
0.5
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
A
B C
D
F
E
G
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
A
B C
D
F
E
G
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
A
B C
D
F
E
G
Relative importance of node V to A:Trade off [distance from A, structural importance of V]
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
A
B C
D
F
E
G
Add backlinks to A with probability (e.g., 0.3)
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Algorithms for Relative Importance(S. White and P. Smyth, ACM KDD 2003: also JICRD, submitted)
• PageRank with Priors (PRankP)– Random walks that start from A and return to A periodically– Relative importance = stationary probability– Iterative algorithm (e.g., Haveliwala, 2002)
• HITS with priors– Formulate HITS as Markov chain, same idea….
• K-Step Markov– Use the transient probability distribution starting from A– Faster than stationary probability methods
• Weighted Paths– Heuristic approximation to K-step Markov: even faster
• All algorithms scale linearly in number of edges– Different constant factors
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Computation Times for Ranking Algorithms (in seconds)
Data Number of
Nodes
Number of
Edges
WeightedPaths
KStep Markov
K=6
PrankWithPriors
HITSWithPriors
Terrorist 63 308 0.01 0.28 1.17 0.57
Biotech 3k 13k 0.02 0.39 3.45 3.64
Author1 30k 88k 0.05 1.11 10.80 11.30
Author2 30k 88k 0.04 1.55 17.06 17.99
PRankP and HITS converged in 20-30 iterations
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Computation Times for Ranking Algorithms (in seconds)
Data Number of
Nodes
Number of
Edges
WeightedPaths
KStep Markov
K=6
PrankWithPriors
HITSWithPriors
Terrorist 63 308 0.01 0.28 1.17 0.57
Biotech 3k 13k 0.02 0.39 3.45 3.64
Author1 30k 88k 0.05 1.11 10.80 11.30
Author2 30k 88k 0.04 1.55 17.06 17.99
PRankP and HITS converged in 20-30 iterations
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Rank PRankP on Unweighted Graph
PRankP on Weighted Graph
1 Thrun Thrun
2 Fisher McCallum
3 Kononenko Nigam
4 Dzeroski Freitag
5 Freitag Blum
6 Bratko Slattery
7 Cheng Joachims
8 MCDermott Fox
Weighted versus Unweighted Graphs
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Visualization and Analysis Software
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
JUNG Java Universal Network/Graph Framework
http://jung.sourceforge.net
16,000 page visits800 downloadssince August
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Demo of Netsight software
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Entity Models from Text Data
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
w 1
A 1 A 2
w 2
A k
w 3 w N
Authors
Words
Can we model authors, given documents?
(more generally, build statistical profiles of entitiesgiven sparse observed data)
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
w 1
A 1 A 2
w 2
T1
A k
w 3 w N
T2
Authors
Words
HiddenTopics
Model = Author-Topic distributions + Topic-Word distributions
Parameters learned via Bayesian learning
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
w 1
A 1 A 2
w 2
T1
A k
w 3 w N
T2
Authors
Words
HiddenTopics
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
w 1
A 1 A 2
w 2
T1
A k
w 3 w N
T2
Authors
Words
HiddenTopics
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
w 1
A 1 A 2
w 2
T1
A k
w 3 w N
T2
Authors
Words
HiddenTopics
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
w 1
A 1 A 2
w 2
T1
A k
w 3 w N
T2
Authors
Words
HiddenTopics
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
w 1
A 1 A 2
w 2
T1
A k
w 3 w N
T2
Authors
Words
HiddenTopics
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
w 1
A 1 A 2
w 2
T1
A k
w 3 w N
T2
Authors
Words
HiddenTopics
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
w 1 w 2
T1
w 3 w N
T2
“Topic Model”:- document can be generated from multiple topics- Hofmann (SIGIR ’99), Blei, Jordan, Ng (JMLR, 2003)
Words
HiddenTopics
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
w 1
A 1 A 2
w 2
T1
A k
w 3 w N
T2
Authors
Words
HiddenTopics
Model = Author-Topic distributions + Topic-Word distributions
NOTE: documents can be composed of multiple topics
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Author Modeling Data Sets
Source Documents UniqueAuthors
Unique Words
Total Word Count
CiteSeer 163,389 85,465 30,799 11.7 million
CORA 13,643 11,427 11,101 1.2 million
NIPS 1,740 2,037 13,649 2.3 million
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Topic Models from CiteSeer
WORDS: probabilistic, Bayesian, carlo, monte, distribution, inference, conditional, prior, mixture, Markov, posterior, belief……
AUTHORS: N_Friedman, D_Heckerman, Z_Ghahramani, D_Koller, M_Jordan, R_Neal, A_Raftery, T_Lukasiewicz, J_Halpern….
WORDS: retrieval, text, document, information, content, indexing, relevance, collection, query, IR, feedback….
AUTHORS: D. Oard, W_Croft, K_Jones, P_Schauble, E_Voorhees, A_Singhal, D_Hawking, J_Allan, A_Smeaton, M_Hearst,….
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Topic Models from CiteSeer
WORDS: Web, user, world, wide, pages, www, site, internet, hypertext, hypermedia, content, links, page, navigation..
AUTHORS: S. Lawrence, B. Mobasher, M. Levene, D. Florescu, O. Etzioni, R_Studer, W. Hall, R. Fielding, J. Pitkow, M. Crovella,….
WORDS: data, mining, attributes, discovery, association, large, knowledge, databases, dataset, interesting, frequent, discover, sets….
AUTHORS: J. Han, R. Rastogi, M. Zaki, R. Ng, B. Liu, H. Mannila, S. Brin, H Liu, L. Holder, H. Toivonen…
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Author-Topic Models from CiteSeer
• Author = A McCallum:– Topic 1: classification, training, generalization, decision, data,…– Topic 2: learning, machine, examples, reinforcement, inductive,…..– Topic 3: retrieval, text, document, information, content,…
• Author = H Garcia-Molina:- Topic 1: query, index, data, join, processing, aggregate….
- Topic 2: transaction, concurrency, copy, permission,distributed….- Topic 3: source, separation, paper, heterogeneous, merging…..
• Author = P Cohen:- Topic 1: agent, multi, coordination, autonomous, intelligent….- Topic 2: planning, action, goal, world, execution, situation…- Topic 3: human, interaction, people, cognitive, social, natural….
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Author-Topic Browser
• Interesting scalability issues– CiteSeer model exceeds 1 Gbyte – Real-time query answering demands Gibbs sampling
(not well suited to SQL!)
• Solution– Coupling of Gibbs sampling and relational DB (it works!)
Original Text+ Statistical
Model
JAVAQueryGUI
MySQLDB
SQLInterface
BayesianSampling
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Demo of Author-Topic Browser
• Note– Real-time querying on CiteSeer authors/documents
• 85,000 authors• 163,000 documents• 30,000 unique words• 300 topics
– Can query on• Authors, topics, words, documents
– Topic distribution given documents/words requires sampling to estimate:
• Gibbs sampling is fast enough to answer queries in real-time
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Applications of Author-Topic Models
• “Expert Finder”– “Find researchers who are knowledgeable in cryptography
and machine learning within 100 miles of Washington DC”– “Find reviewers for this set of NSF proposals who are active
in relevant topics and have no conflicts of interest”
• Prediction– Given a document and some subset of known authors for
the paper (k=0,1,2…), predict the other authors– Predict how many papers in different topics will appear
next year
• Change Detection/Monitoring– Which authors are on the leading edge of new topics?– Characterize the “topic trajectory” of this author over time
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
1986 1988 1990 1992 1994 1996 1998 2000 20020
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2x 10
4
Year
Nu
mb
er o
f Do
cum
ents
Document and Word Distribution by Year in the UCI CiteSeer Data
Nu
mb
er o
f Wo
rds
0
2
4
6
8
10
12
14x 10
5
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
1990 1992 1994 1996 1998 2000 20020
0.002
0.004
0.006
0.008
0.01
0.012Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
7::web:user:world:wide:users:80::mobile:wireless:devices:mobility:ad:76::java:remote:interface:platform:implementation:275::multicast:multimedia:media:delivery:applications:
Rise in Web, Mobile, JAVA
Web
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
1990 1992 1994 1996 1998 2000 20021
2
3
4
5
6
7
8x 10
-3 Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
114::regression:variance:estimator:estimators:bias:153::classification:training:classifier:classifiers:generalization:205::data:mining:attributes:discovery:association:
Rise of Machine Learning
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
1990 1992 1994 1996 1998 2000 20021.5
2
2.5
3
3.5
4
4.5
5
5.5x 10
-3 Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
189::statistical:prediction:correlation:predict:statistics:209::probabilistic:bayesian:probability:carlo:monte:276::random:distribution:probability:markov:distributions:
Bayes lives on….
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
1990 1992 1994 1996 1998 2000 20022
3
4
5
6
7
8
9
10
11x 10
-3 Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
60::programming:language:concurrent:languages:implementation:139::system:operating:file:systems:kernel:283::collection:memory:persistent:garbage:stack:268::memory:cache:shared:access:performance:
Decline in Languages, OS, …
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
1990 1992 1994 1996 1998 2000 20022
4
6
8
10
12
14x 10
-3 Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
111::proof:theorem:proofs:proving:prover:156::polynomial:complexity:np:complete:hard:226::language:languages:semantics:syntax:constructs:235::logic:semantics:reasoning:logical:logics:252::computation:computing:complexity:compute:computations:
Decline in CS Theory, …
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
1990 1992 1994 1996 1998 2000 20021
2
3
4
5
6
7
8
9x 10
-3 Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
205::data:mining:attributes:discovery:association:261::transaction:transactions:concurrency:copy:copies:198::server:client:servers:clients:caching:82::library:access:digital:libraries:core:
Trends in Database Research
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
1990 1992 1994 1996 1998 2000 20022
3
4
5
6
7
8x 10
-3 Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
280::language:semantic:natural:linguistic:grammar:289::retrieval:text:documents:information:document:
Trends in NLP and IR
IR
NLP
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
1990 1992 1994 1996 1998 2000 20021
2
3
4
5
6
7
8
9x 10
-3 Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
120::security:secure:access:key:authentication:240::key:attack:encryption:hash:keys:
Security Research Reborn…
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
1990 1992 1994 1996 1998 2000 20021
2
3
4
5
6
7x 10
-3 Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
23::neural:networks:network:training:learning:35::wavelet:operator:operators:basis:coefficients:242::genetic:evolutionary:evolution:population:ga:
(Not so) Hot Topics
NeuralNetworks
GAs
Wavelets
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
1990 1992 1994 1996 1998 2000 20021.5
2
2.5
3
3.5
4
4.5
5x 10
-3 Topic Proportions by Year in CiteSeer Data
Year
To
pic
Pro
ba
bili
ty
157::gamma:delta:ff:omega:oe:
Decline in use of Greek Letters
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Graph-based Query Refinement and Query Languages
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Heterogeneous Event-Entity Querying
• Problem:– Most existing graph/link mining approaches assume single
node types (e.g. people, documents, etc.) and restricted link types (e.g. collaboration, html links, etc.)
• Solution– Single framework that enables analysts to mine
heterogeneous event-entity data
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Supporting Exploratory Event-Entity Graph Analysis
• Influence/dependence analysis
• Prediction of links between entity type 1 and entity type 2, given their relation to entity 3.
• Compute strength of relationship between a given pair of individuals or groups with varying edge and node types.
Given the overall schema and graph data:
• Subschema selection• Subgraph selection (data
filtering)• Decoration of Data Graph
Nodes and Edges• Structural Grouping and
Aggregation– May also involve
aggregation of decoration values.
• Progressive/Interactive Refinement
Example tasks Our Approach
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
The GrAQ System(built using JUNG library)
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Status of Work
• Achievements– query language for interactive graph analysis– Aggregation operators for graph data analysis.– Similarity predicates and ranking for analysis involving imprecise
matching– Integration of concept hierarchies in graph data analysis– System development over a commercial ORDBMS
• Future Work– Model and language extensions to support spatio-temporal graph
analysis – Efficient support for graph analysis queries
• Graph indexing strategies • Query processing and optimization
– Integration of feedback based query refinement in graph analysis queries
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Interactive Querying and Refinement
• Relevance-based retrieval– Queries approximately capture user’s information need– Ranked retrieval based on relevance of object to query
• Query Refinement – Customization based on user’s subjectivity, information
need, and preferences
• Existing Search Technologies – Database Systems: do not support relevance based
retrieval (only exact search)– IR systems: support (limited) aspect of similarity retrieval
but are limited to textual data.
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Q: Start with 3 Universities and my research interests,retrieve important information about authors.
Entity (Author)
Stanford UCI UCLA
AI
Database
Stanford
UCI
UCLA
AI
Database
IBM IR
DataMining
Relation (Write) Event (Paper)
Jeff Ullman
Hector
R. Agrawal
MichaelPazzani
PadhraicSmyth
Feedback
Feedback
SELECT author FROM db WHERE (Inst=‘Stanford’ OR Inst=‘UCI’ OR Inst= ‘UCLA’) AND Sim_area(‘AI’, ‘Database)
Richard Korf
RefinementEngine
Setup Initial query
Feedback
Retrieve Result represented as Ranked ListRetrieve Result represented as Nodes in a
graph
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Similarity Queries in SQL are Complex!
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Evaluation of Query Refinement
• Tested on multiple real data sets• Average precision on 400 queries over 4 refinements • The new methods outperform existing methods
– substantially fewer iterations required
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Other work in progress..
• Edge prediction in graphs– Given a graph with attributes on nodes and edges– Assume some edges are missing (or remove them)– Predict the probability of edge(i,j)
• E.g., what is likelihood that A and B have interacted given everything else we know, or that they will interact within the next 6 months
– Note: “runtime” querying, avoid O(N2) complexity
• Data cleaning– multiple names for a single entity– multiple entities mapped to the same name, e.g., J_Wang
• How many unique P_Smyths are there?
– Use heterogenous data sources and probabilistic models to iteratively produce “consistent” data
• E.g., combine CiteSeer, Web information, topic models, institution, etc
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Conclusions
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Summary of Accomplishments
• Infrastructure– Developed entity-event testbed data sets and IE tools– Released JUNG API for graph data analysis and visualization
• Graph Data Analysis/Querying Research– Novel author-topic models– New class of “relative importance” algorithms – Efficient similarity query refinement system– New general framework for graph schemas
• Software– Netsight– Topic-Author Browser– Interactive query refinement system– Prototype graph-based DB language system
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
What’s ready for the KD-D TestBed?
• Netsight– Built on JUNG API– Can handle any standard network data set– Supports both visualization and analysis
• Relative importance algorithms• Relative betweenness algorithms• Graph layout and browsing• Graph filtering
– Easily extendible– Integrated database support is planned in Year 2
• Other software is also in principle available– Author-topic applications:
• e.g., find experts in South Florida in virus research– - GraQ tool for graph DB interface
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Proposed Year 2 Work
• Basic research: extend theory and algorithms to – Extend to temporal and spatial semantics– Handle missing/noisy network data– Multi-edge types (multiple edges on same entities)– Scalability: graphs with millions of edges– Interaction: tools that support exploration and querying
• Integration and Coupling of– Statistical topic models, querying, graph visualization, and databases
• Software Tools and Applications for the KDD-testbed– Netsight as an analysis tool…– Application of Author-topic type model (e.g., “expert finder”)– Entity Monitoring application (monitor data sources over time with focused
Web crawling)
• Data Sets/Types (TBD)– KDD-provided testbed data sets– Digital libraries: more CiteSeer, possibly Patent DB, MEDLINE– Less structured text sources such as email streams
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
BACKUP SLIDES
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Number of Topics
5 10 20 50 100 200 400 8002000
2500
3000
3500
4000
4500
5000
5500
0th
1st
2nd
5th
10th
Perplexities for true author and any random author
PercentilesIn distribution
A = true author
A = any author
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
• Accuracy of author prediction as a function of # topics
Number of Topics
5 10 20 50 100 200 400 800
0
10
20
30
40
% of documents for which correct author was picked
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Heterogeneous Event-Entity Graph Analysis and Query Language
Analysis of link/graph data involves: • Subschema selection
– Selecting node and edge types of interest from the graph schema• Subgraph selection
– Identifying relevant members of a group based on (possibly imprecise) matching of edge/node attributes or involvement in a given pattern of relationship.
• Decoration– E.g. computation of pair-wise association measures between individual
entities (conditioned on a context or third entity type)
• Structural Grouping and Aggregation– Node/edge grouping– combination of decorations (or other attribute values) for groups of entities
at various levels.
• Progressive Refinement – carrying out the above operations in a progressive and interactive manner.
In particular, user should be able to ask queries based on results of previous queries.
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
P(author and topic given a word)
P(Ai,Zi|{W},{Z}\Zi,{A}\Ai) (CWZ+ )(CAZ+)/(W’CW’Z+V)
CWZ counts the number of times the same word, W, (in the same or other documents) is assigned to topic Z
CAZ counts the number of times the same author, A, (in the same or other documents) is assigned to topic Z
Keeping these counts speeds up the algorithm!
Department of Computer ScienceUniversity of California, Irvine
KDD Program ReviewNovember 18th 2003
Sampling over a query document
Preprocessing: Assign to each word in the query document an Author and a TopicK Iterations (typically K=10)
• For each word out of the N query words
• Derive the probability P(A, Z) conditioned on the current assignments of query words and the database words
• Assign a new author, A, and topic, Z, according to P(A,Z)The probability for a topic is the averaged ratio of words
assigned to the topic per total words
P(Z)=Kt=1Ct
Z/(KN)
CtZ is the number of words assigned in the t iteration to the z
topic