9/20/2011
1
Mining and Querying Multimedia DataFan GuoSep 19, 2011Committee Members:Christos Faloutsos, ChairEric P. XingWilliam W. CohenAmbuj K. Singh, University of California at Santa Babara
What this talk is about?
2
9/20/2011
2
What this talk is about?
3
Going Multimedia
4
9/20/2011
3
Beyond Text and Images
5
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
6
9/20/2011
4
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
7
Mining Multimedia Data (1)
• Labeling Satellite Imagery
8
Input Output
9/20/2011
5
Mining Multimedia Data (2)
• Network Traffic Log Analysis
9
Mining Multimedia Data (3)
• Web Knowledge Base
10
9/20/2011
6
Mining Multimedia Data
• Data-driven problem solving over multiple modes at a non-trivial scale.
11
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
12
9/20/2011
7
Querying Multimedia Data (1)
• A querying system provides an interface to retrieve records that best match users’ information need.
13
Querying Multimedia Data (1)
• Here is another example:
14
https://www.facebook.com/pages/browser.php
9/20/2011
8
Querying Multimedia Data (1)
• May be transformed into a graph search problem
15
Querying Multimedia Data (2)
• Calibrate ranking from user feedback
16
9/20/2011
9
Querying Multimedia Data (2)
• Calibrate ranking from user feedback
17
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
18
9/20/2011
10
Data
• Large-Scale Heterogeneous Networks
19
Port
198.129.1.2131.243.2.10
131.243.2.5
128.3.10.40 128.3.1.50
IP-source IP‐destination
80 (HTTP)
80 (HTTP)
993 (IMAP)
Goal
• How can we automatically detect and visualize patterns within a local community of nodes?
20
9/20/2011
11
Preliminary
• Tensor for high-order data representation▫ 3 data modes: source IP, destination IP, port #
21
Approach
22
9/20/2011
12
Data Decomposition
• The canonical polyadic (CP) decomposition can factor tensor into a sum of rank-1 tensors
23
Data Decomposition
• A special case is Singular Value Decomposition
24
9/20/2011
13
Attribute Plot
25
How to compute?
Spike Detection
• Iteratively search for spikes in the histogram plot along each data mode.
26
“ “” ”
9/20/2011
14
Substructure Discovery
• Focus on part of the data within the spike
• Categorize into a few subgraph patterns
27
Pattern 1: Generalized Star (1)
28
IP-src’s sending packets to the same IP-dst & the same port
Typical client/server
system
9/20/2011
15
Pattern 1: Generalized Star (1)
29
A ‘bar’ in a carefully reordered tensor
Pattern 1: Generalized Star (2)
30
Extending along “Port-Number”
9/20/2011
16
Pattern 1: Generalized Star (2)
31
Port scanning or P2P
Port numbers used in packets from the same IP-src to the same IP-dst
Pattern 2: Generalized Bipartite-Core (1)
32
A ‘plane’ in a carefully reordered tensor
9/20/2011
17
Pattern 2: Generalized Bipartite-Core (1)
33
IP-src’s sending packets to the
same IP-dst’s & the same port
Clients talking to a shared server pool
Pattern 2: Generalized Bipartite-Core (2)
34
A ‘plane’ in a carefully reordered tensor
9/20/2011
18
Pattern 2: Generalized Bipartite-Core (2)
35
IP-src’s sending packets over
multiple ports to one IP-dst
A multi-purpose
windows server
M1: MultiAspectForensics
• Automatically detects novel patterns in heterogenous networks
36
9/20/2011
19
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
37
QMAS: Mining Satellite Imagery (1)
• Low-labor labeling
38
Input Output
9/20/2011
20
QMAS: Mining Satellite Imagery (2)
• Low-labor labeling• Identification of Representatives
39
QMAS: Mining Satellite Imagery (2)
• Low-labor labeling• Identification of Representatives and Outliers
40
9/20/2011
21
QMAS: Mining Satellite Imagery (2)
• Low-labor labeling• Identification of Representatives and Outliers
41
QMAS: Mining Satellite Imagery (3)
• Low-labor labeling• Identification of Representatives and Outliers• Linear in time & space
42
9/20/2011
22
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
43
Web Search
44
9/20/2011
23
User Clicks as Quality Feedback
45
# of total clicks
Motivation
• Leverage the signal from click data to improve search ranking.
46
9/20/2011
24
Click Through Rate (CTR)
• CTR = # of Clicks / # of Impressions
47
Position Bias
48
9/20/2011
25
Relevance of Web Document
• Relevance = CTR @ Position 1
49
# Clicks @ Position 1# Impressions @ Position 1=
Problem Definition
• Estimate the relevance of web documents given clicks and their positions.
50
9/20/2011
26
Design Goals / Constraints
• Scalable: single-pass, easy to parallel.
• Incremental: real-time updates possible.
• Accurate: consistent with past and future observations.
51
Approach
52
9/20/2011
27
User Behavior Model
53
Last Clicked Position
54
9/20/2011
28
Empirical Results
• Click data after pre-processing▫ 110K distinct queries, 8.8M query sessions.
• Training time: <6 mins
• Online update:▫ Bump impression and click counters▫ No data retention required
55
Empirical Results
• Higher log-likelihood indicates better quality.
56
27% accuracy in prediction2% improvement over ICM, the baseline model
9/20/2011
29
Empirical Results
• Position-bias visualized
57
Ground Truth
DCM
Scaling to Terabytes
• 265TB data, 1.15B document relevance results,running time on wall clock ~ 3 hours
58
9/20/2011
30
Q1: Click Models
• A statistical approach to leveraging click data for better ranking aware of position-bias.
• They are incremental, more accurate than the baseline, scaling to almost petabyte-scale data.
59
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
60
9/20/2011
31
Q2: C-DEM
• A flexible query interface for 3-mode data: images, genes, annotation terms.
61
Q2: C-DEM
62
Images
Terms Genes
9/20/2011
32
Q2: C-DEM
• Solution: random walk with restart on graphs.
63
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
64
9/20/2011
33
Q3: BEFH (1)• Bayesian exponential family harmonium• Deriving topical representations for multimedia
corpora (e.g., video snapshots and captions)
65
Input Model
Q3: BEFH (2)• Bayesian exponential family harmonium• Deriving topical representations for multimedia
corpora (e.g., video snapshots and captions)
66
Validation – Synthetic Data Validation – TRECVID Data
Better Quality
Better Quality
9/20/2011
34
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
67
Conclusion
• Data-driven research under the theme of pattern mining and similarity querying.
68
9/20/2011
35
Conclusion
• Data-driven research under the theme of pattern mining and similarity querying.
• An array of practical tasks addressed:▫ Internet traffic surveillance (M1)
69
Conclusion
• Data-driven research under the theme of pattern mining and similarity querying.
• An array of practical tasks addressed:▫ Internet traffic surveillance (M1)▫ Satellite image analysis (M2)
70
9/20/2011
36
Conclusion
• Data-driven research under the theme of pattern mining and similarity querying.
• An array of practical tasks addressed:▫ Internet traffic surveillance (M1)▫ Satellite image analysis (M2)▫ Web search (Q1)
71
Conclusion
• Data-driven research under the theme of pattern mining and similarity querying.
• An array of practical tasks addressed:▫ Internet traffic surveillance (M1)▫ Satellite image analysis (M2)▫ Web search (Q1)▫ …
72
9/20/2011
37
Thank You!
• http://www.cs.cmu.edu/~fanguo/dissertation/
73
74