Date post: | 23-Aug-2014 |
Category: |
Science |
Upload: | thomas-gottron |
View: | 426 times |
Download: | 1 times |
Institute for Web Science & Technologies – WeST
Challenging Retrieval Scenarios:Social Media and Linked Open Data
Dr. Thomas [email protected]
Thomas Gottron Lugano, 23.4.2012 2Challenging Retrieval Scenarios
Outline
The ROBUST project Background Use cases
Retrieval on Microblogs Particularities of Twitter Interestingness LiveTweet
Search on the LOD cloud Querying LOD as IR task Schema extraction SchemEX
Thomas Gottron Lugano, 23.4.2012 3Challenging Retrieval Scenarios
Online Communities
Thomas Gottron Lugano, 23.4.2012 4Challenging Retrieval Scenarios
Business Communities
Information ecosystems Employees Business Partners, Customers General Public
Valuable asset
OpportunitiesRisks
Thomas Gottron Lugano, 23.4.2012 5Challenging Retrieval Scenarios
Risk Management• Risk modelling• Detection• Automatic
reaction
Community Analysis• Contents• Single users• Entire
communities
Community Forecasting • Policies• Prediction• Decision
support
Large Scale Processing• Big Data• Realtime• Parallel
Processing
High Level Objectives
Thomas Gottron Lugano, 23.4.2012 6Challenging Retrieval Scenarios
Scenario 1
Social Media - Microblogs
Thomas Gottron Lugano, 23.4.2012 7Challenging Retrieval Scenarios
IBM Connections
Thomas Gottron Lugano, 23.4.2012 8Challenging Retrieval Scenarios
My dear @johndoe had
troubles to wake up this #morning
Follower
@janedoe
Thomas Gottron Lugano, 23.4.2012 10Challenging Retrieval Scenarios
Retrieval on Twitter: First Steps
10 Millionen Tweets Retrieval Engine Query: beer
Rang User Tweet1 LoriAG beer2 Crushdwinebar beer!!3 Skippertaylor BEER4 BigMacScola Beer5 VANiamore beer.......6 CindyMcManis To beer or not to beer on Beer Summit ?7 silverlakewine beer beer beer beer beer beer beer. Simple 3pm8 eldoradobar http://ping.fm/p/Bnra7 - In!!! BEER, BEER, BEER, BEER,
BEER, BEER, BEER, BEER, BEER, BEER,
9 tonx Lompoc. beer beer beer beer beer beer beer beer beer beer. http://twitpic.com/l68ld
10 punkeyfunky Beer beer beer beer beer beer beer beer beer beer beer beer beer. Er, guess what I'm looking forward to?
What is going wrong?
Thomas Gottron Lugano, 23.4.2012 11Challenging Retrieval Scenarios
Particularities of Twitter
Thomas Gottron Lugano, 23.4.2012 12Challenging Retrieval Scenarios
Twitter is different
Maximum length: 140 characters
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 1390
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
Zeichen
# Tw
eets
Thomas Gottron Lugano, 23.4.2012 13Challenging Retrieval Scenarios
Twitter is different
140 characters = few words
0 1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233343536394142434446471
10
100
1000
10000
100000
1000000
10000000
100000000
Max TF in Tweet
# Tw
eets
85% of tweets contain each word only once
𝑤 (𝑡 𝑗 ,𝑑𝑖 )=𝐭𝐟 (𝒕 𝒋 ,𝒅𝒊 ) ∙ log ( 𝑁df (𝑡 𝑗 ) )
Binary value !
Thomas Gottron Lugano, 23.4.2012 15Challenging Retrieval Scenarios
Length normalisation
Why are some documents longer (classic explanation)
Verbosity hypothesis: Long documents repeat themself Short documents prefered as they are more concise
Scope hypothesis: Long documents address more topics Short document prefered as they are more focussed
Intuition: Not valid for Twitter
Thomas Gottron Lugano, 23.4.2012 16Challenging Retrieval Scenarios
Verbosity hypothesis and Twitter?
Are long tweets more verbous? Consider length of tweets and number of repeated words
Correlation (Spearman‘s Rank)
Thomas Gottron Lugano, 23.4.2012 17Challenging Retrieval Scenarios
Scope hypothesis and Twitter?
Are long tweets broader in scope?
LDA: 100 topics
Observations 8,5% of tweets have no strong topic Remaining tweets:
• 77,1% are dominated by one topic• 99,6% are dominated by two topics
Thomas Gottron Lugano, 23.4.2012 18Challenging Retrieval Scenarios
Length normalisation on tweets
Not necessary! … Negative impact?
YES: Short tweets are preferred!
Long tweets are considered of too wide scope.
Beer!
Pubs brewing their own beer: a list for Düsseldorf http://bit.ly/w2GZrV
Thomas Gottron Lugano, 23.4.2012 19Challenging Retrieval Scenarios
Interestingness
Thomas Gottron Lugano, 23.4.2012 20Challenging Retrieval Scenarios
Interesting Content
Concept of „relevance“ in IR: Document is about a topic
Additionally for Twitter: Timeliness Current trend Informative
Interestingness Tweet is about a topic AND is interesting!
Question: How to determine what is interesting???
Thomas Gottron Lugano, 23.4.2012 21Challenging Retrieval Scenarios
Retweets
My dear @johndoe had
troubles to wake up this #morning
Follower
@janedoe
RT @janedoe: My dear @johndoe had troubles to wake up this
#morning
Thomas Gottron Lugano, 23.4.2012 22Challenging Retrieval Scenarios
Retweets
Retweet indicates quality „of interest for others“
Depends on Content Context (time, follower)
Idea: Learn to predict retweets!
Likelihood of retweet as metric for Interestingness
Thomas Gottron Lugano, 23.4.2012 24Challenging Retrieval Scenarios
Retweets: Prediction model
Aim: Prediction of a probability
Logistic regression
Model parameters learned on training data
Dataset Users Tweets Retweets
Choudhury 118,506 9,998,756 7.89%
Choudhury (extended) 277,666 29,000,000 8.64%
Petrovic 4,050,944 21,477,484 8.46%
Thomas Gottron Lugano, 23.4.2012 25Challenging Retrieval Scenarios
Logistic Regression: Weights
Feature Dimensions WeightConstant (intercept) -5.45
Message feature
Direct message -147.89Username 146.82Hashtag 42.27URL 249.09
SentimentValence -26.88Arousal 33.97Dominance 19.56
Emoticons Positive -21.8Negative 9.94
Exclamation Positive 13.66Negative 8.72
Punctuation ! -16.85? 23.67
Terms Odds 19.79
Thomas Gottron Lugano, 23.4.2012 26Challenging Retrieval Scenarios
Logistic Regression: Topic Weights
Topic Weight
social media market post site web tool traffic network 27.54
follow thank twitter welcome hello check nice cool people 16.08
credit money market business rate economy home 15.25
christmas shop tree xmas present today wrap finish 2.87
home work hour long wait airport week flight head -14.43
twitter update facebook account page set squidoo check -14.43
cold snow warm today degree weather winter morning -26.56
night sleep work morning time bed feel tired home -75.19
Thomas Gottron Lugano, 23.4.2012 27Challenging Retrieval Scenarios
Re-Ranking using Interestingness
Top-k relevant tweets Re-rank based on interestingness
Rang Username Tweet1 BeeracrossTX UK beer mag declares "the end of beer writing." @StanHieronymus says not so in the US.
http://bit.ly/424HRQ #beer2 narmmusic beer summit @bspward @jhinderaker no one had billy beer? heehee #narm - beer summit
@bspward @jhinde http://tinyurl.com/n29oxj3 beeriety Go green and turn those empty beer bottles into recycled beer glasses! | http://bit.ly/2src7F
#beer #recycle (via: @td333)4 hblackmon Great Divide beer dinner @ Porter Beer Bar on 8/19 - $45 for 3 courses + beer pairings.
http://trunc.it/172wt5 nycraftbeer Interesting Concept-Beer Petitions.com launches&hopes 2help craft beer drinkers enjoy beer
they want @their fave pubs. http://bit.ly/11gJQN6 carichardson Beer Cheddar Soup: Dish number two in my famed beer dinner series is Beer Cheddar
Soup. I hadn’t had too.. http://bit.ly/1diDdF7 BeerBrewing New York City Beer Events - Beer Tasting - New York Beer Festivals - New York Craft Beer
http://is.gd/39kXj #beer8 delphiforums Love beer? Our member is trying to build up a new beer drinker's forum. Grab a #beer and
join us: http://tr.im/pD1n9 Jamie_Mason #Baltimore Beer Week continues w/ a beer brkfst, beer pioneers luncheon, drink & donate
event, beer tastings & more. http://ping.fm/VyTwg10 carichardson Seattle and Beer: I went to Seattle last weekend. It was my friend’s stag - he likes
beer - we drank beer.. http://tinyurl.com/cpb4n9
Thomas Gottron Lugano, 23.4.2012 28Challenging Retrieval Scenarios
Application
Thomas Gottron Lugano, 23.4.2012 29Challenging Retrieval Scenarios
LiveTweet
Data: Twitter streaming API: sample 1% of all tweets
Architecture: Time slices over tweets Analytical component with
REST API Web Frontend for end user
Thomas Gottron Lugano, 23.4.2012 30Challenging Retrieval Scenarios
LiveTweet
http://livetweet.west.uni-koblenz.de/
Thomas Gottron Lugano, 23.4.2012 32Challenging Retrieval Scenarios
LiveTweet: What comes next?
Retrieval Incorporate with other retrieval metrics Include Interestingness in a learning to rank approach Social graph
System extension Personalisation Public API Work with IBM data
Thomas Gottron Lugano, 23.4.2012 33Challenging Retrieval Scenarios
Scenario 2
Linked Open Data
Thomas Gottron Lugano, 23.4.2012 34Challenging Retrieval Scenarios
Information needs requiring semantic structure
Examples Male persons who have a public profile document Computing science papers authored by social scientists American actors who are also politicians and are married
to a model.
Maybe specific databases available: Person search engines Bibliographic databases Movie database
How to integrate?
Thomas Gottron Lugano, 23.4.2012 35Challenging Retrieval Scenarios
Linked Data
B C
Thing
typedlinks
A D E
typedlinks
typedlinks
typedlinks
Thing
Thing
Thing
Thing
Thing Thing
Thing
Thing
Thing
Semantic Web Technology to1. Provide structured data on the web2. Link data across data sources
Thomas Gottron Lugano, 23.4.2012 36Challenging Retrieval Scenarios
Entities are identified via URIs
pd:cygri
Richard Cyganiak
dbpedia:Berlin
foaf:name
foaf:based_near
foaf:Personrdf:type
pd:cygri = http://richard.cyganiak.de/foaf.rdf#cygridbpedia:Berlin = http://dbpedia.org/resource/Berlin
Description of a link between two data sources
One statement = one triple
Subject Predicate Object
Thomas Gottron Lugano, 23.4.2012 37Challenging Retrieval Scenarios
Resolving URIs
dp:Cities_in_Germany
3.405.259dp:population
skos:subject
Richard Cyganiak
dbpedia:Berlin
foaf:name
foaf:based_near
foaf:Personrdf:type
pd:cygri
Thomas Gottron Lugano, 23.4.2012 39Challenging Retrieval Scenarios
The LOD Cloud
Thomas Gottron Lugano, 23.4.2012 40Challenging Retrieval Scenarios
Querying linked data
SELECT ?xWHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument .}
Thomas Gottron Lugano, 23.4.2012 41Challenging Retrieval Scenarios
Info
rmat
ion
need
Keyword query Documents Information
SPARQL query Data sources Entities
Querying linked data – an IR task?
Here happens IR magic
Here we need magic
Thomas Gottron Lugano, 23.4.2012 42Challenging Retrieval Scenarios
Querying linked data – using an index
SELECT ?xWHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument .}
Thomas Gottron Lugano, 23.4.2012 43Challenging Retrieval Scenarios
A Schema for LOD
Thomas Gottron Lugano, 23.4.2012 44Challenging Retrieval Scenarios
Idea
Schema Index: Define families of graph patterns Assign entities to graph patterns Map graph patterns to context / source
Construction: Streambased for scalability Little loss of accuracy
NOTE: Index defined over entities But: Index stores the contexts (sources)
Thomas Gottron Lugano, 23.4.2012 45Challenging Retrieval Scenarios
http://dig.csail.mit.edu/2008/...
Input Data
n-Quads<subject> <predicate> <object> <context> .
Example:
<http://www.w3.org/People/Connolly/#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> <http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf> .
w3p:#me
foaf:Person
Thomas Gottron Lugano, 23.4.2012 46Challenging Retrieval Scenarios
Layer 1: RDF Classes
timbl:card#i
foaf:Person
foaf:Person
http://www.w3.org/People/Berners-Lee/card
http://dig.csail.mit.edu/2008/...
C1
DS 3DS 2DS 1
SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person .}
All entities of a particular type
Thomas Gottron Lugano, 23.4.2012 47Challenging Retrieval Scenarios
Layer 2: Type Clusters
timbl:card#i
foaf:Person
foaf:Person
http://www.w3.org/People/Berners-Lee/card
TC1
DS 3DS 2DS 1
SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male .}
C1 C2
pim:Male
tc4711
pim:Male
All entities belonging to the same set of types
Thomas Gottron Lugano, 23.4.2012 48Challenging Retrieval Scenarios
Layer 3: Equivalence Classes
EQC1
DS 3DS 2DS 1
C1 C2
TC1
C3
TC2
Two entities are equivalent iff: They are in the same TC They have the same
properties The property targets are in the
same TC
Thomas Gottron Lugano, 23.4.2012 49Challenging Retrieval Scenarios
Layer 3: Equivalence Classes
timbl:card#i
foaf:Person
foaf:Person
http://www.w3.org/People/Berners-Lee/card
pim:Male
tc4711
pim:Male
foaf:PPD
timbl:card
eqc0815
foaf:PPD
tc1234
eqc0815-maker-tc1234
foaf:maker
SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument .}
Thomas Gottron Lugano, 23.4.2012 51Challenging Retrieval Scenarios
Schema Index Overview
3 Layers – 3 different graph patterns
Thomas Gottron Lugano, 23.4.2012 52Challenging Retrieval Scenarios
Schema Computation
Thomas Gottron Lugano, 23.4.2012 53Challenging Retrieval Scenarios
Building the Index from a Stream
Stream of n-quads (coming from a LD crawler)
… Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1
FiFo
4
3
2
1
1
6
23
4
5
C3
C2
C2
C1
Thomas Gottron Lugano, 23.4.2012 55Challenging Retrieval Scenarios
Does it work good?
Comparison of stream based vs. Gold standard Schema on 11 M triple data set
Thomas Gottron Lugano, 23.4.2012 56Challenging Retrieval Scenarios
Does it scale?
Semantic Web Challenge: Billion Triples Track Provision of large scale RDF dataset Crawled from LOD
Task: Do something „useful“ Do it (web-)scalable Do it with at least 1 billion triples
Presentation at ISWC
Thomas Gottron Lugano, 23.4.2012 57Challenging Retrieval Scenarios
BTC results
1st billion 2nd billion full BTC
# triples 1 billion 1 billion 2.17 billion
# instances 187.7 M 222.6 M 450.0 M
# data sources 13.5 M 9.5 M 24.1 M
# type clusters 208.5 k 248.5 k 448.6 k
# equivalence classes 0.97 M 1.14 M 2.12 M
# triples index 29.1 M 24.8 M 54.7 M
Compression ratio 2.91% 2.48% 2.52%
# triples/sec. 40.5 k 45.6 k 39.5 k
1st place BTC‘11
Thomas Gottron Lugano, 23.4.2012 58Challenging Retrieval Scenarios
SchemEX: What comes next?
Hierarchy of semantic information: Type clusters Equivalence clusters Related types
Optimization Smarter caching Performance – Hadoop Error correction
Thomas Gottron Lugano, 23.4.2012 59Challenging Retrieval Scenarios
Conclusion
Thomas Gottron Lugano, 23.4.2012 60Challenging Retrieval Scenarios
Take away message
Web evolving in interesting directions Social networks, user generated content Semantic data
Challenges for IR Different settings Different tasks Question basic assumptions
Thomas Gottron Lugano, 23.4.2012 61Challenging Retrieval Scenarios
Thank you!
Contact:WeST – Institute for Web Science and TechnologiesUniversität Koblenz-Landau [email protected]
Thomas Gottron Lugano, 23.4.2012 62Challenging Retrieval Scenarios
Relevant Publications
1. A. Che Alhadi, S. Staab, and T. Gottron. Exploring user purpose writing single tweets. In WebSci ’11: Proceedings of the 3rd International Conference on Web Science, 2011.
2. A. Che Alhadi, T. Gottron, J. Kunegis, and N. Naveed, Livetweet: Microblog retrieval based on interestingness, in TREC’11: Proceedings of the Text Retrieval Conference 2011, 2011.
3. A. Che Alhadi, T. Gottron, J. Kunegis, and N. Naveed, Livetweet: Monitoring and predicting interesting microblog posts, in ECIR’12: Procedings of the 34th European Conference on Information Retrieval, 2012. in preparation.
4. T. Gottron and N. Lipka, A comparison of language identification approaches on short, query-style texts, in ECIR ’10: Proceedings of the 32nd European Conference on Infor-mation Retrieval, pp. 611–614, Mar. 2010.
5. M. Konrath, T. Gottron, and A. Scherp. Schemex – web-scale indexed schema extraction of linked open data, in Semantic Web Challenge, Submission to the Billion Triple Track,
6. 2011.N. Naveed, T. Gottron, J. Kunegis, and A. Che Alhadi. Bad news travel fast: A content-based analysis of interestingness on twitter. In WebSci ’11: Proceedings of the 3rd International Conference on Web Science, 2011.
7. N. Naveed, T. Gottron, J. Kunegis, and A. Che Alhadi. Searching microblogs: Coping with sparsity and document quality. In CIKM’11: Proceedings of 20th ACM Conference on Information and Knowledge Management, 2011.
Thomas Gottron Lugano, 23.4.2012 63Challenging Retrieval Scenarios
Attic
Thomas Gottron Lugano, 23.4.2012 64Challenging Retrieval Scenarios
Use Cases
Business PartnersExtranet
EmployeesIntranet
Public DomainInternet
SAP Community Network (SCN) Lotus Connections MeaningMine
Communities• Customers• Partners• Suppliers• Developers
Business value• Products support• Services• Find business partners
Communities• Employees• Working groups• Interest Groups• Projects
Business value• Task relevant information• Collaboration• Innovation
Communities• Social media• News• Web fora• Public communities
Business value• Topics• Opinions• Service for partners
Volume• 6,000 posts/day• 1,700,000 subscribers• 16GB log/day
Volume• 4,000 posts/day• 386,000 employees• 1.5GB content/day
Volume• 1,400,000 posts/day• 708,000 web sources• 45GB content/day
Thomas Gottron Lugano, 23.4.2012 65Challenging Retrieval Scenarios
Twitter is different
Follower form social graph PageRank applicable?!
BUT: Follow not (only) motivated
by content No statement about tweets!
Thomas Gottron Lugano, 23.4.2012 66Challenging Retrieval Scenarios
Information seeking behaviour on Twitter
Web 2-4 query terms Broader terms Intentions
• Navigation• Information• Ressourcen
Get to know a topic
Twitter 1-2 query terms Specific terms Intentions
• Timely information• Trends• People
Follow a topic
Thomas Gottron Lugano, 23.4.2012 67Challenging Retrieval Scenarios
TREC
Microblog Track 2011 12.000.000 Tweets 2 Weeks 49 „Topics“ (Queries) Task: Filtering
Constraints No external knowledge! English tweets only Temporal order of topic & tweets Official extension of „relevance“ to „interestingness“ (!!!)
Thomas Gottron Lugano, 23.4.2012 68Challenging Retrieval Scenarios
WeST @ TREC Microblog Track
Basics: Lucene No length normalisation Interestingness
4 configurations: WESTfilter: Retrieval via Lucene, filtering non interesting
tweets WESTfilext: like WESTfilter, but with sentiments WESTrelint: like WESTfilter, but re-ranking according to
interestingness WESTrlext: like WESTrelint, but with sentiments
Thomas Gottron Lugano, 23.4.2012 69Challenging Retrieval Scenarios
Results
Filtering significantly better than re-ranking Sentiments are of disadvantage (not significant)
P5 P10 P15 P20 P30 R-prec bpref MAP nDCG0
0.050.1
0.150.2
0.250.3
0.350.4
WESTfilter WESTfilext WESTrelint WESTrlext
Metric
Scor
e
Thomas Gottron Lugano, 23.4.2012 70Challenging Retrieval Scenarios
Results
Effective especially for shorter queries
1 2 3 4 5 6 70
0.05
0.1
0.15
0.2
0.25
0.3
WESTfilext WESTfilter WESTrelint WESTrlext
Query Length (word count)
MA
P
Thomas Gottron Lugano, 23.4.2012 71Challenging Retrieval Scenarios
Schema representation using VoiD