Post on 21-Jan-2018
transcript
Extracting Emerging Knowledgefrom Social Media
Marco Brambilla, Stefano Ceri, Emanuele Della Valle, Riccardo Volonterio, Felix Acero Salazar
marco.brambilla@polimi.itmarcobrambiWWW 2017, Perth, Australia
Humans aim at formalizingknowledge
Ontology is the philosophical study ofthe nature of being, becoming,
existence or realityand the basic categories of being and their
relations.
the nature of being, becoming, existence or reality
the basic categories of being and their relations.
the nature of being, becoming, existence or reality
the basic categories of being and their relations.
Formalizing new knowledge is hard
Only high frequency emerges
The long tail challenge
There are more things In heaven and earth, Horatio, Than are dreamt of in your philosophy.
Shakespeare (Hamlet Act 1, scene 5)
The Answer to the Great Question... Of Life, the Universe and Everything
Data
Information
Knowledge
WisdomContextindependence
Understanding
Understanding relations
Understanding patterns
Understanding principles
Our focus: The Evolving Knowledge
knownsocial
factoid
a
c¬c
bpotentially emerging potentially
decaying
actual and solid
d
Heaven and Heart
How to peer into the world through an effective window?
TWO INGREDIENTS
Social media – the dataDomain experts – the context
Can we use social media to discover and codify emerging knowledge?
Overview
Famous Emerging
…
Knowledge Enrichment Setting
HF Entity1 HF Entity5
HF Entity2 HF Entity4
HF Entity3
LF Entity1 ??
LF Entity2 LF Entity4
LF Entity3
??
High FrequencyEntities
Low FrequencyEntities
??
?? ????
??
Type1
Type11
Type2Type111
Instances Types
<<instanceof>>
<<instanceof>>
<<ins
tance
of>>
<<instanceof>>
<<instanceof>>
<<instanceof>>
??
??
??
??
??
Seed Entity
Seed Type Type of interest
Legend
Expert inputs
Enrichment problems
Property2
Relations HF - LF entities
Relations LF - LF entities
Typing of LF entities
Extraction of new LF entities
Property1
?? ?? ??Finding attribute values
Emerging Knowledge Harvesting
Input (1): Domain Specific TypesTypes selected by the expert
Relevant for the domain
Input (2): Seeds (emerging entities)Known and selected by the domain expert
Belonging to an expert type
Thoroughly Described
# @ a
Objectives
(1) Discover candidate unknown emerging entities(2) Determine the relevance of the candidate(3) Determine the type of the candidate
Step (1): Social Media Sourcing
Collect content produced by the seeds
Step (2): Candidate Extraction
Potentially any entity extracted from the social streams of the seeds
Resulting in huge sets of candidates
Our hyp.: take only SN users as candidates
# @ w
@
Step (3): Candidate PruningInitial pruning of candidates based on
TF-DF:= df * ttf / (N – df +1)
Where: df = Number of seeds with which a candidate co-occurs with;ttf = Total number of times a candidate occurs in the analyzed content;
N = Number of seeds.
Ranking + threshold
(*) variant of TF-IDF that does not discount document frequency because we are actually happy about frequent appearance (we don’t look for information entropy!)
Step (4): Candidate Description
Repeat social media sourcing for candidates
A potentially good candidate is one that behaves similarly to one or more of the seeds
Our hyp.: Talks about the same things# @ w
Step (5): Candidate Ranking
Seed centroid
Step (6): Feature selection
Purely syntacticonly user handles (accounts)handles and hashtags
Semantic:based on entity extraction / Dbpediabased on deep learning on images / ClarifAI
Step (6): Semantic Feature selection for text
9 basic strategies
Generating 18 combinations of T + E strategies
990 semantic strategies evaluated18 alternative feature vectors
11 different weighting values for aggregations
5 levels of recall for entity extraction
( + 3 different distance functions analyzed)
Experiments
Fashion BrandsWriters
Exhibitions
Emerging Australian Writers – 22 seeds
http://www.emergingwritersfestival.org.au/ in June in Melbourne
Emerging Australian WritersWeighting parameter
Entity extraction recall
Emerging Australian WritersPrecision @ K for two strategies
EHE—AST CHE—AST
Cross-scenario39 strategies always outperform the syntactic one
Writers
Expo
Fashion
Conclusions
Extraction of relevant emerging entities
Top, Fast and Reliable are the important
Off-the-shelf or as-a-service tools
Repeatability in time (years!)Recursion (candidates to seeds)
Multi-source data collection
Multiple typesEmerging relations
Emerging types
Challenges ahead
You can try it yourself!http://datascience.deib.polimi.it/social-knowledge
THANKS! QUESTIONS?
Marco Brambilla, Stefano Ceri, Emanuele Della Valle, Riccardo Volonterio, Felix Acero SalazarExtracting Emerging Knowledge from Social Media
Marco Brambilla @marcobrambi marco.brambilla@polimi.ithttp://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi