Date post: | 14-Jun-2015 |
Category: |
Technology |
Upload: | htanev |
View: | 729 times |
Download: | 3 times |
Unsupervised Learning of Social Networks
from a Multiple-Source News Corpus
Hristo Tanev
European CommissionJoint Research [email protected]
IntroductionIntroduction
� Social networks provide an intuitive picture of inferred relationships between entities, such as people and organizations.
� Social network analysis uses Social Networks to identify underlying groups, communication patterns, and other information.
� Manual construction of a social network is very laborious task. Algorithms for automatic detection of relations may be used to save time and human efforts.
IntroductionIntroduction
� We present an unsupervised methodology for automatic learning of social networks
� We use multiple-source syntactically parsed news corpus.
� In order to overcome the efficiency problems which emerge from using syntactic information on real-world data, we put forward an efficient graph matching algorithm.
Related workRelated work
� Learning social networks from Friend-Of-A-Friend links (Mika 2005) or statistical co-occurrences Disadvantage: cannot detect the type of the relation
Related workRelated work
�� Support Vector Machines (SVM) Support Vector Machines (SVM) provide more accurate means for provide more accurate means for relation extraction (relation extraction (ZelenkoZelenko et.alet.al. . 2003)2003)
�� Disadvantages: Disadvantages:
•• require a sufficient amount of annotated require a sufficient amount of annotated datadata
•• each pair of named entities should be each pair of named entities should be evaluated separately, which slows down evaluated separately, which slows down the relation extractionthe relation extraction
Related workRelated work
�� (Romano (Romano et.alet.al. 2006) propose a generic . 2006) propose a generic
unsupervised method for learning of unsupervised method for learning of
syntactic patterns for relation extractionsyntactic patterns for relation extraction
�� Disadvantages:Disadvantages:
•• they use the Web as a training corpus, which they use the Web as a training corpus, which
makes the learning very slowmakes the learning very slow
•• they match each pattern against each they match each pattern against each
sentence which is not efficient when matching sentence which is not efficient when matching
many templates against a big corpusmany templates against a big corpus
Unsupervised learning of social
networks
�� Our algorithm is unsupervised Our algorithm is unsupervised –– it accepts on its it accepts on its input one, two, or other small number of twoinput one, two, or other small number of two--slot slot seed syntactic templates which express certain seed syntactic templates which express certain semantic relation.semantic relation.
�� The algorithm uses news clusters to learn new The algorithm uses news clusters to learn new syntactic patterns expressing the same semantic syntactic patterns expressing the same semantic relation.relation.
�� When the patterns are learned we apply a novel When the patterns are learned we apply a novel efficient methodology for pattern matching to efficient methodology for pattern matching to extract related person names from the text.extract related person names from the text.
�� Extracted relations are aggregated in a social Extracted relations are aggregated in a social network.network.
EMM news clustersEMM news clusters
�� European Media Monitor downloads European Media Monitor downloads
news from different sources around news from different sources around
the clock.the clock.
�� Every day 4000Every day 4000--5000 English 5000 English
language news are downloaded.language news are downloaded.
�� The news articles are grouped into The news articles are grouped into
topic clusters.topic clusters.
Parsing the corpusParsing the corpus
�� The training and the test corpus The training and the test corpus
consist of Englishconsist of English--language news language news
articles from 200 sources.articles from 200 sources.
�� Articles are parsed with a full Articles are parsed with a full
dependency parser, dependency parser, MiniParMiniPar..meet
Bush Blair
March
objsubj
in
Learning patternsLearning patterns
�� Provide manually a very small Provide manually a very small
number of number of seed seed syntactic templates syntactic templates
which express the main relation. which express the main relation.
For example, for the relation “X For example, for the relation “X
supports Y” we use the syntactic supports Y” we use the syntactic
patterns:patterns:
X subj support obj Y
X subj praise obj Y
Learning patternsLearning patterns
�� Match these templates against the Match these templates against the news clusters in the corpus. Each news clusters in the corpus. Each pair of person names which fill the pair of person names which fill the slots X and Y is called an slots X and Y is called an anchor anchor pairpair. .
�� From “From “Bush praised the Prime Bush praised the Prime Minister Minister HamidHamid KarzaiKarzai””, the , the algorithm will extract the anchor algorithm will extract the anchor pair pair ((X:BushX:Bush; ; Y:HamidY:Hamid KarzaiKarzai))
Learning patternsLearning patterns
�� Normalize the anchor pairs using Normalize the anchor pairs using
the information in the EMM the information in the EMM
database. database.
�� After this step, the example anchor After this step, the example anchor
pair will become pair will become ((X:GeorgeX:George W. W.
Bush; Bush; Y:HamidY:Hamid KarzaiKarzai). ).
Learning patternsLearning patterns
�� For each extracted anchor pair, For each extracted anchor pair, search in the same cluster all the search in the same cluster all the sentences where both names of the sentences where both names of the anchor pair occur. anchor pair occur.
�� The assumption is that the same The assumption is that the same relation will hold between the same relation will hold between the same pairs of names in the whole news pairs of names in the whole news cluster, since all articles in it have cluster, since all articles in it have the same topic. the same topic.
Learning patternsLearning patterns
�� From all the sentences in which at least From all the sentences in which at least one anchor pair appears, learn syntactic one anchor pair appears, learn syntactic pattern using our patternpattern using our pattern--learning learning algorithm similar to the General algorithm similar to the General Structure Learning algorithm (GSL) Structure Learning algorithm (GSL) described in (described in (SzpektorSzpektor et.alet.al. 2006). 2006)
�� Example: Example: XX��subjsubj--agreeagree--withwith��YY
�� Each pattern obtains as a score the Each pattern obtains as a score the number of different anchor pairs which number of different anchor pairs which support itsupport it
Learning patternsLearning patterns
�� Pattern selection and filteringPattern selection and filtering
•• Filter out all templates which appear for Filter out all templates which appear for
less than 2 anchor pairs. less than 2 anchor pairs.
•• Take out generic patterns like “X say Y”, Take out generic patterns like “X say Y”,
“X have Y”, “X is Y”, etc. using a a “X have Y”, “X is Y”, etc. using a a
predefined template list predefined template list
Syntactic Network model
� “Prodi met President Bush in September”
� “Berlusconi met President Chirac”
Syntactic Network model
Adding syntactic templatesAdding syntactic templates
Efficiency Efficiency
�� The worst case time complexity of building The worst case time complexity of building SyntNetSyntNet is is O(|wO(|w| log |w|)| log |w|), where |, where |ww|| is the is the number of the words in the parsed corpusnumber of the words in the parsed corpus
�� The worst case time complexity of the syntactic The worst case time complexity of the syntactic matching algorithm is bounded by matching algorithm is bounded by O((|s|+|tO((|s|+|t|) |) (log (log MaxArcOMaxArcO)))), where , where |s||s| is the number of the is the number of the sentences in the corpus, |sentences in the corpus, |t| t| is the number of the is the number of the templates, and the templates, and the MaxArcOMaxArcO is the maximum is the maximum number of occurrences of an number of occurrences of an SyntNetSyntNet arc, i.e. the arc, i.e. the size of the maximal index set of a size of the maximal index set of a SyntNetSyntNet arcarc
Evaluation schemaEvaluation schema
�� To evaluate our algorithm we learned syntactic To evaluate our algorithm we learned syntactic patterns for “meeting” and “support” patterns for “meeting” and “support” relationships between peoplerelationships between people
�� We evaluate the algorithm how well it captures We evaluate the algorithm how well it captures relationship between the top 33 VIP from our relationship between the top 33 VIP from our databasedatabase
�� We do not evaluate how it captures relation We do not evaluate how it captures relation mentionsmentions
�� If a specific relation (e.g. “meeting”) holds If a specific relation (e.g. “meeting”) holds between a pair of people X and Y, it is sufficient between a pair of people X and Y, it is sufficient that the algorithm finds at least one mention of that the algorithm finds at least one mention of this relation between X and Ythis relation between X and Y
ExperimentsExperiments
�� For paraphrase learning we used a training For paraphrase learning we used a training corpus of 98'000 Englishcorpus of 98'000 English--language news articles language news articles clustered in 22'000 EMM topic clusters published clustered in 22'000 EMM topic clusters published in the period 01/May/2006 in the period 01/May/2006 –– 03/Oct/2006.03/Oct/2006.
�� For testing the method, we used 125'000 For testing the method, we used 125'000 EnglishEnglish--language news articles published in the language news articles published in the period 03/Oct/2006 period 03/Oct/2006 –– 31/Oct/2006.31/Oct/2006.
�� To read the test corpus and the templates in the To read the test corpus and the templates in the memory and to build memory and to build SyntNetSyntNet+ it took 9 min and + it took 9 min and 3 sec. It took only 45 seconds to match the 101 3 sec. It took only 45 seconds to match the 101 syntactic templates against the test corpus of syntactic templates against the test corpus of about 1'080'000 parsed sentences.about 1'080'000 parsed sentences.
�� We normalized extracted names using the EMM We normalized extracted names using the EMM databasedatabase
Relationship extraction evaluation on the top Relationship extraction evaluation on the top
33 VIP from the EMM DB33 VIP from the EMM DB
0.420.420.320.320.600.60overalloverall
0.170.170.100.100.570.57supportsupport
0.580.580.560.560.610.61meetingmeeting
F1F1RecallRecallPrecisionPrecision
Using the social network viewUsing the social network view
�� We run the We run the PageRankPageRank algorithm on algorithm on
the automatically extracted the automatically extracted
“meeting” network and found the top “meeting” network and found the top
5 ranked people5 ranked people
�� We compared this ranking with We compared this ranking with
simple frequencysimple frequency--based people based people
rankingranking
Comparing two people ranking Comparing two people ranking
schemasschemas
S. HusseinS. HusseinT. BlairT. Blair
N. alN. al--MalikiMalikiE. E. OlmertOlmert
C. RiceC. RiceV. V. PutinPutin
T. BlairT. BlairG.W. BushG.W. Bush
G.W. BushG.W. BushC. RiceC. Rice
FrequencyFrequencyPagerankPagerank
Conclusions and future workConclusions and future work
�� We presented an unsupervised method for We presented an unsupervised method for social network learning from news clusterssocial network learning from news clusters
�� We presented very efficient syntactic We presented very efficient syntactic pattern matching algorithmpattern matching algorithm
�� Automatically learned social networks can Automatically learned social networks can be used for some analyst tasksbe used for some analyst tasks
�� In our future work we will try to consider In our future work we will try to consider more types of relationsmore types of relations
�� We consider learning and using more We consider learning and using more abstract patternsabstract patterns
THANK YOU!THANK YOU!