Semantic Entity extraction from Sports Tweets

Copyright 2009 Digital Enterprise Research Institute.All rights reserved.

Digital Enterprise Research Institute www.deri.ie

Extracting Semantic Entities and Events from Sports Tweets

Smitashree Choudhury, John G. Breslin

Digital Enterprise Research Institute, NUI Galway

DIATA 2011, 14-15 SeptemberDusseldorf, Germany


Background

• Twitter has established itself as a real time communication platform.

• Events (natural disaster or scheduled events) are live tweeted.– Communicate facts of events

– Express opinions about the event

• Most of the scheduled public events are video recorded and shared on web for wider consumption.

• Event tweets can be leveraged for localised annotation of event videos.


Objective

• To detect named entities and interesting micro-events within the main event.

• Domain of data (sport).• The application of these results aims for temporally

localised annotation of sports video.


Present Video Annotation Scenario

• Document level global tagging.• A single user perspective (only the creator tags the video).

• User generated tags are sometimes personal and not so relevant to the semantic content of the video.

• No way to say which tag is associated to which segment of the video (localisation).

• Captions are manually created.

• Automatic Speech recognition (ASR) is error prone.


How Tweets Fit in …

• Hundreds of users tweets about the event (free annotation).

• Multiple user perspectives of single event.• Tweet comes with a time stamp.• Latency (actual event happening and message) of

tweets is very short (almost real time).


Challenges in Tweets

• Informal communication: – Alternative ways of expressing words: “pls” for “please”, “forgt” for

“forgot”.

• Lack of standard linguistic rules and grammar:– Due to the space constraint, language rules are ignored when possible.

– Information extraction is challenging.

• The use of slang words, acronyms, compound hashtags (#globalwarming), community and event specific tags (#diata11).

• Spam tweets.


Data Collection

• Seed query list with Twitter API– #cricket, #cwc2011, ICC cricket world cup, #cwc11, ENG vs IRE

• Dataset– Dataset(Feature): used to extract entity and event specific features

(20,000 tweets).

– Dataset (Ground truth): manually annotated for players' name and presence of interesting events and used for test (2000).

– Dataset(Independent): used for test (1500).

• Background Knowledge: – Game website : Players name, game time, playing countries.

– Wikipedia: Rules and concepts of the game (“crease”, “field”, “wicket”, “boundary”, “six”, ”four”, “mid wicket”, “run out”).


Noise Filtering

• Messages with only hashtags.• Similar content, different user names (excluding RT)

are considered to be a case of multiple accounts.• Similar content, same user name (within a small time

range), are considered to be duplicate tweets.• Similar content, same user name, at multiple times

are considered as spam tweets.


Ground Truth Annotation

• Annotators were asked to annotate tweets (“yes”, ”no” and “other” )– for the presence of any interesting events.

– players.

• Given the list of player of both the teams and the list of events for reference.

• Inter-annotator Agreement– 2/3 had to agree for a label.

– 3 agreed on labels in 86% of cases.

– 2 agreed in 94% of the time.


Features for Player

• Full name– “Tim Bersnan looks not so happy” will be classified as player positive

tweet.– Only 30% with full name.

• Name Variations– First name (Kevin).– Last name (Peterson).– Player’s initials (KP).– First name initial + last name (KPeterson).

• Twitter handlers and hashtag.• Nick names (difficult to predict).• Final name related feature set looks like:

– (“firstname_only”, “lastname_only”, “initials”, “initial_plus_lastname”, “twitter_handle” and “player_hashtag”).

• Recall increased but added noise (included irrelevant names).


Features for Player (domain concept in entity context)

• Once a player is detected from the tweet, we scan the tweet with a contextual window of 4 terms to detect the presence or absence of a game related concept.

#Cricket : Kevin O'Brien playing some glorious shots..!! :)

@slbry - Mooney smokes another over mid-wicket. Four !! :) #cwc2011

First SIX of tournament for Afridi!!! #cwc2011

Full Name

FN LN Initials Initials+LN

Context word

Twitter handler

Player hashtag

Label Event Label

Mooney Overmid-wicket

John Mooney

Four

0 0 1 0 0 1 1 1


Micro-Event Detection

• Micro-events are the highlight points within the main event e.g.:– Goal in a soccer match.– Wicket in a cricket match.

• Expected to generate excitement from multiple users.• Measured through burst detection (Kleinberg, 2002).• Linguistic feature based event classification.


Features for Micro-Event

• List of possible events during a match were collected from Wikipedia “Rules of the game“ page.

• We focused four categories of events:

– “ Fall of a wicket” (A player getting out).

– Scoring a “SIX”.

– Scoring a “FOUR”.

– Others .

• An event is represented with related terms and its lexical variations.

#sixer from #kevinobrien for #ireland against #england #cricket

Kevin O'Brien OUT ! Ireland 317/7 (48.1 ov) #ENGvsIRE #cricket #wc11

Crap O'Brien goes ARGH!!!


Tweet Volume

• Presence of event related terms in a single tweet does not indicate the event occurrence.

• An interesting event has to be validated with additional conditions:– Tweet Volume >

mean+1sd.– Number of unique

twitters during the event.

1 4 7 10 13 16 19 22 25 28 31 340

50100150200250300

Tweet vs RTs

rt tweet volumes per minute


Result (Player detection)

• Recall-Precission• Player’s named entity

recognition against the ground truth.

• Low recall for positive classification - 70%.

• Negative classification 90%.

• May be high ranked users can help!.


Result (Event Detection)

• Recall with only linguistic features gives low recall (70%).

• Combination of features increase recall to 86%.

• Performance for the “no” labels is always better than for the “yes” labels. (dominance of negative samples).


Result: Test on Dataset (independent)

• Test with an independent dataset (independent).

• Not part of the original training.

• Used NLP parser for proper noun detection.

• Event detection was better than Players name detection (Black &Grey).


Result: Feature weight

• Experimental result showed that context feature combined with name variations has strong discrimination power among all other features.


Tweet to Timeline Localisation (not part of this paper)

• Segmented the tweet timeline into Temporal Interval (2 minutes).

• Detected entities and events can be aligned to a video timeline.

• A video can be segmented based on semantic proximity of events and entities.


Observations and Future Work

• Named entities proved to be difficult in Tweets.

• Events can be detected with light computation.

• Temporally localised deep annotation of an event video is possible with the help of user tweets.

Future Work

• Cross domain comparison (conference, Entertainment etc).

• Contributions of user authority.

Date post:	10-May-2015
Category:	Technology
Upload:	mitsmit
View:	992 times
Download:	1 times

Semantic Entity extraction from Sports Tweets

Technology