Automatic Extraction of Soccer Game Event Data from Twitter

Post on 05-Dec-2014

914 views 4 download

description

Presentation at DeRiVE 2012. Paper at: http://ceur-ws.org/Vol-902/paper_3.pdf Slides by Guido van Oorschot

transcript

Automa'c  extrac'on  of  soccer  game  event  data  

from  Twi6er  

Guido  van  Oorschot,  Marieke  van  Erp  and  Chris  Dijkshoorn

Monday, November 12, 12

Soccer  data

Monday, November 12, 12

Theory

1. Fair body of research on automated sports highlight extraction

2. Twitter data can offer interesting insights in real world phenomena

Monday, November 12, 12

Automated  highlight  detec@on

Let’s Use Twitter data!

Monday, November 12, 12

1.Detecting events What minutes did events occur?

2.Classifying events Is the event a goal, card or substitution?

3.Assigning events to teams Is the event for the home team or away team?

3  Tasks

Monday, November 12, 12

5  types  of  events

- Goal

- Own Goal

- Red Card

- Yellow Card

- Substitution

Monday, November 12, 12

Methodology

1. Gathering the data

2. Exploring and cleaning the data

3. Classifying interesting data points

Monday, November 12, 12

Gathering  data

- Collect all tweets with game hashtags

#ajafey #nacgro #psvutr

- Collect official data for each match

Goals, cards, substitutions

Monday, November 12, 12

Our  data

6 months61 games

661 events10,643 tweets

Monday, November 12, 12

1. Detecting events

2. Classifying events

3. Assigning events to teams

Three  Experiments

Monday, November 12, 12

1. Detecting events

Monday, November 12, 12

1. Detecting events

Monday, November 12, 12

1. Experimental Setup

- Goal: detect peaks in # tweets per minute signal to extract events

- Setup: Test three peak detection methods:

1. LocMaxNoBaseLineCorr2. IntThresNoBaseLineCorr3. IntThresWithBaseLineCorr

Monday, November 12, 12

1. Results

Monday, November 12, 12

1. Findings

- Goals and red cards are detected better than yellow cards and substitutions

- None of the three peak selection methods works well.

- Highlights can be extracted, but not precise enough

Monday, November 12, 12

1. Detecting events

2. Classifying events

3. Assigning events to teams

Three  Experiments

Monday, November 12, 12

2. Classifying Events

minute “goal” “1” “red” “card” “boring” class

34 0 2 0 1 20 nothing

35 23 34 0 0 0 goal

12 1 2 0 0 5 nothing

13 1 0 22 11 0 red  card

- Goal: Classify minutes into event classes

Monday, November 12, 12

Issues

Problem: Huge, sparse matrix

1. Reduce features Choose words/features smartly

2. Reduce instances Choose minutes smartly

Monday, November 12, 12

2. Experimental Setup

- 3 Instance selection settings

1. AllMinutes2. PeakMinutes3. Eventminutes

Monday, November 12, 12

2. Experimental Setup

- 7 Feature selection settings1. AllMoreThanOnce2. Top500TotalFreq3. Top10MinuteFreq4. Top500TotalTfIdf5. Top10MinuteTfIdf6. Top50Infogain7. Top50GainRatio

Monday, November 12, 12

2. Experimental Setup

- 6 types of classifiers1. C4.52. RandomForest3. NaiveBayes4. NaiveBayesMultinomial5. libSVM6. IB1

Monday, November 12, 12

2. Results

Monday, November 12, 12

2. Discussion

- Top50GainRatio best feature selection- libSVM best classifier- EventMinutes results:

         

Class F-­‐measure

OVERALL 0.822Goal 0.841

Own  goal 0.000

Red  card 0.848

Yellow  card 0.785

Subs@tu@on 0.839

Monday, November 12, 12

1. Detecting events

2. Classifying events

3. Assigning events to teams

Three  Experiments

Monday, November 12, 12

3. Experimental Setup

- Goal: Assign events to team

- Based on the ratio between tweets from fans for home and away team

- But first: extract fans

Monday, November 12, 12

3. Extracting fans

- Hypothesis:

People that tweet for the same team each week are probably fan of that team

Monday, November 12, 12

3. Extracting fans

- Extracted 38,527 fans from 146,326 users (26%)

- This method of extracting fans works well:

Right  team Not  clear Wrong  team

88% 10% 2%

Monday, November 12, 12

3. Results

Monday, November 12, 12

3. Results

- Performance of assigning events to teams above baseline performance:

Class Baseline Performance

OVERALL 52% 58%Goal 58% 69%

Red  card 50% 62%

Yellow  card 63% 63%

Subs@tu@on 52% 57%

Monday, November 12, 12

1. Detecting events => difficult

2. Classifying events => good results

3. Assigning events to teams=> promising results

Conclusion

Monday, November 12, 12

Future Work

- Use sentiment in tweets (for detecting events and assigning events to teams)

- Player detection

- Other sports

Monday, November 12, 12

Ques@ons?Monday, November 12, 12