Automatic Extraction of Soccer Game Event Data from Twitter

transcript

Automa'c extrac'on of soccer game event data

from Twi6er

Guido van Oorschot, Marieke van Erp and Chris Dijkshoorn

Monday, November 12, 12

Soccer data

Theory

1. Fair body of research on automated sports highlight extraction

2. Twitter data can offer interesting insights in real world phenomena

Automated highlight detec@on

Let’s Use Twitter data!

1.Detecting events What minutes did events occur?

2.Classifying events Is the event a goal, card or substitution?

3.Assigning events to teams Is the event for the home team or away team?

3 Tasks

5 types of events

- Goal

- Own Goal

- Red Card

- Yellow Card

- Substitution

Methodology

1. Gathering the data

2. Exploring and cleaning the data

3. Classifying interesting data points

Gathering data

- Collect all tweets with game hashtags

#ajafey #nacgro #psvutr

- Collect official data for each match

Goals, cards, substitutions

Our data

6 months61 games

661 events10,643 tweets

1. Detecting events

2. Classifying events

3. Assigning events to teams

Three Experiments

1. Detecting events

1. Experimental Setup

- Goal: detect peaks in # tweets per minute signal to extract events

- Setup: Test three peak detection methods:

1. LocMaxNoBaseLineCorr2. IntThresNoBaseLineCorr3. IntThresWithBaseLineCorr

1. Results

1. Findings

- Goals and red cards are detected better than yellow cards and substitutions

- None of the three peak selection methods works well.

- Highlights can be extracted, but not precise enough

1. Detecting events

Three Experiments

2. Classifying Events

minute “goal” “1” “red” “card” “boring” class

34 0 2 0 1 20 nothing

35 23 34 0 0 0 goal

12 1 2 0 0 5 nothing

13 1 0 22 11 0 red card

- Goal: Classify minutes into event classes

Issues

Problem: Huge, sparse matrix

1. Reduce features Choose words/features smartly

2. Reduce instances Choose minutes smartly

- 3 Instance selection settings

1. AllMinutes2. PeakMinutes3. Eventminutes

- 7 Feature selection settings1. AllMoreThanOnce2. Top500TotalFreq3. Top10MinuteFreq4. Top500TotalTfIdf5. Top10MinuteTfIdf6. Top50Infogain7. Top50GainRatio

- 6 types of classifiers1. C4.52. RandomForest3. NaiveBayes4. NaiveBayesMultinomial5. libSVM6. IB1

2. Results

2. Discussion

- Top50GainRatio best feature selection- libSVM best classifier- EventMinutes results:

Class F-‐measure

OVERALL 0.822Goal 0.841

Own goal 0.000

Red card 0.848

Yellow card 0.785

Subs@tu@on 0.839

1. Detecting events

Three Experiments

- Goal: Assign events to team

- Based on the ratio between tweets from fans for home and away team

- But first: extract fans

3. Extracting fans

- Hypothesis:

People that tweet for the same team each week are probably fan of that team

3. Extracting fans

- Extracted 38,527 fans from 146,326 users (26%)

- This method of extracting fans works well:

Right team Not clear Wrong team

88% 10% 2%

3. Results

- Performance of assigning events to teams above baseline performance:

Class Baseline Performance

OVERALL 52% 58%Goal 58% 69%

Red card 50% 62%

Yellow card 63% 63%

Subs@tu@on 52% 57%

1. Detecting events => difficult

2. Classifying events => good results

3. Assigning events to teams=> promising results

Conclusion

Future Work

- Use sentiment in tweets (for detecting events and assigning events to teams)

- Player detection

- Other sports

Ques@ons?Monday, November 12, 12

Automatic Extraction of Soccer Game Event Data from Twitter

Documents