Based Mood Analysis and Text-Mining Methods · Introduction Micro-blog Sentiment Analysis Twitter...

Post on 23-Aug-2020

4 views 0 download

transcript

Forensic Investigation of Smartphones Using Lexicon-

Based Mood Analysis and Text-Mining Methods Panagiotis Andriotis, Atsuhiro Takasu, Theo Tryfonas

Overview

Introduction

Micro-blog Sentiment Analysis

Twitter feeds vs. Short Message Service (SMS)

Mood Score Calculation (Lexicon-based Approach)

Mood Score Evaluation and Optimization for SMS

Sentiment Timeline View (developing a forensic tool)

Conclusions and Future Work

Introduction and Problem Specification

Affiliations

Is Mood Important? Seems it is. if we inspect apps and web sites it is clear that we are sharing our emotions and feelings regularly.

Twitter Sentiment Analysis in the Internet (1)

Twitter Sentiment Analysis in the Internet (2)

Twitter Sentiment Analysis in the Internet (2)

Twitter Sentiment Analysis in the Internet (2)

Twitter Sentiment Analysis in the Internet (2)

Questions to be answered

Can we apply N.L.P. methods to perform mood analysis on SMS?

Is a Twitter Feed (tweet) similar to a SMS?

Is stemming important?

Can we optimize the algorithm (focus on SMS and their characteristics)?

Can we depict the extracted result during a forensic analysis?

Data Collection and Experimental Setup

Defining Emotions Positive vs. Negative

Emotion Classification

Positive Joy

Happiness

Intimacy

Familiarity

Friendship

Love

Negative Anger

Malevolence

Enmity

Fear

Disgust

Sadness

The data we used We collected 6566 tweets from the (public) accounts of famous people and celebrities (TWT). We also used already classified tweets (SENT140) and a SMS dataset. We utilized 3 different lexicons (AFINN, WordNet, NRC).

Lexicons and Algorithm

Lexicons Characteristics AFINN (contains positive and

negative words with their valence)

WordNet-Affect (WRDNT: contains synsets)

NRC word-emotion lexicon (contains numerous hashtags, words, valence)

A Bag-of-words Approach Let Lp = {lpi}, be the set of our positive

textual markers and Ln = {lnj}, be the set of our negative textual markers. C is the total of single tweets or SMS, C = {tk}.

If a positive marker lpi, appears in a tweet or SMS (tk) in the corpus, we set lpi(tk) = 1. Else, lpi(tk) = 0. We also perform the same calculations for negative markers lnj.

The tweet sentiment score s(tk) is equal to the total of positive markers found in a tweet minus the total of negative markers found in the tweet.

s(tk) = Σilpi(tk) – Σjlnjt(k)

Datasets

TWT: twitter feeds

PoSENT140 from SENT140

NegSENT140 from SENT140

SMS dataset (sanitized)

Positive SMS (manually classified)

Negative SMS (manually classified)

Experiments and Results

Is Stemming Useful? (1)

Using the AFINN lexicon:

We calculated the mood scores of each tweet in the TWT corpus and the SENT140 corpus (no stemming).

We performed the same experiments on the same datasets using stemming (Porter’s stem algorithm).

Finally, the same tests were done on the SMS dataset (using stemming).

Is Stemming Useful? (2)

The following table suggests that our results were better when we used stemming (tweets dataset).

Distribution of textual markers within the datasets without using stemming and with stemming.

Is Stemming Useful? (3)

The following table suggests that our results were better when we used stemming (SMS dataset).

Distribution of textual markers within the SMS datasets without using stemming and with stemming.

Visualizing the results

Distribution of lexicon words found in: Tweets without stemming (left) and (right) in Stemmed tweets (blue) and SMS (red).

Blue: TWT, Red:NegSENT140, Green: PoSENT140

Evaluating the classification ability

We used the TWT dataset,

Stemming,

And the three lexicons,

1. AFINN,

2. WRDNT,

3. NRC,

To decide which lexicon we should utilize.

AFINN results already discussed.

WRDNT contained more formal vocabulary. (Neutral s(tk) for 68.5% of tweets.)

NRC consists of a plethora of words-abbreviations and ‘internet slung’. More than 20 markers could be found in a tweet.

We decided to use AFINN.

Optimizing the hit rates for SMS

Developing a forensic tool to demonstrate the use of Mood Analysis for SMS and Instant Messengers

Open Source tools.

Android SDK

USB cable -> Developer Options -> USB Debugging -> On

From Platform Tools -> ADB

Get a root shell

Mount (to see file system info)

Use dd on the data partition and pull image to the computer

SMS on: /data/com.android.providers.telephony/databases/mmssms.db

from Android Devices

Data Extraction

The Design Concept Using the Apache Lucene library for text pre-processing, indexing and searching.

The MySQL database schema We keep the extracted keywords of each SMS in separate cells and from the stemmed words we calculate the mood score using AFINN, emoticons and valence.

Sentiment Timeline View (1)

Extracted from the list of all messages in the SQLite database.

Sentiment Timeline View (2)

Extracted from messages exchanged with one entity (left) or from messages sent by the person under investigation (right).

Searching the index (some advantages) • Faster search (here we looked for the word ‘happy’). • Friendlier output providing detailed information (id from original SQLite db, date, etc.). • Indexing and the specific methodology can be applied to all data in the phone with text format, e.g. emails.

Conclusions and Future Work

Outcome

Conclusions It is possible to extract feelings

from SMS using techniques applied to Twitter or micro-blogs.

Lexicon, Emoticons and word valence are important to the final outcome (s(tk)).

Timeline Sentiment View can stress regions of interest.

We can merge N.L.P. with Forensics to automate specific tasks.

Future Work Investigate the efficiency of

Support Vector Machines against our simplistic bag-of-words approach.

(Naïve Bayesian classification can exceed 75% of accuracy and SVMs may produce better results.)

Apply the concepts of Mood Analysis and Text Mining on the whole text material in a smartphone.

Acknowledgement

This work has been supported by the European Union’s Prevention of and Fight against Crime Programme “Illegal Use of Internet” - ISEC 2010 Action Grants, grant ref. HOME/2010/ISEC/AG/INT-002 and the Systems Centre of the University of Bristol.

Thank you!