+ All Categories
Home > Documents > Text Categorization Moshe Koppel Lecture 4: Author Profiling

Text Categorization Moshe Koppel Lecture 4: Author Profiling

Date post: 30-Dec-2015
Category:
Upload: indira-trevino
View: 35 times
Download: 2 times
Share this document with a friend
Description:
Text Categorization Moshe Koppel Lecture 4: Author Profiling. With Shlomo Argamon, Jonathan Schler, James Pennebaker, Kfir Zigdon and others. Profiling. In real life: We don’t have a closed set of candidate authors We don’t have writing samples from each of them - PowerPoint PPT Presentation
Popular Tags:
50
Text Categorization Moshe Koppel Lecture 4: Author Profiling With Shlomo Argamon, Jonathan Schler, James Pennebaker, Kfir Zigdon and others
Transcript
Page 1: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Text CategorizationMoshe Koppel

Lecture 4: Author ProfilingWith Shlomo Argamon, Jonathan Schler, James Pennebaker,

Kfir Zigdon and others

Page 2: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Profiling

In real life:1. We don’t have a closed set of candidate authors2. We don’t have writing samples from each of them

We can still try to say something about the author: Gender Age group Linguistic background …

Page 3: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Which is Male/Female?

• My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance .

• The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.

Page 4: Text Categorization Moshe Koppel Lecture 4: Author Profiling

British National Corpus

• 920 documents labelled for – author gender– document genre

• Used 566 controlled for genre

Fiction / Female 132

Fiction / Male 132

Non-fiction / Female 151

Non-fiction / Male 151

Arts (Non-academic) 16

Arts (Academic) 24

Belief & Thought 24

Biography 54

Commerce 10

Leisure 16

Science 26

Soc. Sci. (Non-ac.) 52

Soc. Sci. (Ac.) 38

World Affairs 42

Page 5: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Experiment

Features: 400+ FW ; 600+ POS n-grams

Learner: exponential gradient / linear SVM

Test: 10-fold cross-validation

Page 6: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Results per Feature Set

50

55

60

65

70

75

80

85

All docs Fiction Non-Fiction

FWPOSFW+POS

•Handle fiction and non-fiction separately

•Use full feature set

Page 7: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Results per Genre

Testing on Genre: # of docs Train on All Train on Fiction Fiction 264 74.5 79.5

Fiction / Female 132 74.8 81.7 Fiction / Male 132 74.2 77.3

Train on Non-fiction Non-fiction 302 79.7 82.6

Non-fiction / Female 151 79.2 83.3 Non-fiction / Male 151 80.2 81.9 Arts (Non-academic) 16 76.0 76.3 Arts (Academic) 24 75.6 77.5 Belief & Thought 24 85.0 85.0 Biography 54 87.0 90.0 Commerce 10 60.0 84.0 Leisure 16 85.7 81.3 Science 26 74.2 78.5 Social Science (Non-academic) 52 77.5 83.0 Social Science (Academic) 38 82.9 78.4 World Affairs 42 79.2 82.9

Page 8: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Learning-Based Feature Reduction

• Apply learning algorithm

• Eliminate features with low weights

• Learn again

Page 9: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Results: Feature Reduction

Fiction

0.6

0.65

0.7

0.75

0.8

0.85

0.9

all 128 64 32 16 8

Number of features

accu

racy

FWPOSFWPOS

Page 10: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Results: Feature Reduction

Feature reduction for Nonfiction

0.6

0.65

0.7

0.75

0.8

0.85

0.9

all 128 64 32 16 8

Number of features

Accu

racy

FWPOS

POS

FW

Page 11: Text Categorization Moshe Koppel Lecture 4: Author Profiling

What are the Distinguishing Features?

• Fiction– Male: a, the, as– Female: she, for, with, not

• Non-Fiction– Male: that, one, of, PRP, AT0– Female: she, for, with, and, in, PNP

Page 12: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Feature

FictionNon-fiction

Male μ stderr

Female μ stderr

Male μ stderr

Female μ stderr

PNP732 ± 14809 ± 15291 ± 12331 ± 17

he145 ± 4.7135 ± 4.747.5 ± 3.548.1 ± 4.3

she67 ± 4.3139 ± 6.98.73 ± 1.721.5 ± 2.3

AT0735 ± 9.5626 ± 8.7884 ± 9.1822 ± 12

DT0160 ± 2.9153 ± 2.0220 ± 4.0204 ± 4.6

the520 ± 8.6418 ± 7.5611 ± 8.4614 ± 12

XX084 ± 2.498 ± 2.254 ± 1.555 ± 2.3

PRP623 ± 6.0615 ± 5.7767 ± 5.9763 ± 7.0

PRF170 ± 4.2158 ± 3.7355 ± 7.2324 ± 7.9

for55.7 ± 1.161.3 ± 1.077.9 ± 1.690.7 ± 1.4

with58.6 ± 1.166.5 ± 1.056.9 ± 1.167.8 ± 1.4

and234 ± 4.9249 ± 5.5242 ± 3.9287 ± 5.2

Feature Frequencies

Page 13: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Summary: Male vs. Female Style

Males use more• Determiners• Adjectives• of modifiers (e.g. pot of gold)

Females use more• Pronouns• for and with• Negation• Present tense

Informational features

Involvedness features

Page 14: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Which is Male/Female?

• My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance .

• The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.

Page 15: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Which is Male/Female?

• My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance .

• The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.

Page 16: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Which is Male/Female?

• My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance .

• The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.

Page 17: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Which is Male/Female?

• My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance .

• The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.

Page 18: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Which is Male/Female?

• My aim in this article is to show that given a relevance theoretic approach to utterance interpretation, it is possible to develop a better understanding of what some of these so-called apposition markers indicate. It will be argued that the decision to put something in other words is essentially a decision about style, a point which is, perhaps, anticipated by Burton-Roberts when he describes loose apposition as a rhetorical device. However, he does not justify this suggestion by giving the criteria for classifying a mode of expression as a rhetorical device. Nor does he specify what kind of effects might be achieved by a reformulation or explain how it achieves those effects. In this paper I follow Sperber and Wilson's (1986) suggestion that rhetorical devices like metaphor, irony and repetition are particular means of achieving relevance. As I have suggested, the corrections that are made in unplanned discourse are also made in the pursuit of optimal relevance. However, these are made because the speaker recognises that the original formulation did not achieve optimal relevance .

• The main aim of this article is to propose an exercise in stylistic analysis which can be employed in the teaching of English language. It details the design and results of a workshop activity on narrative carried out with undergraduates in a university department of English. The methods proposed are intended to enable students to obtain insights into aspects of cohesion and narrative structure: insights, it is suggested, which are not as readily obtainable through more traditional techniques of stylistic analysis. The text chosen for analysis is a short story by Ernest Hemingway comprising only 11 sentences. A jumbled version of this story is presented to students who are asked to assemble a cohesive and well formed version of the story. Their re-constructions are then compared with the original Hemingway version.

Page 19: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Blog Corpus

• 85,000 blogs

• blogger-provided profiles (gender, age, occupation, astrological sign)

• harvested August 2004

• non-text ignored (formatting, quoting)

Page 20: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Example 1

Yesterday we had our second jazz competition. Thank God we weren't competing. We were sooo bad. Like, I was so ashamed, I didn't even want to talk to anyone after. I felt so rotton, and I wanted to cry, but...it's ok.

Page 21: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Example 2

My gracious boss had agreed to let me have one week off of "work." He did finally give me my report back after eight freakin' days! Now I only have the rest of this week and then one full week after my vacation to finish this damned thing.

Page 22: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Example 3

So about a month or two ago, I met Katy N. at a party in New York. Katy's friend, Kevin M., whom she met while living in Barcelona last year, lives in Miami and is working on getting a TV series produced. Kevin is friends with a guy named Charlie P.

Page 23: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Blog Corpus

gender

age female male Totalunknown 12287 12259 2454613-17 6949 4120 1106918-22 7393 7690 1508323-27 4043 6062 1010528-32 1686 3057 474333-37 860 1827 268738-42 374 819 119343-48 263 584 847>48 314 906 1220 Total 34169 37324 71493

Final balanced corpus:• 19,320 total blogs

– 8240 in “10s”– 8086 in “20s”– 2994 in “30s”

• 681,288 total posts• 141,106,859 total

words

Page 24: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Experimental Setup

Feature sets:• Content: words (filtered by infogain on train set)• Style: parts-of-speech, function words, blog slang

Learning algorithms: Real-valued balanced winnow (RBW) Bayesian Multinomial Regression (BMR)

Evaluation: 10-fold cross-validation

Page 25: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Age: Classification

RBW BMRStyle & Content 75.0% 77.4%Function Words 67.7% 69.4%Content Words 75.9% 76.2%

Page 26: Text Categorization Moshe Koppel Lecture 4: Author Profiling

The lifecycle of the common blogger...

feature 10s 20s 30s

bored 3.84 1.11 0.47boring 3.69 1.02 0.63

awesome 2.92 1.28 0.57

mad 2.16 0.8 0.53

homework 1.37 0.18 0.15

mum 1.25 0.41 0.23

maths 1.05 0.03 0.02dumb 0.89 0.45 0.22

sis 0.74 0.26 0.1

crappy 0.46 0.28 0.11

Page 27: Text Categorization Moshe Koppel Lecture 4: Author Profiling

The lifecycle of the common blogger...

feature 10s 20s 30s

bored 3.84 1.11 0.47boring 3.69 1.02 0.63

awesome 2.92 1.28 0.57

mad 2.16 0.8 0.53

homework 1.37 0.18 0.15

mum 1.25 0.41 0.23

maths 1.05 0.03 0.02dumb 0.89 0.45 0.22

sis 0.74 0.26 0.1

crappy 0.46 0.28 0.11

feature 10s 20s 30s

college 1.51 1.92 1.31bar 0.45 1.53 1.11

apartment 0.18 1.23 0.55

beer 0.32 1.15 0.7

student 0.65 0.98 0.61

drunk 0.77 0.88 0.41

album 0.64 0.84 0.56dating 0.31 0.52 0.37

semester 0.22 0.44 0.18

someday 0.35 0.4 0.28

Page 28: Text Categorization Moshe Koppel Lecture 4: Author Profiling

The lifecycle of the common blogger...

feature 10s 20s 30s

bored 3.84 1.11 0.47boring 3.69 1.02 0.63

awesome 2.92 1.28 0.57

mad 2.16 0.8 0.53

homework 1.37 0.18 0.15

mum 1.25 0.41 0.23

maths 1.05 0.03 0.02dumb 0.89 0.45 0.22

sis 0.74 0.26 0.1

crappy 0.46 0.28 0.11

feature 10s 20s 30s

college 1.51 1.92 1.31bar 0.45 1.53 1.11

apartment 0.18 1.23 0.55

beer 0.32 1.15 0.7

student 0.65 0.98 0.61

drunk 0.77 0.88 0.41

album 0.64 0.84 0.56dating 0.31 0.52 0.37

semester 0.22 0.44 0.18

someday 0.35 0.4 0.28

feature 10s 20s 30s

son 0.51 0.92 2.37local 0.38 1.18 1.85

marriage 0.27 0.83 1.41

development 0.16 0.5 0.82

tax 0.14 0.38 0.72

campaign 0.14 0.38 0.7

provide 0.15 0.54 0.69democratic 0.13 0.29 0.59

systems 0.12 0.36 0.55

workers 0.1 0.35 0.46

Page 29: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Gender: Classification

RBW BMRStyle & Content 80.0%Style Words 77.0%Content Words 73.0%

Page 30: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Men are from Mars...Women are from Venus...

LIWC category male female

job 68.1±0.6 56.5±0.5

money 43.6±0.4 37.1±0.4

sports 31.2±0.4 20.4±0.2

tv 21.1±0.3 15.9±0.2

sex 32.4±0.4 43.2±0.5

family 27.5±0.3 40.6±0.4

eating 23.9±0.3 30.4±0.3

friends 20.5±0.2 25.9±0.3

sleep 18.4±0.2 23.5±0.2

pos-emotions 248.2±1.9 265.1±2

neg-emotions 159.5±1.3 178±1.4

Page 31: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Relating Age & Gender

• Let's examine the connection between age and gender a little more generally...

• Consider the most distinctive words for both Age and Gender:– Intersection of the 1000 words with highest Age

information gain and the 1000 words with highest Gender information gain

– Total of 316 words– Consider log(30s/10s) vs. log(male/female)

Page 32: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Relating Age & Gender

-8

-6

-4

-2

0

2

4

6

8

-2 -1 0 1 2

log(male/female)

log(

30s/

10s)

Page 33: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Relating Age & Gender

-8

-6

-4

-2

0

2

4

6

8

-2 -1 0 1 2

log(male/female)

log(

30s/

10s)

“husband”

Page 34: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Native Language

Given English text, can we determine the author’s native language?

Page 35: Text Categorization Moshe Koppel Lecture 4: Author Profiling

In the second part of this outhor’s novel, called Time Passes, time has passed indeed and Mrs Ramsay has died.  There are pejudments of small groups, such as homosexuals, inmigrants, aids diseaseds, etc. But "political correctness" has have positive and negative consecuences. There is one more kind of films irritating many television viewers - "soap" serials. «Santa Barbara» has even won "Oskar" prize.

Try it yourself. These were written by Russian, French and Spanish speakers, respectively. Can you tell which is which?

Page 36: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Possible Clues

Patterns of native language are typically reflected in how other languages are spoken (Rado61, Corder81):

• Word selection

• Syntax

• Spelling

Page 37: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Measurable Features for Automated Native Language Detection

• Frequency of function words• Frequency of letter sequences (adapted from Peng+ 04)

• Idiosyncrasies

We will gather idiosyncrasies data automatically.

Page 38: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Orthographic Idiosyncrasies

• Repeated letter (e.g. remmit instead of remit)

• Double letter appears once (e.g. comit instead of commit)

• Letter instead of (e.g. firsd instead of first)

• Letter inversion (e.g. fisrt instead of first)

• Inserted letter (e.g. friegnd instead of friend)

• Missing letter (e.g. frend instead of friend)

• Conflated words (e.g stucktogether)

Page 39: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Syntactic Idiosyncrasies

• Sentence Fragment• Run-on Sentence• Repeated Word• Missing Word• Mismatched Singular/Plural• Mismatched Tense • that/which confusion• Rare POS pairs (Chodorow-Leacock 00)

Page 40: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Automatically Finding Idiosyncrasies

1. Run text through automated spell/grammar checker

2. Compare flagged word to best suggestion

3. Mark error accordingly

e.g. text=remmit suggestion=remit

mark as “repeated letter”

Page 41: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Summary: Features Used

• 400 function words

• 200 letter sequences

• 185 error types

• 250 rare POS pairs

Each document is represented as numerical vector of length 1035

Page 42: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Test Corpus

International Corpus of Learner English (Granger98)

• 11 countries• Subjects same age, proficiency level• Samples same genre, length• Actually used in study- 258 docs from each of

– France– Spain– Bulgaria – Czech Rep.– Russia

Page 43: Text Categorization Moshe Koppel Lecture 4: Author Profiling

SVM Classification Accuracy (10-fold CV)

 

30

40

50

60

70

80

90

Function words+ Letter n-grams

Function wordsLetter n-gramsErrors

shaded: w/o error features white: with error featuresBaseline=20%

Page 44: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Confusion Matrix  Classified As

Czech French Bulgarian Russian Spanish

Actual Czech 209 1 18 20 10

French 9 219 13 12 5

Bulgarian14 8 211 18 7

Russian 24 8 24 194 8

Spanish 16 10 10 7 215

Page 45: Text Categorization Moshe Koppel Lecture 4: Author Profiling

What Gives It Away?

• Russian –over, the (infrequent), number_reladverb

• French – indeed, Mr (no period), misused o (e.g. outhor)

• Spanish – c-q confusion (e.g. cuality), m-n confusion (e.g. confortable), undoubled consonant (e.g. comit)

• Bulgarian – most_ADVERB, cannot (uncontracted)

• Czech – doubled consonant (e.g. remmit)

Page 46: Text Categorization Moshe Koppel Lecture 4: Author Profiling

French:In the second part of this outhor’s novel, called Time Passes, time has passed indeed and Mrs Ramsay has died.  Spanish:There are pejudments of small groups, such as homosexuals, inmigrants, aids diseaseds, etc. But "political correctness" has have positive and negative consecuences. Russian:There is one more kind of films irritating many television viewers - "soap" serials. «Santa Barbara» has even won "Oskar" prize.

Let’s look back at our examples. Now it’s pretty obvious.

Page 47: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Real-Life Issues

• Many candidate languages

• Very short texts

• Unpredictable English proficiency

Page 48: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Personality

• Pennebaker data:– Students wrote essays

– Same students took personality assessment tests

• Experiment:Given text, determine if author is – Open

– Conscientious

– Neurotic

– Extroverted

– Agreeable

Page 49: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Accuracy Results

–Open 66%

–Conscientious 65%

–Neurotic 63%

–Extroverted 62%

–Agreeable 60%

Page 50: Text Categorization Moshe Koppel Lecture 4: Author Profiling

Key Features

• Openness– consciousness, strange, thoughts, maybe, you– hope, feel, home, friends, football, team

• Conscientiousness– school, always, high, grades– damn, bad, hate, you, more


Recommended