4 Text Mining and Open-Ended Questions in Sample Surveys Ludovic Lebart CNRS

1

Text Mining and Open-ended Questions

in Sample Surveys

Ludovic Lebart Centre National de la Recherche Scientifique

Telecom-ParisTech, Paris, France

AMAI - 2009 - September 8th, 2009 (Mexico D.F.)

Text Mining and Open-ended Questions in Sample Surveys

Summary / Outline

1) Principles of Data Mining and Text mining: A reminder

2) Open-ended Questions: Why? How?

3) From texts to numerical data

4) Basic statistical tools: Visualization, Characteristic words.

5) Applications: Open questions, sample surveys, texts

6) About textual data in general

7) Conclusions








7) Conclusions

The « data Mining » approach...

✔ Ancient techniques are easier to use,

✔ Ancient techniques are improved

✔ New techniques are conceived

✔ New fields of application

✔ New products: Softwares

✔ Need for a selection of methods, of simple and clear strategy for data processing

1- Principles of Data Mining and Text mining: A reminder

Reminder : (Fayadd et al.)

Data Mining (KDD) is the non-trivial process of identifying patterns in huge data sets, these patterns being supposed to be valid, novel, potentially useful, and ultimately… understandable

Survey data processing and data mining

These huge data sets could be unstructured, non representative.

The main goal being to automatically extract from the ore (raw data) the genuine diamond of truth…. (Benzécri 1973)


Survey data : Homogeneity of content, of coding… … different from the usual inputs of Data Mining programs.

Despite the fact that we may deal with several observational levels (households, individuals, trajectories or biographical data, areas or regions…), there is a consistency and a unity of content in a survey data set - together with general hypotheses formulated beforehand - that are not present in the usual data mining input data.

In this context, a lot of meta-information is generally available(Demographic, economic, sociologic, epidemiologic, etc)that provides a framework for the interpretation phase.

A survey (whatever its complexity) is a costly set of measurements that follows a specific decision.


7

✔ Initial paradigm:✔

✔ - Extracting statistical units from texts ✔

✔ - Complementing lexicometry with a multivariate approach ✔

✔ - Applying visualization tools to lexical tables ✔

✔ Evolution and diversification of techniques and approaches

“Text Mining” and Multivariate exploratory statistical analysis of texts


8

The fields of Text Mining

WEB Press

Scientific papers, abstracts Information Retrieval

Open-ended questions, free responses

Qualitative interviews, Discourses, Reports

Complaints









7) Conclusions

10

◆ To shorten interview time: Open ended questions are less costly in terms of interview time, and generate less fatigue and tension (voluminous lists of items)

Open questions : Why?

◆ To gather spontaneous information: Marketing survey questions contain many questions of this type. " What do you recall (or: what do you like) about this ad?

◆ To probe the response to a closed-end question.: This is the follow up additional question "Why?". Explanations concerning a response already given have to be provided in a spontaneous fashion.

◆ To get information relating to non-comparable variables: Example : Environmental activism, dietary habits….

2- Open-ended Questions: Why? How?

11

Open questions : Why ?

DRAWBACKSCostComplexitySpecificity

ADVANTAGES SpeedFreedomSpecificity


A classical experiment, quoted by Schuman and Presser (1981), stresses the difficulty of comparing the two types of questionning.

When asked "what is the most important problem facing this country [USA] at present", 16% of Americans mention crime and violence (grouped free responses), whereas the same item asked in a closed question produces 35% of the same response.

The explanation given by authors is the following: lack of security is often considered as a local, not a national problem, so that the item crime and violence is not mentioned spontaneously very often.

Closing the question indicates that this response is a relevant or possible response, resulting in a higher response percentage.

Comparison between open and closed questions


In some particular contexts, the absence of a response item can play a positive role. It can establish a climate of confidence and communication, and lead to better results when certain subjects are brought up.

This is what is indicated by the work of Sudman and Bradburn (1974) concerning questions having to do with "threats", and of Bradburn et al. (1979) concerning questions about alcohol and sexuality.

In international studies, it is important to know whether people interviewed in different countries understand the closed questions in the same way. (case of the follow up :”Why” ).

As a matter of fact, it is also legitimate to raise this same issue with respect to regional and generational differences.

Heuristic value of open-ended questions


In some other particular contexts, the cultural gap between those who have conceived the questionnaire and the interviewees is hidden by the purely numerical coding of the closed questions.

In a national survey about the attitudes of economically impaired people towards the minimum wage system in France, a classical open questionwas asked at the end of the interview:

“Would you like to add something about some topics that could be missing in this questionnaire, about the minimum wage system ?”

One answer (among many others of the same vein) was “ We eat potatoes and eggs, despite my diabetes and my cholesterol, because there are cheap.”

Another: “Thank you for coming. It proves that you are thinking of me”.

Some respondents are far from the problematic “Attitude towards an institution”

Heuristic value of open-ended questions (continuation)


15

Empirical Post-Coding of free responses

(Drawbacks of this type of processing)

Coder bias: Coder bias is added to interviewer bias, since the coder makes decisions and formulates interpretations, introducing a «personal touch ».

Alteration of form: Information is destroyed in its form and often weakened in its content: quality of expression, level of vocabulary, and general interview tonality are lost.

Weakening of content: (case of responses that are composed, complex, vague and diversified).

Infrequent responses are eliminated a priori.


16

Example 1: Open Question « Life » (international sample surveys)

The following open-ended question was asked :

"What is the single most important thing in life for you?" It was followed by the probe: "What other things are very important to you?".

This question was included in a multinational survey conducted in seven countries (Japan, France, Germany, Italy, Nederland, United Kingdom, USA) in the late nineteen eighties (Hayashi et al., 1992).

Our illustrative example is limited to the British sample (Sample size: 1043).


17

GenderEduc. Age Responses 1 1 4 happiness in people around me, contented family, would make me happy 1 2 2 my own time, not dictated by other people 1 2 2 freedom of choice as to what I do in my leisure time 1 3 2 I suppose work 1 2 1 firm, my work, which is my dad's firm 2 1 6 just the memory of my last husband 2 2 6 well-being of my handicapped son 1 1 5 my wife, she gave me courage to carry on even in the bad times 2 2 3 my sons, my kids are very important to me, being on my own, I am responsible for their education 1 3 3 job, being a teacher I love my job, for the well-being of the children

Example 1: Open Question « Life »: Examples of responses


Following a viewing of a television commercial on breakfast cereals (copy-test), several open questions were asked. One of them, which we shall use as our example, is :

What was the main idea of this commercial?

In addition a number of closed questions were also asked (socio-demographic characteristics of respondents, purchase intent toward product seen). Purchase intent being an important issue, this question plays a major role in the discussions that follow.

Two examples of responses to that open question.

1 - That it has complex carbohydrates in it, it has energy releaser and it tastes good... It showed people eating grape nuts.

2 - It gives you energy in the morning, nothing else.

Example 2: Open Questions / Copy-Test


A survey in three cities (Tokyo, New York, Paris) about dietary habits.

The common open-ended questions were:

"What dishes do you like and eat often? (With a probe: "Any other dishes you like and eat often?").“ What would be an ideal meal?”

Akuto H.(Ed.) (1992). International Comparison of Dietary Cultures, Nihon Keizai Shimbun, Tokyo.

Akuto H., Lebart L. (1992). Le Repas Idéal. Analyse de Réponses Libres en Anglais, Français, Japonais. Les Cahiers de l'Analyse des Données, vol XVII, n°3, Dunod, Paris

Example 3: An international survey (Tokyo Gas Company)


Four responses (New York) "What dishes do you like and eat often? “What would be an ideal meal?”

---- 1SPAGHETTI,CHINESE++++CAESAR SALAD,LOBSTER TAILS,BAKED POTATO, CHOCOLATE MOUSSE

---- 2SEAFOOD,GREEN SALAD,CHINESE FOOD++++CHAMPAGNE,CAVIAR,GREEN SALAD,GRILLED SEAFOOD

---- 3CHINESE FOOD++++CHINESE FOOD,FRENCH FOOD,VEAL,BREAD---- 4PASTA++++BEARNAISE BEEF,CHINESE FOOD,ITALIAN FOOD,PASTA

Example 3: An international survey (continuation)


5 denominaciones: Bierzo, Cigales, Ribera del Duero, Rueda, Toro

Example 4: Evaluación de vinos mediante notas y comentarios

Guia de Catas de Castilla y León (2005) 522 vinos de Castilla y León pertenecientes a 207 bodegas


---- Nota= 80 Valdelosfriales-2003Joven típico, con notas de tempranillo y balsámicos; en boca amable y frutoso.

---- Nota=91 Tares P3-2001 premiumMucho terruño se detecta en el bouquet de este gran tinto; pólvora, sílex, pizarra, cascajo caliente con el contraste de tierra húmeda y mucha fruta madura de hueso. concentrado, tacto graso sobre el paladar; impresionante viscosidad en la lengua, otra vez impresiones de tierra húmeda y pólvora en el largo final.


Example of two texts









7) Conclusions

24

Statistical units derived from texts

Characterss

Words, lemmas, n-grams

Segments or quasi-segments

Sentences or responses

Texts

CORPUS

3- From texts to numerical data

25

Ambiguity of frequencies: statistical frequency versus « linguistic frequency »

Closed questions

Texts

Open questions ouvertes

(statistical frequency)

( linguistic frequency)

Sample surveys


26

In this example we focus on a partitioning of the sample into nine categories, obtained by cross-tabulating age (three categories) with educational level (three categories).

The counts for the first phase of numeric coding are as follows: Out of 1043 responses, there are 13 669 occurrences, with 1 413 distinct words. When the words appearing at least 16 times are selected, there remain 10 357 occurrences of these words, with 135 distinct words (types).

Example 1: Question « Life » - continuation

The same questionnaire also had a number of closed-end questions (among them, the socio-demographic characteristics of the respondents, which play a major role).


27

Words Appearing at Least Sixteen Times (Alphabetic Order) in the 1043 responses to the open question

Word Frequency Word Frequency Word Frequency I 248 go 19 of 312 I'm 22 going 26 on 59 a 298 good 303 other 33 able 55 grandchildren 30 others 17 about 31 happiness 227 our 29 after 26 happy 137 out 34 all 86 have 99 own 16 and 504 having 70 peace 77 anything 19 health 609 people 61 are 65 healthy 45 really 28

Example 1: Selected statistical units


28

Gender Educ. Age Tagged responses 1 1 4 happiness/NN in/IN people/NNS around/IN me/PRP

,/, contented/VBN family/NN ,/, would/MD make/VB me/PRP happy/JJ

1 2 2 my/PRP$ own/JJ time/NN ,/, not/RB dictated/VBN by/IN other/JJ people/NNS

1 2 2 freedom/NN of/IN choice/NN as/IN to/TO what/WP I/PRP do/VB in/IN my/PRP$ leisure/NN time/NN

1 3 2 I/PRP suppose/VBP work/NN 1 2 1 firm/NN ,/, my/PRP$ work/NN ,/, which/WDT is/VBZ

my/PRP$ dad's/NNS firm/NN 2 1 6 just/RB the/DT memory/NN of/IN my/PRP$ last/JJ

husband/NN 2 2 6 wellbeing/NN of/IN my/PRP$ handicapped/JJ son/NN 1 1 5 my/PRP$ wife/NN ,/, she/PRP gave/VBD me/PRP

courage/NN to/TO carry/VB on/IN even/RB in/IN the/DT bad/JJ times/NNS

Example of morpho-syntactic information


29

- First partition: three age categories

- less than 30 years [noted -30], - between 30 years and 55 years [-55] - over 55 years [+ 55] .

- Second partition: three educational levels - No degree or Low [noted L], - Medium [M], - High level [H]

Example of a lexical contingency table


30

Partial listing of lexical table cross-tabulating 135 words of frequency greater than or equal to 16 with 9 age-education categories

L-30 L-55 L+55 M-30 M-55 M+55 H-30 H-55 H+55 I 2 46 92 30 25 19 11 21 2 I'm 2 5 9 3 2 1 0 0 0 a 10 56 66 54 44 19 20 22 7 able 1 9 16 9 7 4 4 5 0 about 0 3 13 7 1 2 4 1 0 after 1 8 11 3 1 2 0 0 0 all 1 24 19 8 18 6 3 5 2 and 8 89 148 86 73 30 25 32 13 anything 0 4 9 1 3 0 1 1 0

Example 1: A lexical contingency table


Example 2: "What is the main idea in this commercial"Words appearing more than 9 times (100 responses)

Number Word Frequency Number Word Frequency

1 I 14 25 in 272 a 59 26 is 373 about 15 27 it 1334 all 21 28 it's 285 and 42 29 long 146 are 25 30 morning 97 been 12 37 nothing 258 carbohydrate 14 32 nutritional 99 carbohydrates33 33 nutritious 1210 cereal 34 34 nuts 2511 complex 25 35 of 2512 crunchy 9 36 people 2813 eaten 10 37 showed 1114 eating 19 38 taste 1115 energy 33 39 that 8016 for 57 40 that's 1317 give 9 41 he 8218 gives 11 42 they 5019 good 52 43 to 3220 grape 25 44 was 1921 has 30 45 with 1122 have 27 46 years 1123 healthy 23 47 you 8124 how 9

•


SEGM FREQ LENGTH "TEXT of SEGMENT" ------------------------------------- -----------------------------------------a 1 8 3 a long time -----------------------------------------are 2 6 4 are good for you -----------------------------------------carbohydrates 3 5 3 carbohydrates in it -----------------------------------------complex 4 15 2 complex carbohydrates -----------------------------------------for 5 37 2 for you -----------------------------------------give 6 7 3 give you energy -----------------------------------------gives 7 11 2 gives you 8 9 3 gives you energy -----------------------------------------good 9 24 2 good for 10 22 3 good for you -----------------------------------------grape 11 25 2 grape nuts -----------------------------------------have 12 6 3 have been eating -----------------------------------------healthy 13 6 3 healthy for you -----------------------------------------is 14 9 4 is good for you -----------------------------------------it 15 26 2 it has 16 19 2 it is 17 14 2 it was 18 8 3 it gives you 19 8 3 it has a 20 6 3 it has complex 21 5 3 it is good 22 6 4 it gives you energy -----------------------------------------people


Example 2: "What is the main idea in this commercial" Examples of repeated segments

Example 3: An international survey (Tokyo Gas Company)

The common open-ended question : "What dishes do you like and eat often? (With a probe: "Any other dishes you like and eat often?").

- Sub-sample 1 (city of Tokyo) : 1008 individuals. The global corpus of open responses contains 6219 occurrences of 832 distinct words. 139 words appear at least 7 times, leading to 4975 occurrences.

- Sub-sample 2 (city of New-York) contains 634 individuals. (6511 occurrences of 638 distinct words). The processing takes into account the 83 words appearing at least 12 times.

- Sub-sample 3 (city of Paris) contains 1000 individuals. The global corpus contains 11108 occurrences of 1229 distinct words. The processing takes into account the 112 words appearing at least 18 times, leading to 7806 occurrences.

- The three sets of respondents are broken down into into six categories (three categories of age, combined with the gender).


Example 3: An international survey (Tokyo Gas Company) !------------------------------------! ! words (frequency order) ! !-------!---------------------!------! ! num. ! used words ! freq.! !-------!---------------------!------! ! 12 ! CHICKEN ! 254 ! ! 73 ! STEAK ! 101 ! ! 49 ! PASTA ! 95 ! ! 22 ! FISH ! 87 ! ! 60 ! SALAD ! 85 ! ! 1 ! AND ! 85 ! ! 23 ! FOOD ! 82 ! ! 52 ! PIZZA ! 62 ! ! 79 ! VEGETABLES ! 57 ! ! 4 ! BEEF ! 56 ! ! 71 ! SPAGHETTI ! 55 ! ! 13 ! CHINESE ! 54 ! ! 80 ! WITH ! 48 ! ! 59 ! ROAST ! 47 ! ! 58 ! RICE ! 45 ! ! 67 ! SHRIMP ! 45 ! ! 43 ! MACARONI ! 42 ! ! 56 ! POTATOES ! 39 ! ! 35 ! HAMBURGERS ! 36 ! ! 75 ! TUNA ! 35 ! ! 26 ! FRIED ! 33 ! ! 77 ! VEAL ! 33 ! ! 38 ! ITALIAN ! 31 ! ! 2 ! BAKED ! 29 ! ! 48 ! PARMESAN ! 29 ! ! 55 ! POTATO ! 27 ! ! 46 ! MEATBALLS ! 25 ! ! 3 ! BEANS ! 24 ! ! 45 ! MEAT ! 24 ! ! 76 ! TURKEY ! 24 ! ! 14 ! CHOPS ! 23 ! ! 34 ! HAMBURGER ! 22 ! !------------------------------------!


City of New York

The common open-ended question : "What dishes do you like and eat often? (With a probe: "Any other dishes you like and eat often?").

108toque18

116bien17

140madera16

152una15

159el14

167taninos13

168que12

211notas10

211la10

237muy9

246nariz8

308un7

334fruta6

356con5

433boca4

694en3

806y2

891de1

FrecPalabraPos.

Lematización:-Singulares y plurales- Masculino y femenino- Formas verbales a infinitivo- …

Eliminados artículos, preposiciones …

Conservadas palabras utilizadas al menos 8 veces.

Quedan-250 palabras-443 vinos

P1 P2 ... P250Vino 1 0 1 ... 2Vino 2 1 0 ... 1Vino 3 0 0 ... 1. . . . . . . . . . .Vino 443 1 2 ... 0




Example 4: Evaluación de vinos mediante notas y comentarios (Continuation)

0 50 100 150 200 250 300 350 400

acidez (acidity)potente (powerful)

suave (mild)ligero (light)

ser (to be)cereza (cherry)

algo (some/something)fino (fine)

medio (medium)jugoso (juicy)

agradable (pleasant)elegante (elegant)

todavía (still)vino (wine)

balsámico (balsamic) maduro (ripened)

final (end)bien (well)

toque (hint)negro (black)

rojo (red)buen (good)

madera (wood)taninos (tannins)

nota (note)nariz (nose)muy (very)fruta (fruit)

boca (mouth)

Frequency








7) Conclusions

38

✔ Applying visualization tools to lexical tables✔

● Principal axes analyses of lexical tables● Classification (clustering) of words and texts ✔

✔ Selecting characteristic units and responses ✔ (or: sentences)✔

● Characteristic units (words, segments, lemmas)● Selecting « Modal responses »


Briefly, one can summarize the principles of methods for performing these data reductions:

Principal axes methods, largely based upon linear algebra, produce graphical representations on which the geometric proximities among row-points and among column-points translate statistical associations among rows and among columns. Correspondence analysis belongs to this family of methods.

Clustering or classification methods that create groupings of rows or of columns into clusters (or into families of hierarchical clusters) including the SOM (Self Organizing Maps, or Kohonen maps).

These two families of methods can be used on the same data matrix and they complement one another very effectively.

Selection of characteristic units and responses (or: sentences) Characteristic units (words, segments, lemmas)Selecting « Modal responses »


40

Visualization through principal coordinates, « a breakthrough in 1904 ».

Charles Spearman (1904) – “General intelligence, objectively determined and measured”. Amer. Journal of Psychology, 15, p 201-293.

j jii j ix a f= + ε

Value of variable j for individual i

Coefficient of variable j

General factor for individual i

Residual (hopefully small)

Known = Unknown


41

...j ji ii j j ix a f b g= + + + ε

Garnett J.-C. (1919) - General ability, cleverness and purpose. British J. of Psych., 9, p 345-366.Thurstone L. L. (1947) - Multiple Factor Analysis. The University of Chicago Press, Chicago.


42

X v1 u'1 u'pvpu'αvα

+ ... + λ αλ 1 + ... + λ p= × ××

Eckart C., Young G. (1936) - The approximation of one matrix by another of lower rank. Psychometrika, l, p 211-218.

Eckart C., Young G. (1939) - A principal axis transformation for non- hermitian matrices. Bull. Amer. Math. Assoc., 45, p 118-121.

Singular Values Decomposition is a theorem, not a model

A precursor: Pearson K. (1901) - On lines and planes of closest fit to systems of points in space. Phil. Mag. 2, n°ll, p 559-572.


43

95 88 88 87 95 88 95 95 95 106 95 78 65 71 78 77 77 etc. 143 144 151 151 153 170 183 181 162 140 116 128 133 144 159 166 170 153 151 162 166 162 151 126 117 128 143 147 175 181 170 166 132 116 143 144 133 130 143 153 159 175 192 201 188 162 135 116 101 106 118 123 112 116 130 143 147 162 183 166 135 123 120 116 116 129 140 159 133 151 162 166 170 188 166 128 116 132 140 126 143 151 144 155 176 160 168 166 159 135 101 93 98 120 128 126 147 154 158 176 181 181 154 155 153 144 126 106 118 133 136 153 159 153 162 162 154 143 128 159 153 147 159 150 154 155 153 158 170 159 147 130 136 140 150 150 151 144 147 176 188 170 166 183 170 166 153 130 132 154 162 120 135 155 181 183 162 144 147 147 144 126 120 123 129 130 112 101 135 150 166 147 129 123 133 144 133 117 109 118 132 112 109 120 136 120 136 136 130 136 147 147 140 136 144 140 132 129 151 153 140 128 153 147 130 133 140 124 136 152 166 147 144 151 159 140 123 130 123 109 112 126 120 143 145 162 153 155 175 154 144 136 130 120 112 123 123 144 144 159 155 155 162 166 158 147 140 147 126 123 132 135 136 144 147 136 143 162 175 136 110 112 135 120 118 126 151 150 130 129 133 147 133 151 143 106 85 93 128 136 140 140 144 143 126 117 116 129 124 ……………………………..etc.

Image “Cheetah” (Data Compression, Mark Nelson)and table (200 x 320) containing levels of grey.


44

Trace before diagonalization: 0.15930 Trace after diagonalization: 0.15930 eigenvalues 1: 0.045 28.549 28.549 ************************************************** 2: 0.028 17.695 46.243 ****************************** 3: 0.019 12.205 58.448 ********************* 4: 0.012 7.306 65.754 ************ 5: 0.007 4.674 70.428 ******** 6: 0.006 3.516 73.944 ****** 7: 0.005 2.944 76.888 ***** 8: 0.003 2.179 79.067 *** 9: 0.003 1.869 80.936 *** 10: 0.002 1.531 82.467 ** 11: 0.002 1.371 83.838 ** 12: 0.002 1.106 84.944 * 13: 0.002 1.066 86.010 * 14: 0.002 0.956 86.965 * 15: 0.001 0.791 87.756 * 16: 0.001 0.758 88.514 * 17: 0.001 0.690 89.204 * 18: 0.001 0.567 89.771 19: 0.001 0.554 90.325 20: 0.001 0.477 90.801 21: 0.001 0.422 91.223 22: 0.001 0.406 91.629 23: 0.001 0.384 92.013 24: 0.001 0.339 92.352

Eigenvalues of the Correspondence Analysis of the previous table


45

Reconstitution of the Cheetah with 2, 4, 6, 8, 10, 12, 20, 30, 40 principal axes


46

A pedagogical example: Description of « Textual Graphs »


47

**** Ain Ain Isere Jura Rhone Hte_Saone Savoie Hte_Savoie **** Aisne Aisne Ardennes Marne Nord Oise Seine_Marne Somme

**** Allier Allier Cher Creuse Loire Nievre Puy_de_Dome Hte_Saone

**** Alpes_Prov Alpes_Prov Alpes_Hautes Alpes_Marit Drome Var Vaucluse

**** Alpes_Hautes Alpes_Hautes Alpes_Prov Drome Isere Savoie

**** Alpes_Marit Alpes_Marit Alpes_Prov Var

**** Ardeche Ardeche Drome Gard Loire Hte_Loire Lozere

**** Ardennes Ardennes Aisne Marne Meuse ……………………….

Each area “answers” to the fictitious “open-question” : Which are your neighbouring areas?


48

The idea: When a pattern existswithin a text, some techniques maydetect it and exhibit it.

This map is blindlyproduced from theprevious texts.


49

Correspondence analysis can be presented in several different ways.

• It is difficult to trace the method's history accurately(see, e.g., Hill, 1974 ; Benzecri, 1976 ; Nishisato, 1980; Gifi, 1990).

•The underlying theory probably dates back to...

•Fisher (1936) , Guttman (1941), and Hayashi (1956).

Example: CORRESPONDENCE ANALYSIS of a simple lexical table


50

• Correspondence analysis and principal components analysis are used under different circumstances:

• Principal components analysis is used for tables consisting of continuous measurements.

• Correspondence analysis is best applied to contingency tables (cross-tabulations) frequently encountered when analyzing textual data.

• By extension, it also provides a satisfactory description of data tables with binary coding.


51

• Cross-tabulations or contingency tables are among the most common data structures used for analyzing qualitative data.

• By looking simultaneously at two partitions at a time of a population or sample, a cross-tabulation enables us to work with variations in the data by response categories, a necessary step for the interpretation of results.


52

• We will use as a leading example a small contingency table.

• However, this kind of exploratory method is chiefly useful when we are dealing with very large data tables

• (pedagogical paradox)

• In the following table, the 14 rows are words used in responses to an open-ended question given by 2000 respondents.

•The 5 columns are the educational levels of the respondents.


53

No Elem.Trade High Coll- Total Words degr. Sch. Sch. Sch. ege

Money 51 64 32 29 17 193 Future 53 90 78 75 22 318 Unemployment 71 111 50 40 11 283 Decision 1 7 5 5 4 22 Difficult 7 11 4 3 2 27 Economic 7 13 12 11 11 54 Selfishness 21 37 14 26 9 107 Occupation 12 35 19 6 7 79 Finances 10 7 7 3 1 28 War 4 7 7 6 2 26 Housing 8 22 7 10 5 52 Fear 25 45 38 38 13 159 Health 18 27 20 19 9 93 Work 35 61 29 14 12 151

Total 323 537 322 285 125 1592

A contingency table crossing words and education level


54

No Elem. Trade High Coll Total

Words degree Sch. Sch. Sch. ege

Money 26.4 33.2 16.6 15.0 8.8 100.0 Future 16.7 28.3 24.5 23.6 6.9 100.0 Unemployment 25.1 39.2 17.7 14.1 3.9 100.0 Decision 4.5 31.8 22.7 22.7 18.2 100.0 Difficult 25.9 40.7 14.8 11.1 7.4 100.0 Economic 13.0 24.1 22.2 20.4 20.4 100.0 Selfishness 19.6 34.6 13.1 24.3 8.4 100.0 Occupation 15.2 44.3 24.1 7.6 8.9 100.0 Finances 35.7 25.0 25.0 10.7 3.6 100.0 War 15.4 26.9 26.9 23.1 7.7 100.0 Housing 15.4 42.3 13.5 19.2 9.6 100.0 Fear 15.7 28.3 23.9 23.9 8.2 100.0 Health 19.4 29.0 21.5 20.4 9.7 100.0 Work 23.2 40.4 19.2 9.3 7.9 100.0

Total 20.3 33.7 20.2 17.9 7.9 100.0


Row-profiles of the same table

55

No Elem. Trade High Coll- Total Words degree Sch. Sch. Sch. ege

Money 15.8 11.9 9.9 10.2 13.6 12.1 Future 16.4 16.8 24.2 26.3 17.6 20.0 Unemployment 22.0 20.7 15.5 14.0 8.8 17.8 Decision .3 1.3 1.6 1.8 3.2 1.4 Difficult 2.2 2.0 1.2 1.1 1.6 1.7 Economic 2.2 2.4 3.7 3.9 8.8 3.4 Selfishness 6.5 6.9 4.3 9.1 7.2 6.7 Occupation 3.7 6.5 5.9 2.1 5.6 5.0 Finances 3.1 1.3 2.2 1.1 .8 1.8 War 1.2 1.3 2.2 2.1 1.6 1.6 Housing 2.5 4.1 2.2 3.5 4.0 3.3 Fear 7.7 8.4 11.8 13.3 10.4 10.0 Health 5.6 5.0 6.2 6.7 7.2 5.8 Work 10.8 11.4 9.0 4.9 9.6 9.5

Total 100.0 100.0 100.0 100.0 100.0 100.0


Column-profiles of the same table

56

row-profiles column-profiles

general term of the contingency table

• •• •••

•••••

•••

•• •

••

••

••• • •••

••• •

•• • •

Rp

•

• • ••

•••

•

•

•

n points in R

p

•• • ••

•• ••

• •

••

••• •

• ••

••• •

• •

R n

1 j p1 i n

fij F = (n,p)

i i'

1 j p j j'1 i n

p points in R

n

Symmetryof the twospaces:

rows and columns


57

C O L L E G E

D e c i s i o n

E c o n o m i c

H I G H

F u tu r e

W a rF e a r

H e a l t h

S e l f i s h n e s s

N o D E G R E E

U n e m p l o y m e n t

M o n e y

D i f f i c u l t

W o r kH o u s i n g

F i n a n c e s

- . 1

- . 1 5

0

. 1 5

. 1 . 2- . 2

A x i s 2 ( 2 1 % )

A x i s 1

( 5 7 % )

O c c u p a t i o n

E L E M

T R A D E

.

.

.

.

..

.

..

.

.

.

..

.

.

.

.

.


5858

Characteristic elements (words, lemmas, segments)

The corpus contains several parts (categories of respondents).

Notations:kij -sub-frequency of word i in the part j of the corpus;ki. -frequency of word i in the whole corpus;k.j -frequency (size) of part j;k.. -size of the corpus (or, simply, k).

We are interested in the statistical significance of sub-frequency kij .

Is the word i abnormally frequent in part j ? Is it abnormally rare?

The comparison between the relative frequency of word i in part j and the relative frequency of word i in the entire corpus leads to a classicalstatistical test using either the hypergeometric distribution or its normalapproximation.


5959

The 4 parameters for computing characteristic elements

k . j

k i j k i .

k . .

k i j

k i . frequency of word in corpus

k . j size of text part

k . . size of corpus

T E X T P A R T S

W O

R D

S

frequency of word in text part


60


Summary / Outline







7) Conclusions

The two forthcoming diapositives show the principal plane produced by a correspondence analysis of the previous lexical contingency table (section 3).

Proximity between 2 category-points (columns) means similarity of lexical profiles of the 2 categories.

Proximity between 2 word-points (rows) means similarity of lexical profiles of these words.


Example 1: Open Question « Life » (International sample surveys)

The following open-ended question was asked :

"What is the single most important thing in life for you?" It was followed by the probe: "What other things are very important to you?".

62

s e c u r i t y

m i n d

k i d s

p e a c e

l e i s u r e f r e e d o m s t a n d a r d

h o u s e t i m e

c o n t e n t m e n tc h i l d r e n

w e l f a r e c h u r c hs o n

g e n e r a lf a m i l y

h a p p i n e s se m p l o y m e n t

w o r l d d a u g h t e rd o g

w i f e

a r ef r o m

v e r y

s h o u l d

m e

h e l p

i ft h e m

f o r

m u s i c

w o r ke d u c a t i o n

l o v es a t i s f .

j o b

f u t u r ef r i e n d s

m o n e y

t h i n k

c o m f o r t a b l y

h a v ek e e p

g o i n g

a n y t h i n g

w o u l d

d a y

m o r e

a n d

b e

n o t

w e l l

I

t o

y o u

c o m f o r t a b l em u c h

c a r

t h i n g s

o u t g oc a n

E 3 - A G E 2E 1 - A G E 2

E 2 - A G E 3

E 2 - A G E 2

E 1 - A G E 3

E 2 - A G E 1

E 1 - A G E 1

E 3 - A G E 1

E 3 - A G E 3

CorrespondenceWords - Categories

Example 1 (« Life » question)

63

p e a c e o f m i n d

a g o o d s t a n d a r d o f l i v i n g

w e l f a r e o f m y f a m i l y

h a p p i n e s s , g o o d h e a l t h

l a w a n d o r d e r

a g o o d j o b

f r i e n d s a n d f a m i l y

h a v i n g e n o u g h m o n e y t o l i v e

c a n ' t t h i n k o f a n y t h i n g e l s e

E 3 - A G E 2E 1 - A G E 2

E 2 - A G E 3

E 2 - A G E 2

E 1 - A G E 3

E 2 - A G E 1

E 1 - A G E 1

E 3 - A G E 1

E 3 - A G E 3

p e a c e i n t h e w o r l d

a n i c e h o m e

I d o n ' t k n o w

Location ofSegments

Example 1 (« Life » question)

64

H-30 = -30 * high (Young, High education) 1 f r i e n d s 2 . 8 7 1 . 1 1 1 7 1 1 6 3 . 4 4 2 d o 1 . 3 5 . 4 5 8 4 7 2 . 6 0 3 w a n t 1 . 0 1 . 3 0 6 3 1 2 . 4 4 4 b e i n g 2 . 1 9 1 . 1 1 1 3 1 1 6 2 . 1 8 5 j o b 2 . 5 3 1 . 3 6 1 5 1 4 2 2 . 1 6 6 h a v i n g 1 . 5 2 . 6 7 9 7 0 2 . 1 1 7 t h i n g s . 8 4 . 2 7 5 2 8 2 . 0 6 - - - - - - - - - - - - - - - - 2 w i f e . 0 0 . 6 5 0 6 8 - 2 . 1 0 1 h e a l t h 2 . 7 0 5 . 8 5 1 6 6 0 9 - 3 . 5 9 H+55 = +55 * high (Older, High education) 1 m i n d 2 . 5 5 . 4 5 5 4 7 2 . 9 1 2 w e l f a r e 1 . 5 3 . 2 1 3 2 2 2 . 4 2 3 p e a c e 2 . 5 5 . 7 4 5 7 7 2 . 1 7

Example 1 (« Life » question) Characteristic words words %W %glob Fr.W Fr.glob TestValue


65

Category 7 Less than 30 years, high level of education 1 . 3 3 - 1 f r i e n d s , f r i e n d s , m y h o m e l i f e

1 . 1 2 - 2 b e i n g c o n t e n t h a v i n g e n o u g h m o n e y t o d o w h a t y o u w a n t t o d o , w i t h i n r e a s o n , h a v i n g g o o d f r i e n d s , h a v i n g a f u l f i l l i n g j o b t o d o , h a v i n g s o m e i d e a o f w h a t y o u w a n t t o d o a n d t h e f r e e d o m t o c h o o s e , p r o t e c t i o n o f t h e e n v i r o n m e n t

1 . 0 5 - 3 t o h a v e g o o d f r i e n d s a r o u n d h a v i n g a g o o d j o b , l i v i n g i n a g o o d a r e a , h a v i n g l o t s o f f r e e d o m t o d o t h e t h i n g s y o u w a n t t o d o

. 9 3 - 4 g o o d l i v i n g e d u c a t i o n , g o o d j o b , m o n e y

Category 9 Over 55 years, high level of education . 9 7 - 1 t o g e t h e r n e s s , p e a c e o f m i n d , g o o d h e a l t h , r e l i g i o n , n o . 6 4 - 2 n o t t o d i e , p e a c e o f m i n d , d o n ' t l i k e p e o p l e l i v i n g e n v i o u s o f e a c h

o t h e r . 6 3 - 3 p e a c e o f m i n d g o o d h e a l t h , h a p p i n e s s , e n o u g h m o n e y t o k e e p a

s t a n d a r d o f l i v i n g . 3 8 - 4 w e l f a r e o f m y f a m i l y w o r k , s a t i s f a c t i o n , g o o d h e a l t h , t r a v e l

Example 1 (« Life » question) : Modal Responses


66

Example 1: Similar survey in Japan (Same open question, same categories of respondents)


67

Example 1: Similar survey in Japan: visualizationof the characteristic words for 2 categories




E x a m p l e s o f 2 C h a r a c t e r i s t i c r e s p o n s e s f o r 4 c a t e g o r i e s TEXT 1 PWNB = Prob.w.n.buy -- 1 to tell you about how long people have eaten them. -- 1 the complex carbohydrate that are in this cereal. -- 1 the people who eat this cereal and the product. that's all. -- 2 it's supposed to be healthy, it has good carbohydrates in it. TEXT 2 Hesi = Hesitates . -- 1 it gives you energy in the morning. nothing else. . -- 2 grape nuts cereal gives you energy -- 2 it has complex carbohydrates. they showed the man eating it with -- 2 strawberries and bananas. TEXT 3 PWB = Probably would buy -- 1 it's nutritious for you. nothing else. -- 2 that,is good for you, that,s all it said to me TEXT 4 DW B = Definitely would buy -- 1 they are bigger nuggets. low in carbohydrates, that's all. -- 2 it has nutty flavor, it is nutritious



Example 3: International survey (Tokyo Gas Company). A survey in three cities (Tokyo, New York, Paris) about dietary habits. Open question: "What dishes do you like and eat often?

New York: First principal plane. Table crossing words and age x gender categories


New York: First principal plane. Example of confidence areas for categories (Bootstrap)

Example 3: International survey (continuation). Question: "What dishes do you like and eat often?


New York: First principal plane. Example of confidence areas for words (Bootstrap)



New York: First principal plane. Example of Kohonen Map (Self Organizing map).



Nota y commentarios activos

9797

9393

-3.0 -1.5 1.5

-1.5

1.5

3.0

4.5

6.0Mesoneros de Castilla (03)

Torondos (02)

Valdelosfrailes (03)

Fuentenarro (02)

Gayubar (02)

Valdetán (02)

Carramimbre (03)Viña Eremos (03)

Marqués de Peñamonte (01)

7878

7979

8080

8181 82828383 8484 8585 8686

87878989

8888 9090 9191 9292

9494

9595

Axis 2 : 1.75%

Axis 1: 3.52%

Jaros Chafandín (01)

Tares P3 (01)Termanthia (02)

San Román (01)Numanthia (02)

Gran Elías Mora (00)

Bienvenida Sitio de El Palo (01) Bienvenida Sitio de El Palo (02)

Vega Sicilia 'Único' (94)Viña Sastre Pesus(01)

First Principal PlaneWINES & MARKS

Tinto joven

Gran Reserva

Tinto crianza

Tinto reservaTinto roble

Eje de calidad

Example 4: Comments about 522 Spanish wines.


82

highest marks

enérgico

lowest marks

Average mark: 85.16

corto

cocopólvoravoluptuosomagnífico

-1,9 -1,1 1,30,90,5-1,5 -0,7 -0,3 0,1

herbáceo

tradicionalrústicojovenroblelineal

amable

densosaladoimpresionante

83 86

consistencia

frutalcrianzaalgolimpioligerobeberevolucionarfácil

agradablesobremadurezsequedadmediotempranilloligeramenteamericanocapa

tuestesciertoabiertoalgúndemasiadofranco

reducidodiscretofrutosidadensambladosecoclásicodominar

rojotípicoexpresióncompotadosuaveRiberacestatoque

vezgrasotorrefactogranulosograntiempo

todonoblecascajo

estiloconcentradonecesitarpotencialsabrososorprendetactocomplejolargo

potentepurodejarmineralprimermodernocarnosoamargo

salinofinodondemuchoserbouquetsílexintensofirmevinochocolate

Mark81 84 85 87 88 89 90

Example 4: Comments about 522 Spanish wines (continuation)


1.5

3.0

4.5

1.5- 3.0 - 1.5

- 1.5

8381

82

84 85

88 90 91 92

93

94

97

95

79

80

78

86 87

Gran Reserva

50-99,9€

30-49,9€

89

15-19,9€

20-24,9€

25-29,9€Tinto joven0-4,9€ 5-9,9€

Tinto crianza

10-14,9€

Tinto reserva

Tinto roble

Axis2

Axis1

Vega Sicilia 'Único' (94)

Viña Sastre Pesus(01)

Jaros Chafandín (01)

100-300€

Astrales (02)

Punta Esencia (01)

Tares P3 (01)

Termanthia (02)

Gran Elías Mora (00)

Bienvenida Sitio de El Palo (01)

Bienvenida Sitio de El Palo (02)

Numanthia (02)

San Román (01)

Valdetán (02)

Torondos (02)

Mesoneros de Castilla (03)

Valdelosfrailes (03)

Fuentenarro (02)

Valdecuadrón (02)

Gayubar (02)

Viñatorondos (03)

Viña Valdable (03)

Marqués de Olivara (98)Rauda (01)

El Marqués (02)

Carramimbre (03)Viña Eremos (03) Valsotillo (01)

Marqués de Peñamonte (01)

Variables suplementariasExample 4: Comments about 522

Spanish wines (continuation)


---- Wine 212 (mark= 85) Legaris-2001Tuestes, gominolas y buenos balsámicos marcan la intensidad media frutal de este crianza. En boca aparece muy lineal, con consistencia media; el retrogusto frutal todavía tapado por una madera algo rústica.

---- Wine 30 (mark=91) Tares P3-2001 premiumMucho terruño se detecta en el bouquet de este gran tinto; pólvora, sílex, pizarra, cascajo caliente con el contraste de tierra húmeda y mucha fruta madura de hueso. concentrado, tacto graso sobre el paladar; impresionante viscosidad en la lengua, otra vez impresiones de tierra húmeda y pólvora en el largo final.

---- Wine 314 (mark=97) Vega Sicilia 'Único-1994Hay que realizar un ejercicio de disciplina gustativa de primer rango para describir este gran vino. el bouquet es fresco, bien armado de fruta roja que se ve potenciada por tintes de chocolates, tabacos, notas de sotobosque y una madera que se manifiesta pero que resulta difícil de localizar y menos de concretar. Tenemos el caso raro de un tinto que sale ileso del paso del tiempo sin lucir su armadura, que es la barrica. En boca joven, aunque ya tiene su cuerpo vigoroso y enérgico bastante ensamblado, con la excepción de algunos taninos saltamontes que quedan para domesticar. Largo y vibrante final que mezcla madurez con una notable finura fresca.

Example 4: Comments about 522 Spanish wines (continuation)








6) About textual data in general7) Conclusions

79

Processing Strategy

A priori Grouping (Lexical contingency table)

Juxtaposition of Lexical contingency tables

Instrumental Partition

Direct Analysis of the sparse Lexical table


80

Importance of Meta-data

Textual data

Grammar / Syntax

Meta-data

linguistics

Semantics networks

External Corpora externes

Other a priori structures sociolinguistics,

chronology, etc.


81

The four phases of a linguistic analysis

Morphology

Syntax

Semantics

Pragmatics

A big flower

A bag flower

A bug flower

A bog flower

(A bxg flower)

The spoon speaks (The speaks)

A man thinks (A stone thinks)

A challenge to I.A.


82

Homography, Polysemy, Synonymy

Homographs: BORE

To bear

A tedious person

To bore

Polysemous words: DUTY

DRUG

Task

Taxmedicine

Addicting product


83

Semantic content of a lexical profile

Distributional linguistics (Z. Harris)

X is sometimes purringX mewsX has whiskersX likes milkX likes chasing mice

At the end, the point « X » will be superimposed with the point « CAT»


84

Semantic similarity is not a transitive relationship

Example of semantic chains:

(1) calm–wisdom–discretion–wariness–fear–panic,

(2) fact–feature –aspect–appearance–illusion .


85

New additional variables

Nouns (proper, common)

Verbs (auxiliary, modal, usual…)

Adjective

Pronoun

Determiner

Adverb

Preposition

Conjunction


86

new variables, new metrics



Summary / Outline







7) Conclusions

7) Conclusions

For each open-ended question,

and for each partition of the sample of respondents, we obtain, without any preliminary coding or other intervention:

• A visualization of proximities between words and categories.

• Characteristic elements or words for each category . • Modal responses for each category (a kind of automatic summary).

[Remember also that the open question “Why” following a closed question provides an indispensable assessment of the real understanding of the question].

As a conclusion...

7) Conclusions

All these processing are carried out under the supervision of robust assessment procedures:

- Non-parametric statistical tests, - Bootstrap validation.

We are not dealing here with a novel sophisticated modelling.

It is rather a painstaking effort to stick to the real concerns of therespondent, i.e.: the customer, the user, the client.

With the rapid development of online surveys, the spreading of e-mails and blogs, the presented set of tools is expected to be a noteworthy component in a new methodology of customer knowledge.

As a conclusion... (continuation)

- Akuto H. (1992). International Comparison of Dietary Culture. Nihon Keizai Simbun, Tokyo. - Bécue M., Lebart L. (1996). Clustering of texts using semantic graphs. Application to open-ended questions in surveys, Proceedings of the IFCS 96 Symposium, Kobe, Springer Verlag, Tokyo (in press). - Bécue-Bertaut M., Pagès J., Alvarez-Esteban R., Vásquez Burguete J.L. (2006) Détermination d’une note globale, synthèse d’une évaluation numérique et d’appréciations libres. Application aux études de marché. (in French) Actes des JADT-2006. - Bécue-Bertaut, M., Álvarez Esteban R., Pagès (2008,) http://www.cavi.univ-paris3.fr/lexicometrica/jadt/jadt2006/tocJADT2006.htm Rating of products through scores and free-text assertions. Comparing and combining both. Food Quality and Preference, 19/1, 122-134. - Belson W.A., Duncan J.A. (1962): A Comparison of the check-list and the open response questioning system, Applied Statistics, 2, 120-132. - Benzécri J.-P. (1992). Correspondence Analysis Handbook. Marcel Dekker, New York. - Biber D. (1995). Dimensions of register variation. Cambridge Univ. Press, Cambridge. - Bradburn N., Sudman S., and associates (1979): Improving Interview Method and Questionnaire Design, Jossey Bass, San Francisco. - Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R. (1990). Indexing by latent semantic analysis, J. of the Amer. Soc. for Information Science, 41 (6), 391-407. - Habert B., Nazarenko A., Salem A. (1997). Les linguistiques de corpus. Armand colin, Paris. - Hayashi C., Suzuki T., Sasaki M. (1992): Data Analysis for Social Comparative research: International Perspective, North-Holland, Amsterdam. - Lebart L. (1982). Exploratory analysis of large sparse matrices, with application to textual data, COMPSTAT, Physica Verlag, 67-76. - Lebart L., Salem A., Bécue M., (2000), Análisis estadístico de textos, Editorial Milenio, Lleida. - Lebart L., Salem A., Berry E. (1998). Exploring Textual Data. Kluwer, Dordrecht. - Lebart L., Morineau A., Warwick K. (1984). Multivariate Descriptive Statistical Analysis. John Wiley. N.Y. - Ritter H., Kohonen T. (1989). Self Organizing Semantic Maps. Biol. Cybern. 61, 241-254. - Salem A. (1984). La typologie des segments répétés dans un corpus, fondée sur l'analyse d'un tableau croisant mots et textes, Cahiers de l'Analyse des Données, 489-500. - Sasaki M., Suzuki T. (1989): New directions in the study of general social attitudes : trends and cross-national perspectives, Behaviormetrika, 26, 9-30. - Schuman H., Presser F. (1981): Question and Answers in Attitude Surveys, Academic Press, New York. - Sudman S., Bradburn N. (1974): Response Effects in Survey, Aldine, Chicago.

7) Conclusions – Short Bibliography

Surveys data and software (DtmVic)

can be downloaded from

www.dtm-vic.com

92

Merci

Thank YouGracias

Grazie

Obrigado

Date post:	23-Nov-2014
Category:	Documents
Upload:	evelyn
View:	71 times
Download:	0 times

4 Text Mining and Open-Ended Questions in Sample Surveys Ludovic Lebart CNRS

Documents