tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6....

Post on 12-Sep-2018

222 views 0 download

transcript

1

Florian Thomas, Christian ReßMap/Reduce Algorithms on Hadoop

6. Juli 2009

tf/idf computation

2

tf/idf computation

•Was ist tf/idf?

•Verschiedene Implementierungen

•Map/Reduce-Aufbau

•Implementierungsbesonderheiten

•Analyse

•Fazit

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

3

Was ist tf/idf?

term frequency

=

d}||{d|t

|D|

i

logiiiiidfidfidfidf

∑=

k k,j

i,j

freq

freqjjjji,i,i,i,tftftftf

(tf - idf) i, j = tf i,j ⋅ idf i

inverse document frequency

tf/idf

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

4

Phase 1

Phase 2

#

Implementierungen – 2 Phasen

=

d}||{d|tidf

i

i

||||DDDD||||log

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

5

Phase 1

Phase 2

#

Wort1: 5

Wort2: 8

Wort3: 2∑

=

k k,jfreqi,j

tfjjjji,i,i,i,freqfreqfreqfreq

=

d}||{d|tidf

i

i

||||DDDD||||log

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Implementierungen – 2 Phasen

6

Phase 1

Phase 2

#

tfWort1,Dokument

tfWort2,Dokument

tfWort3,Dokument

Wort1: 5

Wort2: 8

Wort3: 2

Summe: 15

∑=

k k,jfreqi,j

tfjjjji,i,i,i,freqfreqfreqfreq

∑= kkkk jjjjk,k,k,k,jjjji,i,i,i, freqfreqfreqfreqtftftftf i,jfreq

=

d}||{d|tidf

i

i

||||DDDD||||log

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Implementierungen – 2 Phasen

7

Phase 1

Phase 2

#

ni

tfWort1,Dokument

tfWort2,Dokument

tfWort3,Dokument

Wort1: 5

Wort2: 8

Wort3: 2

Summe: 15

tf-idfWort,Dokument

idfWort

∑=

k k,jfreqi,j

tfjjjji,i,i,i,freqfreqfreqfreq

∑= kkkk jjjjk,k,k,k,jjjji,i,i,i, freqfreqfreqfreqtftftftf i,jfreq

⋅= ||||d}d}d}d}tttt||||{d{d{d{d|||| iiii ||log,

Dtf ji

=

d}||{d|tidf

i

i

||||DDDD||||log

iiiijjjji,i,i,i, idfidfidfidfidf)idf)idf)idf)----(tf(tf(tf(tf ⋅= jitf ,

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Implementierungen – 2 Phasen

8

Phase 1

Phase 2

Phase 3

Phase 4

#

ni

tfWort1,Dokument

tfWort2,Dokument

tfWort3,Dokument

Wort1: 5

Wort2: 8

Wort3: 2

Summe: 15

tf-idfWort,Dokument

idfWort

∑=

k k,jfreqi,j

tfjjjji,i,i,i,freqfreqfreqfreq

∑= kkkk jjjjk,k,k,k,jjjji,i,i,i, freqfreqfreqfreqtftftftf i,jfreq

⋅= ||||d}d}d}d}tttt||||{d{d{d{d|||| iiii ||log,

Dtf ji

=

d}||{d|tidf

i

i

||||DDDD||||log

(tf - idf) i, j = tf i, j ⋅ idf i

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Implementierungen – 4 Phasen

9

Phase 2

Phase 3

tfWort1,Dokument

tfWort2,Dokument

tfWort3,Dokument

Summe: 15

Wort1: 5

Wort2: 8

Wort3: 2∑

=

kkkk jjjjk,k,k,k,freqfreqfreqfreqjjjji,i,i,i,freqfreqfreqfreq

i,jtf

∑=

k k,j

i,j

freq

freqjjjji,i,i,i,tftftftf

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Implementierungen – 5 Phasen

10

Phase 2

Phase 3

Phase 4

tfWort1,Dokument

tfWort2,Dokument

tfWort3,Dokument

tf-idfWort,Dokument

idfWort

ni

Phase 5

Summe: 15

Wort1: 5

Wort2: 8

Wort3: 2∑

=

kkkk jjjjk,k,k,k,freqfreqfreqfreqjjjji,i,i,i,freqfreqfreqfreq

i,jtf

∑=

k k,j

i,j

freq

freqjjjji,i,i,i,tftftftf

= ||||d}d}d}d}tttt||||{d{d{d{d||||idfidfidfidf iiiiiiii ||log

D

iji idftf ⋅= ,jjjji,i,i,i,idf)idf)idf)idf)----(tf(tf(tf(tf

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Implementierungen – 5 Phasen

11

Map/Reduce-Aufbau – 5 Phasen

Phase 1: Dokumente zählen

DokID, Text

„N“, 1

„N“, (1,1,...)

„N“, #Dokumente

Map

Reduce

=

d}||{d|tidf

i

i

||||DDDD||||log

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

12

Phase 2: Wörter zählen in jedem Dokument

DokID, Text

(Wort, DokID),1 (Wort, DokID),1 DokID, Sum

(Wort, DokID), (1,1,...) (Wort, DokID), (1,1,...) DokID, Sum

(Wort, DokID), freq (Wort, DokID), freq DokID, Sum

Map

Reduce

...

∑=

kkkk jjjjk,k,k,k,jjjji,i,i,i,

freqfreqfreqfreqfreqfreqfreqfreq

i,jtf

Map/Reduce-Aufbau – 5 Phasen

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

13

Phase 3: tf berechnen

(Wort, DokID), freq

DokID, (Wort, freq)

DokID, (Sum, (Wort, freq), (Wort, freq),...)

(Wort, DokID), tf (Wort, DokID), tf (Wort, DokID), tf

Map

Reduce

DokID, Sum

DokID, Sum

... ...

tf i, j =

freqi,j

freqk,jk

Map/Reduce-Aufbau – 5 Phasen

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

14

Phase 4: #DokumenteWort bestimmen, idf berechnen

(Wort, DokID), tf

Wort, (DokID, tf)

Wort, ((DokID, tf), (DokID, tf),...)

Wort, (DokID, tf) Wort, (DokID, tf) Wort, idf

Map

Reduce

...

idf i = log| D |

| {d | ti ∈ d} |

Map/Reduce-Aufbau – 5 Phasen

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

15

Phase 5: tf-idf berechnen

Wort, (DokID, tf)

Wort, (DokID, tf)

Wort, (idf, (DokID, tf), (DokID, tf), ...)

(Wort, DokID), tf-idf (Wort, DokID), tf-idf (Wort, DokID), tf-idf

Map

Reduce

... ...

Wort, idf

Wort, idf

(tf - idf) i, j = tf i, j ⋅ idf i

Map/Reduce-Aufbau – 5 Phasen

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

16

Motivation: Phase 5

Wort, (DokID, tf)

Wort, (DokID, tf)

Wort, (idf, (DokID, tf), (DokID, tf), ...)

(Wort, DokID), tf-idf (Wort, DokID), tf-idf (Wort, DokID), tf-idf

Map

Reduce

... ...

Wort, idf

Wort, idf

Implementierung – Secondary Sort

Normalerweise: Wort, ((DokID, tf), (DokID, tf), ..., idf, …)

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

17

Hadoop – normales Verhalten

Implementierung – Secondary Sort

mental (1, 0.4)

mental (3, 0.9)

mental (5, 0.3)

mental (0, 0.7)

mental (2, 0.6)

mental 1.8

retardation (8, 0.5)

retardation (6, 0.1)

retardation (9, 0.4)

retardation (4, 0.8)

retardation 5.1

Wort (DokID, tf)

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

18

Hadoop – normales Verhalten

Implementierung – Secondary Sort

mental (1, 0.4)

mental (3, 0.9)

mental (5, 0.3)

mental (0, 0.7)

mental (2, 0.6)

mental 1.8

retardation (8, 0.5)

retardation (6, 0.1)

retardation (9, 0.4)

retardation (4, 0.8)

retardation 5.1

Wort (DokID, tf)

→ mental, ((1, 0.4), (3, 0.9), (5, 0.3), (0, 0.7), 1.8, (2, 0.6))

→ retardation, ((8, 0.5), 5.1, (6, 0.1), (9, 0.4), (4, 0.8))

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

19

Hadoop – gewünschtes Verhalten

Implementierung – Secondary Sort

mental 1 (1, 0.4)

mental 1 (3, 0.9)

mental 1 (5, 0.3)

mental 1 (0, 0.7)

mental 1 (2, 0.6)

mental 0 1.8

retardation 1 (8, 0.5)

retardation 1 (6, 0.1)

retardation 1 (9, 0.4)

retardation 1 (4, 0.8)

retardation 0 5.1

Wort (DokID, tf)

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

• Anhängen eines Index-Wertes an den Schlüssel

20

Hadoop – gewünschtes Verhalten

Implementierung – Secondary Sort

mental 1 (1, 0.4)

mental 1 (3, 0.9)

mental 1 (5, 0.3)

mental 1 (0, 0.7)

mental 1 (2, 0.6)

mental 0 1.8

retardation 1 (8, 0.5)

retardation 1 (6, 0.1)

retardation 1 (9, 0.4)

retardation 1 (4, 0.8)

retardation 0 5.1

Wort (DokID, tf)

• Anhängen eines Index-Wertes an den Schlüssel

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

21

Hadoop – gewünschtes Verhalten

Implementierung – Secondary Sort

mental 1 (1, 0.4)

mental 1 (3, 0.9)

mental 1 (5, 0.3)

mental 1 (0, 0.7)

mental 1 (2, 0.6)

mental 0 1.8

retardation 1 (8, 0.5)

retardation 1 (6, 0.1)

retardation 1 (9, 0.4)

retardation 1 (4, 0.8)

retardation 0 5.1

Wort (DokID, tf)

→ mental, (1.8, (1, 0.4), (3, 0.9), (5, 0.3), (0, 0.7), (2, 0.6))

→ retardation, (5.1, (8, 0.5), (6, 0.1), (9, 0.4), (4, 0.8))

• Anhängen eines Index-Wertes an den Schlüssel

• Anpassen der Gruppierung

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

22

Tupel

Implementierung – Eigene Datentypen

Tuple<F,S>

F FirstS Second

getFirst(): F;getSecond(): S;readFields(DataInput);write(DataOutput);compareTo(Tuple): int;toString(): String;

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Comparator

compare(byte[], int, int, byte[], int, int): int

23

Tupel

Implementierung – Eigene Datentypen

Tuple<F,S>

F FirstS Second

getFirst(): F;getSecond(): S;readFields(DataInput);write(DataOutput);compareTo(Tuple): int;toString(): String;

TupleTextLong TupleLongDouble

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Comparator

compare(byte[], int, int, byte[], int, int): int

Speicherauslastung – 2 Phasen

� Phase 1� Map: konstant� Reduce: konstant – 1 long

� Phase 2� Map: 1 Dokument, pro Wort ein float� Reduce: | {Wort ∈ Korpus} | * (Wort, Long, Double)

� Abschätzung Obergrenze:

| {Wort ∈ Korpus} | = | Korpus |

1’000’000 * (10+8+8 Byte) ≈ 25 MiB3 GiB RAM ≈ 3,2 Mrd. Dokumente

Speicherauslastung – 4 Phasen

� Phase 1� Map: konstant� Reduce: konstant – 1 long

� Phase 2� Map: 1 Dokument, 1 int� Reduce: konstant, 1 int

� Phase 3� Map: 1 Tupel (String, Long, Double)� Reduce: { (Wort, tf) } eines Dokuments

� Phase 4� Map: 1 Tupel (String, Long, Double)� Reduce: | {Wort ∈ Korpus} | * (String, Double)

Speicherauslastung – 5 Phasen

� Phase 1� Map: konstant� Reduce: konstant – 1 long

� Phase 2� Map: 1 Dokument, 1 int� Reduce: konstant, 1 int

� Phase 3� Map: 1 Tupel (String, Long, Long)� Reduce: 1 Tupel (String, Long, Long)

� Phase 4 & Phase 5 � Map: 1 Tupel (String, Long, Double)� Reduce: 1 Tupel (String, Long, Double)

27

Obergrenze – 5 Phasen

� durch Phase 2 beschränkt

� 1 Wort => max. INT_MAX Vorkommen im Korpus� 1 Dokument => max. LONG_MAX Worte� max. LONG_MAX Dokumente

� Abschätzung: 10^19 * 10^19 * 10Byte = 10 ^ 39 Byte

=> unbegrenzt (10^80 Atome im Universum…)

28

Laufzeit

29

Laufzeit

30

Laufzeit

31

Laufzeit

32

Fazit

• beide Varianten bei größeren Korpora anwendbar, max. O(N)

• Vergleich bei großen Korpora nötig (Wikipedia!)

33

Quellen

• Hadoop API Documentationhttp://hadoop.apache.org/core/docs/r0.20.0/api/

• Yahoo! Hadoop Tutorialhttp://developer.yahoo.com/hadoop/tutorial/

• Wikipediahttp://en.wikipedia.org/wiki/Tf-idf

• Introduction to Information Retrievalhttp://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html