Date post: | 23-Feb-2018 |
Category: |
Documents |
Upload: | jati-hamengku-gati |
View: | 241 times |
Download: | 0 times |
of 37
7/24/2019 referensi TF-IDF
1/37
Prasad L08VSM-tfd 1
Vector Space Model : TF - IDF
Adapted from Lectures by
Prabhakar Raghavan (Yahoo and Stanford) andhr!"topher Mann!ng (Stanford)
7/24/2019 referensi TF-IDF
2/37
Recap la"t lect#re
ollect!on and vocab#lar$ "tat!"t!c" %eap"& and '!pf&" la"
D!ct!onar$ copre""!on for *oolean !nde+e" D!ct!onar$ "tr!ng, block", front cod!ng
Po"t!ng" copre""!on ap encod!ng #"!ng pref!+-#n!.#e code"
Var!able-*$te and aa code"
Pra"ad /012VSM-tf!df
7/24/2019 referensi TF-IDF
3/37
Th!" lect#re3 Sect!on" 45/-45657
Scor!ng doc#ent"
Ter fre.#enc$
ollect!on "tat!"t!c"
8e!ght!ng "chee"
Vector "pace "cor!ng
Pra"ad 7012VSM-tf!df
7/24/2019 referensi TF-IDF
4/37
Ranked retr!eval
Th#" far, o#r .#er!e" have all been *oolean5 Doc#ent" e!ther atch or don&t5 ood for e+pert #"er" !th prec!"e #nder"tand!ng
of the!r need" and the collect!on (e5g5,l!brar$ "earch)5
9l"o good for appl!cat!on": 9ppl!cat!on" can ea"!l$
con"#e 111" of re"#lt"5
;ot good for the a
7/24/2019 referensi TF-IDF
5/37
Proble !th *oolean "earch:
fea"t or fa!ne *oolean .#er!e" often re"#lt !n e!ther too fe (=1)or too an$ (111") re"#lt"5 >#er$ : ?standard user dlink 650@ A /11,111 h!t"
>#er$ /: ?standard user dlink 650 no card found@:1 h!t"
It take" "k!ll to coe #p !th a .#er$ that
prod#ce" a anageable n#ber of h!t"5
8!th a ranked l!"t of doc#ent", !t doe" not
atter ho large the retr!eved "et !"5
Pra"ad B012VSM-tf!df
7/24/2019 referensi TF-IDF
6/37
Scor!ng a" the ba"!" of ranked
retr!eval 8e !"h to ret#rn in order the doc#ent" o"tl!kel$ to be #"ef#l to the "earcher
%o can e rank-order the doc#ent" !n the
collect!on !th re"pect to a .#er$C
9""!gn a "core "a$ !n E1, to each doc#ent Th!" "core ea"#re" ho ell doc#ent and
.#er$ ?atch@5
Pra"ad 4012VSM-tf!df
7/24/2019 referensi TF-IDF
7/37
>#er$-doc#ent atch!ng "core"
8e need a a$ of a""!gn!ng a "core to a
.#er$Gdoc#ent pa!r
0et&" "tart !th a one-ter .#er$ If the .#er$ ter doe" not occ#r !n the doc#ent:
"core "ho#ld be 1
The ore fre.#ent the .#er$ ter !n the
doc#ent, the h!gher the "core ("ho#ld be)
8e !ll look at a n#ber of alternat!ve" for th!"5
Pra"ad H012VSM-tf!df
7/24/2019 referensi TF-IDF
8/37
Take : accard coeff!c!ent
Recall: accard coeff!c!ent !" a coonl$ #"ed
ea"#re of overlap of to "et"Aand B
7/24/2019 referensi TF-IDF
9/37
accard coeff!c!ent: Scor!ng
e+aple 8hat !" the .#er$-doc#ent atch "core that theaccard coeff!c!ent cop#te" for each of the to
doc#ent" beloC
>#er$: ides of marc" Doc#ent : caesar died in marc"
Doc#ent /: t"e lon# marc"
Pra"ad K012VSM-tf!df
7/24/2019 referensi TF-IDF
10/37
I""#e" !th accard for "cor!ng
It doe"n&t con"!der ter fre.#enc$ (ho an$
t!e" a ter occ#r" !n a doc#ent)
It doe"n&t con"!der doc#entGcollect!on
fre.#enc$ (rare ter" !n a collect!on are ore!nforat!ve than fre.#ent ter")
8e need a ore "oph!"t!cated a$ of
noral!J!ng for length 0ater !n th!" lect#re, e&ll #"e
5 5 5 !n"tead of L9 *LGL9 *L (accard) for lengthnoral!Jat!on5
|BA|/|BA|
Pra"ad 1012VSM-tf!df
7/24/2019 referensi TF-IDF
11/37
Recall (0ect#re ): *!nar$ ter-
doc#ent !nc!dence atr!+
Each document is represented by a binary vector !0"1# $V
Pra"ad 012VSM-tf!df
7/24/2019 referensi TF-IDF
12/37
Ter-doc#ent co#nt atr!ce"
on"!der the n#ber of occ#rrence" of a ter !n
a doc#ent: Nach doc#ent !" a co#nt vector !n %v: a col#n
belo
Pra"ad /012VSM-tf!df
7/24/2019 referensi TF-IDF
13/37
Ba# of $ords odel
Vector repre"entat!on doe"n&t con"!der the
order!ng of ord" !n a doc#ent %o"n is &uicker t"an 'ary and 'ary is &uicker
t"an %o"nhave the "ae vector" Th!" !" called the bag of ord"odel5
In a "en"e, th!" !" a "tep back: The po"!t!onal
!nde+ a" able to d!"t!ng#!"h the"e to
doc#ent"5 8e !ll look at ?recover!ng@ po"!t!onal !nforat!on
later !n th!" co#r"e5
Pra"ad 7012VSM-tf!df
7/24/2019 referensi TF-IDF
14/37
Ter fre.#enc$ tf
The term fre&uency tft,dof ter t!n doc#ent d!"
def!ned a" the n#ber of t!e" that t occ#r" !n d5
8e ant to #"e tfhen cop#t!ng .#er$-
doc#ent atch "core"5 *#t hoC a$ter fre.#enc$ !" nothat e ant:
9 doc#ent !th 1 occ#rrence" of the ter a$ be
ore relevant than a doc#ent !th one occ#rrence of
the ter5 *#t not 1 t!e" ore relevant5
Relevance doe" not !ncrea"e proport!onall$ !th
ter fre.#enc$5
Pra"ad 6012VSM-tf!df
7/24/2019 referensi TF-IDF
15/37
0og-fre.#enc$ e!ght!ng
The log fre.#enc$ e!ght of ter t !n d !"
1 A 1, A , / A 57, 1 A /, 111 A 6, etc5
Score for a doc#ent-.#er$ pa!r: "# over ter"
t!n both &and d:
"core
The "core !" 1 !f none of the .#er$ ter" !" pre"ent !n the
doc#ent5
>+
=otherwise0,
0tfif,tflog1
10 t,dt,d
t,dw
+= dqt dt )tflog(1 ,
Pra"ad B012VSM-tf!df
7/24/2019 referensi TF-IDF
16/37
Doc#ent fre.#enc$
Rare ter" are ore !nforat!ve than fre.#ent ter"
Recall "top ord"
on"!der a ter !n the .#er$ that !" rare !n the collect!on
(e5g5, arac"nocentric)
9 doc#ent conta!n!ng th!" ter !" ver$ l!kel$ to be
relevant to the .#er$ arac"nocentric
A 8e ant a h!gher e!ght for rare ter" l!ke
arac"nocentric5
Pra"ad 4012VSM-tf!df
7/24/2019 referensi TF-IDF
17/37
Doc#ent fre.#enc$, cont!n#ed
on"!der a .#er$ ter that !" fre.#ent !n the collect!on (e5g5,
"i#", increase, line)
9 doc#ent conta!n!ng "#ch a ter !" ore l!kel$ to be
relevant than a doc#ent that doe"n&t, but its not a sure
indicator of rele*ance+ A For fre.#ent ter", e ant po"!t!ve e!ght" for ord"
l!ke "i#", increase, and line, b#t loer e!ght" than for rare
ter"5
8e !ll #"e doc#ent fre.#enc$ (df) to capt#re
th!" !n the "core5 df () !" the n#ber of doc#ent" that conta!n
the ter
Pra"ad H012VSM-tf!df
7/24/2019 referensi TF-IDF
18/37
!df e!ght
dft!" the doc#ent fre.#enc$ of t: the n#ber of
doc#ent" that conta!n t df !" a ea"#re of the !nforat!vene"" of t
8e def!ne the !df (!nver"e doc#ent fre.#enc$)of tb$
8e #"e log Gdft
!n"tead of Gdft
to ?dapen@ theeffect of !df5
tt N/dflogidf 10=
&i'' turn out that the base o the 'o( is immateria')
Pra"ad 2012VSM-tf!df
7/24/2019 referensi TF-IDF
19/37
!df e+aple, "#ppo"e = !ll!on
term df t
idft
calp#rn!a 4
an!al 11 6
"#nda$ ,111 7
fl$ 1,111 /
#nder 11,111
the ,111,111 1
*here is one id va'ue or each term tin a co''ection)
Pra"ad K012VSM-tf!df
7/24/2019 referensi TF-IDF
20/37
ollect!on v"5 Doc#ent fre.#enc$
The collect!on fre.#enc$ of t!" the n#ber of
occ#rrence" of t!n the collect!on, co#nt!ng
#lt!ple occ#rrence"5
8h!ch ord !" a better "earch ter (and "ho#ld
get a h!gher e!ght)C
Word Collection frequency Document frequency
insurance 1661 7KKH
try 16// 2H41
Pra"ad /1012VSM-tf!df
7/24/2019 referensi TF-IDF
21/37
tf-!df e!ght!ng
The tf-!df e!ght of a ter !" the prod#ct of !t" tf
e!ght and !t" !df e!ght5
*e"t knon e!ght!ng "chee !n !nforat!on retr!eval ;ote: the ?-@ !n tf-!df !" a h$phen, not a !n#" "!gnO
9lternat!ve nae": tf5!df, tf + !df
Increa"e" !th the n#ber of occ#rrence" !th!n a
doc#ent
Increa"e" !th the rar!t$ of the ter !n the collect!on
tdt Ndt df/log)tflog1(w ,,+=
Pra"ad /012VSM-tf!df
7/24/2019 referensi TF-IDF
22/37
*!nar$ A co#nt A e!ght atr!+
Each document is no+ represented by a rea'-va'ued vector o t-id +ei(hts R$V$
Pra"ad //012VSM-tf!df
7/24/2019 referensi TF-IDF
23/37
Doc#ent" a" vector"
So e have a LVL-d!en"!onal vector "pace
Ter" are a+e" of the "pace
Doc#ent" are po!nt" or vector" !n th!" "pace
Ver$ h!gh-d!en"!onal: h#ndred" of !ll!on" of
d!en"!on" hen $o# appl$ th!" to a eb "earch
eng!ne Th!" !" a ver$ "par"e vector - o"t entr!e" are
Jero5
Pra"ad /7012VSM-tf!df
7/24/2019 referensi TF-IDF
24/37
>#er!e" a" vector"
e$ !dea : Do the "ae for .#er!e": repre"ent
the a" vector" !n the "pace
e$ !dea /: Rank doc#ent" accord!ng to the!r
pro+!!t$ to the .#er$ !n th!" "pace pro+!!t$ = "!!lar!t$ of vector"
pro+!!t$ Q !nver"e of d!"tance
Recall: 8e do th!" beca#"e e ant to get aa$fro the $o#&re-e!ther-!n-or-o#t *oolean odel5
In"tead: rank ore relevant doc#ent" h!gher
than le"" relevant doc#ent"
Pra"ad /6012VSM-tf!df
7/24/2019 referensi TF-IDF
25/37
Foral!J!ng vector "pace pro+!!t$
F!r"t c#t: d!"tance beteen to po!nt" ( = d!"tance beteen the end po!nt" of the to
vector")
N#cl!dean d!"tanceC
N#cl!dean d!"tance !" a bad !dea 5 5 5
5 5 5 beca#"e N#cl!dean d!"tance !" large forvector" of d!fferent length"5
Pra"ad /B012VSM-tf!df
7/24/2019 referensi TF-IDF
26/37
8h$ d!"tance !" a bad !dea
The N#cl!deand!"tance beteen &
and d-!" large eventho#gh the
d!"tr!b#t!on of ter"!n the .#er$ &andthe d!"tr!b#t!on of
ter" !n the
doc#ent d-arever$ "!!lar5
Pra"ad /4012VSM-tf!df
7/24/2019 referensi TF-IDF
27/37
"e angle !n"tead of d!"tance
Tho#ght e+per!ent: take a doc#ent d and
append !t to !t"elf5 all th!" doc#ent d5
?Seant!call$@ d and d have the "ae content
The N#cl!dean d!"tance beteen the todoc#ent" can be .#!te large
The angle beteen the to doc#ent" !" 1,
corre"pond!ng to a+!al "!!lar!t$5
e$ !dea: Rank doc#ent" accord!ng to angle
!th .#er$5
Pra"ad /H012VSM-tf!df
7/24/2019 referensi TF-IDF
28/37
Fro angle" to co"!ne"
The follo!ng to not!on" are e.#!valent5 Rank doc#ent" !n decrea"!ng order of the angle
beteen .#er$ and doc#ent
Rank doc#ent" !n !ncrea"!ng order ofco"!ne(.#er$,doc#ent)
o"!ne !" a onoton!call$ decrea"!ng f#nct!on for
the !nterval E1o, 21o
Pra"ad /2012VSM-tf!df
7/24/2019 referensi TF-IDF
29/37
0ength noral!Jat!on
9 vector can be (length-) noral!Jed b$ d!v!d!ng
each of !t" coponent" b$ !t" length for th!" e
#"e the 0/nor:
D!v!d!ng a vector b$ !t" 0/nor ake" !t a #n!t
(length) vector
Nffect on the to doc#ent" d and d (d appended
to !t"elf) fro earl!er "l!de: the$ have !dent!cal
vector" after length-noral!Jat!on5
=i
ixx2
2
Pra"ad /K012VSM-tf!df
7/24/2019 referensi TF-IDF
30/37
co"!ne(.#er$,doc#ent)
==
===
=
V
i iV
i i
V
i ii
dq
dq
d
d
q
q
dq
dqdq
1
2
1
2
1),cos(
,ot product nit vectors
qiis the t-id +ei(ht o term iin the .uery
diis the t-id +ei(ht o term iin the document
cos/q,d is the cosine simi'arity o qand d or"e.uiva'ent'y" the cosine o the an('e bet+een qand d)
Pra"ad 71012VSM-tf!df
7/24/2019 referensi TF-IDF
31/37
o"!ne "!!lar!t$ aong"t 7 doc#ent"
term SaS PaP WH
affect!on B B2 /1
7/24/2019 referensi TF-IDF
32/37
7 doc#ent" e+aple contd5
Log frequency weighting
term SaS PaP WH
affect!on 7514 /5H4 /571
S"&;?
7/24/2019 referensi TF-IDF
33/37
op#t!ng co"!ne "core"
7/24/2019 referensi TF-IDF
34/37
tf-!df e!ght!ng ha" an$ var!ant"
@o'umns headed AnB are acronyms or +ei(ht schemes)
&hy is the base o the 'o( in id immateria'?
7/24/2019 referensi TF-IDF
35/37
8e!ght!ng a$ d!ffer !n .#er!e" v"
doc#ent" Man$ "earch eng!ne" allo for d!fferent
e!ght!ng" for .#er!e" v" doc#ent"
To denote the cob!nat!on !n #"e !n an eng!ne,
e #"e the notat!on ...5ddd !th the acron$"fro the prev!o#" table
N+aple: ltn5lnc ean": >#er$: logar!th!c tf (l !n lefto"t col#n), !df (t !n
"econd col#n), no noral!Jat!on
Doc#ent logar!th!c tf, no !df and co"!ne
noral!Jat!onCs this a bad idea?
Pra"ad 7B
7/24/2019 referensi TF-IDF
36/37
tf-!df e+aple: ltn5lnc
Term Query Document Prod
tf-ra tf-t df !df t tf-ra tf-t t n&l!Jed
a#to 1 1 B111 /57 1 15B/ 1
be"t B1111 57 57 1 1 1 1 1
car 1111 /51 /51 15B/ 516
!n"#rance 111 751 751 / 57 75K /517 451K
,ocumentD car insurance auto insuranceueryD best car insurance
EFerciseD +hat is N" the number o docs?
7/24/2019 referensi TF-IDF
37/37
S#ar$ vector "pace rank!ng
Repre"ent the .#er$ a" a e!ghted tf-!df vector
Repre"ent each doc#ent a" a e!ghted tf-!df vector
op#te the co"!ne "!!lar!t$ "core for the .#er$
vector and each doc#ent vector
Rank doc#ent" !th re"pect to the .#er$ b$ "core Ret#rn the top 3(e5g5, 3= 1) to the #"er
Pra"ad 7H012VSM tf!df