+ All Categories
Home > Documents > Machine Learning for Text and Web Mining

Machine Learning for Text and Web Mining

Date post: 04-Mar-2016
Category:
Upload: swapnil022
View: 23 times
Download: 0 times
Share this document with a friend
Description:
Introduction to machine learning for text and web mining.

of 20

Transcript
  • 1Machine Learning Methodsfor

    Text / Web Data Mining

    Byoung-Tak ZhangSchool of Computer Science and Engineering

    Seoul National UniversityE-mail: [email protected]

    This material is available at http://scai.snu.ac.kr./~btzhang/

    2

    Overview

    z Introduction4Web Information Retrieval 4Machine Learning (ML)4ML Methods for Text/Web Data Mining

    z Text/Web Data Analysis4Text Mining Using Helmholtz Machines4Web Mining Using Bayesian Networks

    z Summary 4Current and Future Work

  • 23

    Web Information RetrievalText Data

    Classification System

    Information Filtering System

    questionuser profile

    feedback

    answer

    DB

    LocationDate

    DB Record

    DB Template Filling& InformationExtraction System

    Information FilteringInformation Extraction

    filtered data

    Text Classification

    Preprocessing and Indexing

    4

    Machine Learning

    z Supervised Learning4Estimate an unknown mapping from known input-

    output pairs4Learn fw from training set D={(x,y)} s.t.

    4Classification: y is discrete4Regression: y is continuous

    z Unsupervised Learning4Only input values are provided4Learn fw from D={(x)} s.t.

    4Density Estimation4Compression, Clustering

    )()( xxw fyf ==

    xxw =)(f

  • 35

    Machine Learning Methodsz Neural Networks

    4Multilayer Perceptrons (MLPs)4Self-Organizing Maps (SOMs)4Support Vector Machines (SVMs)

    z Probabilistic Models4Bayesian Networks (BNs)4Helmholtz Machines (HMs)4Latent Variable Models (LVMs)

    z Other Machine Learning Methods4Evolutionary Algorithms (EAs)4Reinforcement Learning (RL)4Boosting Algorithms4Decision Trees (DTs)

    6

    ML for Text/Web Data Mining

    z Bayesian Networks for Text Classificationz Helmholtz Machines for Text Clustering/Categorizationz Latent Variable Models for Topic Word Extractionz Boosted Learning for TREC Filtering Taskz Evolutionary Learning for Web Document Retrievalz Reinforcement Learning for Web Filtering Agentsz Bayesian Networks for Web Customer Data Mining

  • 47

    Preprocessing for Text LearningFrom: [email protected]:comp.graphicsSubject: Need specs on Apple QT

    I need to the specs, or at least a very verbose interpretation of the specs, for QuckTime. Technical articles from magazines and references to books would be nice, too.

    I also need the specs in a format usable on a Unix or MS-Dos system. I cant do much with the QuickTime stuff they have on..

    specs3unix1

    hockey0

    computer0

    .

    space0references1

    .

    .quicktime2

    graphics0

    clinton0car0baseball0

    8

    Text Mining: Data Sets

    z Usenet Newsgroup Data420 categories 41000 documents for each category 420000 documents in total.

    z TDT2 Corpus4Target detection and tracking (TDT): NIST

    4Used 6,169 documents in experiments

  • 59

    d1 d2 d3 dn

    h1 h2 hm

    4 [Chang and Zhang, 2000]4 Input nodes

    Binary values Represent the existence or

    absence of words in documents.

    Text Mining:Helmholtz Machine Architecture

    +

    ===

    n

    jjiji

    i

    dwbhP

    1exp1

    1)1(

    +

    ===

    m

    jjiji

    i

    hwbdP

    1exp1

    1)1(

    4Latent nodes Binary values Extract the underlying causal

    structure in the document set. Capture correlations of the words

    in documents.

    : recognition weight

    : generative weight

    10

    Text Mining: Learning Helmholtz Machines

    4Introduce a recognition network for estimation of a generative network.

    4Wake-Sleep Algorithm Train the recognition and generative models alternately. Update the weight in network iteratively by simple local delta rule.

    ( ) ( ) ( )( )( ) ( )( )

    =

    ==

    =

    =T

    tt

    ttt

    tt

    ttt

    T

    t t

    tt

    t QdPQ

    QdPQdPD

    1)(

    )()()(

    T

    1t )()(

    )()()(

    1 )(

    )()(

    )(

    |,log

    |,log |,log)|log(

    ))1(( ==+=

    jjiij

    ijoldij

    newij

    spsswwww

  • 611

    Text Mining: Methods

    z Text Categorization4Train a Helmholtz machine for each category.4Total N machines for N categories.

    4Once the N machines have been estimated, classification of a test document proceeds by estimating the likelihood of the document for each machine.

    z Topic Words Extraction4For the entire document sets, train a Helmholtz machine.4After training, examine the weights of connections from a latent

    node to input nodes, that is words.

    )]|([logmaxarg cdPcCc

    =

    12

    4Usenet Newsgroup Data 20 categories, 1000 documents for each category, 20000 documents in

    total.

    Text Mining: Categorization Results

  • 713

    4TDT2 Corpus46,169 documents

    imf, monetary, currencies, currency, rupiah, singapore, bailout, traders, markets, thailand, inflation, investors, fund, banks, baht

    7

    netanyahu, palestinian, arafat, israeli, yasser, kofi, annan, benjamin, palestinians, mideast, gaza,jerusalem, eu, paris, israel

    4

    India, pakistan, pakistani, delhi, hindu, vajpayee, nuclear, tests, atal, kashmir, indian, janata, bharatiya, islamabad, bihari

    5

    Suharto, habibie, demonstrators, riots, indonesians, demonstrations, soeharto, resignation, jakarta, rioting, electoral, rallies, wiranto, unrest, megawati

    6

    pope, cuba, cuban, embargo, castro, lifting, cubans, havana, alan, invasion, reserve, paul, output, vatican, freedom

    8

    olympics, nagano, olympic, winter, medal, hockey, atheletes, cup, games, slalom, medals, bronze, skating, lillehammer, downhill

    3

    warplane, airline, saudi, gulf, wright, soldiers, yitzhak, tanks, stealth, sabah, stations, kurds, mordechai, separatist, governor

    2

    tabacco, smoking, gingrich, newt, trent, republicans, congressional, republicans, attorney, smokers, lawsuit, senate, cigarette, morris, nicotine

    1

    Text Mining: Topic Words Extraction Results

    14

    z KDD-2000 Web Mining Competition4Data: 465 features over 1700 customers

    Features include friend promotion rate, date visited, weight of items, price of house, discount rate,

    Data was collected during Jan. 30 March 30, 2000 Friend promotion was started from Feb. 29 with TV

    advertisement. 4Aims: Description of heavy/low spenders

    Web Mining: Customer Analysis

  • 815

    16

  • 917

    Web Mining: Feature Selectionz Features selected by various ways [Yang & Zhang, 2000]

    V240 (Friend) V229 (Order-Average) V304 (OrderShippingAmtMin.)V368 (Weight Average)V43 (Home Market Value)V377 (NumAcountTemplate Views)+V11 (WhichDoYouWearMostFrequent)V13 (SendEmail)V17 (USState)V45 (VehicleLifeStyle)V68 (RetailActivity)V19 (Date)

    V13 (SendEmail)V234 (OrderItemQuantity Sum%HavingDiscountRange(5 . 10))V237 (OrderItemQuantitySum%Having DiscountRange(10.))V240 (Friend)V243 (OrderLineQuantitySum)V245 (OrderLineQuantity Maximum)V304 (OrderShippingAmtMin)V324 (NumLegwearProductViews)V368 (Weight Average)V374 (NumMainTemplateViews)V412 (NumReplenishableStock Views)

    V368 (Weight Average)V243 (OrderLine Quantity Sum)V245 (OrderLine Quantity Maximum)F1 = 0.94*V324 + 0.868*V374 + 0.898*V412 F2 = 0.829*V234 + 0.857*V240F3 = -0.795*V237+ 0.778*V304

    Discriminant ModelDecision TreeDecisionTree+Factor Analysis

    18

    Web Mining: Bayesian Nets

    z Bayesian network4DAG (Directed Acyclic Graph) 4Express dependence relations between variables 4Can use prior knowledge on the data (parameters)

    A B C P(A,B,C,D,E) = P(A)P(B|A)P(C|B)

    P(D|A,B)P(E|B,C,D)

    D E

    4Examples of conjugate priors:Dirichlet for multinomial data, Normal-Wishart for normal data

  • 10

    19

    Web Mining: Results

    z A Bayesian net for KDD web data

    z V229 (Order-Average) and V240 (Friend) directly influence V312 (Target)

    z V19 (Date) was influenced by V240 (Friend) reflecting the TV advertisement.

    20

    Summaryz We study machine learning methods, such as

    4Probabilistic neural networks4Evolutionary algorithms4Reinforcement learning

    z Application areas include4Text mining4Web mining4Bioinformatics (not addressed in this talk)

    z Recent work focuses on probabilistic graphical models for web/text/bio data mining, including

    4Bayesian networks4Helmholtz machines4Latent variable models

  • 11

    21

    22

    Bayesian Networks:Architecture

    z A Bayesian network represents the probabilistic relationships between the variables.

    B

    G M

    L

    =

    =n

    iiiXPP

    1

    )|()( paX pai is the set of parent nodes of Xi.

    ),|()|()()( ),,|(),|()|()(),,,(

    LBMPBGPBPLPGBLMPBLGPLBPLPMGBLP

    ==

  • 12

    23

    z The network structure represents the nave Bayes assumption.z All nodes are binary.z [Hwang & Zhang, 2000]

    C

    t1 t2 t8754

    C: document class

    ti: ith term

    Bayesian Networks:Applications in IR A Simple BN for Text Classification

    24

    z Dataset4The acq dataset from Reuters-2157848754 terms were selected by TFIDF.

    4Training data: 8762 documents4Test data: 3009 documents

    z Parametric Learning4Dirichlet prior assumptions for the network parameter

    distributions.

    4Parameter distributions are updated with training data.),...,|(Dir)|( 1 iijrijij

    hij Sp =

    ),...,|(Dir),|( 11 ii ijrijrijijijh

    ij NNSDp ++=

    Bayesian Networks:Experimental Results

  • 13

    25

    z For training data4Accuracy: 94.28%

    z For test data4Accuracy: 96.51%

    99.3293.76Negative examples75.9896.83Positive examples

    Precision (%)Recall (%)

    98.6796.88Negative examples89.1795.16Positive examples

    Precision (%)Recall (%)

    Bayesian Networks:Experimental Results

    26

    Latent Variable Models:Architecture

    Latent Variable Model forTopic Words Extraction and Document Clustering

    z [Shin & Zhang, 2000]z Maximize log-likelihood

    4Update , , . 4With EM Algorithm

    = = =

    = =

    =

    =N

    n

    M

    m

    K

    kknkmkmn

    N

    n

    M

    mmnmn

    zdPzwPzPwdn

    wdPwdnL

    1 1 1

    1 1

    )|()|()(log),(

    ),(log),(

    )( kzP )|( km zwP )|( kn zdP

    Document Clustering

    Topic-Words Extraction

  • 14

    27

    z EM (Expectation-Maximization) Algorithm4Algorithm to maximize pre-defined log-likelihood

    z Iteration of E-Step and M-Step4E-Step M-Step

    =

    = Kk

    kmknk

    kmknk

    mnk

    zwPzdPzP

    zwPzdPzPwdzP

    1)|()|()(

    )|()|()(),|(

    = =

    == Mm

    N

    nmnkmn

    N

    nmnkmn

    km

    wdzPwdn

    wdzPwdnzwP

    1 1

    1

    ),|(),(

    ),|(),()|(

    = =

    == Mm

    N

    nmnkmn

    M

    mmnkmn

    kn

    wdzPwdn

    wdzPwdnzdP

    1 1

    1

    ),|(),(

    ),|(),()|(

    = =

    = =

    =M

    m

    N

    nmn

    M

    m

    N

    nmnkmnk

    wdnR

    wdzPwdnR

    zP

    1 1

    1 1

    ),(

    ,),|(),(1)(

    Latent Variable Models:Learning

    28

    z Topic Words Extraction and Document Clustering with a subset of TREC-8 data

    z TREC-8 adhoc task data4Documents: DTDS, FR94, FT, FBIS, LATIMES4Topics: 401-450 (401, 434, 439, 450)4401: Foreign Minorities, Germany4434: Estonia, Economy4439: Inventions, Scientific discovery4450: King Hussein, Peace

    Latent Variable Models:Applications in IR Experimental Results

  • 15

    29

    0.9900.729290003450 (293)0.9270.953920307439 (219)0.6860.996791023820434 (347)0.9300.9022001279401 (300)RecallPrecisionz1z3z4z2Topic (#Docs)

    Label (assigned to zk with Maximum P(di|zk) )

    jordan, peac, isreal, palestinian, king, isra, arab, meet, talk, husayn, agreem, presid, majesti, negoti, minist, visit, region, arafat, secur, peopl, east, washington, econom, sign, relat, jerusalem, rabin, syria, iraq,

    Cluster 1(z1)

    research, technology, develop, mar, materi, system, nuclear, environment, electr, process, product, power, energi, countrol, japan, pollution, structur, chemic, plant,

    Cluster 3(z3)

    percent, estonia, bank, state, privat, russian, year, enterprise, trade, million, trade, estonian, econom, countri, govern, compani, foreign, baltic, polish, loan, invest, fund, product,

    Cluster 4(z4)

    german, germani, mr, parti, year, foreign, people, countri, govern, asylum, polit, nation, law, minist, europ, state, immigr, democrat, wing, social, turkish, west, east, member, attack,

    Cluster 2(z2 )

    Extracted Topic Words (top 35 words with highest P(wj|zk) Topics

    Latent Variable Models:Applications in IR Experimental Results

    30

    Boosting: Algorithmsz A general method of converting rough rules into a highly accurate

    prediction rulez Learning procedure

    4 Examine the training set4 Derive a rough rule (weak learner)4 Re-weight the examples in the training set, concentrating on the hard cases for

    previous rules4 Repeat T times

    Learner Learner Learner Learner

    h1 h2 h3 h4 ),,,( 4321 hhhhf

    Importance weightsof training documents

  • 16

    31

    Boosting: Applied to Text Filteringz Nave Bayes

    4Traditional algorithm for text filtering

    z Boosting nave Bayes4Using nave Bayes classifiers as weak learners4 [Kim & Zhang, SIGIR-2000]

    )|""(

    ... )|""()|""()(maxarg

    )|()(maxarg

    )|()(maxarg

    21

    1

    } ,{

    jin

    jijijc

    n

    kjikj

    c

    jijirrelevantrelevantc

    NM

    ctroublewP

    capproachwPcourwPcP

    cwPcP

    cdPcPc

    j

    j

    j

    ====

    =

    =

    =

    Assume independence among terms

    32

    z TREC (Text Retrieval Conference)4 Sponsored by NIST

    z TREC-7 filtering datasets4Training Documents

    AP articles (1988) 237 MB, 79919 documents4Test Documents

    AP articles (1989~1990) 471 MB, 162999 documents4No. of topics: 50

    Example of a document

    z TREC-8 filtering datasets4Training Documents

    Financial Times (1991~1992) 167 MB, 64139 documents4Test Documents

    Financial Time (1993~1994) 382 MB, 140651 documents4No. of topics: 50

    Boosting: Applied to Text Filtering Experimental Results

  • 17

    33

    0.509

    PIRC

    0.500

    PIRC

    0.505

    NTTATTBoostingNTTATTBoosting

    0.4600.467

    Averaged Scaled F3

    0.4520.4610.474

    Averaged Scaled F1

    0.720

    Mer

    0.714

    PIRC

    0.734

    PIRCCLBoostingPLT2PLT1Boosting

    0.7210.722

    Averaged Scaled LF2

    0.7130.7120.717

    Averaged Scaled LF1

    TREC-7

    TREC-8

    Compared with the state-of-the-art text filtering systems

    Boosting: Applied to Text Filtering Experimental Results

    34

    Evolutionary Learning:Applications in IR - Web-Document Retrieval

    z [Kim & Zhang, 2000]

    Link Information,HTML Tag Information

    chromosomes

    Retrieval

    Fitness

    wnw3w2w1

    wnw3w2w1 wnw3w2w1wnw3w2w1

    wnw3w2w1 wnw3w2w1

  • 18

    35

    Evolutionary Learning:Applications in IR Tag Weighting

    z Crossover

    z Truncation selection

    z Mutation

    yny3y2y1xnx3x2x1

    chromosome X chromosome Y

    chromosome Z (offspring)

    zi = (xi + yi ) / 2 w.p. Pc

    chromosome X

    chromosome X

    change value w.p. Pm

    xnx3x2x1

    xnx3x2x1znz3z2z1

    36

    Evolutionary Learning :Applications in IR - Experimental Resultsz Datasets

    4TREC-8 Web Track Data42GB, 247491 web documents (WT2g)4No. of training topics: 10, No. of test topics: 10

    z Results

  • 19

    37

    Agent

    Environment

    Reinforcement Learning:Basic Concept

    1. State st Reward rt

    3. Reward rt+1

    4. State st+1

    2. Action at

    38

    WAIR

    User

    2. Actioni(modify profile)

    Rewardi

    1. Statei(user profile)

    User profile

    ...Filtered documents

    Document filtering

    4. Statei+1

    3. Rewardi+1 (relevance feedback)

    retrieve documents calculate similarity

    Reinforcement Learning:Applications in IR - Information Filtering[Seo & Zhang, 2000]

  • 20

    39

    (%)

    Reinforcement Learning:Experimental Results (Explicit Feedback)

    40

    (%)

    Reinforcement Learning:Experimental Results (Implicit Feedback)


Recommended