+ All Categories
Home > Documents > Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Date post: 17-Dec-2015
Category:
Upload: marilynn-molly-bruce
View: 212 times
Download: 0 times
Share this document with a friend
Popular Tags:
21
Toward Automatic Speech Act Discovery
Transcript
Page 1: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Toward Automatic Speech Act Discovery

Page 2: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

• email• newsgroups• forums• blogs

Page 3: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Data Set

• 20 usenet newsgroups• The 20 Newsgroups data set is a collection of approximately 20,000

newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

Page 4: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Preprocessing>> I just wonder if this will also cause a divergence between commercial>> and non-commercial software (ie. you will only get free software using>> Athena or OpenLook widget sets, and only get commercial software using>> the Motif widget sets). >>> I can't see why. If just about every workstation will come with Motif> by default and you can buy it for under $100 for the "free" UNIX> platforms, I can't see this causing major problems.

Let me add another of my concerns: Yes, I can buy a port of Motif for "cheap",but I cannot get the source for "cheap", hence I am limited to using whatever Xlibraries the Motif port was compiled against (at least with older versions ofMotif. I have been told that Motif 1.2 can be used with any X, but I have notseen it myself).

Page 5: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Preprocessing>> I just wonder if this will also cause a divergence between commercial>> and non-commercial software (ie. you will only get free software using>> Athena or OpenLook widget sets, and only get commercial software using>> the Motif widget sets). >>> I can't see why. If just about every workstation will come with Motif> by default and you can buy it for under $100 for the "free" UNIX> platforms, I can't see this causing major problems.

Let me add another of my concerns: Yes, I can buy a port of Motif for "cheap",but I cannot get the source for "cheap", hence I am limited to using whatever Xlibraries the Motif port was compiled against (at least with older versions ofMotif. I have been told that Motif 1.2 can be used with any X, but I have notseen it myself).

• Section into “levels”• Level < previous level = reply to previous message• Level > previous level = new message

Page 6: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Also:

• Remove headers

Xref: cantaloupe.srv.cs.cmu.edu comp.windows.x:66928 comp.windows.x.apps:2487Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!howland.reston.ans.net!gatech!asuvax!chnews!tmcconneFrom: [email protected] (Tom McConnell~)Newsgroups: comp.windows.x,comp.windows.x.appsSubject: Re: Motif vs. [Athena, etc.]Date: 16 Apr 1993 20:14:04 GMTOrganization: Intel CorporationLines: 44Sender: tmcconne@sedona (Tom McConnell~)Distribution: worldMessage-ID: <[email protected]>References: <[email protected]> <[email protected]> <[email protected]>NNTP-Posting-Host: thunder.intel.comOriginator: tmcconne@sedona

Page 7: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Also:

• Remove signatures

Cheers,

Tom McConnell-- Tom McConnell | Internet: [email protected] Intel, Corp. C3-91 | Phone: (602)-554-8229 5000 W. Chandler Blvd. | The opinions expressed are my own. No one in Chandler, AZ 85226 | their right mind would claim them.

Page 8: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Also:

• Remove signatures

Cheers,

Tom McConnell-- Tom McConnell | Internet: [email protected] Intel, Corp. C3-91 | Phone: (602)-554-8229 5000 W. Chandler Blvd. | The opinions expressed are my own. No one in Chandler, AZ 85226 | their right mind would claim them.

• Look for ---*• Doesn't always find it

Page 9: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Also:

• Remove signatures

Cheers,

Tom McConnell-- Tom McConnell | Internet: [email protected] Intel, Corp. C3-91 | Phone: (602)-554-8229 5000 W. Chandler Blvd. | The opinions expressed are my own. No one in Chandler, AZ 85226 | their right mind would claim them.

• Look for ---*• Doesn't always match

• First paragraph only• Might miss important content• Sometimes grabs greetings (e.g. “Hi, \n”

Page 10: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Preprocessing

• Bi- and tri-grams• Tag start of sentence with ^• Force “not” to join with adjacent n-grams

• e.g.^there_is_not not_a_way a_way way_to to_do do_that

Page 11: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Text Modeling and Topic Discovery

• Assume words and/or documents belong to some class/topic

• Assume words are conditionally independent given the class/topic

• P(w|z)

Page 12: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Naïve Bayes

• Each document belongs to one class• P(d) = \product P(w|z)

Page 13: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Naïve Bayes - Inference

• Expectation-Maximization

Page 14: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Latent Semantic Indexing / Latent Dirichlet Allocation

• Each document contains multiple topics• P(d) = \product P(w|z) P(z|d)

Page 15: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Model for Conversational Text

• Message m• Response r• P(m,r|z) = P(m|z) P(r|z)• P(r|m) prop to P(z) P(m|z) P(r|z)

Page 16: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Example

Page 17: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Example

Page 18: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Example

Page 19: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Example

Page 20: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Example

Page 21: Toward Automatic Speech Act Discovery. email newsgroups forums blogs.

Classification Performance

• Labeled ~100 messages with speech acts

–M/R model – 40-60%– Single-message NB – 20-30%

• Need more labels


Recommended