Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 1 AND Workshop on...

Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 1AND Workshop on Analytics for

Noisy Unstructured Text Data

Noisy Text Analytics:An Exercise in Futility?

Sreeram Balakrishnan

Hwee Tou NgRohini Srihari

Daniel Lopresti (mod)

Workshop on Analytics for Noisy Unstructured Text DataPanel Session January 8, 2007

IBM Research

National Univ. of SingaporeJanya Inc.Lehigh Univ.



Panel Session

Each panelist will make a brief presentation. Please think of questions to discuss. What's your opinion?

AND workshop has attracted researchers working on noisy text analytics from a variety of perspectives.

Some reports of success, other promising first steps. Still, there are many remaining unsolved issues. Major hurdles include inherent complexities of

human language, wide range of sources for noise. Panel session: is the challenge neverending and

pointless? Or are there reasons to be hopeful?



What's this?

Mwentxeth International JJomtA Conf r n | | t fl la! | lnt II: encie| 9 | | | | Ll 9 | | 6 | 9 | |^~R= | | | | | j | | | |^~R- | ||^~R. || |^~R| ||IJCAI-2007 'Wurksho | on Analytics far Noisy Unstructured Text Data sig?Hyderabad, India - January 8, 2007Noisy unstructured text data is found in informal settings such as online chat, SMS. emeils. IHome message boards. newsgroups, blogs, wilds and web pages Also, text produced by processing Newsspontaneaus; speech, printed text. handwritten text contains processing noise. Text produced || under such circumstances is typically highly noisy containing spelling errors, abbreviations` W tch this S ace for |P d. non-standard words. fates starts, repetitions. missing punctuetions, missimg case information, a I t t P || pause filling words such as ^um` and huh." Such text can be seen in ierge amounts in contact a as news' 3|:Can for pagers centers` online chat rooms, OCRed text documents, SMS gorpug etcffhe theme 'of the IJCAI . 01/08 Worksho200`? Conference as "A! and its benetits to socaety." In keepmg with thss theme, this workshop | Proceedings online`tm ortantlgates proposes to look at text enemies of highly noasy text that Is produced tn such everyday|applications in society. more |peogle On 07 Jan O | The goal of the workshop is to focus on the problems encountered in anaiyzing such || noisy documents coming from various sources, The nature of the text warrants moving |Mendanae beyond traditional text anamics techniques` We hope that the workshop will allow researchers announched- | to present current research and development in addressing this challenge. We aiso betieve that | | by |

Contact as a result ofthis workshop mere Ma be snaring of real Me noisy data sets and Wm resux: in their Gerald Detwngd

becoming aveilebfe to a wider research community. gotential dagasets | | Decisions 1

t A emalled to authorsp |



AND Homepage, Printed & Scanned

Zoom 1

Zoom 2



Not fair?

Binarized

| liiil i El| LJ Fm Eltr'lq liIi1il7J r'E lid tEii=Y=iI lid lE1ilE i ElFFm | El El IE lgg | [Iil lijil IE r'lij El _ Fm |El lgg rilijil Lm pil El _El pil liiil r*mtlE Fl | liiil LJ El El riil | | | Fl , rjil Fi | lid 11LJ Fm lid | F El Lm liii Fl liii i Fliji LJ FFm eEl1ilE Fm li;i | El i El 1Fm liiil FIY eEltlE Fm lid IE rilij |liiil r'lij El I flE I El | eEl1ilE ripil la LJ El | fi I I i Fl s;] |liiil r'lij El El Lm liji Fl IE El " LJ Fliji | | | El I liiil Fm? I i Fm | liii Fm lE1i r'lijil liiil FFI El . liijiil Iii| mijim | | liijii liiil rFmfE r'E Fm | | i El | | | I IE Fm lid i tEpil r'liiil riil liiil El | El tliiil I liiil liiil I-lii lat tEii=Y=iI IE Fm IEIE pil pil I i liii lE1ii liiil Fm El i Fm El liiil liji i | i FFm liiil F|| | | tl1e \!\i=::=FI-<S:l1a':l= is:r143liS>i Cl4gl4:LlrI1er1tS: 4:C>l11iI'lg f~F[Iil |lijil Fm lid tr'la lid i ti lijil Fm IE I tEii=Y=iI IE Fm IE I'3.-ti liii Etliiil pil r'E El | Fit liii LJ r' r'E Fit r'E El | | | Fm IE Fm lidIE El IE r'E El LJ It liiilf 1iFm i El |liiil r' I-lii El Fm liiil riil 1iFl | rbil | liii liiil FFI i FI s;] lE|lE i I IE bil I | tliiil IE -l_.--._l-i lid | | r'

OCR Result

Zoom



What's going on here?

Document image analysis is still a research topic. Complex layouts (e.g., multicolumn) are hard. Noisy inputs problematic for character recognition. Segmentation (character / word) often error-prone. Recognition sensitive to skew, font, resolution. Handwriting is even more difficult. By exploiting redundancy, some analytics tasks (e.g.,

IR, text categorization) still feasible. Other tasks likely to remain unsolved for a long time.



The big question

Will noisy text likethis ever be tractable?

Note: it is very easy to say “Just make the OCR better,” but this has proven to be a very hard problem and is likely to take a very long time.



Noisy Text Analytics:An Exercise in Futility?

Sreeram Balakrishnan

Hwee Tou NgRohini Srihari

Daniel Lopresti (mod)

Workshop on Analytics for Noisy Unstructured Text DataPanel Session January 8, 2007

IBM Research

National Univ. of SingaporeJanya Inc.Lehigh Univ.



Text Processing Stages: Functions

Processing Stage Intended Function

Optical character recognition

Transcribe input bitmap into encoded text (hopefully accurately).

Sentence boundary detection

Break input into sentence-sized units, one per text line.

Tokenization Break each sentence into word (or word-like) tokens delimited by white space.

Part-of-speech tagging

Takes tokenized text and attaches label to each token indicating its part-of-speech.

“Performance Evaluation for Text Processing of Noisy Inputs,” Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.



Text Processing Stages: Problems

Processing Stage Potential Problem(s)


Current OCR is “brittle,” errors made early-on propagate to later stages.


Missing or spurious sentence boundaries due to OCR errors on punctuation.

Tokenization Missing or spurious tokens due to OCR errors on whitespace and punctuation.


Bad PoS tags due to failed tokenization or OCR errors that alter orthographies.




Problems 1

CHAPTER 1 Loomings.

Call me Ishmael.

Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.

Results for noisy input (light photocopy):

' cH__' R l ' .

_omings.

, call me IshMael.

soMe ye_s ago--never mind how long , p,ec;sely__hav;ng _;€tle or no _oney in my purse, and nothing p_;,u__ to ;,terest Me on shore, I thoug_t I would sail _boUt a _;tt1e and see _e watery p_ or the world.

Sentence boundary detection results for clean input:

Note: 3 sentencesvs. 4 sentences.




Problems 2

CHAPTER 1 Loomings .

Call me Ishmael .

Some years ago -- never mind how long precisely -- having little or no money in my purse , and nothing particular to interest me on shore , I thought I would sail about a little and see the watery part of the world .


' cH__ ' R l ' .

_omings .

, call me IshMael .

soMe ye_s ago -- never mind how long , p , ec ; sely__hav ; ng _ ; €tle or no _oney in my purse , and nothing p_ ; , u__ to ; , terest Me on shore , I thoug_t I would sail _boUt a _ ; tt1e and see _e watery p_ or the world .

Tokenization results for clean input:




Problems 3

CHAPTER_NNP 1_CD Loomings_NNS ._.

Call_VB me_PRP Ishmael_NNP ._.

Some_DT years_NNS ago_RB --_: never_RB mind_VB how_WRB long_JJ precisely_RB --_: having_VBG little_JJ or_CC no_DT money_NN in_IN my_PRP$ purse_NN ,_, and_CC nothing_NN particular_JJ to_TO interest_VB me_PRP on_IN shore_NN ,_, I_PRP thought_VBD I_PRP would_MD sail_VB about_IN a_DT little_JJ and_CC see_VB the_DT watery_JJ part_NN of_IN the_DT world_NN ._.


Part-of-speech tagging results for clean input:

'_POS cH___NNS '_POS R_`` l_NNS '_'' ._.

_omings_NNS ._.

,_, call_VBP me_PRP IshMael_NNP ._.

soMe_JJ ye_s_NNS ago_RB --_: never_RB mind_VB how_WRB long_JJ ,_, p_NNP ,_, ec_NNP ;_: sely__hav_NNP ;_: ng_NNP __NNP ;_: €tle_NNP or_CC no_DT _oney_NN in_IN my_PRP$ purse_NN ,_, and_CC nothing_NN p__NN ;_: ,_, u___JJ to_TO ;_: ,_, terest_NN Me_NN on_IN shore_NN ,_, I_PRP thoug_t_VBP I_PRP would_MD sail_VB _boUt_VBN a_DT __NN ;_: tt1e_JJ and_CC see_VBP _e_JJ watery_NN p__, or_CC the_DT world_NN ._.




Test Conditions


Open Source gocr package.

http://jocr.sourceforge.net/index.html (Joerg Schulenburg et al.)


MXTERMINATOR.“A Maximum Entropy Approach to Identifying Sentence Boundaries,” J. C. Reynarand A. Ratnaparkhi, Proc. 5th Conf. on Applied Natural Language Processing, 1997.

Tokenization Penn Treebank tokenizer.

http://www.cis.upenn.edu/~treebank/tokenizer.sed (Robert MacIntyre)


MXPOST.“A Maximum Entropy Part-Of-Speech Tagger,” A. Ratnaparkhi, Proc. Empirical Methods in Natural Language Processing Conference, 1996.

Corpus 10 pages of Project Gutenberg Moby-Dick.http://www.gutenberg.net (Michael Hart et al.)




Average OCR Performance

Notes:

All Symbols

Clean Light Dark Fax

0

0.2

0.4

0.6

0.8

1

1.2

Precision

Recall

Overall


0

0.2

0.4

0.6

0.8

1

1.2

Precision

Recall

Overall

Punctuation


0

0.2

0.4

0.6

0.8

1

1.2

Precision

Recall

Overall

Whitespace

Baseline high on clean inputs, deteriorates rapidly on noisy inputs. Punctuation especially badly impacted: many false alarms.




Token-level segmentation error

Sample Alignment

Applying hierarchical string matching paradigm, we can recover correct correspondence between noisy output and original input.

A straightforward example found by algorithm:

Substitutionerrors

Substitutionerror




Text Processing Performance


0

0.2

0.4

0.6

0.8

1

1.2

Precision

Recall

Overall

Sentence Boundaries


0

0.2

0.4

0.6

0.8

1

1.2

Precision

Recall

Overall

Tokenization

Notes: Clean input processed at > 95%; many false alarms in noisy inputs. Performance degrades with each successive stage.


0

0.2

0.4

0.6

0.8

1

1.2

Precision

Recall

Overall

PoS Tagging




Lehigh University

A research university founded in 1865. Four colleges: Engineering, Arts &

Sciences, Business, Education. Faculty = 441 full-time. Graduate students = 2,064. Undergraduates = 4,577. Three campuses spread over 1,600 acres

(mountain side, wooded). Located in northeastern U.S. (about 1.5

hours from New York and Philadelphia, 3 hours from Washington, DC).

Engineering College ranked in top 20% of Ph.D.-granting schools in U.S.

University ranked in top 15% of U.S. national universities.

Key facts about Lehigh:

Packard Lab: Home ofComputer Science & Engineering



Lehigh University

LehighUniversity

New York

120 km

Philadelphia

80 km

Date post:	27-Mar-2015
Category:	Documents
Upload:	savannah-baird
View:	229 times
Download:	0 times

Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 1 AND Workshop on...

Documents