Date post: | 27-Mar-2015 |
Category: |
Documents |
Upload: | savannah-baird |
View: | 229 times |
Download: | 0 times |
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 1AND Workshop on Analytics for
Noisy Unstructured Text Data
Noisy Text Analytics:An Exercise in Futility?
Sreeram Balakrishnan
Hwee Tou NgRohini Srihari
Daniel Lopresti (mod)
Workshop on Analytics for Noisy Unstructured Text DataPanel Session January 8, 2007
IBM Research
National Univ. of SingaporeJanya Inc.Lehigh Univ.
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 2AND Workshop on Analytics for
Noisy Unstructured Text Data
Panel Session
Each panelist will make a brief presentation. Please think of questions to discuss. What's your opinion?
AND workshop has attracted researchers working on noisy text analytics from a variety of perspectives.
Some reports of success, other promising first steps. Still, there are many remaining unsolved issues. Major hurdles include inherent complexities of
human language, wide range of sources for noise. Panel session: is the challenge neverending and
pointless? Or are there reasons to be hopeful?
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 3AND Workshop on Analytics for
Noisy Unstructured Text Data
What's this?
Mwentxeth International JJomtA Conf r n | | t fl la! | lnt II: encie| 9 | | | | Ll 9 | | 6 | 9 | |^~R= | | | | | j | | | |^~R- | ||^~R. || |^~R| ||IJCAI-2007 'Wurksho | on Analytics far Noisy Unstructured Text Data sig?Hyderabad, India - January 8, 2007Noisy unstructured text data is found in informal settings such as online chat, SMS. emeils. IHome message boards. newsgroups, blogs, wilds and web pages Also, text produced by processing Newsspontaneaus; speech, printed text. handwritten text contains processing noise. Text produced || under such circumstances is typically highly noisy containing spelling errors, abbreviations` W tch this S ace for |P d. non-standard words. fates starts, repetitions. missing punctuetions, missimg case information, a I t t P || pause filling words such as ^um` and huh." Such text can be seen in ierge amounts in contact a as news' 3|:Can for pagers centers` online chat rooms, OCRed text documents, SMS gorpug etcffhe theme 'of the IJCAI . 01/08 Worksho200`? Conference as "A! and its benetits to socaety." In keepmg with thss theme, this workshop | Proceedings online`tm ortantlgates proposes to look at text enemies of highly noasy text that Is produced tn such everyday|applications in society. more |peogle On 07 Jan O | The goal of the workshop is to focus on the problems encountered in anaiyzing such || noisy documents coming from various sources, The nature of the text warrants moving |Mendanae beyond traditional text anamics techniques` We hope that the workshop will allow researchers announched- | to present current research and development in addressing this challenge. We aiso betieve that | | by |
Contact as a result ofthis workshop mere Ma be snaring of real Me noisy data sets and Wm resux: in their Gerald Detwngd
becoming aveilebfe to a wider research community. gotential dagasets | | Decisions 1
t A emalled to authorsp |
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 4AND Workshop on Analytics for
Noisy Unstructured Text Data
AND Homepage, Printed & Scanned
Zoom 1
Zoom 2
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 5AND Workshop on Analytics for
Noisy Unstructured Text Data
Not fair?
Binarized
| liiil i El| LJ Fm Eltr'lq liIi1il7J r'E lid tEii=Y=iI lid lE1ilE i ElFFm | El El IE lgg | [Iil lijil IE r'lij El _ Fm |El lgg rilijil Lm pil El _El pil liiil r*mtlE Fl | liiil LJ El El riil | | | Fl , rjil Fi | lid 11LJ Fm lid | F El Lm liii Fl liii i Fliji LJ FFm eEl1ilE Fm li;i | El i El 1Fm liiil FIY eEltlE Fm lid IE rilij |liiil r'lij El I flE I El | eEl1ilE ripil la LJ El | fi I I i Fl s;] |liiil r'lij El El Lm liji Fl IE El " LJ Fliji | | | El I liiil Fm? I i Fm | liii Fm lE1i r'lijil liiil FFI El . liijiil Iii| mijim | | liijii liiil rFmfE r'E Fm | | i El | | | I IE Fm lid i tEpil r'liiil riil liiil El | El tliiil I liiil liiil I-lii lat tEii=Y=iI IE Fm IEIE pil pil I i liii lE1ii liiil Fm El i Fm El liiil liji i | i FFm liiil F|| | | tl1e \!\i=::=FI-<S:l1a':l= is:r143liS>i Cl4gl4:LlrI1er1tS: 4:C>l11iI'lg f~F[Iil |lijil Fm lid tr'la lid i ti lijil Fm IE I tEii=Y=iI IE Fm IE I'3.-ti liii Etliiil pil r'E El | Fit liii LJ r' r'E Fit r'E El | | | Fm IE Fm lidIE El IE r'E El LJ It liiilf 1iFm i El |liiil r' I-lii El Fm liiil riil 1iFl | rbil | liii liiil FFI i FI s;] lE|lE i I IE bil I | tliiil IE -l_.--._l-i lid | | r'
OCR Result
Zoom
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 6AND Workshop on Analytics for
Noisy Unstructured Text Data
What's going on here?
Document image analysis is still a research topic. Complex layouts (e.g., multicolumn) are hard. Noisy inputs problematic for character recognition. Segmentation (character / word) often error-prone. Recognition sensitive to skew, font, resolution. Handwriting is even more difficult. By exploiting redundancy, some analytics tasks (e.g.,
IR, text categorization) still feasible. Other tasks likely to remain unsolved for a long time.
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 7AND Workshop on Analytics for
Noisy Unstructured Text Data
The big question
Will noisy text likethis ever be tractable?
Note: it is very easy to say “Just make the OCR better,” but this has proven to be a very hard problem and is likely to take a very long time.
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 8AND Workshop on Analytics for
Noisy Unstructured Text Data
Noisy Text Analytics:An Exercise in Futility?
Sreeram Balakrishnan
Hwee Tou NgRohini Srihari
Daniel Lopresti (mod)
Workshop on Analytics for Noisy Unstructured Text DataPanel Session January 8, 2007
IBM Research
National Univ. of SingaporeJanya Inc.Lehigh Univ.
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 9AND Workshop on Analytics for
Noisy Unstructured Text Data
Text Processing Stages: Functions
Processing Stage Intended Function
Optical character recognition
Transcribe input bitmap into encoded text (hopefully accurately).
Sentence boundary detection
Break input into sentence-sized units, one per text line.
Tokenization Break each sentence into word (or word-like) tokens delimited by white space.
Part-of-speech tagging
Takes tokenized text and attaches label to each token indicating its part-of-speech.
“Performance Evaluation for Text Processing of Noisy Inputs,” Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 10AND Workshop on Analytics for
Noisy Unstructured Text Data
Text Processing Stages: Problems
Processing Stage Potential Problem(s)
Optical character recognition
Current OCR is “brittle,” errors made early-on propagate to later stages.
Sentence boundary detection
Missing or spurious sentence boundaries due to OCR errors on punctuation.
Tokenization Missing or spurious tokens due to OCR errors on whitespace and punctuation.
Part-of-speech tagging
Bad PoS tags due to failed tokenization or OCR errors that alter orthographies.
“Performance Evaluation for Text Processing of Noisy Inputs,” Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 11AND Workshop on Analytics for
Noisy Unstructured Text Data
Problems 1
CHAPTER 1 Loomings.
Call me Ishmael.
Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
Results for noisy input (light photocopy):
' cH__' R l ' .
_omings.
, call me IshMael.
soMe ye_s ago--never mind how long , p,ec;sely__hav;ng _;€tle or no _oney in my purse, and nothing p_;,u__ to ;,terest Me on shore, I thoug_t I would sail _boUt a _;tt1e and see _e watery p_ or the world.
Sentence boundary detection results for clean input:
Note: 3 sentencesvs. 4 sentences.
“Performance Evaluation for Text Processing of Noisy Inputs,” Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 12AND Workshop on Analytics for
Noisy Unstructured Text Data
Problems 2
CHAPTER 1 Loomings .
Call me Ishmael .
Some years ago -- never mind how long precisely -- having little or no money in my purse , and nothing particular to interest me on shore , I thought I would sail about a little and see the watery part of the world .
Results for noisy input (light photocopy):
' cH__ ' R l ' .
_omings .
, call me IshMael .
soMe ye_s ago -- never mind how long , p , ec ; sely__hav ; ng _ ; €tle or no _oney in my purse , and nothing p_ ; , u__ to ; , terest Me on shore , I thoug_t I would sail _boUt a _ ; tt1e and see _e watery p_ or the world .
Tokenization results for clean input:
“Performance Evaluation for Text Processing of Noisy Inputs,” Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 13AND Workshop on Analytics for
Noisy Unstructured Text Data
Problems 3
CHAPTER_NNP 1_CD Loomings_NNS ._.
Call_VB me_PRP Ishmael_NNP ._.
Some_DT years_NNS ago_RB --_: never_RB mind_VB how_WRB long_JJ precisely_RB --_: having_VBG little_JJ or_CC no_DT money_NN in_IN my_PRP$ purse_NN ,_, and_CC nothing_NN particular_JJ to_TO interest_VB me_PRP on_IN shore_NN ,_, I_PRP thought_VBD I_PRP would_MD sail_VB about_IN a_DT little_JJ and_CC see_VB the_DT watery_JJ part_NN of_IN the_DT world_NN ._.
Results for noisy input (light photocopy):
Part-of-speech tagging results for clean input:
'_POS cH___NNS '_POS R_`` l_NNS '_'' ._.
_omings_NNS ._.
,_, call_VBP me_PRP IshMael_NNP ._.
soMe_JJ ye_s_NNS ago_RB --_: never_RB mind_VB how_WRB long_JJ ,_, p_NNP ,_, ec_NNP ;_: sely__hav_NNP ;_: ng_NNP __NNP ;_: €tle_NNP or_CC no_DT _oney_NN in_IN my_PRP$ purse_NN ,_, and_CC nothing_NN p__NN ;_: ,_, u___JJ to_TO ;_: ,_, terest_NN Me_NN on_IN shore_NN ,_, I_PRP thoug_t_VBP I_PRP would_MD sail_VB _boUt_VBN a_DT __NN ;_: tt1e_JJ and_CC see_VBP _e_JJ watery_NN p__, or_CC the_DT world_NN ._.
“Performance Evaluation for Text Processing of Noisy Inputs,” Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 14AND Workshop on Analytics for
Noisy Unstructured Text Data
Test Conditions
Optical character recognition
Open Source gocr package.
http://jocr.sourceforge.net/index.html (Joerg Schulenburg et al.)
Sentence boundary detection
MXTERMINATOR.“A Maximum Entropy Approach to Identifying Sentence Boundaries,” J. C. Reynarand A. Ratnaparkhi, Proc. 5th Conf. on Applied Natural Language Processing, 1997.
Tokenization Penn Treebank tokenizer.
http://www.cis.upenn.edu/~treebank/tokenizer.sed (Robert MacIntyre)
Part-of-speech tagging
MXPOST.“A Maximum Entropy Part-Of-Speech Tagger,” A. Ratnaparkhi, Proc. Empirical Methods in Natural Language Processing Conference, 1996.
Corpus 10 pages of Project Gutenberg Moby-Dick.http://www.gutenberg.net (Michael Hart et al.)
“Performance Evaluation for Text Processing of Noisy Inputs,” Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 15AND Workshop on Analytics for
Noisy Unstructured Text Data
Average OCR Performance
Notes:
All Symbols
Clean Light Dark Fax
0
0.2
0.4
0.6
0.8
1
1.2
Precision
Recall
Overall
Clean Light Dark Fax
0
0.2
0.4
0.6
0.8
1
1.2
Precision
Recall
Overall
Punctuation
Clean Light Dark Fax
0
0.2
0.4
0.6
0.8
1
1.2
Precision
Recall
Overall
Whitespace
Baseline high on clean inputs, deteriorates rapidly on noisy inputs. Punctuation especially badly impacted: many false alarms.
“Performance Evaluation for Text Processing of Noisy Inputs,” Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 16AND Workshop on Analytics for
Noisy Unstructured Text Data
Token-level segmentation error
Sample Alignment
Applying hierarchical string matching paradigm, we can recover correct correspondence between noisy output and original input.
A straightforward example found by algorithm:
Substitutionerrors
Substitutionerror
“Performance Evaluation for Text Processing of Noisy Inputs,” Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 17AND Workshop on Analytics for
Noisy Unstructured Text Data
Text Processing Performance
Clean Light Dark Fax
0
0.2
0.4
0.6
0.8
1
1.2
Precision
Recall
Overall
Sentence Boundaries
Clean Light Dark Fax
0
0.2
0.4
0.6
0.8
1
1.2
Precision
Recall
Overall
Tokenization
Notes: Clean input processed at > 95%; many false alarms in noisy inputs. Performance degrades with each successive stage.
Clean Light Dark Fax
0
0.2
0.4
0.6
0.8
1
1.2
Precision
Recall
Overall
PoS Tagging
“Performance Evaluation for Text Processing of Noisy Inputs,” Daniel Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 18AND Workshop on Analytics for
Noisy Unstructured Text Data
Lehigh University
A research university founded in 1865. Four colleges: Engineering, Arts &
Sciences, Business, Education. Faculty = 441 full-time. Graduate students = 2,064. Undergraduates = 4,577. Three campuses spread over 1,600 acres
(mountain side, wooded). Located in northeastern U.S. (about 1.5
hours from New York and Philadelphia, 3 hours from Washington, DC).
Engineering College ranked in top 20% of Ph.D.-granting schools in U.S.
University ranked in top 15% of U.S. national universities.
Key facts about Lehigh:
Packard Lab: Home ofComputer Science & Engineering
Noisy Text Analytics: An Exercise in Futility?Lopresti January 2007 Slide 19AND Workshop on Analytics for
Noisy Unstructured Text Data
Lehigh University
LehighUniversity
New York
120 km
Philadelphia
80 km