Date post: | 15-Jan-2016 |
Category: |
Documents |
Upload: | nathaniel-mclaughlin |
View: | 215 times |
Download: | 0 times |
LREC 2008, Marrakech Morocco - May 30 2008
New Resources for Document Classification, Analysis and Translation
Technologies
Stephanie Strassel, Lauren Friedman, Safa Ismael,
David Lee, Kazuaki Maeda, Linda Brandschain {strassel, lf, safa, david4, maeda, brndschn}@ldc.upenn.edu
Linguistic Data Consortium
http://projects.ldc.upenn.edu/MADCAT
LREC 2008, Marrakech Morocco - May 30 2008
Presentation OutlineMADCAT Program OverviewTechnology ChallengesRoadmapData Creation
Phase 1 Data Profile Processing Collection Annotation
Data FormatEvaluationConclusions and Future Work
LREC 2008, Marrakech Morocco - May 30 2008
MADCAT OverviewMADCAT: Multilingual Document Classification
Analysis and TranslationA 5-year DARPA program MADCAT technologies will convert foreign
language document images into English text, enabling English speakers to extract, assess, and respond to information in a timely manner
Multiple input types and domains Hard-copy, PDF, camera-captured Newspapers, letters, signs, graffiti, how-to manuals,
memos, postcards, forms, diaries, ledgers, etc.
LREC 2008, Marrakech Morocco - May 30 2008
Technology Challenges Extract relevant metadata about the document structure Integrate and optimize page segmentation, metadata
extraction, OCR and translation technologies Create end-to-end system for deployment at program’s
end with over 90% accuracy Current baseline is ~2%
Primary evaluation metric is edit distance: HTER Same protocols as used in the GALE program
Limited focus in Phase 1 Arabic > English High resolution (600 dpi) images of handwritten newspaper and
web text Topics primarily news, current events and commentary Manual segmentation provided
LREC 2008, Marrakech Morocco - May 30 2008
Pre-MADCAT: State of the Art
Phase 1: Add handwriting
Phase 4-5: New genres, topics, quality conditions
Newswire
Broadcast
Talk Shows
Weblogs
Newswire
Broadcast
Talk Shows
Weblogs
Printed Printed
Handwritten
Phase 2-3: New data types
Personal Identif.
Instructns
Books
Training Manuals
Letters
Forms
Ledgers
DiariesCalendar
Maps
Poems
Verdicts
Letters
Forms
Ledgers
Diaries
News
Commentary
News
Commentary
Commentary
Science
Engineering
Personal
Science
Engineering
Personal
Religious
Military
Other
ControlledControlledControlled
Uncontrolled Uncontrolled
Calendars
Instructns
Ge
nre
To
pic
Me
diu
mS
ou
rce
Da
ta
Qu
ali
ty
Printed
Handwritten
Printed
Handwritten
Ph
as
eRoadmap
LREC 2008, Marrakech Morocco - May 30 2008
Phase 1 Data Profile In Phase 1, data drawn from DARPA GALE program
New collection to acquire handwritten versions Genres: Formal text (newswire) and informal text (weblogs)
Benefits Eliminates domain mismatch between GALE state of the art MT
models and MADCAT test sets Allows developers to focus on primary challenge: handwriting Data characteristics well understood, cost and time factors are
reasonably well known Training data costs controlled since translations exist Production begins immediately, training data available sooner Provides controlled test sets for evaluation across programs
Subsequent phases will add new data types, genres and other challenge elements
LREC 2008, Marrakech Morocco - May 30 2008
Training and DevTest Training
Minimum 2000 unique pages• Half formal (newswire), half informal (web text)• 100-250 words per page
Minimum 100 unique scribes in training pool 5 scribes per page At minimum 10,000 manuscripts (scribe-pages) in Phase 1 training
set
DevTest 320 unique pages
• Half formal (newswire), half informal (web text)• 125 words/page
50 scribes in devtest pool• 25 from training, 25 previously unseen
2 scribes per page, ~7 pages per scribe Total of 640 manuscripts; 80,000 words
LREC 2008, Marrakech Morocco - May 30 2008
Evaluation Data
320 unique pages from GALE P3 Eval set Half formal (newswire), half informal (web
text) 125 words/page
50 scribes in eval partition 25 from training, 25 previously unseen
6 scribes per page, ~40 pages per scribeTotal of 1920 manuscripts, 240,000 wordsSubset of eval set designated for pilot
evaluation in September 2008
LREC 2008, Marrakech Morocco - May 30 2008
Data Preparation Start with electronic text from GALE
Whole documents collected from newswire or web Segmented into SUs (semantic/sentence units) Each segment manually translated
Pre-processing prior to handwriting Tokenization to words for later stages Segments reordered and formatting added to create optimal pages
for handwriting assignment• Roughly 5 words/line to avoid line wrapping• No more than 25 lines/page to avoid page breaks
After handwriting, images scanned at high resolution (600 dpi, greyscale)
Images are ground truth annotated at line, word level Major challenge is logical storage of many layers of
information across multiple versions of the same data
LREC 2008, Marrakech Morocco - May 30 2008
Collection New human subjects collection required to produce
handwritten versions of existing data Pilot collection currently underway at LDC in Philadelphia
• LDC Arabic staff and recent Iraqi immigrants in Philly
Additional collections planned with partner sites in Lebanon, Morocco and possibly Egypt
Regional variety necessary to capture stylistic writing differences• E.g. use of Indic vs. Arabic numbers
Assignment and tracking of data and scribes controlled through centralized LDC database and assignment protocol Scribe partition (train only, test only, both) Writing conditions Regional variation Genre, topic and source balance
LREC 2008, Marrakech Morocco - May 30 2008
Writing Conditions
Implement 90% ballpoint pen (I) 10% pencil (P)
Paper 75% unlined white paper (U) 25% lined paper (L)
Writing speed 90% normal (N) 5% fast (F) 5% careful (C)
LREC 2008, Marrakech Morocco - May 30 2008
Scribe visits public URL, contacts site coordinator
Site coordinator schedules appointment
Scribe comes in, takes writing sample test
Site coordinator verifies scribe eligibility
Site coordinator logs in to secure website via login page
Scribe completes registration via registration page
Scribe verifies info via confirmation page
Site coordinator prints out subject ID and instructions for subject via assignment page
Coordinator pulls kit for this subject ID
Scribe returns completed kit to site
Coordinator verifies kit completeness and arranges payment
Scribe leaves with kit and instructions
Coordinator files completed kit for scanning/delivery
Site scans completed kit(s) as safeguard
Site ships completed paper kit(s) to LDC for archiving
LDC selects source data
LDC generates kits (documents + writing conditions)
Sites publicize study and recruit participants
LDC delivers data kits to collection sites
Site uploads image file to LDC
LDC processes completed kits for subsequent tasks
Collection Workflow
LREC 2008, Marrakech Morocco - May 30 2008
Scribe DemographicsScribes register in person at collection site and take
writing test To assess literacy and ability to follow instructions
Enter demographic info on LDC's secure server Name, address (for payment purposes only) Age, gender, level of education, occupation Where born, where raised Primary language of educational instruction Handedness
After registration, scribes receive brief tutorial No line wrapping, no page breaks Copy text exactly: no omissions or insertions, no
corrections to source text
LREC 2008, Marrakech Morocco - May 30 2008
Scribe AssignmentsAssignments are in the form of printed "kits"
50 printed pages to be copied plus assignment table• Assignment table specifies page order and writing conditions
Multiple scribes/kit, so conditions and order vary
Printed pages labeled with page and kit ID Scribes affix label with scribe, page and kit ID to back of
completed manuscript• To facilitate data tracking during scanning and post-processing
Scribes supply paper and writing instrument To sample natural variation
Payment per completed kit Exhaustive check on first assignment (completeness
and accuracy) Spot check on remainder of assignments
LREC 2008, Marrakech Morocco - May 30 2008
Ground TruthingZones created at word level only for Phase 1
Lines can be extrapolated from annotation Other zone types possible in future phases
• Structural elements (e.g. signature block)
Explicit reading order preservedLocations are polygons
Restricted to upright rectangles in the first phaseEach zone contains a unique ID, the contents,
location (coordinates)Status tags to accommodate scribe mistakes
extra, missing, typo nextZoneID tag to indicate reading orderIn Phase 1, ground truthing primarily by partner site
(Applied Media Analysis)
LREC 2008, Marrakech Morocco - May 30 2008
GEDI ToolkitGroundTruth - Editor and Document Interface (GEDI)
created by Applied Media Analysis (AMA)
LREC 2008, Marrakech Morocco - May 30 2008
Data FormatMADCATUnifier Process takes multiple data streamsand generates single xmloutput file which contains allrequired information
1) Text layer*Source Text
*Tokenization
*SU Segmentation
*Translation
2) Image layer*zone bounding boxes
3) Scribe demographics
4) Document metadata
LREC 2008, Marrakech Morocco - May 30 2008
EvaluationInput: (segmented) Arabic handwritten imageOutput: segmented English text HTER is primary evaluation metric (edit
distance) Manual post-editing task corrects MT output one
segment at a time until it has the same meaning as the reference translation, making as few edits as possible
NIST-developed MTPostEditor GUI• Editors review segment-aligned MT and gold standard
translation No access to original Arabic text or handwritten image file
No official separate evaluation of OCR or processing components
LREC 2008, Marrakech Morocco - May 30 2008
Conclusions; Future Work
LDC is creating a set of new linguistic resources for image processing, document classification and translation on a scale not previously available Phase 1: Large collection of Arabic handwritten,
translated, segmented, ground truthed text Infrastructure for collection, annotation and data
management• Including a unified, extensible data format
Extended to new data types, domains, languages, annotations in future phases
Resources will be available through LDC
LREC 2008, Marrakech Morocco - May 30 2008
Acknowledgements
This work was supported in part by the Defense Advanced Research Projects Agency, MADCAT Program Grant No. HR0011-08-1-004. The content of this paper does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Thank you to Audrey Le and Mark Przybocki at NIST for helping to define data and format requirements