Document Image Databases and Retrieval LBSC 708A/CMSC 838L Philip Resnik mostly adapted from Dave...

Document Image Databases and Retrieval LBSC 708A/CMSC 838L Philip Resnik mostly adapted from Dave Doermann partly adapted from Doug Oard and Sam Tseng

Agenda n Questions n Definitions - Document, Image, Retrieval n Document Image Analysis Page decomposition Optical character recognition n Traditional Indexing with Conversion Confusion matrix Shape codes n Doing things Without Conversion Duplicate Detection, Classification, Summarization, Abstracting Keyword spotting, etc n Recent work on Chinese document images

Goals of this Class n Expand your definition of what is a DOCUMENT n To get an appreciation of the (hard!) issues in document image indexing n To look at different ways of solving the same problems with different media n Your job: compare/contrast with other media

DOCUMENT DATABASE IMAGE

Document DOCUMENT n Basic Medium for Recording Information n Transient Space Time n Multiple Forms Hardcopy (paper, stone,..) / Electronic (cdrom, internet, ) Written/Auditory/Visual (symbolic, scenic) n Access Requirements Search Browse Read May require some technological advancements...

Sources of Document Images n The Web Some PDF files come from scanned documents Arabic news stories are often GIF images n Digital copiers Produce corporate memory as a byproduct n Digitization projects Provide improved access to hardcopy documents

Some Definitions n Modality A means of expression n Linguistic modalities Electronic text, printed, handwritten, spoken, signed n Nonlinguistic modalities Music, drawings, paintings, photographs, video n Media The means by which the expression reaches you Internet, videotape, paper, canvas, Internet, videotape, paper, canvas,

Document Images n A collection of dots called pixels Arranged in a grid and called a bitmap n Pixels often binary-valued (black, white) But greyscale or color is sometimes needed n 300 dots per inch (dpi) gives the best results But images are quite large (1 MB per page) Faxes are normally 72 dpi n Usually stored in TIFF or PDF format

Database n Organized access to information n Data Integrity Support Security, redundancy control, abstraction n Fielded Representation Datatype specific manipulation User transparent organization Organized processing and access n Query Support Fields Operations between fields DATABASE

Document Database n Collections of electronic records documents email, business letters, form data, programs, drawings n Classical database access is provided for metadata n Natural Language Processing (NLP) Topic Classification Entity Extraction Abstracting Machine Translation n Information Retrieval (IR) Indexing and Retrieval of Full Text Document Similarity Filtering and Routing DOCUMENT DATABASE

Images n Pixel representation of intensity map n No explicit content, only relations n Image analysis Attempts to mimic human visual behavior Draw conclusions, hypothesize and verify IMAGE Image databases Use primitive image analysis to represent content Transform semantic queries into image features color, shape, texture spatial relations

Document Images n Scanned Pixel representation of document n Data Intensive (100-300dpi, 1-24 bpp) n NO EXPLICIT CONTENT n Document image analysis or manual annotation required takes pixels -> contents automatic means are not guaranteed Yet we want to be able to process them like text files! DOCUMENT IMAGE

Document Image Database n Collection of scanned images n Need to be available for indexing and retrieval, abstracting, routing, editing, dissemination, interpretation We want classical database access DOCUMENT DATABASE IMAGE

Information Retrieval Document Understanding Document Image Retrieval

Retrieval System Model Query Formulation Detection Delivery Selection Examination Index Docs User Indexing

Document Image Database Applications n Archival Applications Basic Indexing may be available by manual annotation n Searching Only High Quality Text Conversion n Searching Lower Quality Heterogeneous Databases ???? Access To Databases

Managing Document Image Databases n Document Image Databases are often influenced by traditional DB indexing and retrieval philosophies We are comfortable with them They work n Problem: Requires content to be accessible n Techniques: Content based retrieval (keywords, natural language) Query by structure (logical/physical) Query by Functional attributes (titles, bold, ) n Requirements: Ability to Browse, search and read

Indexing Page Images (Traditional) Optical Character Recognition Page Decomposition Scanner Document Page Image Structure Representation Character or Shape Codes Text Regions

Document Image Analysis n General Flow: Obtain Image - Digitize Preprocessing Feature Extraction Classification n General Tasks Logical and Physical Page Structure Analysis Zone Classification Language ID Zone Specific Processing RecognitionRecognition VectorizationVectorization

Page Analysis n Skew correction Based on finding the primary orientation of lines n Image and text region detection Based on texture and dominant orientation n Structural classification Infer logical structure from physical layout n Text region classification Title, author, letterhead, signature block, etc.

Image Detection

Text Region Detection

Language Identification n Language-independent skew detection Accommodate horizontal and vertical writing n Script class recognition Asian script have blocky characters Connected scripts cant be segmented easily n Language identification Shape statistics work well for western languages Competing classifiers work for Asian languages

Optical Character Recognition n Pattern-matching approach Standard approach in commercial systems Segment individual characters Recognize using a neural network classifier n Hidden Markov model approach Experimental approach developed at BBN Segment into sub-character slices Limited lookahead to find best character choice Useful for connected scripts (e.g., Arabic)

OCR Accuracy Problems n Character segmentation errors In English, segmentation often changes m to rn n Character confusion Characters with similar shapes often confounded n OCR on copies is much worse than on originals Pixel bloom, character splitting, binding bend n Uncommon fonts can cause problems If not used to train a neural network

Measures of OCR Accuracy n Character accuracy n Word accuracy n IDF coverage n Query coverage

Improving OCR Accuracy n Image preprocessing Mathematical morphology for bloom and splitting Particularly important for degraded images n Voting between several OCR engines helps Individual systems depend on specific training data n Linguistic analysis can correct some errors Use confusion statistics, word lists, syntax, But more harmful errors might be introduced

OCR Speed n Neural networks take about a minute a page Hidden Markov models are slower n Voting can improve accuracy But at a substantial speed penalty n Easy to speed things up with several machines For example, by batch processing - using desktop computers at night

Problem: Logical Page Analysis (Reading Order) n Can be hard to guess in some cases Newspaper columns, figure captions, appendices, n Sometimes there are explicit guides Continued on page 4 (but page 4 may be big!) n Structural cues can help Column 1 might continue to column 2 n Content analysis is also useful Word co-occurrence statistics, syntax analysis

Processing Converted Text Typical Document Image Indexing n Convert hardcopy to an electronic document OCR Page Layout Analysis Graphics Recognition n Use structure to add metadata n Manually supplement with keywords Use traditional text indexing and retrieval techniques?

Information Retrieval on OCR n Primarily non-objective (no use of fielded data) n Requires robust ways of indexing n Statistical methods with large documents work best n Key Evaluations Success for high quality OCR (Croft et al 1994, Taghva 1994) Limited success for poor quality OCR (1996 TREC, UNLV) Clustering successful for > 85% accuracy (Tsuda et al, 1995)

Proposed Solutions n Improve OCR n Automatic Correction Taghva et al, 1994 n Enhance IR techniques Lopresti and Zhou, 1996 4NGrams Applications Cornell CS TR Collection (Lagoze et al, 1995) Degraded Text Simulator (Doermann and Yao, 1995)

N-Grams n Powerful, Inexpensive statistical method for characterizing populations n Approach Split up document into n-character pairs fails Use traditional indexing representations to perform analysis DOCUMENT -> DOC, OCU, CUM, UME, MEN, ENT n Advantages Statistically robust to small numbers of errors Rapid indexing and retrieval Works from 70%-85% where traditional IR fails

Matching with OCR Errors n Above 80% character accuracy, use words With linguistic correction n Between 75% and 80%, use n-grams With n somewhat shorter than usual And perhaps with character confusion statistics n Below 75%, use word-length shape codes

Handwriting Recognition n With stroke information, can be automated Basis for input pads n Simple things can be read without strokes Postal addresses, filled-in forms n Free text requires human interpretation But repeated recognition is then possible

Conversion? n Full Conversion often required n Conversion is difficult! Noisy data Complex Layouts Non-text components Points to Ponder n Do we really need to convert? n Can we expect to fully describe documents without assumptions?

Researchers are seeing a progression from full conversion to image based approach n Applications Indexing and Retrieval Information Extraction Duplicate Detection Clustering (Document Similarity) Summarization n Advantages Makes use of powerful image properties (Function, IVC 1998) Can be cheaper then conversion Makes use of redundancy in the language.

What The Rest of This Class is NOT about. n Physical or Electronic Document Management imaging, archiving, routing, storage, retrieval n Text Analysis or Information Retrieval n Details on Classical Document Image Analysis Preprocessing OCR Layout Analysis Rather, lets ask: Can we manage document image databases without complete and accurate conversion?

Why do without Conversion? n To accomplish a specific task n To narrow the scope for traditional processing n Some things can not be (easily) converted! Image, Graphics, Handwriting n Sometimes (often) metadata is not present! n Conversion is expensive!

Focus Methods which do NOT require complete and accurate conversion n Processing Converted Text n Manipulating Images of Text n Indexing Based on Structure n Graphics and Drawings n Related Work and Applications

Outline

Processing Images of Text n Characteristics Does not require expensive OCR/Conversion Applicable to filtering applications May be more robust to noise n Possible Disadvantages Application domain may be very limited Processing time may be an issue if indexing is otherwise required

Proper Noun Detection (DeSilva and Hull, 1994) n Problem: Filter proper nouns in images of text People, Places, Things n Advantages of the Image Domain: Saves converting all of the text Allows application of word recognition approaches Limits post-processing to a subset of words Able to use features which are not available in the text n Approach: Identify Word Features Capitalization, location, length, and syntactic categoriesCapitalization, location, length, and syntactic categories Classify using rule-set Achieve 75-85% accuracy without conversion

Keyword Spotting Keyword Spotting Techniques: Work Shape/HMM - (Chen et al, 1995) Word Image Matching - (Trenkle and Vogt, 1993; Hull et al) Character Stroke Features - (Decurtins and Chen, 1995) 4Shape Coding - (Tanaka and Torii; Spitz 1995; Kia, 1996) Applications: Filing System (Spitz - SPAM, 1996) Numerous IR Processing handwritten documents Formal Evaluation : Scribble vs. OCR (DeCurtins, SDIUT 1997)

Shape Coding n Approach Use of Generic Character Descriptors Make Use of Power of Language to resolve ambiguity Map Character based on Shape features including ascenders, descenders, punctuation and character with holes

Shape Codes n Group all characters that have similar shapes {A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, 2, 3, 4, 5, 6, 7, 8, 9, 0} {a, c, e, n, o, r, s, u, v, x, z} {b, d, h, k, } {f, t} {g, p, q, y} {i, j, l, 1} {m, w} many relations occur only for vertical regions n Perform inverse mapping n Identify maximal area overlap subset n Compute similarity">

Spatial Matching and Indexing n Establish correspondence Create a directed mapping for each region in the first document to region it overlaps in the second n Restrict horizontal mappings Refine mapping such one->many relations occur only for vertical regions n Perform inverse mapping n Identify maximal area overlap subset n Compute similarity

Step 1 A->1,2 B->1,3 C->2 D->2 E->2 F->NULL Step 1a A->1 B->1,3 C->2 D->2 E->2 F->NULL Step 2 1->A, B 2->A,C,D,E 3->B Step 2a 1->A, B 2->A,C,D 3->B

Outline

Graphics n Maps and Drawings Lorenz and Monagan, 1995 4Samet and Soffer, 1995 Amlani and Kasturi, 1988 n Graphs Koga et al, 1993 n Logos and Icons Jaisimha et al, 1996 Doermann et al, 1996 Gudivada and Raghavan, 1993 n Technical Drawings Syeda-Mahmood, 1995

Map Interpretation Samet et al n Identify Legend on the Map Image n Extract Images map labels and descriptions n Identify labels in the map images n Allow user to query based on extracted images n Bootstraps the information extraction and interpretation problems

Outline

Related to Image Indexing n Using captions to index images Shrihari, 1995 Rowe, 1995 n Identifying text in images (Advertisements, etc) 4 Document Image Matching Indexing and Retrieval Duplicate Image Detection

Document Image Matching n Characteristics Identify Similarity Based on Image Features Use either structure or content attributes Applications to retrieval, filtering and database management n J. Hull (DAS 1994) n J. Hull, J. Cullen and M. Pearis (SPIE 1997) n L. Spitz (SPIE 1997) n D. Doermann, H. Li and O. Kia (ICDAR 1997)

Duplicate Document Image Detection n Problem Detection of Duplicate Document Images based on Image Properties n Assumptions Very large collections Dynamic database No prior human interaction Goal: To identify an appropriate index for the document

Duplicate Detection n Same content, same format For example, a xerox copy n Same content, different format For example, as a web page or on paper n Shared content, same format For example, a paper with annotations n Shared content, different format For example, including text with cut-and-paste

Approach n Use global features to restrict search Number of pages, number of lines, page moments n Extract a signature using shape codes n Convert signature use a set of n-gram keys to index the database n Rank and verify return top N documents visual or algorithmic refinement n Advantages: Robust to noise, extracted quickly, extracted easily, efficiently stored

Cross-Language Duplicate Detection (= finding translations!)

Duplicate Reconciliation

Evaluation n The usual approach: Model-based evaluation Apply confusion statistics to an existing collection n A bit better: Print-scan evaluation Scanning is slow, but availability is no problem n Best: Scan-only evaluation No existing IR collections have printed materials

Summary n Many applications benefit from image based indexing Less discriminatory features Features may therefore be easier to compute More robust to noise Often computationally more efficient n Many classical IR techniques have application for DIR n Structure as well as content are important for indexing n Preservation of structure is essential for in-depth understanding

Page Classification n Document Page Images Scanned or Digital Page Images Typically from Multi-page Documents Page Images as Individual, Independent Images n Classification Determine the Category of Page Image n Classes, Types, or Genres Visually Similar Pages (Query by Example) Explicit/Specific Genre (Query by Type) Business Letters, Memos, Forms, Articles, Ads, MapsBusiness Letters, Memos, Forms, Articles, Ads, Maps User-Specified (Query by Need)

Applications n Automatic Document Structure Analysis From Layout Analysis To Logical Analysis n Document Image Retrieval n Organization, Navigation, and Visualization of Document Collection n Automatic Document Routing and Cataloging

Related Work n Requires OCR Dengel et al. (DFKI): OfficeMAID Hao et al. (NJIT) n Requires (or Assumes) Type Knowledge/Models Maderlechner et al. (Siemens AG) Taylor et al. (Lockheed Martin): Part of IDUS Dengel et al. (DFKI): OfficeMAID Hao et al. (NJIT)

Issues and Challenges n Issues How do we define classes? How can we automate the learning process? n Challenges Without OCR Without Domain-Specific Structural Information Human Relevance Assessment is Subjective User-Defined Types

Goals and Approach n Goals Produce a Database of Relevance Judgements Build Classifiers Visual Similarity Decision Tree ClassifierVisual Similarity Decision Tree Classifier Title Page Decision Tree ClassifierTitle Page Decision Tree Classifier Evaluate Classifiers Judge Correlation between Human Assessments and our ClassifiersJudge Correlation between Human Assessments and our Classifiers n Approach Visual Similarity-Based Human Relevance Judgments User-Specified Types Supervised Classification

User Relevance Test n Objectives Gather Human Relevance (Similarity) Judgments n Sample Images University of Washington Document Image Database I (UW-I) 979 Technical Article Images n Query Images 12 Structurally Distinct Representatives of Layout n 7 Test Subjects Rated Database Images with Similarity Ratings from 1(Minimally Similar) to 5 (Very Similar) n Results Number of Judgments Relevance Ratings

Test Query Images

Distribution of # of Judgements

Distribution of Relevance Rating

Assigning Class Labels n Using Data from User Relevance Test Number of Judgments Relevance Rating n Assigning Class Labels Based on Number of Judgments Based on Average Relevance Rating Based on Combination n Total Scores Confidence Measures of Class Labels Used in Preparing Training Sets

Classification: Experimental Protocol n Rank Each Class by Strength of Training Images n Segment the Database into 7 Subsets 1/7th of each Class, ~135 Images Per Subset n Train and Test using Leave-One-Out Re-sampling Method n Generate Accuracy Rates for each Subset to show Correlation between User Evaluation and Classifier n Run the General Classification Tree on the Entire Class

Computation of Total Score n Total Score = f(relevance(R), # of judgements (J)) R + J R * J SQRT(R*R + J*J) WORST BEST R * J

Distribution of Total Scores

Decision Tree Classifiers n Using OC1 (Oblique Decision Tree Classifier) to Build Decision Trees Oblique Decision Boundaries Linear Combination of Features n Image Features University of Washington Ground-truth Zone/Region Level Segmentation 59 Features

Experiments n 12 Page Class Problem Overall Accuracy Septile Partitions n Title Page Problem Combined Classes #4 and #9 Overall Accuracy Septile Partitions

12 Page Classification Accuracy

Example Title Pages (#4 & #9)

Title Page Classification Accuracy

Title Page Overall Accuracy n 57 Title pages, 891 non-title pages n Overall Accuracy = 906/948 = 95.57% n Title Page Accuracy = 37/57 = 64.91% n False Positives = 22 n False Negatives = 20 n Observations All without Type-Specific Information Need Functional (or Logical) Features

Title Page Classification Errors False Positives False Negatives

Test Query Images

Observation of 12 Page Classification n Correlation Among Classes 66 possible pairs #4 and #9: 91% (Rank #1) #5 and #6: 52% (Rank #2) #1 and #2: 48% (Rank #3) #10 and #11: 40% (Rank #4) n How many is the right number? Combine Most Similar Classes

Summary and Future Work n Document Page Classification Classifies Visually Similar Pages Is based on Human Relevance Judgments Page Classification Results are Good n Future Work Automatic Feature Extraction Indexing Genre-Based Search

Agenda n Questions n Definitions - Document, Image, Retrieval n Document Image Analysis n Traditional Indexing with Conversion n Doing things Without Conversion n Recent work on IR with Chinese document images Tseng and Oard

Document Retrieval Approaches for Images of Text n Full-text search based on manually re-keying the text Prohibitively expensive at large scale n Search based on bibliographic metadata May be difficult to adequately describe the materials. n Full text based on Optical Character Recognition (OCR) Inexpensive and relatively rapid Sensitive to OCR accurracy

Key Questions for Information Retrieval n What to index? Phrase, words, character, or shape codes Unigrams or n-grams n How to weight a term in a document? Term frequency (TF) Document frequency (DF) Document length normalization (Term position) n How to assign scores to documents? Boolean, vector space, and probabilistic models

Chinese Text Retrieval Issues n Words may be any number of characters (typically 2-5) But some that contain only 1 character or more than 5 characters e.g., (cat), (UNESCO) n Longer words (over 2 characters) often have shorter sub-word units Transliteration is an exception n Written Chinese has no word separator A sentence can be segmented in different ways, all may be legal Similar to the phrase detection problem in English n Chinese character inventory is very large 13,500 characters in Big-5 code (traditional Chinese: Taiwan and Hong Kong) Over 6,000 characters in GB code (simplified Chinese: China, Singapore) About 3,000 commonly used characters in each character set

Socio-Cultural Research Center (SCRC) Collection n 800,000 newspaper clippings from 1950-1976 Scanned over 300,000 at 300 dpi n 30 China, Hong Kong, and Taiwan news agencies Mostly simplified Chinese, some traditional Chinese n Focus on diplomatic and military activities

Document Preparation n Selected 11,108 scanned document images n OCR yielded 8,438 valid docs (Presto! OCR Pro, Big-5) Avg valid document had a 69% system-reported recognition rate Computed on a sample of 1,300 documentsComputed on a sample of 1,300 documents n Second version prepared using Big-5 to GB conversion GB version used in experiments

Topic Preparation n Based on contemporaneous Chinese journal articles From 100 paper titles, 30 were selected and rewritten as Chinese topics n Made English translations for cross-language experiments Translated by native speakers of Chinese 12 Anti-Chinese Movements Activities related to the anti-Chinese movements in Indonesia Articles must deal with activities related to the anti-Chinese movement in Indonesia; case reports or articles dealing with PRC's criticism of the Anti-Chinese movement will be considered partly relevant.

Relevance Judgments n Exhaustive tri-state relevance judgments Irrelevant (=0), partially relevant (=1), fully relevant (=2) n Every topic-document pair judged by 3 assessors 2 majored in history, 1 majored in library science Averaged 4 minutes per document image (for all 30 topics) n Sum of the judgments provides a final estimate 0=not relevant, 15=partially relevant, 6=fully relevant Threshold as desired to reflect the intended application In our experiments, any score > 0 is treated as relevantIn our experiments, any score > 0 is treated as relevant

Chinese OCR Text Retrieval Strategies n Indexing method: Both 1-gram (for partial match) and 2-gram (for preserving sequence) Example: ABC will be indexed with A, B, C, AB, BC Compared to 1-gram only and 2-gram only n Weighting scheme: document terms : TF*IDF = log(1+ tf ) * log(N/df) query terms : tf * (3w-1), where w is the length of the term n Retrieval model: Vector space model compared with probabilistic model n Document length normalization: byte size for document terms, compared to cosine

OCR and Length Normalization n Experiments by Taghva et al showed that some sophisticated weighting schemes shown to be more effective for ordinary text might lead to more unstable results for OCR degraded text. n Singhal, Salton, Buckley [96] analyzed this phenomenon by Vector space model (SMART system) Word-based indexing simulated OCR output of a TREC collection (2GB of 742,202 docs) 50 TREC queries (numbered from 151 to 200) Specifically, effects of cosine normalization and IDF are analyzed Incorrect terms like systom have large IDF and thus affect weights of other terms in the same document if cosine normalization is used: They correct this problem by using byte size normalization: (byte size) 0.375

Results Summary Long QueriesTitle Queries 1+2 gram is best ByteSize beats Cosine Long queries beat Titles Inquery does well Mean Average Precision

Blind Relevance Feedback for Query Expansion n Reweight existing terms, add related terms: We used =5 and =1 n Blind: assume the top k documents are relevant High precision at low recall is desirable We used k=5 n Longer n-grams typically improve precision Lead to a very large set of index terms for Chinese n Iterate between merging and pruning Merge common 2-grams to get 3-grams Prune low-frequency 3-grams Repeat, merging common 3-grams to get 4-grams, O(n m) time, where n=length of input string, m=longest repeated pattern

A Pruned n-gram Example (in English) n-grams without pruning 1. clusters based on : 3 2. document clustering : 3 3. of Web : 3 4. on the : 3 5. search engines : 3 6. STC is : 2 7. Web document clustering : 2 8. Web search engines : 2 9. clustering methods in this domain : 2 10. requirements of : 2 11. returned by : 2 n-grams after pruning 1. clusters based : 3 2. document clustering : 3 3. Web : 3 4. 5. search engines : 3 6. STC : 2 7. Web document clustering : 2 8. Web search engines : 2 9. clustering methods in this domain : 2 10. requirements : 2 11. returned : 2 Web Document Clustering: A Feasibility Demonstration Users of Web search engines are often forced to sift through the long ordered list of document returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on the major search engines. The paper articulates the unique requirements of Web document clustering and reports on the first evaluation of clustering methods in this domain. A key requirement is that the methods create their clusters based on the short snippets returned by Web search engines. Surprisingly, we find that clusters based on snippets are almost as good as clusters created using the full text of Web documents. To satisfy the stringent requirements of the Web domain, we introduce an incremental, linear time (in the document collection size) algorithm called Suffix Tree Clustering (STC), which creates clusters based on phrases shared between documents. We show that STC is faster than standard clustering methods in this domain, and argue that Web document clustering via STC is both feasible and potentially beneficial.?

Blind Relevance Feedback Results

Conclusions of Study n The SCRC test collection is useful But more than 30 topics may be needed for statistical significance n Indexing 1-grams and 2-grams together works well If 2-grams are given greater weight in the query n Byte size normalization outperforms cosine normalization But Inquery does better than either on short queries n OCR errors adversely affect blind relevance feedback A clean comparable collection would probably work better Pruning seems to help Considerable parameter tuning is needed ( , , and k)

Summary n Further broading definition of document n Approaches based on conversion to text n Approaches that dont involve (full) conversion n Recent work on Chinese document images

Date post:	23-Dec-2015
Category:	Documents
Upload:	silvester-watkins
View:	214 times
Download:	0 times