Document Image Databases Document Image Databases and Retrieval and Retrieval LBSC 708A/CMSC 838L LBSC 708A/CMSC 838L Philip Resnik Philip Resnik mostly adapted from mostly adapted from Dave Doermann Dave Doermann partly adapted from partly adapted from Doug Oard and Sam Tseng Doug Oard and Sam Tseng
Transcript
Slide 1
Document Image Databases and Retrieval LBSC 708A/CMSC 838L
Philip Resnik mostly adapted from Dave Doermann partly adapted from
Doug Oard and Sam Tseng
Slide 2
Agenda n Questions n Definitions - Document, Image, Retrieval n
Document Image Analysis Page decomposition Optical character
recognition n Traditional Indexing with Conversion Confusion matrix
Shape codes n Doing things Without Conversion Duplicate Detection,
Classification, Summarization, Abstracting Keyword spotting, etc n
Recent work on Chinese document images
Slide 3
Goals of this Class n Expand your definition of what is a
DOCUMENT n To get an appreciation of the (hard!) issues in document
image indexing n To look at different ways of solving the same
problems with different media n Your job: compare/contrast with
other media
Slide 4
DOCUMENT DATABASE IMAGE
Slide 5
Document DOCUMENT n Basic Medium for Recording Information n
Transient Space Time n Multiple Forms Hardcopy (paper, stone,..) /
Electronic (cdrom, internet, ) Written/Auditory/Visual (symbolic,
scenic) n Access Requirements Search Browse Read May require some
technological advancements...
Slide 6
Sources of Document Images n The Web Some PDF files come from
scanned documents Arabic news stories are often GIF images n
Digital copiers Produce corporate memory as a byproduct n
Digitization projects Provide improved access to hardcopy
documents
Slide 7
Some Definitions n Modality A means of expression n Linguistic
modalities Electronic text, printed, handwritten, spoken, signed n
Nonlinguistic modalities Music, drawings, paintings, photographs,
video n Media The means by which the expression reaches you
Internet, videotape, paper, canvas, Internet, videotape, paper,
canvas,
Slide 8
Document Images n A collection of dots called pixels Arranged
in a grid and called a bitmap n Pixels often binary-valued (black,
white) But greyscale or color is sometimes needed n 300 dots per
inch (dpi) gives the best results But images are quite large (1 MB
per page) Faxes are normally 72 dpi n Usually stored in TIFF or PDF
format
Slide 9
Database n Organized access to information n Data Integrity
Support Security, redundancy control, abstraction n Fielded
Representation Datatype specific manipulation User transparent
organization Organized processing and access n Query Support Fields
Operations between fields DATABASE
Slide 10
Document Database n Collections of electronic records documents
email, business letters, form data, programs, drawings n Classical
database access is provided for metadata n Natural Language
Processing (NLP) Topic Classification Entity Extraction Abstracting
Machine Translation n Information Retrieval (IR) Indexing and
Retrieval of Full Text Document Similarity Filtering and Routing
DOCUMENT DATABASE
Slide 11
Images n Pixel representation of intensity map n No explicit
content, only relations n Image analysis Attempts to mimic human
visual behavior Draw conclusions, hypothesize and verify IMAGE
Image databases Use primitive image analysis to represent content
Transform semantic queries into image features color, shape,
texture spatial relations
Slide 12
Document Images n Scanned Pixel representation of document n
Data Intensive (100-300dpi, 1-24 bpp) n NO EXPLICIT CONTENT n
Document image analysis or manual annotation required takes pixels
-> contents automatic means are not guaranteed Yet we want to be
able to process them like text files! DOCUMENT IMAGE
Slide 13
Document Image Database n Collection of scanned images n Need
to be available for indexing and retrieval, abstracting, routing,
editing, dissemination, interpretation We want classical database
access DOCUMENT DATABASE IMAGE
Slide 14
Information Retrieval Document Understanding Document Image
Retrieval
Slide 15
Retrieval System Model Query Formulation Detection Delivery
Selection Examination Index Docs User Indexing
Slide 16
Document Image Database Applications n Archival Applications
Basic Indexing may be available by manual annotation n Searching
Only High Quality Text Conversion n Searching Lower Quality
Heterogeneous Databases ???? Access To Databases
Slide 17
Managing Document Image Databases n Document Image Databases
are often influenced by traditional DB indexing and retrieval
philosophies We are comfortable with them They work n Problem:
Requires content to be accessible n Techniques: Content based
retrieval (keywords, natural language) Query by structure
(logical/physical) Query by Functional attributes (titles, bold, )
n Requirements: Ability to Browse, search and read
Slide 18
Indexing Page Images (Traditional) Optical Character
Recognition Page Decomposition Scanner Document Page Image
Structure Representation Character or Shape Codes Text Regions
Slide 19
Document Image Analysis n General Flow: Obtain Image - Digitize
Preprocessing Feature Extraction Classification n General Tasks
Logical and Physical Page Structure Analysis Zone Classification
Language ID Zone Specific Processing RecognitionRecognition
VectorizationVectorization
Slide 20
Page Analysis n Skew correction Based on finding the primary
orientation of lines n Image and text region detection Based on
texture and dominant orientation n Structural classification Infer
logical structure from physical layout n Text region classification
Title, author, letterhead, signature block, etc.
Slide 21
Image Detection
Slide 22
Text Region Detection
Slide 23
Language Identification n Language-independent skew detection
Accommodate horizontal and vertical writing n Script class
recognition Asian script have blocky characters Connected scripts
cant be segmented easily n Language identification Shape statistics
work well for western languages Competing classifiers work for
Asian languages
Slide 24
Optical Character Recognition n Pattern-matching approach
Standard approach in commercial systems Segment individual
characters Recognize using a neural network classifier n Hidden
Markov model approach Experimental approach developed at BBN
Segment into sub-character slices Limited lookahead to find best
character choice Useful for connected scripts (e.g., Arabic)
Slide 25
OCR Accuracy Problems n Character segmentation errors In
English, segmentation often changes m to rn n Character confusion
Characters with similar shapes often confounded n OCR on copies is
much worse than on originals Pixel bloom, character splitting,
binding bend n Uncommon fonts can cause problems If not used to
train a neural network
Slide 26
Measures of OCR Accuracy n Character accuracy n Word accuracy n
IDF coverage n Query coverage
Slide 27
Improving OCR Accuracy n Image preprocessing Mathematical
morphology for bloom and splitting Particularly important for
degraded images n Voting between several OCR engines helps
Individual systems depend on specific training data n Linguistic
analysis can correct some errors Use confusion statistics, word
lists, syntax, But more harmful errors might be introduced
Slide 28
OCR Speed n Neural networks take about a minute a page Hidden
Markov models are slower n Voting can improve accuracy But at a
substantial speed penalty n Easy to speed things up with several
machines For example, by batch processing - using desktop computers
at night
Slide 29
Problem: Logical Page Analysis (Reading Order) n Can be hard to
guess in some cases Newspaper columns, figure captions, appendices,
n Sometimes there are explicit guides Continued on page 4 (but page
4 may be big!) n Structural cues can help Column 1 might continue
to column 2 n Content analysis is also useful Word co-occurrence
statistics, syntax analysis
Slide 30
Processing Converted Text Typical Document Image Indexing n
Convert hardcopy to an electronic document OCR Page Layout Analysis
Graphics Recognition n Use structure to add metadata n Manually
supplement with keywords Use traditional text indexing and
retrieval techniques?
Slide 31
Information Retrieval on OCR n Primarily non-objective (no use
of fielded data) n Requires robust ways of indexing n Statistical
methods with large documents work best n Key Evaluations Success
for high quality OCR (Croft et al 1994, Taghva 1994) Limited
success for poor quality OCR (1996 TREC, UNLV) Clustering
successful for > 85% accuracy (Tsuda et al, 1995)
Slide 32
Proposed Solutions n Improve OCR n Automatic Correction Taghva
et al, 1994 n Enhance IR techniques Lopresti and Zhou, 1996 4NGrams
Applications Cornell CS TR Collection (Lagoze et al, 1995) Degraded
Text Simulator (Doermann and Yao, 1995)
Slide 33
N-Grams n Powerful, Inexpensive statistical method for
characterizing populations n Approach Split up document into
n-character pairs fails Use traditional indexing representations to
perform analysis DOCUMENT -> DOC, OCU, CUM, UME, MEN, ENT n
Advantages Statistically robust to small numbers of errors Rapid
indexing and retrieval Works from 70%-85% where traditional IR
fails
Slide 34
Matching with OCR Errors n Above 80% character accuracy, use
words With linguistic correction n Between 75% and 80%, use n-grams
With n somewhat shorter than usual And perhaps with character
confusion statistics n Below 75%, use word-length shape codes
Slide 35
Handwriting Recognition n With stroke information, can be
automated Basis for input pads n Simple things can be read without
strokes Postal addresses, filled-in forms n Free text requires
human interpretation But repeated recognition is then possible
Slide 36
Conversion? n Full Conversion often required n Conversion is
difficult! Noisy data Complex Layouts Non-text components Points to
Ponder n Do we really need to convert? n Can we expect to fully
describe documents without assumptions?
Slide 37
Researchers are seeing a progression from full conversion to
image based approach n Applications Indexing and Retrieval
Information Extraction Duplicate Detection Clustering (Document
Similarity) Summarization n Advantages Makes use of powerful image
properties (Function, IVC 1998) Can be cheaper then conversion
Makes use of redundancy in the language.
Slide 38
What The Rest of This Class is NOT about. n Physical or
Electronic Document Management imaging, archiving, routing,
storage, retrieval n Text Analysis or Information Retrieval n
Details on Classical Document Image Analysis Preprocessing OCR
Layout Analysis Rather, lets ask: Can we manage document image
databases without complete and accurate conversion?
Slide 39
Why do without Conversion? n To accomplish a specific task n To
narrow the scope for traditional processing n Some things can not
be (easily) converted! Image, Graphics, Handwriting n Sometimes
(often) metadata is not present! n Conversion is expensive!
Slide 40
Focus Methods which do NOT require complete and accurate
conversion n Processing Converted Text n Manipulating Images of
Text n Indexing Based on Structure n Graphics and Drawings n
Related Work and Applications
Slide 41
Outline
Slide 42
Processing Images of Text n Characteristics Does not require
expensive OCR/Conversion Applicable to filtering applications May
be more robust to noise n Possible Disadvantages Application domain
may be very limited Processing time may be an issue if indexing is
otherwise required
Slide 43
Proper Noun Detection (DeSilva and Hull, 1994) n Problem:
Filter proper nouns in images of text People, Places, Things n
Advantages of the Image Domain: Saves converting all of the text
Allows application of word recognition approaches Limits
post-processing to a subset of words Able to use features which are
not available in the text n Approach: Identify Word Features
Capitalization, location, length, and syntactic
categoriesCapitalization, location, length, and syntactic
categories Classify using rule-set Achieve 75-85% accuracy without
conversion
Slide 44
Keyword Spotting Keyword Spotting Techniques: Work Shape/HMM -
(Chen et al, 1995) Word Image Matching - (Trenkle and Vogt, 1993;
Hull et al) Character Stroke Features - (Decurtins and Chen, 1995)
4Shape Coding - (Tanaka and Torii; Spitz 1995; Kia, 1996)
Applications: Filing System (Spitz - SPAM, 1996) Numerous IR
Processing handwritten documents Formal Evaluation : Scribble vs.
OCR (DeCurtins, SDIUT 1997)
Slide 45
Shape Coding n Approach Use of Generic Character Descriptors
Make Use of Power of Language to resolve ambiguity Map Character
based on Shape features including ascenders, descenders,
punctuation and character with holes
Slide 46
Shape Codes n Group all characters that have similar shapes {A,
B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W,
X, Y, Z, 2, 3, 4, 5, 6, 7, 8, 9, 0} {a, c, e, n, o, r, s, u, v, x,
z} {b, d, h, k, } {f, t} {g, p, q, y} {i, j, l, 1} {m, w} many
relations occur only for vertical regions n Perform inverse mapping
n Identify maximal area overlap subset n Compute
similarity">
Spatial Matching and Indexing n Establish correspondence Create
a directed mapping for each region in the first document to region
it overlaps in the second n Restrict horizontal mappings Refine
mapping such one->many relations occur only for vertical regions
n Perform inverse mapping n Identify maximal area overlap subset n
Compute similarity
Graphics n Maps and Drawings Lorenz and Monagan, 1995 4Samet
and Soffer, 1995 Amlani and Kasturi, 1988 n Graphs Koga et al, 1993
n Logos and Icons Jaisimha et al, 1996 Doermann et al, 1996
Gudivada and Raghavan, 1993 n Technical Drawings Syeda-Mahmood,
1995
Slide 65
Map Interpretation Samet et al n Identify Legend on the Map
Image n Extract Images map labels and descriptions n Identify
labels in the map images n Allow user to query based on extracted
images n Bootstraps the information extraction and interpretation
problems
Slide 66
Outline
Slide 67
Related to Image Indexing n Using captions to index images
Shrihari, 1995 Rowe, 1995 n Identifying text in images
(Advertisements, etc) 4 Document Image Matching Indexing and
Retrieval Duplicate Image Detection
Slide 68
Document Image Matching n Characteristics Identify Similarity
Based on Image Features Use either structure or content attributes
Applications to retrieval, filtering and database management n J.
Hull (DAS 1994) n J. Hull, J. Cullen and M. Pearis (SPIE 1997) n L.
Spitz (SPIE 1997) n D. Doermann, H. Li and O. Kia (ICDAR 1997)
Slide 69
Duplicate Document Image Detection n Problem Detection of
Duplicate Document Images based on Image Properties n Assumptions
Very large collections Dynamic database No prior human interaction
Goal: To identify an appropriate index for the document
Slide 70
Duplicate Detection n Same content, same format For example, a
xerox copy n Same content, different format For example, as a web
page or on paper n Shared content, same format For example, a paper
with annotations n Shared content, different format For example,
including text with cut-and-paste
Slide 71
Approach n Use global features to restrict search Number of
pages, number of lines, page moments n Extract a signature using
shape codes n Convert signature use a set of n-gram keys to index
the database n Rank and verify return top N documents visual or
algorithmic refinement n Advantages: Robust to noise, extracted
quickly, extracted easily, efficiently stored
Evaluation n The usual approach: Model-based evaluation Apply
confusion statistics to an existing collection n A bit better:
Print-scan evaluation Scanning is slow, but availability is no
problem n Best: Scan-only evaluation No existing IR collections
have printed materials
Slide 79
Summary n Many applications benefit from image based indexing
Less discriminatory features Features may therefore be easier to
compute More robust to noise Often computationally more efficient n
Many classical IR techniques have application for DIR n Structure
as well as content are important for indexing n Preservation of
structure is essential for in-depth understanding
Slide 80
Page Classification n Document Page Images Scanned or Digital
Page Images Typically from Multi-page Documents Page Images as
Individual, Independent Images n Classification Determine the
Category of Page Image n Classes, Types, or Genres Visually Similar
Pages (Query by Example) Explicit/Specific Genre (Query by Type)
Business Letters, Memos, Forms, Articles, Ads, MapsBusiness
Letters, Memos, Forms, Articles, Ads, Maps User-Specified (Query by
Need)
Slide 81
Applications n Automatic Document Structure Analysis From
Layout Analysis To Logical Analysis n Document Image Retrieval n
Organization, Navigation, and Visualization of Document Collection
n Automatic Document Routing and Cataloging
Slide 82
Related Work n Requires OCR Dengel et al. (DFKI): OfficeMAID
Hao et al. (NJIT) n Requires (or Assumes) Type Knowledge/Models
Maderlechner et al. (Siemens AG) Taylor et al. (Lockheed Martin):
Part of IDUS Dengel et al. (DFKI): OfficeMAID Hao et al.
(NJIT)
Slide 83
Issues and Challenges n Issues How do we define classes? How
can we automate the learning process? n Challenges Without OCR
Without Domain-Specific Structural Information Human Relevance
Assessment is Subjective User-Defined Types
Slide 84
Goals and Approach n Goals Produce a Database of Relevance
Judgements Build Classifiers Visual Similarity Decision Tree
ClassifierVisual Similarity Decision Tree Classifier Title Page
Decision Tree ClassifierTitle Page Decision Tree Classifier
Evaluate Classifiers Judge Correlation between Human Assessments
and our ClassifiersJudge Correlation between Human Assessments and
our Classifiers n Approach Visual Similarity-Based Human Relevance
Judgments User-Specified Types Supervised Classification
Slide 85
User Relevance Test n Objectives Gather Human Relevance
(Similarity) Judgments n Sample Images University of Washington
Document Image Database I (UW-I) 979 Technical Article Images n
Query Images 12 Structurally Distinct Representatives of Layout n 7
Test Subjects Rated Database Images with Similarity Ratings from
1(Minimally Similar) to 5 (Very Similar) n Results Number of
Judgments Relevance Ratings
Slide 86
Test Query Images
Slide 87
Distribution of # of Judgements
Slide 88
Distribution of Relevance Rating
Slide 89
Assigning Class Labels n Using Data from User Relevance Test
Number of Judgments Relevance Rating n Assigning Class Labels Based
on Number of Judgments Based on Average Relevance Rating Based on
Combination n Total Scores Confidence Measures of Class Labels Used
in Preparing Training Sets
Slide 90
Classification: Experimental Protocol n Rank Each Class by
Strength of Training Images n Segment the Database into 7 Subsets
1/7th of each Class, ~135 Images Per Subset n Train and Test using
Leave-One-Out Re-sampling Method n Generate Accuracy Rates for each
Subset to show Correlation between User Evaluation and Classifier n
Run the General Classification Tree on the Entire Class
Slide 91
Computation of Total Score n Total Score = f(relevance(R), # of
judgements (J)) R + J R * J SQRT(R*R + J*J) WORST BEST R * J
Slide 92
Distribution of Total Scores
Slide 93
Slide 94
Decision Tree Classifiers n Using OC1 (Oblique Decision Tree
Classifier) to Build Decision Trees Oblique Decision Boundaries
Linear Combination of Features n Image Features University of
Washington Ground-truth Zone/Region Level Segmentation 59
Features
Slide 95
Experiments n 12 Page Class Problem Overall Accuracy Septile
Partitions n Title Page Problem Combined Classes #4 and #9 Overall
Accuracy Septile Partitions
Slide 96
12 Page Classification Accuracy
Slide 97
Slide 98
Example Title Pages (#4 & #9)
Slide 99
Title Page Classification Accuracy
Slide 100
Title Page Overall Accuracy n 57 Title pages, 891 non-title
pages n Overall Accuracy = 906/948 = 95.57% n Title Page Accuracy =
37/57 = 64.91% n False Positives = 22 n False Negatives = 20 n
Observations All without Type-Specific Information Need Functional
(or Logical) Features
Slide 101
Title Page Classification Errors False Positives False
Negatives
Slide 102
12 Page Classification Accuracy
Slide 103
Test Query Images
Slide 104
Observation of 12 Page Classification n Correlation Among
Classes 66 possible pairs #4 and #9: 91% (Rank #1) #5 and #6: 52%
(Rank #2) #1 and #2: 48% (Rank #3) #10 and #11: 40% (Rank #4) n How
many is the right number? Combine Most Similar Classes
Slide 105
8 Page Classification Accuracy
Slide 106
Slide 107
Summary and Future Work n Document Page Classification
Classifies Visually Similar Pages Is based on Human Relevance
Judgments Page Classification Results are Good n Future Work
Automatic Feature Extraction Indexing Genre-Based Search
Slide 108
Agenda n Questions n Definitions - Document, Image, Retrieval n
Document Image Analysis n Traditional Indexing with Conversion n
Doing things Without Conversion n Recent work on IR with Chinese
document images Tseng and Oard
Slide 109
Document Retrieval Approaches for Images of Text n Full-text
search based on manually re-keying the text Prohibitively expensive
at large scale n Search based on bibliographic metadata May be
difficult to adequately describe the materials. n Full text based
on Optical Character Recognition (OCR) Inexpensive and relatively
rapid Sensitive to OCR accurracy
Slide 110
Key Questions for Information Retrieval n What to index?
Phrase, words, character, or shape codes Unigrams or n-grams n How
to weight a term in a document? Term frequency (TF) Document
frequency (DF) Document length normalization (Term position) n How
to assign scores to documents? Boolean, vector space, and
probabilistic models
Slide 111
Chinese Text Retrieval Issues n Words may be any number of
characters (typically 2-5) But some that contain only 1 character
or more than 5 characters e.g., (cat), (UNESCO) n Longer words
(over 2 characters) often have shorter sub-word units
Transliteration is an exception n Written Chinese has no word
separator A sentence can be segmented in different ways, all may be
legal Similar to the phrase detection problem in English n Chinese
character inventory is very large 13,500 characters in Big-5 code
(traditional Chinese: Taiwan and Hong Kong) Over 6,000 characters
in GB code (simplified Chinese: China, Singapore) About 3,000
commonly used characters in each character set
Slide 112
Socio-Cultural Research Center (SCRC) Collection n 800,000
newspaper clippings from 1950-1976 Scanned over 300,000 at 300 dpi
n 30 China, Hong Kong, and Taiwan news agencies Mostly simplified
Chinese, some traditional Chinese n Focus on diplomatic and
military activities
Slide 113
Document Preparation n Selected 11,108 scanned document images
n OCR yielded 8,438 valid docs (Presto! OCR Pro, Big-5) Avg valid
document had a 69% system-reported recognition rate Computed on a
sample of 1,300 documentsComputed on a sample of 1,300 documents n
Second version prepared using Big-5 to GB conversion GB version
used in experiments
Slide 114
Topic Preparation n Based on contemporaneous Chinese journal
articles From 100 paper titles, 30 were selected and rewritten as
Chinese topics n Made English translations for cross-language
experiments Translated by native speakers of Chinese 12
Anti-Chinese Movements Activities related to the anti-Chinese
movements in Indonesia Articles must deal with activities related
to the anti-Chinese movement in Indonesia; case reports or articles
dealing with PRC's criticism of the Anti-Chinese movement will be
considered partly relevant.
Slide 115
Relevance Judgments n Exhaustive tri-state relevance judgments
Irrelevant (=0), partially relevant (=1), fully relevant (=2) n
Every topic-document pair judged by 3 assessors 2 majored in
history, 1 majored in library science Averaged 4 minutes per
document image (for all 30 topics) n Sum of the judgments provides
a final estimate 0=not relevant, 15=partially relevant, 6=fully
relevant Threshold as desired to reflect the intended application
In our experiments, any score > 0 is treated as relevantIn our
experiments, any score > 0 is treated as relevant
Slide 116
Chinese OCR Text Retrieval Strategies n Indexing method: Both
1-gram (for partial match) and 2-gram (for preserving sequence)
Example: ABC will be indexed with A, B, C, AB, BC Compared to
1-gram only and 2-gram only n Weighting scheme: document terms :
TF*IDF = log(1+ tf ) * log(N/df) query terms : tf * (3w-1), where w
is the length of the term n Retrieval model: Vector space model
compared with probabilistic model n Document length normalization:
byte size for document terms, compared to cosine
Slide 117
OCR and Length Normalization n Experiments by Taghva et al
showed that some sophisticated weighting schemes shown to be more
effective for ordinary text might lead to more unstable results for
OCR degraded text. n Singhal, Salton, Buckley [96] analyzed this
phenomenon by Vector space model (SMART system) Word-based indexing
simulated OCR output of a TREC collection (2GB of 742,202 docs) 50
TREC queries (numbered from 151 to 200) Specifically, effects of
cosine normalization and IDF are analyzed Incorrect terms like
systom have large IDF and thus affect weights of other terms in the
same document if cosine normalization is used: They correct this
problem by using byte size normalization: (byte size) 0.375
Slide 118
Results Summary Long QueriesTitle Queries 1+2 gram is best
ByteSize beats Cosine Long queries beat Titles Inquery does well
Mean Average Precision
Slide 119
Blind Relevance Feedback for Query Expansion n Reweight
existing terms, add related terms: We used =5 and =1 n Blind:
assume the top k documents are relevant High precision at low
recall is desirable We used k=5 n Longer n-grams typically improve
precision Lead to a very large set of index terms for Chinese n
Iterate between merging and pruning Merge common 2-grams to get
3-grams Prune low-frequency 3-grams Repeat, merging common 3-grams
to get 4-grams, O(n m) time, where n=length of input string,
m=longest repeated pattern
Slide 120
A Pruned n-gram Example (in English) n-grams without pruning 1.
clusters based on : 3 2. document clustering : 3 3. of Web : 3 4.
on the : 3 5. search engines : 3 6. STC is : 2 7. Web document
clustering : 2 8. Web search engines : 2 9. clustering methods in
this domain : 2 10. requirements of : 2 11. returned by : 2 n-grams
after pruning 1. clusters based : 3 2. document clustering : 3 3.
Web : 3 4. 5. search engines : 3 6. STC : 2 7. Web document
clustering : 2 8. Web search engines : 2 9. clustering methods in
this domain : 2 10. requirements : 2 11. returned : 2 Web Document
Clustering: A Feasibility Demonstration Users of Web search engines
are often forced to sift through the long ordered list of document
returned by the engines. The IR community has explored document
clustering as an alternative method of organizing retrieval
results, but clustering has yet to be deployed on the major search
engines. The paper articulates the unique requirements of Web
document clustering and reports on the first evaluation of
clustering methods in this domain. A key requirement is that the
methods create their clusters based on the short snippets returned
by Web search engines. Surprisingly, we find that clusters based on
snippets are almost as good as clusters created using the full text
of Web documents. To satisfy the stringent requirements of the Web
domain, we introduce an incremental, linear time (in the document
collection size) algorithm called Suffix Tree Clustering (STC),
which creates clusters based on phrases shared between documents.
We show that STC is faster than standard clustering methods in this
domain, and argue that Web document clustering via STC is both
feasible and potentially beneficial.?
Slide 121
Blind Relevance Feedback Results
Slide 122
Conclusions of Study n The SCRC test collection is useful But
more than 30 topics may be needed for statistical significance n
Indexing 1-grams and 2-grams together works well If 2-grams are
given greater weight in the query n Byte size normalization
outperforms cosine normalization But Inquery does better than
either on short queries n OCR errors adversely affect blind
relevance feedback A clean comparable collection would probably
work better Pruning seems to help Considerable parameter tuning is
needed ( , , and k)
Slide 123
Summary n Further broading definition of document n Approaches
based on conversion to text n Approaches that dont involve (full)
conversion n Recent work on Chinese document images