Abstracts, automatic generation of, 247 Activating conditions ...

Post on 07-Apr-2023

1 views 0 download

transcript

INDEX

Abstracts, automatic generation of, 247 Activating conditions, 169-170 Activation network, spreading. See spreading

activation network Affixation

cycle, 49f English, 48t

AIR/SCALIR (user interface), 344 Algorithms

clustering, 352-353 Document Sampler, 253 Document Surrogator, 248-250 merging ranked lists, 123 phrase decomposition, 128 stemming, 26 summarization, 252 supervised vs. unsupervised clusters,

362-363 Ambiguity, 273. See also Disambiguation

syntactic, 10 I Amplifiers, 151 Analysis, linguistically-motivated, 1-2 Annotations, 200-201 Answer extraction, 315,316-317, 324 Answer hypotheses, 316-317

combining evidence for, 328-329 Antonymous adjectives, 87,88 Arclist, 40 Artificial Intelligence (AI), 289 Associationism, classical, 82-85

laws of, 83 term association and, 85-88, 86f

Association of Computing Machinery (ACM) categories, 334, 336

Augmented relevancy signatures, 167, 168, 175-177

Automatic suggestion of key terms, 225-247 AutoSlog, manual extraction pattern review,

171 AutoSlog-TS, 168, 177-179

automatic dictionary, 193-194 broad category text extraction, 182-185

extraction patterns, 179f flowchart, 178f phrases, 174n subset text extraction, 185-187

375

Base sample, 253, 256-257 Bayes' Theorem, 306f BBN's stochastic POST tagger, 124 BEAD, graphical document icon, 362 Belew, R. ,230, 234 Bely, N.,IO Biber, Douglas, 151 Bibliographic records

meta-data, 336 online systems, 334

Bicknell, E.,76 Borgman, C.L., 262 Bottleneck, knowledge-engineering, 171, 177 Breidt, E., 226 Brill's rule based tagger, 124, 208 ByChart Parser, 208-209

C4,5 classification tool, 161-163 multi-part conditions, 162-163

"Campaign" news topic experiment, 237-239 CANCERLIT, 338,340, 341f, 349

categories and subheadings, 34lf, 357f clustering, 355f reclustering, 358f, 360f, 36lf

Candidate terms, 5 Case frames, information extraction, 169-170 Case Law Collection, name frequencies,

267-268 Cat-a-Cone interface, 344, 347f, 348

query specification, 351 Categories, 333-370

definition, 337 document organization and, 340 multiple, 335 presentation of, 335 prevalent words in, 296t primitive, 351 pros/cons, 369-370 taxonomies of, 334

Categorization cluster identification and, 354 criteria, 348 definition, 337 routing, 140-141

Category hierarchies, 335, 340 clustering and, 364-366, 365f clustering results as, 367-368

376

intersectionsi n, 342 Charles, G., 106 Choeka, Y.,225,226 Church, K., 85, 88, 225, 226 CIRCUS, 169-170 CLARIT research team, 17, 19, 22 Classification

document meaning, 290 information retrieval, 294 multinomial distribution application,

295-297 routing vs. retrieval, 282-293 word selection, 296--297

Classification and routing, document, 289-309

intuitive model, 291-292 multinomial distribution application,

295-304 sample query, 293t

Clause complexity, approximation of, 150 Closed-class questions, 314-315

MURAX testing, 318 Trivial Pursuit, 319f

Clustering algorithms,352-354,364-366 definition, 337 document organization and, 352-363 vs. DynaCat, 367 examples, 354-359, 355f, 358-361 fuzzy, 361 iterative reclustering, 356--359, 357f, 360f,

367 pros/cons, 369 retrieval characteristics, 359-361

Clusters, 333-369 definition, 337 graphical display of, 362-363 unsupervised vs. category-oriented,

363-367 COBWEB,366 COLLAGE, 273-286

bi-gram generation, 285-286 BRS/SEARCH(t), 282 document ranking, 281-282 languageranking, 281-282 language processing, 277 lexical resources, 283 query algebra, 273-274, 277-278, 281 query generation, 281 standard source lookup, 284-285 system data flow, 275-276, 276f topic parsing, 279-281 topic structuring, 278-279 transfer lexicons, 283-284

INDEX

TREC topic analysis, 274-275 Wordnet, 283

Collections, 201 Collins, A., 81-82 Collocatedness, phrasal terms, 217, 220-221 Collocation research, 225-226 Complex description, 8 Complex term, 8 Compound terms research, 13-14 See also

Multi-word terms full text, 17-21 names, 129-130

Concepts, fixed, 86 Cone Tree, 345 Conflation, 5

multi-word term, 31-33 term variants, 25, 26--28

Conjunction, 273, 275, 286 Conjunctions, extraction patterns and, 244 Constrained domains, 312 Constrained Grammar, 33, 101 Content-based information management,

selected, 21-22 Content word, 3, 26 Contractions, 151 Controlled indexing, 29-38

architecture for, 350-38, 36f dependency relations, 34f feature structure description, 37t multi-word terms, 34-35

Controlled vocabulary, 7, 8, 26, 248 Co-occurrence, 86

evidence, 328 term patterns, 137

Coordinations, 34 Corbin, D., 47 Co-reference expansio, 286 Cowie, Jim, 273-287 Cranfield experiments, II CREOLE (Collection of REusable Objects

for Language Engineering), 201, 202-203, 210

Croft, W.B., 12, 13, 21 CYC (knowledge base), 104-105

Databases, full text, I, 16 LMI retrieval, 16--21

DDC operational system, 12 Decision tree learning techniques, 148 Declaration, annotation type, 200-201 Decomposition, linguistically overt, 8 Deese, J., 87, 88 Demand-based analysis, 312 Derivational analysis, 8-9

INDEX 377

preference rules, English, 47-48 preference rules, French, 48-49 transducer rules for, 51 t

Deruvational morphology, 28, 57-58 Dictionaries

automatic vs. hand-crafted, 193-194 semantic-feature tags, 179

Dillon, M., 12 Disambiguation, 32-33. See also Ambiguity

linguistic knowledge, 44-45 word classes and statistical teaming, 45-46 word sense, I 07, 108

Discourse interpreter, 209 Discourse model, IE, 206-207 Discourse structure, 137 Discriminant analysis, 162 Document definition, 337 Document matching, primary, 315, 316 Document Sampler, 248, 253-257

algorithm, 253 experiment, 253-254

Document Surrogator, 248-252 algorithm, 248-250 experiment, 250-251

Domain expert, 227, 249, 251, 254 Domain topics

constrained, 31 predefined, 216

Dozier, Christopher C., 261-272 DynaCat system, 348-349, 350f, 351

Categorizer, 348 vs. clustering, 367 Organizer, 349

ECRAN, 211,212-213 English, version of, 274, 283-284 Equivalent hypotheses, 324-325

scoring, 325-326 "European Politics & Business" experiment,

231-236 Excite, 149 Expectations

precision and recall, 275 query, 316, 317, 31St verifying limited, 326-328

Explicit analysis, 3 Extraction-based text categorizaton, 167-195,

169-177 Extraction patterns, 167-171

automatically generating, 177-179 top25, 179f types, 17lf

FACILE, 211

Factor analysis, 162 Fagan, Joel L., 12-13, 101 Fang Lin, 113-145 FAS11R,27,38,52-70

evaluation, 66-70 forming metarules, 60-66 metarules metagrammar, 53-54 morpho-syntactic variants metagrammar,

56-60,65-66 syntactic variants metagrammar, 54-56, 63 terms grammar, 52-53

Feature-augmented signatures, 179-180 Features, definition of, 337 File time, 2 Filtering

automatic, 165 clustering and, 360-361 phrasal terms, 242-244 queries,350-351

Finite-state morphological processor, 26 Focus sample, 237, 239

construction, 245-246 Document Sampler, 253, 254 Document Surrogator, 248-249 term distribution across, 242-244

Foreign country tags stream, 132-133 Fragments stream, 131 Frost, D., 226 Fuhr, N., 76 Full text (databases), 1, 16

LMI retrieval, 16-21 Full-text expansion, 137-138 Functional styles, 148

Gaizauskas, Robert, 197-214 Galaxy of News, graphical document icon,

362 GATE (General Architecture for Text

Engineering), 197-213 design, 201-205 elements of, 202f IE systems flexibility, 199, 200 LaSIE,204~205-210

Gazateer lookup, 208 GDM (GATE Document Manager), 201,202 Genres, document, 158 Genuardi, M.T., 21 Genus term, 78 GEINYU research team, 17, 19 GGI (GATE Graphical Interface), 201

operations,204-205 Gierl, C., 226 GLIMPSE, 334 Graphical concept space, 344

378

Gray, A.S., 12 Grefenstette, G., 80 Grolier's Encyclopedia, 314,317,318 Guentzer, U., 76 Gutluie,Joe,289-310 Guthrie, Louise, 289-310

Hahn, U., 14 Hanks, P., 85, 88, 225, 226 Harman, Donna, 29,76 Hayes, P.J., 21 Head-modifier pairs, 117-ll8, 121-122

relations, 79-80, 81 t spreading activation from, 88-89, 89f, 90f stream, 123-128

Hearst, Marti A., 333-373 Hedges, 151 Hindle, D., 79,226 Hogg, T., 96 Homographs, dialectical, 284 Huberman, B., 96 Human indexers, 7, 15, 338. See also Manual

judgment Human memory models, 81-85

associationism, classical, 82-85 Collins and Loftus, 83f spreading activation network, 81-82,

88-91

Indexing, simple vs. rich, 70t Indexing language, 7

designs, 8 Index organizations, 323 Indexterms,2, 6-8

content-bearing key, 3 NLP for extraction, 114

Inflectional morphology, 28 InfoCrystal (user interface), 343 Information extraction (IE), 167- 169,

169-171 European research, 211-212 full-parse vs. finite state pattern matching,

198-199 limitatins, 212-213

Information management, content-based, 99-100

Information retrieval (IR) automatic filtering, 165 categories anf clusters for, 333-370 classification vs. routing, 292-293 evaluating NLP in, 113-145 Hierarchical Concept graphs (HCGs),

105-106

INDEX

knowledge acquisition bottleneck, 3ll-312

knowledge bases (KBs), 104--105 machine-readable dicitonaries (MRDs),

104 NLP experiments, 99- 110 retrospective, 293, 295, 303-304 stream architecture, 121-123 stylistics, 147-165 syntactic ambiguities, 101 synsets (Nordnet), 105-106 Term Suggestion Toolkit, 247-248 term weighting, 100--101 tree-structured analytics (TSAs), 101-103 WORD NET, 105-109

INQUERY, 102, 115, 285 Interrogative words, 317, 3185 Inverted index, 120

Jacquermin, Christian, 25-74 Jarvelin, K., 76 Jin Wang, 113-145 Jones, P., 85 Justeson, J.S., 32, 87-88, 226, 244 Karlgren, Jussi, 147-166 Karlsson, F., 101 Katz, S.M., 32, 87-88, 226, 244 Klingbiel, P.H., 21 Knowledge bases (KBs), 104--105

manual attention requirements, 248 Knowledge representation, 206 Kristensen, J., 76 Kravetz, Robert, 21, 29, 104 Kupiec, Julian M., 311-332

Language Analysis Systems, Inc., 262 Language theory, 199 LaSIE information extraction system,

197- 213 features, 208 GATE application, 205-210 GGI display, 204f meaning representation, 206 modules, 208-209 performance, 209-210

Lattice structure organizastion, 343, 345 Leistensnider, James, 289-310 Length normalization, 160 Lesk, M., 79 Lewis, D.D., 12, 13, 14, 21 Lewis, P., 79 LexiCadCam, 211 Lexical chaining, 109 Lexical rules, 57

Lexicon-based word normalization, 124-125 LEXIS-NEXIS, 227, 250

data,217, 220,224 phrase lists, dictionaries and thesauri, 219

LEXTER,67 Library of Congress Subject Headings, 7,

334,336 Linguistically-motivated indexing (LMI), 1,

2-9, 120 basic concepts, 2-6 complex descriptions and terms, 6-9 full text, 16-21

Linguistics, computational, 311-312 Linguistic String Grammar, 126 List Inclusion, 327 LMI. See Linguistically-motivated indexing Locality stream, 131 Loftus, E., 81-82 Longman Dictionary of Contemporary

English, 284 Loose coupling, 203 Lorenzen, Jeffrey, 167-196 Lovin's stemmer, 26,29-30

Mahcine learning, unsupervised, 366 Machine-readable dictionaries (MRDs), 104 Machine translation, 99 Mann Whitney U rank sum test, 153 Manual judgment, 251, 253-254, 256 Mauldin, M., 14 "Medical Malpractice" legal topic

experiment, 239-242 score comparison, 255t score mapping and distribution, 256t

Medical text, 336-337 MEDLARS, 11 MEDLINE, 339 Merging, 122

precision distribution, 135 ranked lists algorithm testing, 123 score calculation, 133-134 stream coefficients, 136

MeSH (Medical Subject Headings), 334 CANCERLIT categories, 340, 341f, 342 Cat-a-Cone interface, 345-348 DynaCat system, 348-350 hierarchy, 339f

Meta-data, external, 336-337 contentful meta-data and, 336 definition, 337-338

Metagrammar morpho-syntactic variants, 56-60 syntactic variants, 54-56, 56t, 57t, 63

Metarules

INDEX

metagrammar, 53-54 morpho-syntactic variants, 60t, 61t tree structure, finite-state automaton,

feature structures, 53f Metathesaurus, UMLS, 338-340, 348 MIKROKOSMOS, 210-211 Miller, George, 106 Mohri, M., 27 Morphological analysis, 29-35, 38-51

derivational analysis, 47-49 Dynamic Approach, 39 inflectional analysis, 41-43

379

part of speech disambiguation, 43-46 Morphological stemming, 121, 124-125

language type and, 30-31 Morpho-syntactic analysis, 33 Morpho-syntactic variants, 27-29, 65-66, 69

hetero-categorical, 67t iso-cateogircal, 66t metagrammar, 56-60

Morph (stemmer), 208 MUC-4, 182-185 MUC-6

Named Entity Task, 262, 264 resutls requirements, 209

MUCs (Message Understanding Conferences), 22, 169-171

Multi-faceted topics, 275 Multinomial distribution application,

295-304,306f Boolean Expression, 302f expected probabilities, 305t results, 305t results probability, 306t routing using, 306f, 306-309, 307t, 308t,

309t, 310t top 10 documents, 30lt use of, 304-306

Multiple categories, 335, 340 Multiple clusters, 361-362 Multi-word terms, 26 See also Compound

words research conflation, 31-33 controlled indexing, 34-35 suggestion, 230-231

MURAX, 311-332 answer extraction, 324-331 closed-class questions, 318 description, 314 output from, 315f secondary queries, 329-331 system architecture, 319-324 system overview, 320f

380

NameFinder (Carnegie Group), 263 Name matcher LaSIE module, 209 Name matching, 261-270, 272

natural language queries, 269-270 problems and issues, 262-264 TREC topics, 27 4

Name recognition, 263 case law documents, 264-265, 268t, 272 COLLAGE, 278 retrieval performance, 267, 267t

Name searching, 263, 267t, 269, 317 NameTag (lsoquest), 263

name recognition rules, 266 NASA Kennedy Space Center, 318 National Library of Medicine, 338 Natural Language Processing (NLP), 1-22,

113-143 categories and clusters, 333-369 deep ior shallow, 311-312 early experiments, I0--16 information extraction (IE), I 67-I 69 information retrieval, 99, 118-I21,

142-143 linguistically-motivated indexing, 2-9 other roles for, 21-22 term variant extraction, 25-71 text processing uses, 289-309 TREC Program, 17-2 I

NLI (non-linguistic indexing), 4 vs. LMI, 5-6, 17

Noun phrases, 224, 244 AutoS log-TS rules, 177 closed-class questions, 315 long, 127-128 MURAX analysis, 3I9-321, 327 organizing answer text by, 318 simple stream, I28-129 TREC topics recognition, 279

Nymble (BBN), 266-267

Organiation answer text,318 index, 323 meta-data, 336-337 retrieval results, 334-336 search results, 313- 314 stream, 123

Overgeneration, 58-59 advantages of, 50--51

Paradigmatic classifications, 8 Paradigmatic relations, 86, 87 Paragraph picking, I 40 Parsing. See also Tagged Text Parser

INDEX

deeper parse trees, 161 local transformations metarules, 27 partial, 58

Part-of-speech tagging, 26, I24 COLLAGE, 279

linguistic knowledge, 44-45 MURAX, 326 primary document matching, 316, 319-32 I skip-and-fit, 126-127 word classes and statistical learning, 45-46

PATR-11, 27 Pattern-matching, 32-33, 198-199

AutoSiog, 171 extraction, I67-I71, 177-179 MURAX, 326f, 326-328 prepositions, 172, 174-175, 224, 244 term association, 77

Pedersen, J.O., 21 Penn Trebank tagset, 124 Pereira, F., 27, 39 Perez-Carballo, Jose, 113-145 Permutations, 34 Phrasal terms, 215-259

automatic suggestion, 225-247, 236-242 base sample, 228t collocatedness, 217, 220--221 component word types, 220t focused sample and base sample, 227 off-the-shelf, 216 manually determined, 227, 231-236 statistical, 11 study, off-the-shelf vs. selected, 2I 7-225 syntactic, II term evaluation, 244-246 term filtering, 242-244 topic groups experiment, 251 t TREC topic analysis, 274 variations generation research, 226

Phrase categories and proximity, 221t Phrase decomposition algorithm, I28 Phrase expansion, 281 Phrase extraction, 117-118, 121

syntactic and statistical, 119 two-phase, I 28

Phrase indexing, 3I- 32 Phrase normalization, l21- I22 Porter's stemmer, 29- 30 Postcoordinate indexing, 9

vs. precoordinate indexing, 2 Precedence rules

English, 47-48 French, 48-49

Precision, 256-257 average, l43t

change, query by query, l64f Cornell/Sabir SMART, l4lt degradation due to overgeneration, 69 distribution estimates, selected streams,

l35t expected level of, 275 extraction-based text categorization,

182-184 improvement, NLRI vs. SMAR baselines,

l4lt stems-only retrieval, 1345 word-augmented relevancy signatures, 191

Preclassified texts, 177, 179 Precoordinate indexing

vs. postcoordinate indexing, 2 Prefix. See Affixation Prepositions, extraction patterns and, 172,

174-175,224, 244 Primary matches, 3 I 6

scoring, 323-324 Primary query construction, 321-322 PRISE, 122, 133-134 Procedural answers, 318 Pronoun counts, 151

personal, 152, 160 Proper name extraction, 122. See also Name

matching Proximity, fixed or nondirectional, 217-219 Pruning criteria, 139

Queries, 2 ad-hoc, 17,349,361-362 broadening of, 321-322 characteristics of, 317-318 closed-class, 314-315 development of, 115 exploiting, 313-314 filtering and routing, 350--352 interrogative words and expectations, 318t long vs. short, 114 narrowing of, 322 primary, 319-322 reformulation of, 322 relevance judgments on, 158-159 secondary, 329-331, 330f

Query algebra, 273-274, 277-278 Query expansion experiments, 136-140

ad-hoc runs, 140--141 automatic, 139-140 manual guidelines, 138-139 purpose, 136-138 routing runs, 141-142

Rada, R., 76

INDEX

Ranking clusters, 362 COLLAGE document, 281-282 probability retrieval method, 294 relevance, 222 vector space, 294,361

Recall, 256-257

381

degradation due to undergeneration, 69 expected level of, 275

Relevance characteristics of, 160--161 multi-facted TREC topics, 275 ranking, 222 sty lis tics and, 158-161

Relevance feedback, 137-138 Smart routing and, 142

Relevance rate extraction pattern, 177 signature, 172

Relevancy signatures, 167, 168, 172-175 broad category extractions, 182-185 subset extractions, I 88-197

Relevant words broad category extractions, 182-185 subset extractions, 185-187 subsubset extractions, 188-191

Request, search. See Query Research, constraints on, 199 Resnik, P., I 06 Retrieval results, organizing, 334-336 Retrospective retrieval , 293, 295, 303-304 Rich indexing, 70t Riley, M., 27, 39 Riloff, Ellen, 167-196 Role fillers, 175, 180

relevant, 181 f word-augmented relevancy signatures,

187, 190--191 Routing, document, 289-309

Boolean test, 301-303 classification vs. retrieval, 292- 293 document frequency measure, 300, 300f,

30lf information retrieval, 294 multinomial distribution application,

297-303, 307t, 308t, 309t, 310t performance, 299-300 TREC-5 evaluation, 303 word selection, 298-299 zero word counts, 299

RUBRIC system, 351 Ruge, Gerda, 75-98

Sager, Naomi, 126

382

Salton, Gerard, 10---11, 167 Sanderson, Mark, 104 Scatter/Gather clustering approach, 352-363

iterative reclustering, 356-358, 357f, 360f Schiitze, H., 21 Search engines, WWW, 149, 334 Search Lead, 250 Search results, organizing, 313-314 Search Software America, 262 Search time, 2 Secondary queries, 329-331 Semantic feature hierarchy, 179 Sentence analyzer, conceptual, 169-170,

172-173 augmented relevancy signatures, 176

Sentence length, 150 Sentence splitter, 208 Siegfried, S.L., 262 Signature, 172

lack of role fillers, 175 multiple vs. single analysis, 173-174

Silvester, J .P., 21 Similarity

measure, 89-90, 92-95, 95t semantic, 106, 107-108, 109, 113

Simple indexing, 70t Single word terms, suggesting, 228-229 Skip-and-fit recovery, 126 Slot triples, 179-180

relevant, 175-177 relevant for terrorism, 176f

Smadja, F., 225 SMART 14, 102, 115, 122, 124, 132, 136,

140---141 Smeaton, Alan F., 99-111 Spanish verb inflections, 40t

partial transducer for jugar, 41f Spanish verb stems, 40t Sparck Jones, Karen, 1-24 SPARKLE, 211 Spearman's rho, 102, 153, 154t Spelling variations, 283-284 Spreading activation network, 81-82, 84f,

88-91 from head/modifier relations, 88-89 vs. semantically similar words, 89-90,

92-95,95t synonyms and modifiers, 90---91 valuation of, 95-96

Sproat, R., 27, 39 Standard Industrial Classification Manual

(SIC),284-285 Standing queries, 350---352 Statistical phrases, 11

INDEX

Statistics corpus analysis, 78-79, 114, 312 human associationism, 84-8h multi-word terms suggestion, 230---231,

235t non-parametric multivariate, !53, 162 lexical, 150t, 161 t single word terms suggestion, 228-229,

232t text-based, 149-150 two-word terms suggestion, 229-230 variable correlation, 153-154 weighting schemes for compound terms,

120---121 Steier, A., 230, 234 Stemming, 6

algorithm, 26 morphological, 30---31 techniques of, 29-30

Stems stream, 130 Stopwords elimination, 121 Straszheim, Troy, 154-158S Stream architecture, 115-116, 121-123

advanced, 123-133 foreign country tags, 132-133 fragments, 131 head+ modifier pairs, 123-128 locality, 131 merging and weighting, 133-136 performance, 132t proper name extraction, 129-130 simple noun phrases, 128-129 stems, 130 unstemmed-word, 131

Stream organization, 122f, 123 Strube, G., 85-87 Strzalkowski, Tomek, 80, 113, 145 Stylistics, 147-166

information retrieval experiments, 147-165

items, 148 item variables, 149- 152 precision and, 161-163 relevance and, 158-161 visualizing variation in, 154-158

Subject codes, Library of Congress, 334 Substitutions, 34 Suffix. See Affixation Summarization algorithm, 252 Synapsies, 34-35 Syntactic ambiguities, 101 Syntactic analysis, 4

local, 25 Syntactic complexity, 150

Syntactic parsing, 117-118 Syntactic phrases, 11 Syntactic variants, 27-29, 69

metagrammar, 54-56, 63 Syntagmatic classifications, 8 Syntagmatic relations, 86, 87 SYNTOL-type indexing, 10

Table Lens (user interface), 344 Tagged text parser (TIP), 125-127 Tagger See Part-of-speech tagging Tait, J.I., 12, 14 Templatability, boundaries of, 212 Term association, 75-96

associationism, classical, 85-88 experiments, 92-95 first or second order, 78 lexicon analysis, 78 linguistically-based corpus analysis, 79-81 pseudo-classification, 77 statistic corpus analysis, 78-79 text patterns, 77 user observation, 78

Termbanks, 284 Term evaluation, 244-246 Term filtering, 242-244 Termino (grammar), 32 Terminology, changes in, 283-284 Term Suggestion Toolkit, 247-248, 253 Term Variance, 25-71,76

conftation to linguistic variants analysis, 26--28

controlled indexing and variant extraction, 29-35

controlled indexing architecture, 35-38 FASTR for term variant extraction, 52-70 morphological analysis, 38-51

Term weighting, 100-101 compound terms, 120--121 SMART and, 132t statistical, 2 stream architecture, I 33-I 36

Text ftavorofa,290--291,295-296 meaning of a, 290

Text-based representation, 116 Text-based statistics, 149-150, 312 Text categorization, extraction-based,

167-195 attack category, 185-187, 186f bombing category, 188f, 188-191 experimental results, 182-194 fully automatic, 181-182 kidnapping category, 191f, 191-193

INDEX

terorrism category, 182-185, 185f Text labels, 343 Text simplification, 31-32

text Tiles, I 50 relevant document scores, I 6 I

Tfidf measure, 296f

383

ThemeScapes, graphical document icon, 362 Thesaurus, 7,26, 75-77

relevance-feedback, 137 syntactic analysis, 10-1 I

Thompson, Paul, 261-272 Tight coupling, 203 TIPSTER program, 22, 197-198, 200-201 Tokenizer, 208-209 Topics

predefined and manually determined, 23 It predefined domain, 216 relevance of, 293 routing, 292-293 TREC natural language analysis, 273-286

TREC-5, 137-138 classification-based routing system, 303 retrieved text characteristics, 160--I 61 simulated routing mode, 140--I 41 stylistic variation and relevance, 159-160

TREC (Test REtrieval Conferenc) Program, 114

document classification and routing, 294-592

linguistically-motivated indexing, 17-21 natural language topic analysis, 273-286

TREE, 211 Tree-structured analytics (TSAs), 101-103 Trigger words, 169-170, 174 Trivial Pursuit, 318, 319f TTP (tagged text parser), 126--127 Turing, Alan, 289 Turtle, H.R., 12, 13 Two-word terms, 229-230, 233,t, 234t Tzoukermann, Evelyne, 25-74

UMass research team, 17 UNIMEM,366 Unstemmed-word stream, 131 Unsupervised clusters, 335, 363-366 User interface

information access, 369-370 search results, 343-349

Validation, 317-318 van Rijsbergen, C.J., 368 Variation, visualizing, 154-158

genres, 158 nonrandom intentional, 156

384

principal components analysis, 157-158 random, 156

Vector space retrieval method, 294 Verb forms, 172, 175, 316

passive, 189 relations between question and answer,

317,327-328 Verification evidenc,e 328-329 VIBE (user interface), 343

Wall Street Journal, 148-150, 152, 159-160 type-token ratio, 157f

West FED collection, 264--265, 267 Wettler, M., 82-85 Wilcoxon's rank sum test, 153 Wilks, Yorick, 197-214 Word-augmented relevancy signatures, 167,

168, 179-182 broad category extractions, 182-185 subset extractions, 185-187 subsubset extractions, 188, 190-191

Word-based representation, 116-117 Word-based statistics, 149 Word intervals, 216 Wordnet, 104--105, 283

criticism of, 109-110 information retrieval, I 05-109 MURAX proposal, 331 TREC topic parsing, 280-281

World Wide Web Excite (search engine), 149 search engines, 334, 336

Wrapper functions, 202-203

Xtract, 225

Yahoo!, 343

Zhou,Joe, 215-259

INDEX

Text, Speech and Language Technology

1. H. Bunt and M. Tomita (eds.): Recent Advances in Parsing Technology. 1996 ISBN 0-7923-4152-X

2. S. Young and G. Bloothooft (eds.): Corpus-Based Methods in Language and Speech Processing. 1997 ISBN 0-7923-4463-4

3. T. Dutoit: An Introduction to Text-to-Speech Synthesis. 1997 ISBN 0-7923-4498-7 4. L. Lebart, A. Salem and L. Berry: Exploring Textual Data. 1998

ISBN 0-7923-4840-0 5. J. Carson-Bemdsen, Time Map Phonology. 1998 ISBN 0-7923-4883-4 6. P. Saint-Dizier (ed.), Predicative Forms in Natural Language and in Lexical Know-

ledge Bases. 1999 ISBN 0-7923-5499-0 7. T. Strzalkowski (ed.), Natural Language Information Retrieval. 1999

ISBN 0-7923-5685-3

KLUWER ACADEMIC PUBLISHERS - DORDRECHT I BOSTON I LONDON