Date post: | 07-Apr-2023 |
Category: |
Documents |
Upload: | khangminh22 |
View: | 1 times |
Download: | 0 times |
INDEX
Abstracts, automatic generation of, 247 Activating conditions, 169-170 Activation network, spreading. See spreading
activation network Affixation
cycle, 49f English, 48t
AIR/SCALIR (user interface), 344 Algorithms
clustering, 352-353 Document Sampler, 253 Document Surrogator, 248-250 merging ranked lists, 123 phrase decomposition, 128 stemming, 26 summarization, 252 supervised vs. unsupervised clusters,
362-363 Ambiguity, 273. See also Disambiguation
syntactic, 10 I Amplifiers, 151 Analysis, linguistically-motivated, 1-2 Annotations, 200-201 Answer extraction, 315,316-317, 324 Answer hypotheses, 316-317
combining evidence for, 328-329 Antonymous adjectives, 87,88 Arclist, 40 Artificial Intelligence (AI), 289 Associationism, classical, 82-85
laws of, 83 term association and, 85-88, 86f
Association of Computing Machinery (ACM) categories, 334, 336
Augmented relevancy signatures, 167, 168, 175-177
Automatic suggestion of key terms, 225-247 AutoSlog, manual extraction pattern review,
171 AutoSlog-TS, 168, 177-179
automatic dictionary, 193-194 broad category text extraction, 182-185
extraction patterns, 179f flowchart, 178f phrases, 174n subset text extraction, 185-187
375
Base sample, 253, 256-257 Bayes' Theorem, 306f BBN's stochastic POST tagger, 124 BEAD, graphical document icon, 362 Belew, R. ,230, 234 Bely, N.,IO Biber, Douglas, 151 Bibliographic records
meta-data, 336 online systems, 334
Bicknell, E.,76 Borgman, C.L., 262 Bottleneck, knowledge-engineering, 171, 177 Breidt, E., 226 Brill's rule based tagger, 124, 208 ByChart Parser, 208-209
C4,5 classification tool, 161-163 multi-part conditions, 162-163
"Campaign" news topic experiment, 237-239 CANCERLIT, 338,340, 341f, 349
categories and subheadings, 34lf, 357f clustering, 355f reclustering, 358f, 360f, 36lf
Candidate terms, 5 Case frames, information extraction, 169-170 Case Law Collection, name frequencies,
267-268 Cat-a-Cone interface, 344, 347f, 348
query specification, 351 Categories, 333-370
definition, 337 document organization and, 340 multiple, 335 presentation of, 335 prevalent words in, 296t primitive, 351 pros/cons, 369-370 taxonomies of, 334
Categorization cluster identification and, 354 criteria, 348 definition, 337 routing, 140-141
Category hierarchies, 335, 340 clustering and, 364-366, 365f clustering results as, 367-368
376
intersectionsi n, 342 Charles, G., 106 Choeka, Y.,225,226 Church, K., 85, 88, 225, 226 CIRCUS, 169-170 CLARIT research team, 17, 19, 22 Classification
document meaning, 290 information retrieval, 294 multinomial distribution application,
295-297 routing vs. retrieval, 282-293 word selection, 296--297
Classification and routing, document, 289-309
intuitive model, 291-292 multinomial distribution application,
295-304 sample query, 293t
Clause complexity, approximation of, 150 Closed-class questions, 314-315
MURAX testing, 318 Trivial Pursuit, 319f
Clustering algorithms,352-354,364-366 definition, 337 document organization and, 352-363 vs. DynaCat, 367 examples, 354-359, 355f, 358-361 fuzzy, 361 iterative reclustering, 356--359, 357f, 360f,
367 pros/cons, 369 retrieval characteristics, 359-361
Clusters, 333-369 definition, 337 graphical display of, 362-363 unsupervised vs. category-oriented,
363-367 COBWEB,366 COLLAGE, 273-286
bi-gram generation, 285-286 BRS/SEARCH(t), 282 document ranking, 281-282 languageranking, 281-282 language processing, 277 lexical resources, 283 query algebra, 273-274, 277-278, 281 query generation, 281 standard source lookup, 284-285 system data flow, 275-276, 276f topic parsing, 279-281 topic structuring, 278-279 transfer lexicons, 283-284
INDEX
TREC topic analysis, 274-275 Wordnet, 283
Collections, 201 Collins, A., 81-82 Collocatedness, phrasal terms, 217, 220-221 Collocation research, 225-226 Complex description, 8 Complex term, 8 Compound terms research, 13-14 See also
Multi-word terms full text, 17-21 names, 129-130
Concepts, fixed, 86 Cone Tree, 345 Conflation, 5
multi-word term, 31-33 term variants, 25, 26--28
Conjunction, 273, 275, 286 Conjunctions, extraction patterns and, 244 Constrained domains, 312 Constrained Grammar, 33, 101 Content-based information management,
selected, 21-22 Content word, 3, 26 Contractions, 151 Controlled indexing, 29-38
architecture for, 350-38, 36f dependency relations, 34f feature structure description, 37t multi-word terms, 34-35
Controlled vocabulary, 7, 8, 26, 248 Co-occurrence, 86
evidence, 328 term patterns, 137
Coordinations, 34 Corbin, D., 47 Co-reference expansio, 286 Cowie, Jim, 273-287 Cranfield experiments, II CREOLE (Collection of REusable Objects
for Language Engineering), 201, 202-203, 210
Croft, W.B., 12, 13, 21 CYC (knowledge base), 104-105
Databases, full text, I, 16 LMI retrieval, 16--21
DDC operational system, 12 Decision tree learning techniques, 148 Declaration, annotation type, 200-201 Decomposition, linguistically overt, 8 Deese, J., 87, 88 Demand-based analysis, 312 Derivational analysis, 8-9
INDEX 377
preference rules, English, 47-48 preference rules, French, 48-49 transducer rules for, 51 t
Deruvational morphology, 28, 57-58 Dictionaries
automatic vs. hand-crafted, 193-194 semantic-feature tags, 179
Dillon, M., 12 Disambiguation, 32-33. See also Ambiguity
linguistic knowledge, 44-45 word classes and statistical teaming, 45-46 word sense, I 07, 108
Discourse interpreter, 209 Discourse model, IE, 206-207 Discourse structure, 137 Discriminant analysis, 162 Document definition, 337 Document matching, primary, 315, 316 Document Sampler, 248, 253-257
algorithm, 253 experiment, 253-254
Document Surrogator, 248-252 algorithm, 248-250 experiment, 250-251
Domain expert, 227, 249, 251, 254 Domain topics
constrained, 31 predefined, 216
Dozier, Christopher C., 261-272 DynaCat system, 348-349, 350f, 351
Categorizer, 348 vs. clustering, 367 Organizer, 349
ECRAN, 211,212-213 English, version of, 274, 283-284 Equivalent hypotheses, 324-325
scoring, 325-326 "European Politics & Business" experiment,
231-236 Excite, 149 Expectations
precision and recall, 275 query, 316, 317, 31St verifying limited, 326-328
Explicit analysis, 3 Extraction-based text categorizaton, 167-195,
169-177 Extraction patterns, 167-171
automatically generating, 177-179 top25, 179f types, 17lf
FACILE, 211
Factor analysis, 162 Fagan, Joel L., 12-13, 101 Fang Lin, 113-145 FAS11R,27,38,52-70
evaluation, 66-70 forming metarules, 60-66 metarules metagrammar, 53-54 morpho-syntactic variants metagrammar,
56-60,65-66 syntactic variants metagrammar, 54-56, 63 terms grammar, 52-53
Feature-augmented signatures, 179-180 Features, definition of, 337 File time, 2 Filtering
automatic, 165 clustering and, 360-361 phrasal terms, 242-244 queries,350-351
Finite-state morphological processor, 26 Focus sample, 237, 239
construction, 245-246 Document Sampler, 253, 254 Document Surrogator, 248-249 term distribution across, 242-244
Foreign country tags stream, 132-133 Fragments stream, 131 Frost, D., 226 Fuhr, N., 76 Full text (databases), 1, 16
LMI retrieval, 16-21 Full-text expansion, 137-138 Functional styles, 148
Gaizauskas, Robert, 197-214 Galaxy of News, graphical document icon,
362 GATE (General Architecture for Text
Engineering), 197-213 design, 201-205 elements of, 202f IE systems flexibility, 199, 200 LaSIE,204~205-210
Gazateer lookup, 208 GDM (GATE Document Manager), 201,202 Genres, document, 158 Genuardi, M.T., 21 Genus term, 78 GEINYU research team, 17, 19 GGI (GATE Graphical Interface), 201
operations,204-205 Gierl, C., 226 GLIMPSE, 334 Graphical concept space, 344
378
Gray, A.S., 12 Grefenstette, G., 80 Grolier's Encyclopedia, 314,317,318 Guentzer, U., 76 Gutluie,Joe,289-310 Guthrie, Louise, 289-310
Hahn, U., 14 Hanks, P., 85, 88, 225, 226 Harman, Donna, 29,76 Hayes, P.J., 21 Head-modifier pairs, 117-ll8, 121-122
relations, 79-80, 81 t spreading activation from, 88-89, 89f, 90f stream, 123-128
Hearst, Marti A., 333-373 Hedges, 151 Hindle, D., 79,226 Hogg, T., 96 Homographs, dialectical, 284 Huberman, B., 96 Human indexers, 7, 15, 338. See also Manual
judgment Human memory models, 81-85
associationism, classical, 82-85 Collins and Loftus, 83f spreading activation network, 81-82,
88-91
Indexing, simple vs. rich, 70t Indexing language, 7
designs, 8 Index organizations, 323 Indexterms,2, 6-8
content-bearing key, 3 NLP for extraction, 114
Inflectional morphology, 28 InfoCrystal (user interface), 343 Information extraction (IE), 167- 169,
169-171 European research, 211-212 full-parse vs. finite state pattern matching,
198-199 limitatins, 212-213
Information management, content-based, 99-100
Information retrieval (IR) automatic filtering, 165 categories anf clusters for, 333-370 classification vs. routing, 292-293 evaluating NLP in, 113-145 Hierarchical Concept graphs (HCGs),
105-106
INDEX
knowledge acquisition bottleneck, 3ll-312
knowledge bases (KBs), 104--105 machine-readable dicitonaries (MRDs),
104 NLP experiments, 99- 110 retrospective, 293, 295, 303-304 stream architecture, 121-123 stylistics, 147-165 syntactic ambiguities, 101 synsets (Nordnet), 105-106 Term Suggestion Toolkit, 247-248 term weighting, 100--101 tree-structured analytics (TSAs), 101-103 WORD NET, 105-109
INQUERY, 102, 115, 285 Interrogative words, 317, 3185 Inverted index, 120
Jacquermin, Christian, 25-74 Jarvelin, K., 76 Jin Wang, 113-145 Jones, P., 85 Justeson, J.S., 32, 87-88, 226, 244 Karlgren, Jussi, 147-166 Karlsson, F., 101 Katz, S.M., 32, 87-88, 226, 244 Klingbiel, P.H., 21 Knowledge bases (KBs), 104--105
manual attention requirements, 248 Knowledge representation, 206 Kristensen, J., 76 Kravetz, Robert, 21, 29, 104 Kupiec, Julian M., 311-332
Language Analysis Systems, Inc., 262 Language theory, 199 LaSIE information extraction system,
197- 213 features, 208 GATE application, 205-210 GGI display, 204f meaning representation, 206 modules, 208-209 performance, 209-210
Lattice structure organizastion, 343, 345 Leistensnider, James, 289-310 Length normalization, 160 Lesk, M., 79 Lewis, D.D., 12, 13, 14, 21 Lewis, P., 79 LexiCadCam, 211 Lexical chaining, 109 Lexical rules, 57
Lexicon-based word normalization, 124-125 LEXIS-NEXIS, 227, 250
data,217, 220,224 phrase lists, dictionaries and thesauri, 219
LEXTER,67 Library of Congress Subject Headings, 7,
334,336 Linguistically-motivated indexing (LMI), 1,
2-9, 120 basic concepts, 2-6 complex descriptions and terms, 6-9 full text, 16-21
Linguistics, computational, 311-312 Linguistic String Grammar, 126 List Inclusion, 327 LMI. See Linguistically-motivated indexing Locality stream, 131 Loftus, E., 81-82 Longman Dictionary of Contemporary
English, 284 Loose coupling, 203 Lorenzen, Jeffrey, 167-196 Lovin's stemmer, 26,29-30
Mahcine learning, unsupervised, 366 Machine-readable dictionaries (MRDs), 104 Machine translation, 99 Mann Whitney U rank sum test, 153 Manual judgment, 251, 253-254, 256 Mauldin, M., 14 "Medical Malpractice" legal topic
experiment, 239-242 score comparison, 255t score mapping and distribution, 256t
Medical text, 336-337 MEDLARS, 11 MEDLINE, 339 Merging, 122
precision distribution, 135 ranked lists algorithm testing, 123 score calculation, 133-134 stream coefficients, 136
MeSH (Medical Subject Headings), 334 CANCERLIT categories, 340, 341f, 342 Cat-a-Cone interface, 345-348 DynaCat system, 348-350 hierarchy, 339f
Meta-data, external, 336-337 contentful meta-data and, 336 definition, 337-338
Metagrammar morpho-syntactic variants, 56-60 syntactic variants, 54-56, 56t, 57t, 63
Metarules
INDEX
metagrammar, 53-54 morpho-syntactic variants, 60t, 61t tree structure, finite-state automaton,
feature structures, 53f Metathesaurus, UMLS, 338-340, 348 MIKROKOSMOS, 210-211 Miller, George, 106 Mohri, M., 27 Morphological analysis, 29-35, 38-51
derivational analysis, 47-49 Dynamic Approach, 39 inflectional analysis, 41-43
379
part of speech disambiguation, 43-46 Morphological stemming, 121, 124-125
language type and, 30-31 Morpho-syntactic analysis, 33 Morpho-syntactic variants, 27-29, 65-66, 69
hetero-categorical, 67t iso-cateogircal, 66t metagrammar, 56-60
Morph (stemmer), 208 MUC-4, 182-185 MUC-6
Named Entity Task, 262, 264 resutls requirements, 209
MUCs (Message Understanding Conferences), 22, 169-171
Multi-faceted topics, 275 Multinomial distribution application,
295-304,306f Boolean Expression, 302f expected probabilities, 305t results, 305t results probability, 306t routing using, 306f, 306-309, 307t, 308t,
309t, 310t top 10 documents, 30lt use of, 304-306
Multiple categories, 335, 340 Multiple clusters, 361-362 Multi-word terms, 26 See also Compound
words research conflation, 31-33 controlled indexing, 34-35 suggestion, 230-231
MURAX, 311-332 answer extraction, 324-331 closed-class questions, 318 description, 314 output from, 315f secondary queries, 329-331 system architecture, 319-324 system overview, 320f
380
NameFinder (Carnegie Group), 263 Name matcher LaSIE module, 209 Name matching, 261-270, 272
natural language queries, 269-270 problems and issues, 262-264 TREC topics, 27 4
Name recognition, 263 case law documents, 264-265, 268t, 272 COLLAGE, 278 retrieval performance, 267, 267t
Name searching, 263, 267t, 269, 317 NameTag (lsoquest), 263
name recognition rules, 266 NASA Kennedy Space Center, 318 National Library of Medicine, 338 Natural Language Processing (NLP), 1-22,
113-143 categories and clusters, 333-369 deep ior shallow, 311-312 early experiments, I0--16 information extraction (IE), I 67-I 69 information retrieval, 99, 118-I21,
142-143 linguistically-motivated indexing, 2-9 other roles for, 21-22 term variant extraction, 25-71 text processing uses, 289-309 TREC Program, 17-2 I
NLI (non-linguistic indexing), 4 vs. LMI, 5-6, 17
Noun phrases, 224, 244 AutoS log-TS rules, 177 closed-class questions, 315 long, 127-128 MURAX analysis, 3I9-321, 327 organizing answer text by, 318 simple stream, I28-129 TREC topics recognition, 279
Nymble (BBN), 266-267
Organiation answer text,318 index, 323 meta-data, 336-337 retrieval results, 334-336 search results, 313- 314 stream, 123
Overgeneration, 58-59 advantages of, 50--51
Paradigmatic classifications, 8 Paradigmatic relations, 86, 87 Paragraph picking, I 40 Parsing. See also Tagged Text Parser
INDEX
deeper parse trees, 161 local transformations metarules, 27 partial, 58
Part-of-speech tagging, 26, I24 COLLAGE, 279
linguistic knowledge, 44-45 MURAX, 326 primary document matching, 316, 319-32 I skip-and-fit, 126-127 word classes and statistical learning, 45-46
PATR-11, 27 Pattern-matching, 32-33, 198-199
AutoSiog, 171 extraction, I67-I71, 177-179 MURAX, 326f, 326-328 prepositions, 172, 174-175, 224, 244 term association, 77
Pedersen, J.O., 21 Penn Trebank tagset, 124 Pereira, F., 27, 39 Perez-Carballo, Jose, 113-145 Permutations, 34 Phrasal terms, 215-259
automatic suggestion, 225-247, 236-242 base sample, 228t collocatedness, 217, 220--221 component word types, 220t focused sample and base sample, 227 off-the-shelf, 216 manually determined, 227, 231-236 statistical, 11 study, off-the-shelf vs. selected, 2I 7-225 syntactic, II term evaluation, 244-246 term filtering, 242-244 topic groups experiment, 251 t TREC topic analysis, 274 variations generation research, 226
Phrase categories and proximity, 221t Phrase decomposition algorithm, I28 Phrase expansion, 281 Phrase extraction, 117-118, 121
syntactic and statistical, 119 two-phase, I 28
Phrase indexing, 3I- 32 Phrase normalization, l21- I22 Porter's stemmer, 29- 30 Postcoordinate indexing, 9
vs. precoordinate indexing, 2 Precedence rules
English, 47-48 French, 48-49
Precision, 256-257 average, l43t
change, query by query, l64f Cornell/Sabir SMART, l4lt degradation due to overgeneration, 69 distribution estimates, selected streams,
l35t expected level of, 275 extraction-based text categorization,
182-184 improvement, NLRI vs. SMAR baselines,
l4lt stems-only retrieval, 1345 word-augmented relevancy signatures, 191
Preclassified texts, 177, 179 Precoordinate indexing
vs. postcoordinate indexing, 2 Prefix. See Affixation Prepositions, extraction patterns and, 172,
174-175,224, 244 Primary matches, 3 I 6
scoring, 323-324 Primary query construction, 321-322 PRISE, 122, 133-134 Procedural answers, 318 Pronoun counts, 151
personal, 152, 160 Proper name extraction, 122. See also Name
matching Proximity, fixed or nondirectional, 217-219 Pruning criteria, 139
Queries, 2 ad-hoc, 17,349,361-362 broadening of, 321-322 characteristics of, 317-318 closed-class, 314-315 development of, 115 exploiting, 313-314 filtering and routing, 350--352 interrogative words and expectations, 318t long vs. short, 114 narrowing of, 322 primary, 319-322 reformulation of, 322 relevance judgments on, 158-159 secondary, 329-331, 330f
Query algebra, 273-274, 277-278 Query expansion experiments, 136-140
ad-hoc runs, 140--141 automatic, 139-140 manual guidelines, 138-139 purpose, 136-138 routing runs, 141-142
Rada, R., 76
INDEX
Ranking clusters, 362 COLLAGE document, 281-282 probability retrieval method, 294 relevance, 222 vector space, 294,361
Recall, 256-257
381
degradation due to undergeneration, 69 expected level of, 275
Relevance characteristics of, 160--161 multi-facted TREC topics, 275 ranking, 222 sty lis tics and, 158-161
Relevance feedback, 137-138 Smart routing and, 142
Relevance rate extraction pattern, 177 signature, 172
Relevancy signatures, 167, 168, 172-175 broad category extractions, 182-185 subset extractions, I 88-197
Relevant words broad category extractions, 182-185 subset extractions, 185-187 subsubset extractions, 188-191
Request, search. See Query Research, constraints on, 199 Resnik, P., I 06 Retrieval results, organizing, 334-336 Retrospective retrieval , 293, 295, 303-304 Rich indexing, 70t Riley, M., 27, 39 Riloff, Ellen, 167-196 Role fillers, 175, 180
relevant, 181 f word-augmented relevancy signatures,
187, 190--191 Routing, document, 289-309
Boolean test, 301-303 classification vs. retrieval, 292- 293 document frequency measure, 300, 300f,
30lf information retrieval, 294 multinomial distribution application,
297-303, 307t, 308t, 309t, 310t performance, 299-300 TREC-5 evaluation, 303 word selection, 298-299 zero word counts, 299
RUBRIC system, 351 Ruge, Gerda, 75-98
Sager, Naomi, 126
382
Salton, Gerard, 10---11, 167 Sanderson, Mark, 104 Scatter/Gather clustering approach, 352-363
iterative reclustering, 356-358, 357f, 360f Schiitze, H., 21 Search engines, WWW, 149, 334 Search Lead, 250 Search results, organizing, 313-314 Search Software America, 262 Search time, 2 Secondary queries, 329-331 Semantic feature hierarchy, 179 Sentence analyzer, conceptual, 169-170,
172-173 augmented relevancy signatures, 176
Sentence length, 150 Sentence splitter, 208 Siegfried, S.L., 262 Signature, 172
lack of role fillers, 175 multiple vs. single analysis, 173-174
Silvester, J .P., 21 Similarity
measure, 89-90, 92-95, 95t semantic, 106, 107-108, 109, 113
Simple indexing, 70t Single word terms, suggesting, 228-229 Skip-and-fit recovery, 126 Slot triples, 179-180
relevant, 175-177 relevant for terrorism, 176f
Smadja, F., 225 SMART 14, 102, 115, 122, 124, 132, 136,
140---141 Smeaton, Alan F., 99-111 Spanish verb inflections, 40t
partial transducer for jugar, 41f Spanish verb stems, 40t Sparck Jones, Karen, 1-24 SPARKLE, 211 Spearman's rho, 102, 153, 154t Spelling variations, 283-284 Spreading activation network, 81-82, 84f,
88-91 from head/modifier relations, 88-89 vs. semantically similar words, 89-90,
92-95,95t synonyms and modifiers, 90---91 valuation of, 95-96
Sproat, R., 27, 39 Standard Industrial Classification Manual
(SIC),284-285 Standing queries, 350---352 Statistical phrases, 11
INDEX
Statistics corpus analysis, 78-79, 114, 312 human associationism, 84-8h multi-word terms suggestion, 230---231,
235t non-parametric multivariate, !53, 162 lexical, 150t, 161 t single word terms suggestion, 228-229,
232t text-based, 149-150 two-word terms suggestion, 229-230 variable correlation, 153-154 weighting schemes for compound terms,
120---121 Steier, A., 230, 234 Stemming, 6
algorithm, 26 morphological, 30---31 techniques of, 29-30
Stems stream, 130 Stopwords elimination, 121 Straszheim, Troy, 154-158S Stream architecture, 115-116, 121-123
advanced, 123-133 foreign country tags, 132-133 fragments, 131 head+ modifier pairs, 123-128 locality, 131 merging and weighting, 133-136 performance, 132t proper name extraction, 129-130 simple noun phrases, 128-129 stems, 130 unstemmed-word, 131
Stream organization, 122f, 123 Strube, G., 85-87 Strzalkowski, Tomek, 80, 113, 145 Stylistics, 147-166
information retrieval experiments, 147-165
items, 148 item variables, 149- 152 precision and, 161-163 relevance and, 158-161 visualizing variation in, 154-158
Subject codes, Library of Congress, 334 Substitutions, 34 Suffix. See Affixation Summarization algorithm, 252 Synapsies, 34-35 Syntactic ambiguities, 101 Syntactic analysis, 4
local, 25 Syntactic complexity, 150
Syntactic parsing, 117-118 Syntactic phrases, 11 Syntactic variants, 27-29, 69
metagrammar, 54-56, 63 Syntagmatic classifications, 8 Syntagmatic relations, 86, 87 SYNTOL-type indexing, 10
Table Lens (user interface), 344 Tagged text parser (TIP), 125-127 Tagger See Part-of-speech tagging Tait, J.I., 12, 14 Templatability, boundaries of, 212 Term association, 75-96
associationism, classical, 85-88 experiments, 92-95 first or second order, 78 lexicon analysis, 78 linguistically-based corpus analysis, 79-81 pseudo-classification, 77 statistic corpus analysis, 78-79 text patterns, 77 user observation, 78
Termbanks, 284 Term evaluation, 244-246 Term filtering, 242-244 Termino (grammar), 32 Terminology, changes in, 283-284 Term Suggestion Toolkit, 247-248, 253 Term Variance, 25-71,76
conftation to linguistic variants analysis, 26--28
controlled indexing and variant extraction, 29-35
controlled indexing architecture, 35-38 FASTR for term variant extraction, 52-70 morphological analysis, 38-51
Term weighting, 100-101 compound terms, 120--121 SMART and, 132t statistical, 2 stream architecture, I 33-I 36
Text ftavorofa,290--291,295-296 meaning of a, 290
Text-based representation, 116 Text-based statistics, 149-150, 312 Text categorization, extraction-based,
167-195 attack category, 185-187, 186f bombing category, 188f, 188-191 experimental results, 182-194 fully automatic, 181-182 kidnapping category, 191f, 191-193
INDEX
terorrism category, 182-185, 185f Text labels, 343 Text simplification, 31-32
text Tiles, I 50 relevant document scores, I 6 I
Tfidf measure, 296f
383
ThemeScapes, graphical document icon, 362 Thesaurus, 7,26, 75-77
relevance-feedback, 137 syntactic analysis, 10-1 I
Thompson, Paul, 261-272 Tight coupling, 203 TIPSTER program, 22, 197-198, 200-201 Tokenizer, 208-209 Topics
predefined and manually determined, 23 It predefined domain, 216 relevance of, 293 routing, 292-293 TREC natural language analysis, 273-286
TREC-5, 137-138 classification-based routing system, 303 retrieved text characteristics, 160--I 61 simulated routing mode, 140--I 41 stylistic variation and relevance, 159-160
TREC (Test REtrieval Conferenc) Program, 114
document classification and routing, 294-592
linguistically-motivated indexing, 17-21 natural language topic analysis, 273-286
TREE, 211 Tree-structured analytics (TSAs), 101-103 Trigger words, 169-170, 174 Trivial Pursuit, 318, 319f TTP (tagged text parser), 126--127 Turing, Alan, 289 Turtle, H.R., 12, 13 Two-word terms, 229-230, 233,t, 234t Tzoukermann, Evelyne, 25-74
UMass research team, 17 UNIMEM,366 Unstemmed-word stream, 131 Unsupervised clusters, 335, 363-366 User interface
information access, 369-370 search results, 343-349
Validation, 317-318 van Rijsbergen, C.J., 368 Variation, visualizing, 154-158
genres, 158 nonrandom intentional, 156
384
principal components analysis, 157-158 random, 156
Vector space retrieval method, 294 Verb forms, 172, 175, 316
passive, 189 relations between question and answer,
317,327-328 Verification evidenc,e 328-329 VIBE (user interface), 343
Wall Street Journal, 148-150, 152, 159-160 type-token ratio, 157f
West FED collection, 264--265, 267 Wettler, M., 82-85 Wilcoxon's rank sum test, 153 Wilks, Yorick, 197-214 Word-augmented relevancy signatures, 167,
168, 179-182 broad category extractions, 182-185 subset extractions, 185-187 subsubset extractions, 188, 190-191
Word-based representation, 116-117 Word-based statistics, 149 Word intervals, 216 Wordnet, 104--105, 283
criticism of, 109-110 information retrieval, I 05-109 MURAX proposal, 331 TREC topic parsing, 280-281
World Wide Web Excite (search engine), 149 search engines, 334, 336
Wrapper functions, 202-203
Xtract, 225
Yahoo!, 343
Zhou,Joe, 215-259
INDEX
Text, Speech and Language Technology
1. H. Bunt and M. Tomita (eds.): Recent Advances in Parsing Technology. 1996 ISBN 0-7923-4152-X
2. S. Young and G. Bloothooft (eds.): Corpus-Based Methods in Language and Speech Processing. 1997 ISBN 0-7923-4463-4
3. T. Dutoit: An Introduction to Text-to-Speech Synthesis. 1997 ISBN 0-7923-4498-7 4. L. Lebart, A. Salem and L. Berry: Exploring Textual Data. 1998
ISBN 0-7923-4840-0 5. J. Carson-Bemdsen, Time Map Phonology. 1998 ISBN 0-7923-4883-4 6. P. Saint-Dizier (ed.), Predicative Forms in Natural Language and in Lexical Know-
ledge Bases. 1999 ISBN 0-7923-5499-0 7. T. Strzalkowski (ed.), Natural Language Information Retrieval. 1999
ISBN 0-7923-5685-3
KLUWER ACADEMIC PUBLISHERS - DORDRECHT I BOSTON I LONDON