The London School of Economics Computer-Based Bibliography of Statistical Literature

The London School of Economics Computer-Based Bibliography of Statistical LiteratureAuthor(s): Susan JonesSource: Journal of the Royal Statistical Society. Series A (General), Vol. 137, No. 2 (1974), pp.219-226Published by: Wiley for the Royal Statistical SocietyStable URL: http://www.jstor.org/stable/2344549 .

Accessed: 24/06/2014 21:00

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

Wiley and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access toJournal of the Royal Statistical Society. Series A (General).

http://www.jstor.org

This content downloaded from 195.34.78.121 on Tue, 24 Jun 2014 21:00:18 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=black

http://www.jstor.org/action/showPublisher?publisherCode=rss

http://www.jstor.org/stable/2344549?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


J. R. Statist. Soc. A, 219 (1974), 137, Part 2, p. 219

The London School of Economics Computer-Based Bibliography of Statistical Literature

By SUSAN JONES

Comi1puter Services Uniit, Lonidonz School of Economics

SUMMARY The following paper gives an account of a computer-based information retrieval system on statistical literature which has been developed at L.S.E. over the past 4 years and which is now available to all Fellows of the Royal Statistical Society. It attempts to enumerate the decisions which were made, and which continue to be made as the work progresses, and to show how these decisions affect the kind of service which can be offered to users of the system.

Keywtsords: COMPUTER-BASED INFORMATION RETRIEVAL; STATISTICAL LITERATURE; L.S.E. SYSTEM

1. BRIEF HISTORY OF THE PROJECT

IN 1967 Dr R. Churchhouse, then of the Atlas Computer Laboratory, implemented a retrieval system for computer science journals, using the Atlas machine. His system was based on one set up in 1965 by Professor M. Kessler at M.I.T. In 1968 the Statistics Department at L.S.E. decided to do something similar with a number of statistical journals, and work began on preparing the data. In 1970 the first programs for dealing with these data were written for the University of London CDC 6600 machine. Since then the programs have been progressively improved, the data base has been enlarged by the periodic addition of new material, and the system has been made available to all institutions possessing a suitable terminal connected to the University of London computer system. Recently, the opportunity to use the system was extended to all Fellows of the Royal Statistical Society for an experimental period of 2 years.

Meanwhile, contact with the Atlas Computer Laboratory has been maintained; files of data are exchanged periodically so that researchers at the Laboratory are able to use the statistical bibliography and members of London University have access to the Computer Science system. A similar arrangement is made with University College Cardiff, where Dr Churchhouse is now Professor of Computing. In addition, a system has recently been set up on a CDC 6400 at the University of Western Australia, using the data and programs written for the London machine.

2. THE DATA BASE

The first decision to be made at L.S.E. concerned the statistical journals to be covered. In itself the method is quite general, capable of being used with any number of journals from any academic discipline, but because the recording of cross-references between papers is such an important aspect of the system, it is best used with a reasonably small and self-contained body of literature. The selection of the journals to be covered was made by members of the L.S.E. Statistics Department; seventeen were chosen and these are listed in Appendix A. It was decided to start from the



220 JoNEs - L.S.E. Computer-based Statistical Bibliography [Part 2,

year 1959, so that the system could take over where the Kendall and Doig bibliography left off.

Once choices of this kind have been made, it is not very easy to change them. Although it would now be possible to add details of papers published before 1959, or in other journals, information about references to these additional papers from articles already in the data base could not be obtained without re-reading the material already scanned, and every year this becomes less feasible. In fact researchers have quite frequently wanted to use the citation system in order to follow up a line of work from a pre-1959 starting point, but have been unable to do so. With hindsight, one can see that it would have been useful originally to have coded references outside the specified time-span.

3. DATA PREPARATION For each paper, the following information is recorded.

(i) Journal

(iii) Page number Coded into a unique 12-digit reference number (iv) Volume number (v) Name(s) of author(s) (vi) Title (vii) Keywords (if provided by the author) (viii) Details of references to other papers within the specified journals and

time-span. These are coded in such a way that they can be linked with the paper referred to and used to produce citation lists.

The information is written onto coding sheets and then punched on cards for computer input. Coding is a fairly mechanical task, but requires accuracy and concentration. It is normally done by students as a temporary vacation job; approximately thirty different individuals have done the work over the past 3 years. Inevitably some errors occur; references given by the authors of papers are occasionally inaccurate, and further inaccuracies are introduced by the process of coding and punching. The effect of these errors is not, in general, to give the user incorrect information but to deprive him of information which fits his specification; if a keyword is misspelled or a reference wrongly coded an article relevant to the enquiry may not be picked up. Ideally it would be quite feasible to detect and correct almost all errors but in a situation where resources are limited, it seems more valuable to use the available labour to keep the information up to date. However, the file now contains several hundred misspellings and incorrect references, so that some amendment is obviously overdue. Some remedial work on this will be undertaken in the next few months.

New material is added to the file two or three times a year, usually during vacations when other demands for data preparation and computer time are lowest. It is economical in computer time to add large batches of material at infrequent intervals, but the effect is that the data base is always at least a year out of date. At the end of 1973, most of the 1972 material was accessible, but none from the current year. The cost of coding, punching and adding to the file details of each new issue of each journal as soon as it appeared would be high, and could not be justified unless there was a considerable increase in demand.

Some choices have been made by default, in the absence of information about users' requirements. Perhaps a system covering fewer journals with 100 per cent



1974] JONES -L.S.E. Computer-based Statistical Bibliography 221

accuracy and kept completely up to date would be more valuable to researchers. It is part of the purpose of writing this paper to stimulate a feedback from actual or potential users of the system so that the value of the work done in the past few years can be assessed.

4. PROGRAMS

The system uses a suite of eight programs, all written in Fortran for the CDC 6600. In addition to the retrieval program proper, there are programs to check new data and add them to the existing file, to extract authors and keywords and set up indexes, and to produce a summary of the years and journals covered by the current version of the file.

The first version of the program written in 1970 used a fairly orthodox Fortran, and worked by scanning the whole file for every retrieval request, making comparisons character by character. This method proved slow and inefficient; used with a file of the present size it would cause each retrieval to take around 300 seconds of computer time. A decision was made to use non-standard Fortran techniques which were dependent on the structure of the particular machine for which the programs were written.

For example, the 6600 has a word size of 60 bits, so that the 12-digit numbers characterizing each article, as well as keywords and authors' names (which are truncated if necessary to 10 characters), may be held in a single computer word making the process of comparison very efficient. Even more significant, the file used for retrieval is kept for the duration of each enquiry on disc in random-access form, with associated indexes, so that only the information actually elicited by the enquiry needs to be read into the machine's central memory. Machine-dependent reading and writing routines are used to access the particular areas of disc required.

This choice of computing methods has two implications. Firstly, setting up the file for retrieval after updating it, producing the indexes, etc., is costly in computer time, but each individual retrieval run thereafter is very quick, taking on average 2 or 3 seconds, so the more frequently the system is used, the more justifiable, in a sense, is the high initial cost. Secondly, the programs could not easily be transfered to a machine with a different architecture. They cannot, for instance, be used at the Atlas Laboratory or University College Cardiff, where ICL computers are installed. Implementors of the system at these installations have found that they also needed to use machine-dependent techniques in order to run their programs efficiently.

5. USING THE SYSTEM

The magnetic tape holding the data and retrieval program is kept permanently at the University of London Computer Centre, so that any member of the University with access to the CDC 6600 may use the system without charge. Every effort has been made to enable users to make their requests with the minimum of formality; in fact only three job-control commands are required, followed by the card(s) specifying the enquiry. Users of the R.S.S. service merely state their requirements by letter, and the appropriate cards are punched and processed by the L.S.E. Computer Unit at the L.S.E. terminal.

The enquiry cards read by the program contain: (a) A number indicating whether author retrieval, keyword retrieval or a citation

list for a known article is required.



222 JONES -L.S.E. Computer-based Statistical Bibliography [Part 2,

(b) The author's name, keyword(s) or article reference number. (c) A number indicating whether references and citations for articles retrieved

by author or keyword are to be printed. Output from the program is in the form of standard computer print-out listing

details of all papers satisfying the enquiry. At present the usual turn-around time for a retrieval job is about 3 or 4 hours, but the necessity for sending enquiries and output through the post introduces several days' delay on the R.S.S. service. Whether delays of hours or days seriously affect the usefulness of the system is difficult to determine. A researcher is at least saved the trouble of scanning the journals himself, although he has to wait for the information which he needs.

There are of course other possible ways of making the data available to researchers, which might entail less delay. The most obvious is the method adopted by most compilers of computerized bibliographies, to use the computer basically as a sorting and printing machine, and produce a permuted title index arranged by keywords which is then published in book form. In many ways this is more convenient for consultation, but it is rather inflexible for an expanding data base; it is necessary to produce a new and larger version of the book periodically or to print a series of supplements. This has particular disadvantages with regard to the citations data; it would be feasible to link back to previous volumes but not to link forward.

There is a possible solution to the difficulty in the shape of microfilm output, which is now available from the London University machine. In principle, a microfilm reprint of the whole data base could be produced periodically, in the form of a permuted title index with coded references. This would not be too bulky to handle, and could be read in libraries and other institutions with the appropriate equipment.

Another, perhaps minor, objection to the permuted title index is that it does require the researcher to do some scanning and it is not as easy to read as the output from a retrieval program. It could not, for instance, have full details of references and citations arranged with each article; such details could at best be printed in coded form, and the researcher would have to look up each one separately. Furthermore, many decisions would have to be made which can be avoided with the present method; for instance, what exactly constitutes a keyword? At present, this is implicitly decided by users; if no one uses a particular word for retrieval, then nothing is ever printed about it, but it is always there in the system ready to be found if necessary. It might, however, be argued that some clear policy on keywords is desirable in a well-run system. There is some discussion of this point in the section on keywords below.

Going to the other extreme, another way of giving access to the data base is through teletype terminals linked directly to the computer, in other words, by "interactive" as against "batch" use of the machine. Such a system was implemented very successfully at Chilton and some work was done on it in the early stages of the project at L.S.E. It is quite feasible to arrange for the program to read from and write to a teletype, and the idea of enabling users to type in their requests and receive their answers immediately is very attractive. The principal objection to this method is its cost. In order to achieve a significant improvement in turn-around time over batch use of the system, it would be necessary to hold the data base perpetually on a disc on-line to the computer. (The wait for operator action to load a magnetic tape is the major delaying factor when the job is run normally.) Disc space is limited, and the occupation of a large area of it by the statistics bibliography would not be justified on present usage, which is no more than two or three accesses per week.



1974] JONES - L.S.E. Computer-based Statistical Bibliography 223

This is an illustration of a general fact about information retrieval systems; they are notoriously expensive to set up and they need a large number of potential users to make them worth while. This is because any single individual will need to consult the data quite rarely, perhaps only when embarking on a piece of research or checking references for a paper. Even a large institution like the University of London does not generate enough usage to make interactive retrieval an economic proposition. Moreover while the R.S.S. service has increased the number of potential users considerably, the technical problems of giving all such users access to the London machine via teletypes are of course increased correspondingly.

6. KEYWORDS

Retrieval by author's name is a comparatively straightforward matter; provided the names are punched in a consistent format which the users know, nothing much can go wrong. In practice, experience in London and at Chilton has shown that keyword retrieval requests greatly outnumber author requests and present more problems, so it is perhaps worth going into some detail about the way keywords are dealt with in the system.

Titles of papers are punched exactly as they appear in the journal, if possible. Formulae cause some difficulty, since only upper case roman letters, arabic numerals and a few punctuation symbols are available for coding. Greek letters are spelt out, e.g. SIGMA, LAMBDA, and formulae are put into standard Fortran notation, e.g. x2 becomes x**2, A i becomes A(I,J).

All "words", i.e. sequences of alphabetic characters between spaces, appearing in titles or specified by the author as keywords are included in the keyword index, except those less than four characters long. Omitting short words is a crude but simple method of eliminating articles, conjunctions and prepositions which occur very frequently but will never be needed for retrieval. It is not an ideal solution, since it loses some items which have a useful semantic content, and leaves behind a number of useless words with four or more characters. Any hopes of automatically deleting "grammatical" words on the basis of a very high frequency were disappointed quite early; when it was found that the word "distribution" occurred over 700 times in the first 5000 titles. Although too general to be useful alone, this word, like other commonly occurring items (see Appendix B), will obviously form part of key phrases which are wanted.

The only limit to the number of words in a retrieval key is the physical length of the 80-column card on which the request is punched. In practice users tend to ask mainly for two- and three-word combinations; some recent examples are: half- normal plot, controlled variability, free boundary, optimal stopping, Wiener process, exponential regression.

The program prints details of articles in which all the words specified co-occur in the title or author's keyword list-they need not be adjacent to one another or in the order given. Words may be truncated so as to find all grammatical variants of the same item. Thus estimat ratio would find titles containing ratio estimation, estimating ratios, and estimation of ratios. Normally the program works by matching sequences of characters which are not necessarily full words in the title, so truncation can also be used to avoid problems with variant spellings such as generalise/generalize. However, there are times when the automatic output of variants is not convenient; a user searching for factor analysis may not appreciate receiving all co-occurrences of analysis with factors, factorial(s), factorization and factory, so the program allows an

9



224 JONES - L.S.E. Computer-based Statistical Bibliography [Part 2,

"exact match" to be requested, indicated by terminating the input word with a colon instead of a space.

Researchers can make the best use of the service if they have a printed list of all authors and keywords represented in the system. These lists, which are available to anyone with a serious interest in using the system, correspond to the internal author and keyword indexes used by the retrieval program. They show the exact format in which a name or word appears (truncated to 10 characters), whether any variant spellings have been used, and the number of papers which will be retrieved by it. They cannot, of course, show whether particular combinations of words occur, although the frequencies of the individual words may give some indication. Normally the lists are printed in alphabetical order, but the words can also be arranged by rank. A list of the most common words; with their frequencies, is given at the end of this articles in Appendix B.

There are now over 5,700 words in the index, including misspelling, variants, foreign words (French and German) and grammatical items. Probably the useful vocabulary comprises fewer than 4,500 words. (The foreign titles present yet another problem; ideally they should be translated before being put into the file, which is the policy adopted at Chilton; at present unless they appear among the references and citations they are unlikely to be found in a keyword search.) The data base is growing rapidly, and the question of central memory storage space for indexes will soon become critical. For this reason alone, it will be desirable to tidy up the keyword list by eliminating errors, inconsistencies and words which will never be needed for retrieval. A keyword search already goes through four levels of indirection to find the required papers; in time it may be necessary to restructure the program and increase the number of levels in order to keep the store requirements down.

The growing requirement on authors of papers to provide keyword lists describing the topics they cover should gradually improve the value of the system to users, since titles alone do not always give a good indication of the scope of a paper. The present data base already contains a great deal of information about keywords and it would be easy to extract from it statistics about their co-occurrences. These might provide a basis for a comprehensive list of commonly used key phrases, eventually leading to a more sophisticated indexing and retrieval system for statistical literature.

7. THE FUTURE

The data base now contains details of about 9,500 articles, and the system is sufficiently stable to cope with a steady increase in information, without major alterations, for some years. It has already proved itself useful to a number of people but cannot be said yet to be cost-effective; a charge to users which reflected the true cost of setting up and -maintaining the system would be enormous, on present usage. As its coverage increases, it should become a more and more attractive alternative to searching the literature manually, particularly if it can be kept -reasonably accurate and up-to-date. The experimental period during which a service is provided to members of the R.S.S. will last 2 years. The response during this period should show whether there is enough demand for a service of this kind to keep it in operation.

Details of the procedure to be followed by Fellows wishing to make use of the system are given in Appendix C. Unfortunately, owing to restrictions on the use of University of London computing equipment, the service cannot at present be extended to non-Fellows.



1974] JONES - L.S.E. Computer-based Statistical Bibliography 225

APPENDIX A

JOURNAL LIST

Annals of the Institute of Statistical Mathematics Annals of Mathematical Statistics

(Replaced since 1973 by: Annals of Probability Annals of Statistics)

Australian Journal of Statistics Berkeley Symposium on Probability and Mathematical Statistics Biometrika Biometrics Bulletin of the International Statistical Institute Journal of the American Statistical Association Journal of Applied Probability Journal of the Royal Statistical Society (Series A, B and C) Metrika Review of the International Statistical Institute Sankhya Technometrics Theory of Probability and its Applications

APPENDIX B

LIST OF FREQUENT WORDS IN RANK ORDER

Distribution 1,341 Problem 275 Exponential 167 With 891 Application 265 Mean 165 Some 708 Variables 264 Maximum 164 Estimation 597 Markov 263 Likelihood 161 Analysis 512 Stochastic 256 Testing 156 Processes 475 Model 251 Confidence 156 Random 461 Experiment 246 Estimates 154 Probability 442 Sequential 244 Theorems 152 Tests 439 Functions 227 Series 152 Note 395 Time 224 Certain 150 Normal 382 Statistical 221 Samples 149 Sample 380 Limit 217 Class 148 Test 370 Approximation 214 Between 148 Linear 368 Problems 210 Characterize 147 Regression 338 Data 208 Properties 146 Statistics 336 Parameters 204 Function 142 From 331 Theorem 201 Poisson 140 Sampling 323 Models 191 Estimating 139 Population 315 Comparison 190 Parameter 135 Designs 306 Independent 188 Design 135 Asymptotic 297 Stationary 184 Methods 134 Theory 284 Order 183 Correlation 133 Variance 282 Method 181 Ratio 132 Multivariate 281 Estimators 173 Tables 129 Process 276 Finite 172 Multiple 128



226 JoNEs - L.S.E. Computer-based Statistical Bibliography [Part 2,

APPENDIX B (cont.)

When 127 Convergence 113 Power 105 Optimal 126 Coefficient 112 Number 104 Incomplete 125 Type 110 Case 104 Generalize 124 Sums 109 Balanced 104 Procedures 120 General 109 Squares 102 Binomial 119 Rank 108 Information 101 Based 118 Size 107 Moments 101 Observation 114 Gaussian 106 Bivariate 101 Block 114 Results 105 Components 100

APPENDIX C

PROCEDURE FOR USING THE SERVICE Anyone wishing to make an enquiry has three different ways of specifying what he

wants. He may ask for all articles by a certain author, all articles with a particular word (or group of words) in the title or list of key-words, or he may request details of all citations of a specified article by other papers in the system. This last facility is useful in exploratory work since it enables the investigator to follow up a line of work from a particular starting point. If he knows of one article on a topic of interest to him, he is able to obtain a list of other papers referring to this article and thus discover other papers on the same topic.

Fellows wishing to make use of the service should write to the Secretary of the Society specifying the information required. A charge of ?2 per enquiry will be made, to offset computer time and administrative costs. An enquiry may consist of requests for several authors, keywords or citation lists, but should not be so general as to be likely to produce excessive output. Because of the mechanical nature of the system, Fellows are asked to check carefully details of references and the spelling of authors' names.

The reply will be sent in the form of a standard computer print-out listing details of papers satisfying the enquiry. With any article which is retrieved, details of references to and citations by other articles within the system can also be printed out if this is specifically requested. Fellows requesting combinations of keywords are asked to state whether, if no article with the desired combination occurs, they wish to have a print-out for the keywords taken separately.

A complete list of authors and keywords at present in the system will be held in the Society's offices for consultation by Fellows.



Date post:	31-Jan-2017
Category:	Documents
Upload:	susan-jones
View:	212 times
Download:	0 times

The London School of Economics Computer-Based Bibliography of Statistical Literature

Documents