+ All Categories
Home > Documents > Paper-en-1

Paper-en-1

Date post: 31-May-2018
Category:
Upload: vijaykumar
View: 213 times
Download: 0 times
Share this document with a friend

of 32

Transcript
  • 8/14/2019 Paper-en-1

    1/32

    GREENSTONE DIGITAL LIBRARY

    FROM PAPER TO COLLECTION

    Dr Michel Loots, Dan Camarzan and Ian H. Witten

    Human Info NGO, BelgiumSimple Words, Romania

    University of Waikato, New Zealand

    Greenstone is a suite of software for building and distributing digital librarycollections. It provides a new way of organizing information and publishing iton the Internet or on CD-ROM. Greenstone is produced by the New ZealandDigital Library Project at the University of Waikato, and developed anddistributed in cooperation with UNESCO and the Human Info NGO. It isopen-source software, available from http://greenstone.orgunder the termsof the G nu General Public License.

    We want to ensure that this software works well for you. Please report anyproblems to [email protected]

    Greenstone gsdl-2.50 March 2004

  • 8/14/2019 Paper-en-1

    2/32

    About this manual

    This document explains how to create CD-ROM collections from paper documents. Itdescribes in full detail the procedures and economics involved in the scanning andoptical character recognition (OCR) processes, so that you end up with text in the rightformat to apply the Greenstone software. It also describes how to create and edit thematerial associated with a collection.

    We have tried to be as plain as possible in our explanation. Reference to any trademark or company product is purely for illustrative purposes, and does not imply that weendorse or favor this product over any other.

    Companion documents

    The complete set of Greenstone documents include five volumes:

    Greenstone Digital Library Installer's Guide Greenstone Digital Library User's Guide Greenstone Digital Library Developer's Guide Greenstone Digital Library: From Paper to Collection (this document) Greenstone Digital Library: Using the Organizer

    Copyright

    Copyright 2002 2003 2004 2005 2006 2007 by the New Zealand Digital LibraryProject at the University of Waikato, New Zealand.

    Permission is granted to copy, distribute and/or modify this document under the termsof the GNU Free Documentation License, Version 1.2 or any later version published bythe Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and noBack-Cover Texts. A copy of the license is included in the section entitled GNU FreeDocumentation License.

  • 8/14/2019 Paper-en-1

    3/32

    Acknowledgements

    The scanning operation and other know-how relating to the creation of collaborativenon-profit collections have been developed by Dr Michel Loots, MD, of Human InfoNGO and HumanityCD, Dan Camarzan of Simple Words, and their team ofcollaborators in Brasov, Romania.

    The Greenstone software is a collaborative effort between many people. Rodger McNaband Stefan Boddie are the principal architects and implementors. Contributions havebeen made by David Bainbridge, George Buchanan, Hong Chen, Michael Dewsnip,Katherine Don, Elke Duncker, Carl Gutwin, Geoff Holmes, Dana McKay, JohnMcPherson, Craig Nevill-Manning, Dynal Patel, Gordon Paynter, Bernhard Pfahringer,Todd Reed, Bill Rogers, John Thompson, and Stuart Yeates. Other members of theNew Zealand Digital Library project provided advice and inspiration in the design of the

    system: Mark Apperley, Sally Jo Cunningham, Matt Jones, Steve Jones, Te TakaKeegan, Michel Loots, Malika Mahoui, Gary Marsden, Dave Nichols and Lloyd Smith.We would also like to acknowledge all those who have contributed to the GNU-licensedpackages included in this distribution: MG, GDBM, PDFTOHTML, PERL, WGET,WVWARE and XLHTML.

  • 8/14/2019 Paper-en-1

    4/32

    Contents

    1 Introduction 1

    2 Scanners and scanning 3

    2.1 Scanners 3

    Low-cost flat-bed scanner

    Low-end scanner with sheet feeder

    Color scanners

    Professional duplex scanners

    Scanning programs

    2.2 Preparing the documents 4

    2.3 The scanning process 5

    Quality control

    Filename conventions

    2.4 Productivity and resources 6

    Scanning costs

    3 OCR: Optical Character Recognition 8

    3.1 The OCR process 8

    Quality control

    Tables

    Images

    Specialized material

    3.2 Productivity and resources 10

    Intensive OCR

    Achievable productivity

    3.3 Alternatives to OCR 12

    Manual retyping

    Image files

    3.4 Combining scanning and OCR 13

    4 Three examples: 1000 to 100,000 pages 14

    4.1 Typical small collection: 500 to 1000 pages 14

    4.2 All publications from an organization: 5000 pages 14

  • 8/14/2019 Paper-en-1

    5/32

    4.3 A small library: 100,000 pages 15

    5 Creating an electronic collection 17

    5.1 Methods of collection building 17

    5.2 Getting started in seven steps and 15 minutes 18

    GNU Free Documentation License 20

  • 8/14/2019 Paper-en-1

    6/32

  • 8/14/2019 Paper-en-1

    7/32

    1Introduction

    One goal of the Greenstone Digital Library software is to empower organizations suchas universities, United Nations agencies, non-governmental organizations, non-profitorganizations and governments to create varied collections of information that can bedelivered online or on CD-ROM.

    Typical steps that have to be implemented are:

    1. Selecting the documents to be included2. Securing copyrights permissions to use these documents in the digital

    library3. Scanning and OCR of the hard-copy documents which are not available

    in to digital form to have a perfect digital format4. Converting all documents to a format (integrating text and images)

    which can be imported into Greenstone (preferably HTML or MicrosoftWord, but others are also covered at varying levels of precision by aplugin (see the Greenstone Users Manual)

    5. Tagging the chapters, paragraphs and images of the digital documents6. Organising the collection into a optimally structured digital library7. Building the digital library using the Greenstone software8. Printing and distributing the collection on CD-ROM and/or distributing it

    over the Internet

    In order to create a digital collection, the publications must be available in digital format.If books, newsletters or other documents are only available on paper, they will need tobe scanned and processed into machine-readable form (step iii). Usually this is doneusing optical character recognition (OCR), but sometimes by manual retyping. Thisprocess is covered in Chapters 2-4 of this manual.

    Step v. enables the different parts of a document to be independently selected anddisplayed by readers in the final library, while step vi. involves assigning attributes to thedocuments such as subject categories, keywords and bibliographic data for orderingand searching the library. These steps are covered in Chapter 5 of this manual.

    This manual introduces many issues that affect the editorial process of creating acollection from paper. Before reading on, you should consider these questions:

    What is the goal of your collection? What is your target group? How big is itlocal, regional, or global? How many documents are you making available? How many pages? How much graphics content? Does the material split into parts that will be consulted by a limited

    audience and parts that need to be disseminated widely?

    Are the documents already available electronically? If so, in which formats? (Note incidentally that PDF files are not

    automatically equivalent to digital full-text form, as they often contain

    1 Introduction

  • 8/14/2019 Paper-en-1

    8/32

    only page images.) What is the copyright status of the documents? Who owns the copyright? Are there other organizations with the same target audience?

    Are you willing to collaborate with other groups? What budget is available for the whole project? What human resources are available (in person-months) for

    co-ordination, editing, scanning and programming? How many computers are available for this project? How many CD-ROMs do you want to distribute? Will they be free, or for sale?

    2 Introduction

  • 8/14/2019 Paper-en-1

    9/32

    [1] All sums of money mentioned in this document are in US dollars, and were currentin 2001.

    2Scanners and scanning

    The first step in converting paper documents into a digital library collection is to obtainimages of all pages of all publications in digital format. The next stage is opticalcharacter recognition (OCR), and clean, high-quality images are essential for successfulOCR. The digitization process requires a scanner capable of working at a resolution of300 dpi (dots per inch). Most scanning can be done in black-and-white, but if colorillustrations are included they must be scanned with a color scanner. In most cases thecovers of the book contain colors and will have to be scanned as a color photographicimage.

    2.1 Scanners

    Scanners are available in all price ranges, and all shapes and sizes. They range from$100 for flat-bed scanners to upwards of $50,000 for large industrial scanners frommanufacturers such as Bell & Howell.[1]There are many websites that offer a widerange of scanners for sale. To locate them, just search for scanners in search engineslike Google, Altavista, or Yahoo.

    The output format of a scanned page is a computer file that is usually stored in TIFF orBitmap format. Compressed TIFF IV is the best format to use. An average pagescanned and converted to this format occupies only 50 Kb, compared to perhaps 2 Mbfor the equivalent page in uncompressed Bitmap form.

    Low-cost flat-bed scanner

    Low-cost flat-bed units are the cheapest and most widely available type of scanner.There are many brands: HP, Agfa, Acer, etc. Prices range from $100 to $300. Bothblack-and-white and color images can be scanned. The low price allows each computerto have its own scanner.

    Disadvantages of these scanners include the medium quality of the result, the slow rateof scanning, unreliability in warm environments, and relatively frequent breakdown.Pages must be scanned manually, one by one. Each page must be positioned carefully

    on the scanning plate to ensure that it is aligned correctly. Productivity of thesescanners is low. Despite manufacturers' claims that each page can be scanned in lessthan a minute, the fact is that rates exceeding twelve pages per hour are rarelyachieved. The scanning process monopolizes the computer on which the work is beingperformed.

    Consequently these scanners are useful only for small jobs with limited numbers ofpagesno more than 200 to 400 pages a month on a regular basis, or one-time jobs of

    3 Scanners and scanning

  • 8/14/2019 Paper-en-1

    10/32

    up to 1000 or 2000 pages.

    Low-end scanner with sheet feeder

    Low-end scanners with sheet feeders typically cost between $500 and $1200. Ten tofifty pages can be inserted, scanned and processed at once: thus the operator does nothave to attend constantly to the machine. This increases capacity up to 150 to 200pages per day. These scanners are more robust, and have a larger lifespan beforerepairusually in the range 30,000 to 50,000 pages.

    A disadvantage is that only one side of the page is scanned at a timethe stack ofpages must be reversed and rescanned in order to obtain an image of both sides. Thisoften creates problems because sheet feeders are never without problems andsometimes pages get blocked.

    These scanners are useful for up to 1500 to 3000 pages a month.

    Color scanners

    Any scanning operation invariably involves some color images, so a color scanner willalways be required. Generally speaking, less than 5% of any publication contains colorimages, plus the cover. Thus a low cost flat-bed scanner as described above suffices. Itis advisable to select one capable of scanning up to 600 dpi resolution.

    Professional duplex scanners

    Professional scanners are reliable, heavy-duty machines capable of processing a largevolume of pagestypically from 2000 pages to 10,000 pages per day. They have anautomatic sheet-feeder tray system that processes batches of about 50 to 200 pages.The best and fastest are duplex machines that scan both sides of the page at once.

    Professional duplex scanners require a powerful computer with a hard disk of at least10 to 20 Gb. Prices range from $5000 to $50,000. For example, the Canon DR-6020duplex scanner costs $5000 and works with double-sided documents. It has a capacityof about 2000 pages per day and a lifespan of 600,000 to 800,000 pages. Bell & Howell

    and Fujitsu scanners range from $10,000 to $50,000 and have a lifespan of manymillions of pages.

    Micro-fiche scanners cost from $15,000 for a semi-manual unit to $80,000 for one thatoperates fully automatically.

    Scanning programs

    Every scanner comes with its own software, which means that the program must beinstalled on the computer that manages the scanner. Some have a computer card thatneeds to be installed in your computer to speed up the scanning operation.

    2.2 Preparing the documents

    4 Scanners and scanning

  • 8/14/2019 Paper-en-1

    11/32

    Before being scanned, documents must be properly prepared. Dusty documents mustbe cleaned, humid documents dried, clips removed, pages unfolded.

    The spine of each book should be removed by cutting it off, straight and precisely.Books provided by libraries must often be rebound, and if so you should be particularlycareful when removing spines in order to facilitate smooth rebinding.

    If there are just a few documents, cutting can be done manually with a ruler and cutters.Be careful with your hands! For more documents, special manual cutting machines areavailable.

    For high volumesmore than 20 documentswe recommend asking a printer orcopy-shop if you can use their professional cutting machine. Do not forget to removemetal clips which could damage the cutting blades.

    2.3 The scanning process

    Using software provided with the scanner, a digital image of each paper page isscanned and transformed into a Bitmap or TIFF image. These images should be storedon hard disk with standard filenames. The OCR process starts once some or all of abatch of documents have been scanned. It can be undertaken by the person whooperates the scanner, or by someone else.

    Typically a scanning resolution of 300 dpi is needed, although sometimes 200 dpi isacceptable.

    Quality control

    The final goal of scanning is either to OCR the pages to obtain perfect word processoror HTML versions of the publications, or to produce enhanced image files such as PDFimage files. In either case the quality of the image is very important. If quality issub-standard, image files will not look good and will consume more memory. Imagequality seriously affects the OCR process: with sub-standard quality, productivitydeteriorates by up to 40%. OCR typically represents more than 90% of the total cost, soscanning quality can have a very substantial effect on the final cost.

    The quality of the TIFF file can be enhanced by adjusting the scanning process to eachtype of paper, using settings provided by the scanner software. Relatively transparentkind of paper will require a lighter setting; the contrast must be adjusted depending onthe quality of printing, and so on.

    First divide the material into batches with similar paper and print qualities. Perform OCRtests on a sample from the first batch to determine the optimal settings. Then scan allmaterial in this batch before proceeding to the next one.

    Filename conventions

    Give each book or document a job number or unique code, which will become the nameof the folder that contains all TIFF images in the document. Depending on the computersystem (DOS, Windows, UNIX, LINUX, etc) from 8 characters to 128 characters can be

    5 Scanners and scanning

  • 8/14/2019 Paper-en-1

    12/32

    used in a filename. We recommend restricting this unique document identifier to 8 to 16characters. The first five characters might identify the document, the following lettermight contain a language code, and the remaining characters might identify the

    particular page. For example, the identifier u7548e12.tifmight identify the TIFF image ofpage 12 of a book written in English with code u7548e.

    Allocate one directory on the hard disk for scanning jobs, say scanjobs. Then make asubdirectory for each job. Within this make a subdirectory for each publicationsayu7548e for the above document. Store all the TIFF images of the publication, includingcolor images, in this folder.

    2.4 Productivity and resources

    You should not underestimate the magnitude of the scanning operationandparticularly the OCR process that follows. It is best to consider scanning and OCR ascompletely separate activities. The optimal choice from an economic and practical pointof view should be madeindividually for each one.

    Some points to consider are the investment in scanners and computers that isnecessary; the availability of appropriate space and human resources; training theworkforce; salary costs; the initial and total number of pages to be scanned; deadlines;and whether documents can be outsourced to third parties.

    Scanning costs

    An important decision is whether to invest in scanning equipment and perform allscanning oneself, or outsource it to a scanning company. The main considerations are:

    pressure of time for the scanning job; total number of pages; salary costs of those who perform the scanning.

    The people who perform the scanning must be highly motivated, technically skilled, andquality-oriented.

    The typical cost of scanning by a professional company is $0.06 per page. To this mustbe added the cost of shipment, which can be up to $0.03 per page for transport fromdeveloping countries to developed countries, and $0.015 per page for transport withincountries.

    Table 1 estimates the cost of doing it yourself, using various scanner types. Note that allfigures are approximate. They are provided as rough guidelines based on the authors'experience. The first three columns concern labor costs. The first is the capacity inpages/month, assuming full-time work. The resources required in person-hours perpage is obtained by dividing the number of working hours per month by thepages/month capacity in the second column. It is shown in the second column, whichassumes 180 working hours per month.

    6 Scanners and scanning

  • 8/14/2019 Paper-en-1

    13/32

    Table 1 Scanning cost

    Capacity(pages/month)

    Hours/page(180-hourmonth)

    Cost/page(assuming$4/hour)

    Scanneracquisit-ion

    Scannerlifespan(pages)

    Outsourcedpages forscanner cost

    (at $.06 each)

    Flat bed scanner 2,500 0.072 $0.288 $300 7,000 5,000

    Scanner withsheet-feeder

    8,000 0.0225 $0.09 $800 30,000 13,000

    Professional:low-end duplex

    40,000 0.0045 $0.018 $6,000 600,000 100,000

    Professional:high-end duplex

    150,000 0.0012 $0.0048 $50,000 8,000,000 833,000

    To determine the price per page, multiply the total hourly salary costs in your situationby the second column of Table 1.As an example, the third column gives the price of

    in-house scanning at a salaryrate of $4/hournot including investment costs.

    These calculations assume that the scanner is used for a sufficient volume to justify theinvestment. The final three columns of Table 1 give more information about the cost ofthe scanner itself. The first of these shows the acquisition cost of the scanner, and thenext gives its expected lifetime. The last shows the number of pages that could bescannedcommercially, at a cost of $0.06/page, for the price of the scanner alone.

    Of course, many other factors affect the choice of scanner: availability of funds, need tominimize dependence on others, desire to build local capacity, obligations to libraries toscan books locally and not transport them, and so on.

    The above figures give some idea of the volume of pages needed to justify differentlevels of investment. Rarely will an institute or organization need to scan 800,000pages. At such levels more complex issues arisesuch as maintenance and thepossibility of recouping costs by offering scanning services to othersthat we will notdiscuss here.

    It is tempting to regard the development of scanning capacity asa commercial venture,particularly in developing countries. But one shouldalways bear in mind that scanning isnot a repetitive business. Oncedocuments have been scanned, clients never place neworders for the same documentsno matter how good the relationship with the scanningcompany. From a commercial point of view, intensive marketing efforts are needed. Wedo not advise NGOs or other non-profit organizations to venture into this realm withoutthorough initial trials and a carefully-considered business plan.

    In conclusion, if 10,000 to 50,000 pages are to be scanned, one should consideroutsourcing the job. A low-end professional scanner costing about $6000 can only be

    justified if more than 100,000 pages have to be scanned. You might consider bandingtogether with a few other institutionsperhaps NGOs or librariesto purchase such ascanner.

    7 Scanners and scanning

  • 8/14/2019 Paper-en-1

    14/32

    [2] Recall that all sums of money are expressed in 2001 US dollars.

    3OCR: Optical Character

    RecognitionAn optical character recognition or OCR system transforms a scanned image into text.The input is a digitized image in TIFF or Bitmap formatpreferably a clean, high-qualityimage. The output is a word-processor or web file, typically in RTF, Word, or HTMLformat.

    The following steps are involved in converting paper documents to computer form:

    scanning; page layout analysis;

    recognition; scanning images and tables.

    Following these, you must perform quality checks on the resulting files, and save themin the appropriate format.

    On the market are many good OCR programs, with prices ranging from $100 to$400.[2]For example, among many others are:

    Read-Iris(http://www.readiris.com/) Omnipage(http://www.omnipage.com/)

    Fine-Reader(http://www.finereader.com/)

    All information, including lists of local distributors, can be found on the manufacturers'websites. Among these, in the authors' experience the most user-friendly areFine-Reader and Omnipage. Fine-Reader is cheapest, costing about $100. It offers agreat deal of flexibility, and the widest range of different language options.

    A choice must be made between undertaking the scanning and OCR in-house oroutsourcing it to a commercial organization. To do it in-house requires a scanner, OCRsoftware program, OCR skill development, and a quality-conscious, highly motivatedworkforce.

    3.1 The OCR process

    The OCR process differs from one OCR program to another, and each one requires aconsiderable amount of learning. The program's manual will explain this process indetail. Four points deserve particular attention: quality control, tables, images, andspecialized material such as formulas, foreign characters etc.

    8 OCR: Optical Character Recognition

  • 8/14/2019 Paper-en-1

    15/32

    Quality control

    We cannot place enough emphasis on quality control. Quality checks are bestperformed by native speakers, or people with an excellent command of the language tocheck. The best people are at the university or high-school level. We should also notethat young people tend to sustain higher concentration than older people for this kind ofwork.

    Normally there are four quality checks.

    The first is performed at the same time as OCR. Every OCR program has a built-inspell-checker that highlights every suspect letter. At the same time the image of theword appears too, making it easy to check and correct the error.

    The second is a general check of the text once the OCR process is finished. Commonerrors are to miss a page, a paragraph, chapter titles, and so on. A general overview isnecessary to check if pages are missing. It is essential to check titles, chapter headings,paragraphs, and tables.

    The third is a spelling check using Microsoft Word. This program has a dictionary that isoften more sophisticated than the one embedded in OCR programs. By importing thebook into Word and performing a spelling check there, more errors can be found andcorrected. Be sure to add to the spell-checker any particularly difficult or error-pronewords, or scientific and technical terms common in that type of publication.

    Finally, the completed document should be checked by an independent person whosamples the complete book and checks for errors, problems with tables and images,tagging, and the general look of the resulting text. Only after this final check can a bookbe considered ready for digital dissemination.

    Tables

    OCR programs do not cope well with tables. Moreover, tables are hard to check. Theycontain many digits, sometimes with points and commas, and entries are easilymisplaced into the wrong row or column. They require concentrated effort, dedicatedwork, intensive proof-reading, careful checking, and good quality control. They can be

    handled in three basically different ways.

    First, tables can be treated as images. This involves scanning them as black-and-whiteimages and placing them in this form at the appropriate point in the document. This isthe easiest solution. There are no errors, and the only time taken is that involved increating the image. However, this solution consumes more memory than others. Also,the resolution is not always sufficient when large tables are displayed on a computerscreen. If you make the complete table fit, the resolution is too small. If you make thetable over-wide, the user must scroll to see all columns and rows, and cannot get anoverview of the contents.

    Second, tables can be recreated manually by making a table with the same number of

    rows and columns and filling the entries by typing them in, character by character.

    Third, the table can be OCR'd. This saves time compared to the manual process, but

    9 OCR: Optical Character Recognition

  • 8/14/2019 Paper-en-1

    16/32

    has a potential for more errors. Columns sometimes get merged, and commas andpoints are not recognized.

    Images

    Publications contain three different general types of image:

    black and white line art; black and white photographs; color photographs.

    Black and white line art should be scanned in line art mode and saved as GIF or PNGfiles. Black and white photographs should be scanned in greyscale mode and saved asGIF or JPEG files. Color photographs should be scanned in color mode and saved asJPEG files. Generally speaking, medium-quality JPEG provides adequate resolution.

    For most collections, images consume the bulk of the space required on a hard-disk orCD-ROM. This makes it important to optimize each image for clarity and visibility, whileminimizing its size. To save space you might drop some or all of the images if they arenot relevant to the text.

    Images should be scanned separately, one by one. We recommend giving the imagefiles a name that consists of the first five or six characters used to denote the documentfollowed by the number of the page on which the image was found. An alternative,assuming each document is in its own directory, is to simply use the letter p followed by

    the page containing the image. If there are several images on a single page, append anadditional letter a, b, c to the filename. For example, if a JPEG image appeared onpage 36 of the publication u7548e discussed earlier, it would be placed in a file namedu7548e36.jpg or p36.jpg.

    Once the images have been scanned, you can put batch-processing programs to workto resize or enhance all the images at once.

    Specialized material

    Many documents contain specialized material such as special characters, formulas, and

    difficult pages. Special characters generally relate to different languages and diacriticalmarks. The language option for the OCR program should be set for the specificlanguage being read. Formulas will have to be recreated manually. Sometimes this isnot possible in the OCR program, but only in a word processor like MICROSOFT Word.Difficult pages that contain complex material or are damaged so that a clear imagecannot be obtained might have to be retyped manually.

    3.2 Productivity and resources

    As mentioned earlier, you should not underestimate the difficulty of OCR. Although the

    economic and practical options for OCR should be considered separately fromscanning, similar points arise: the necessary investment in computers; the availability ofhuman resources and management skills; training the workforce; salary costs; the totalnumber of pages to be processed; and whether documents can be outsourced to third

    10 OCR: Optical Character Recognition

  • 8/14/2019 Paper-en-1

    17/32

    parties.

    In this section we share our experience of OCR operations in Belgium, Romania and

    India. All case studies, calculations and figures assume average situations, documentsof standard difficulty (including tables and images) such as are found in most archivesor libraries, very high-quality results, and a medium- to long-term operation.

    Intensive OCR

    OCR is difficult. It demands great concentration and much skill. Before attaining peakproductivity level and quality, a learning period of about six weeks is needed.

    Typically, best results and productivity are achieved during the first hours of each day.After three hours of OCR work, productivity declines very rapidly, perhaps to 50% of the

    initial level. After six hours most people become very tired.

    The same kind of evolution occurs over the initial weeks. In the first few weekseveryone achieves fairly high productivity, but after that up to two-thirds of peoplebecome bored and frustrated. These people either quit or perform poorly in terms ofquality and productivity. Even those who pass the first three to five critical weeks andbecome part of the regular work team often leave in search of a better position after 6 to12 months.

    The remarks made in Section 3.1 about personnel apply particularly to intensive OCR.Quality checks are best undertaken by native speakers or people with a good commandof the language being checked. Young people generally sustain higher concentrationthan older people for OCR work. As a rule-of-the-thumb, people aged between 18 and23 years tend to be better suited than those over 25.

    Finally, OCR can be a boring job, which makes motivation and sustained commitmentto quality exceptionally important.

    These facts about OCR lead to the following guidelines:

    Young people between 18 and 25 are best suited for this job. Because the first hours are always the most productive, the work should

    either be organized on a part-time basis or only the most motivated andconcentrated people should be selected for full-time work.

    Two-thirds of people tend to quit or get bored after about three to fiveweeks. This translates into poorer quality and low productivity in the lastweeks.

    A regular supply of work is needed to justify the necessary training, tomaintain concentration, and to keep spirits high.

    Achievable productivity

    11 OCR: Optical Character Recognition

  • 8/14/2019 Paper-en-1

    18/32

    Table 2 OCR productivity

    Working hours/day Pages/day Pages/month

    Initial training (6 weeks) 3 6 120

    Optimal productivity level 3 9 150 to 2007 28 500 to 600

    Table 2 gives typical OCR productivity figures. Documents come in all sizes andqualities, and these figures assume that the mix of documents contains an averagenumber of images or tablessay one image and one table of five rows by five columnsevery 8 pages. They also assume that the page images are of medium to highqualitynote that, as discussed above, this depends on the quality of scanningandthat the OCR workers have a good command of the language.

    Table 2 gives separate figures for people undergoing training and for those who have

    reached their optimal productivity level. If a member of the administrative staff were toallocate three hours a day to OCR, they could achieve 180 to 200 pages OCR permonth. For full-time staff with proper training, high concentration and dedication toquality, 500 to 600 pages a month can be achieved.

    However, the rates that are achieved on difficult pages of low quality, with manycolumns or many tables, are far lowerperhaps 300 to 400 pages per month forfull-time work.

    Assume that the salary cost for dedicated and motivated full-time OCR workers is $400per month, and the overheadincluding management costs, computers, office space,utilities, etc.comes to another $300 to $400 per person per month. Then the cost of

    OCR comes to about $1.2 to $1.6 per page. Taking into account the training period,total volume, time-span, and layoff costs should the operation close down for lack ofwork, these figures rise to $1.5 to $2.5 per page.

    The cost of in-house OCR should be weighed against the cost of outsourcing the workto a professional OCR company. These typically charge from $1.5 to $4 per page,including images and tables. Human Info NGO/Simple Words has such a unit inRomania, and charges humanitarian non-profit organizations a special price that rangesfrom $1.2 to $2 per page. Please contact us at [email protected] for furtherinformation and advice.

    3.3 Alternatives to OCR

    There are two alternatives to OCR that we discuss here.

    Manual retyping

    One, which eliminates most scanning as well, is to retype the documents manually,using a word processor. This still requires the images and front cover to be scanned,but the remaining pages need not be scannedthus one can dispense with bothpowerful scanners and OCR software.

    The people who do this work do not have to understand the text. They must be accuratetypists and re-key exactly what they see. Retyping does introduce errors, and

    12 OCR: Optical Character Recognition

  • 8/14/2019 Paper-en-1

    19/32

    double-keying is often used to find and correct these. This method involves two peoplewho independently re-key the same document, after which both digital versions arecompared word for word using a special software program by an operator who has the

    original document in front of them. The assumption is that if the same word has beentyped independently twice in the same way, it is correct. However, this is not alwaystrue, and for extremely high precision, triple-keying is performed.

    The advantage of rekeying is that cost is saved because an OCR program is notneeded and so the computers can be older, lower-range, or second-handmodelswhereas powerful computers are needed for OCR. Also, the work can beperformed by people with a lower level of skill. The disadvantages are that a trainingperiod of at least two months is needed. Single keying usually produces too manyerrors, and double or triple keying is needed.

    The cost depends entirely on salary level. Typically, re-keyers in developing countries

    are paid on the order of $150/month. Their productivity could be twenty to thirty pagesper daycorresponding to 400 pages per month, images included. With double-keying,this makes the total salary costs around $300 per month, plus overheads.

    Image files

    A very low cost alternative to OCR is simply to use a PDF image version of thedocument pages. The cost is only a fraction of OCR'sabout $0.1 per page.

    Once scanning has been completed and TIFF files are available, an automaticconverter (usually Adobe Acrobat or Adobe Photoshop) converts all TIFF files of book

    pages into PDF files.

    The downside is that these files are not searchable. Also, they are quite largeusually50 Kb per page, plus or minus 20% depending on the quality of the original TIFF file.

    PDF image files are slowsometimes, in developing countries, impossible orprohibitively expensiveto download. They rarely fit on a floppy disk, and do notsupport text manipulation functions such as cut-and-paste.

    The PDF image file method should only be used if no OCR budget is available, and fordocuments that are likely to be used by a small number of people who have high-speed

    low-cost Internet access.

    3.4 Combining scanning and OCR

    If a scanner is connected directly to the computer that runs the OCR software, mostOCR programs can scan a page and perform OCR immediately. Page-by-pagescanning and OCR is a reasonable strategy for low volumes, but will provetime-consuming for bigger and more continuous jobs.

    For up to 100 to 150 pages per month, this solution may suffice. For higher volumes it is

    faster and more efficient to scan the document first, then perfom OCR on all the pagesas a separate step.

    13 OCR: Optical Character Recognition

  • 8/14/2019 Paper-en-1

    20/32

    4Three examples: 1000 to 100,000

    pages4.1 Typical small collection: 500 to 1000 pages

    Most NGOs have 500 to 1000 pages to scan. This volume can be OCRed in-house ifmotivated volunteers are available.

    Scanning

    The first step is to scan the publications to generate a high-quality TIFF file of each

    page, and a separate line-art, grey-scale or color bitmap image for each illustration.Assuming that 1000 pages have to be scanned, this might represent a part-time job ofabout one monthjust for scanning. The TIFF files would consume 60 to 80 Mb ofhard-disk space, and a good policy is to create a CD-R containing these files. Alow-cost flatbed scanner of $100 to $300 will be sufficient for the job. Scanning can bedone after working hours or during the weekends by a volunteer in the office or athome.

    OCR

    The second step is OCR by another volunteer, or team of volunteers, skilled inlanguage and correction. The TIFF files can either be shared between computers, orone computer can be used for the entire job. Typically, it will take five or six months ofpart-time labor (e.g. 20 hours a week) to convert 1000 pages into perfect Word or HTMLdocuments.

    Outsourcing

    An alternative is to outsource the scanning and OCR process. It would probably cost$1500 to $2000 to convert everything into perfect Word and HTML files.

    4.2 All publications from an organization: 5000 pages

    Many larger organizations have archives of around 5000 pages of currrent or out-ofprint books, journals, newsletters, grey literature, etc.

    Scanning

    This is too much for a flat-bed scanner. Scanning should either be outsourced(approximately $400 for 5000 pages) or a sheet-feeder scanner purchased(approximately $900). Alternatively, a more expensive scanner could be boughttogether with a few other institutions or NGOs ($6000 costs divided by the number ofparticipants). All 5000 pages in TIFF format will take about 300 to 400 Mb of hard-disk

    space. Again, a good policy is to create a CD-R containing these files.

    OCR

    14 Three examples: 1000 to 100,000 pages

  • 8/14/2019 Paper-en-1

    21/32

    The second step is OCR by another volunteer, or team of volunteers, skilled in OCRand correction. Again, several computers might be used, or one computer for the whole

    job. It would take 25 to 30 months of half-time labor (assuming 20 hours a week) to

    convert 5000 pages into perfect Word or HTML. In practice this is too long and toocomputer-intensive to manage on a volunteer basis. One would have to pay volunteers,monitor them for performance and quality, provide adequate space, etc, in order to havethe job finished within reasonable time at a high level of quality.

    Alternatively one could create image PDF files, which would take 300 to 400 Mb ofspace and would be harder to download over the Internet.

    Outsourcing

    An alternative is to outsource the scanning and OCR processes. It would probably cost$7500 to $10,000 to convert everything into perfect Word and HTML files.

    4.3 A small library: 100,000 pages

    Larger organizations, universities, governments, and specialized libraries might have awhole library to digitizesay 100,000 pages. The first issue to consider is the copyrightstatus of the publications. If they are not in the public domain, explicit permission todigitize them must be obtained from the copyright holders. You should also checkwhether the files are already available digitally.

    Scanning

    The volume is too high for a sheet-feed scanner. Scanning should either be outsourced($8000 for 100,000 pages), or a more expensive scanner purchased together with a fewother institutions or NGOs ($6000 shared between the participants). 100,000 pages inTIFF format will take 6 to 8 Gb of hard-disk space. The best plan is to create a set ofCD-R copies containing these files.

    OCR

    The second step is OCR (or creation of PDF files for less widely used documents). Itwould take 500 to 700 months of half-time labor to convert 100,000 pages into perfectWord or HTML. This is impossible to realize with volunteers, and the job must be done

    on a professional basis.

    To save cost, some of the less-frequently-used pagessay 80% or 80,000pagescould be transformed into PDF, and the other 20,000 pages into Word andHTML. The PDFs would take 4 to 6 Gb space and be harder to download on theInternet, but would cost only $0.2 per page to create by a professional organization(total of $16,000). If 80,000 PDF files were created from TIFF files by volunteers usingPDF conversion programs like Adobe Acrobat, 10 to 20 months of part-time work wouldbe necessary on a powerful computer.

    Outsourcing

    An alternative is to outsource the work. If the 80% PDF and 20% HTML mix weremaintained, the PDF would cost around $16,000 and the HTML $30,000 to $40,000a

    15 Three examples: 1000 to 100,000 pages

  • 8/14/2019 Paper-en-1

    22/32

    total budget of around $50,000. If everything were OCRed, it would cost $150,000 to$200,000 to convert the entire collection into perfect Word and /HTML files.

    16 Three examples: 1000 to 100,000 pages

  • 8/14/2019 Paper-en-1

    23/32

    5Creating an electronic collection

    Three important aspects should be kept in mind when deciding to create digitalcollections. First, the collection must be organized. The more content there is, thegreater the need for indexes and powerful search systems. For collections of 3000 to5000 pages or more, indexes and search systems are essential. Second, the needs ofend-users must prevail. The target groups that will use the collection should beidentified, and a process of regular consultation set up. Third, the available budget willdetermine how much can be done.

    5.1 Methods of collection building

    There are many examples of excellent CD-ROMs that are created on the web-pagemodel. HTML, PDF or Word documents are added and linked using hyperlinks.Navigation is made simple and attractive by the use of hyperlinks, frames, keywords,indexes and so on. Such systems work well up to a few thousand pages, but from 3000to 5000 pages onwards it is important to have a well-structured collection and apowerful search facility. This is where the Greenstone software can help.

    The Greenstone Digital Library software creates a structured digital library including avery powerful search and retrieval engine. Up to 150,000 pages can be indexed on asingle CD-ROM. Every CD-ROM can become an Internet server. Greenstone isopen-source software, and is freely available under the GNU license.

    The companion manuals describe how to build Greenstone collections. There areessentially three different ways of building collections:

    The librarian interface The Collector Building from the command line.

    The first method is the librarian interface, described in the Greenstone Digital LibraryUser's Guide(Chapter 3, Making Greenstone Collections). This is a comprehensiveinteractive facility for collection-building. With it, you can collect sets of documents,import or assign metadata, and build them into a Greenstone collection. The second

    method is the Collector subsystem, described in Chapter 4 of the User's Guide. This isan older facility that provides an alternative way of building collections of web pages orother documents. It guides you through a sequence of interactive web pages thatrequest the information needed. However, it does not provide any way of addingmetadata to the documents, andbecause it is a web interfaceit is not really suitablefor collections that take more than a few minutes to build. The third method is to run theprograms for collection-building directly from the command line; this is in theGreenstone Digital Library Developer's Guide(Chapter 1). This gives more flexibility inrunning programs individually and saving intermediate results, which may be desirablefor collections that take many hours to build. You will also need to read Chapter 2 of theDeveloper's Guide in order to harness the full power of Greenstone to build advancedcollections.

    There is a fourth method for creating and editing the material associated with acollection, a program called the Collection Organizer. However, its functionality has

    17 Creating an electronic collection

  • 8/14/2019 Paper-en-1

    24/32

    been superseded by the librarian interface mentioned above. It is described in a legacydocument entitled Using the Organizer.

    5.2 Getting started in seven steps and 15 minutes

    The best way of getting the look and feel of the librarian interface is to actually create asmall test library. If you have 15 minutes please follow these steps and you willunderstand this program much better.

    Before getting started, first install Greenstone (see the Greenstone Installer's Guide)which includes the Demo collection in DLS format and its source files. Note, if youwish to be able to add to your collection any of the 140 documents in the DLScollection (instead of just the 11 of these documents in the Greenstone Democollection), you should install DLS as one of the sample Greenstone libraries. TheDemo and DLS collections will be installed in C:\Program Files\gsdl\collect, insubdirectories demo and dlsrespectively. If you previously installed Greenstone withoutDLS and wish to install it, then you may re-insert your Greenstone CD-ROM and addthis collection. It is not necessary to uninstall Greenstone first.

    We suggest that you print the instructions below and follow them step by step:

    1. Launch the librarian interface under Windows by selecting GreenstoneDigital Libraryfrom the Programssection of the Startmenu andchoosing Librarian InterfaceIf you are using Unix, instead type

    cd ~/gsdlcd gli./gli.shwhere ~/gsdl is the directory containing your Greenstone system.

    2. Select Newfrom the File menu in the horizontal menu bar at the top ofthe window. Give it a title, for example My First Collection, and fill outyour email address and a brief description of the collection. In the Basethis collection on menu, choose greenstone demo or DevelopmentLibrary Subset (the effect is the same because these two collectionshave the same structure).

    3. Add some documents from the Demo collection (or the DLS collection ifit is installed) to your new collection. To do this, double-click theGreenstone Collectionsfolder in the left-hand panel, then double-clickthe collection you desire. The documents in it are displayed underneath.Select one of these, drag it, and drop into the right-hand panel. Thispanel represents the collection you are building. Choose severaldocuments and drag them into it one by one, or using multiple selectionin the standard way.

    4. Add some of your own documents that are not in the Demo or DLScollections. Close the Greenstone Collectionsfolder in the left-handpanel and double-click the Local Filespacefolder. Navigate to adirectory that contains some documents (e.g. small Word or HTML files).Drag a few of these into the right-hand panel to include them in yourcollection.

    18 Creating an electronic collection

  • 8/14/2019 Paper-en-1

    25/32

    5. Add metadata to the documents in your collection. So far you have beenoperating under the Gatherpanel, indicated by the Gathertabunderneath the horizontal menu bar at the top of the window. Click theEnrichtab beside it. The documents in your collection now appear in the

    left-hand panel: click one and examine the metadata associated with itin the Element Value table at the top right. Use the panelunderneath to change individual values by selecting the desiredElementand either choosing an existing value from the list or typing anew value into the box near the bottom. Add Title, Organization, andKeywordmetadata to each of your own documents that you put in thecollection. After you type each value you need to click Appendixto addthat value to the metadata.

    6. Click the Createtab to leave the Enrichmode and create your newcollection. Click the Build Collectionbutton at the bottom. While thecomputer is building the collection you will receive some feedback on

    what it is doing.7. When it has finished, click the Previewtab to view the collection from

    within the librarian interface. Check the titles a-z, organisationsand howto lists to ensure that your documents have been included in thecollection. You will also find when you visit your Greenstone home pagethat the collection has been installed as one of the regular collections.

    19 Creating an electronic collection

  • 8/14/2019 Paper-en-1

    26/32

    GNU Free Documentation LicenseVersion 1.2, November 2002

    Copyright (C) 2000,2001,2002 Free Software Foundation, Inc. 51 Franklin St, FifthFloor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distributeverbatim copies of this license document, but changing it is not allowed.

    0. PREAMBLE

    The purpose of this License is to make a manual, textbook, or other functional anduseful document "free" in the sense of freedom: to assure everyone the effectivefreedom to copy and redistribute it, with or without modifying it, either commercially ornoncommercially. Secondarily, this License preserves for the author and publisher away to get credit for their work, while not being considered responsible for modificationsmade by others.

    This License is a kind of "copyleft", which means that derivative works of the documentmust themselves be free in the same sense. It complements the GNU General PublicLicense, which is a copyleft license designed for free software.

    We have designed this License in order to use it for manuals for free software, becausefree software needs free documentation: a free program should come with manualsproviding the same freedoms that the software does. But this License is not limited tosoftware manuals; it can be used for any textual work, regardless of subject matter orwhether it is published as a printed book. We recommend this License principally forworks whose purpose is instruction or reference.

    1. APPLICABILITY AND DEFINITIONS

    This License applies to any manual or other work, in any medium, that contains a noticeplaced by the copyright holder saying it can be distributed under the terms of thisLicense. Such a notice grants a world-wide, royalty-free license, unlimited in duration, touse that work under the conditions stated herein. The "Document", below, refers to anysuch manual or work. Any member of the public is a licensee, and is addressed as"you". You accept the license if you copy, modify or distribute the work in a wayrequiring permission under copyright law.

    A "Modified Version" of the Document means any work containing the Document or a

    portion of it, either copied verbatim, or with modifications and/or translated into anotherlanguage.

    A "Secondary Section" is a named appendix or a front-matter section of the Documentthat deals exclusively with the relationship of the publishers or authors of the Documentto the Document's overall subject (or to related matters) and contains nothing that couldfall directly within that overall subject. (Thus, if the Document is in part a textbook ofmathematics, a Secondary Section may not explain any mathematics.) The relationshipcould be a matter of historical connection with the subject or with related matters, or oflegal, commercial, philosophical, ethical or political position regarding them.

    The "Invariant Sections" are certain Secondary Sections whose titles are designated, as

    being those of Invariant Sections, in the notice that says that the Document is releasedunder this License. If a section does not fit the above definition of Secondary then it isnot allowed to be designated as Invariant. The Document may contain zero Invariant

    20 GNU Free Documentation License

  • 8/14/2019 Paper-en-1

    27/32

    Sections. If the Document does not identify any Invariant Sections then there are none.

    The "Cover Texts" are certain short passages of text that are listed, as Front-Cover

    Texts or Back-Cover Texts, in the notice that says that the Document is released underthis License. A Front-Cover Text may be at most 5 words, and a Back-Cover Text maybe at most 25 words.

    A "Transparent" copy of the Document means a machine-readable copy, represented ina format whose specification is available to the general public, that is suitable forrevising the document straightforwardly with generic text editors or (for imagescomposed of pixels) generic paint programs or (for drawings) some widely availabledrawing editor, and that is suitable for input to text formatters or for automatictranslation to a variety of formats suitable for input to text formatters. A copy made in anotherwise Transparent file format whose markup, or absence of markup, has beenarranged to thwart or discourage subsequent modification by readers is not

    Transparent. An image format is not Transparent if used for any substantial amount oftext. A copy that is not "Transparent" is called "Opaque".

    Examples of suitable formats for Transparent copies include plain ASCII withoutmarkup, Texinfo input format, LaTeX input format, SGML or XML using a publiclyavailable DTD, and standard-conforming simple HTML, PostScript or PDF designed forhuman modification. Examples of transparent image formats include PNG, XCF andJPG. Opaque formats include proprietary formats that can be read and edited only byproprietary word processors, SGML or XML for which the DTD and/or processing toolsare not generally available, and the machine-generated HTML, PostScript or PDFproduced by some word processors for output purposes only.

    The "Title Page" means, for a printed book, the title page itself, plus such followingpages as are needed to hold, legibly, the material this License requires to appear in thetitle page. For works in formats which do not have any title page as such, "Title Page"means the text near the most prominent appearance of the work's title, preceding thebeginning of the body of the text.

    A section "Entitled XYZ" means a named subunit of the Document whose title either isprecisely XYZ or contains XYZ in parentheses following text that translates XYZ inanother language. (Here XYZ stands for a specific section name mentioned below, suchas "Acknowledgements", "Dedications", "Endorsements", or "History".) To "Preserve theTitle" of such a section when you modify the Document means that it remains a section

    "Entitled XYZ" according to this definition.

    The Document may include Warranty Disclaimers next to the notice which states thatthis License applies to the Document. These Warranty Disclaimers are considered to beincluded by reference in this License, but only as regards disclaiming warranties: anyother implication that these Warranty Disclaimers may have is void and has no effect onthe meaning of this License.

    2. VERBATIM COPYING

    You may copy and distribute the Document in any medium, either commercially ornoncommercially, provided that this License, the copyright notices, and the license

    notice saying this License applies to the Document are reproduced in all copies, andthat you add no other conditions whatsoever to those of this License. You may not usetechnical measures to obstruct or control the reading or further copying of the copies

    21 GNU Free Documentation License

  • 8/14/2019 Paper-en-1

    28/32

    you make or distribute. However, you may accept compensation in exchange forcopies. If you distribute a large enough number of copies you must also follow theconditions in section 3.

    You may also lend copies, under the same conditions stated above, and you maypublicly display copies.

    3. COPYING IN QUANTITY

    If you publish printed copies (or copies in media that commonly have printed covers) ofthe Document, numbering more than 100, and the Document's license notice requiresCover Texts, you must enclose the copies in covers that carry, clearly and legibly, allthese Cover Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on theback cover. Both covers must also clearly and legibly identify you as the publisher ofthese copies. The front cover must present the full title with all words of the title equally

    prominent and visible. You may add other material on the covers in addition. Copyingwith changes limited to the covers, as long as they preserve the title of the Documentand satisfy these conditions, can be treated as verbatim copying in other respects.

    If the required texts for either cover are too voluminous to fit legibly, you should put thefirst ones listed (as many as fit reasonably) on the actual cover, and continue the restonto adjacent pages.

    If you publish or distribute Opaque copies of the Document numbering more than 100,you must either include a machine-readable Transparent copy along with each Opaquecopy, or state in or with each Opaque copy a computer-network location from which the

    general network-using public has access to download using public-standard networkprotocols a complete Transparent copy of the Document, free of added material. If youuse the latter option, you must take reasonably prudent steps, when you begindistribution of Opaque copies in quantity, to ensure that this Transparent copy willremain thus accessible at the stated location until at least one year after the last timeyou distribute an Opaque copy (directly or through your agents or retailers) of thatedition to the public.

    It is requested, but not required, that you contact the authors of the Document wellbefore redistributing any large number of copies, to give them a chance to provide youwith an updated version of the Document.

    4. MODIFICATIONS

    You may copy and distribute a Modified Version of the Document under the conditionsof sections 2 and 3 above, provided that you release the Modified Version underprecisely this License, with the Modified Version filling the role of the Document, thuslicensing distribution and modification of the Modified Version to whoever possesses acopy of it. In addition, you must do these things in the Modified Version:

    A. Use in the Title Page (and on the covers, if any) a title distinct fromthat of the Document, and from those of previous versions (whichshould, if there were any, be listed in the History section of the

    Document). You may use the same title as a previous version if theoriginal publisher of that version gives permission.B. List on the Title Page, as authors, one or more persons or entities

    22 GNU Free Documentation License

  • 8/14/2019 Paper-en-1

    29/32

    responsible for authorship of the modifications in the Modified Version,together with at least five of the principal authors of the Document (all ofits principal authors, if it has fewer than five), unless they release youfrom this requirement.

    C. State on the Title page the name of the publisher of the ModifiedVersion, as the publisher.D. Preserve all the copyright notices of the Document.E. Add an appropriate copyright notice for your modifications adjacent tothe other copyright notices.F. Include, immediately after the copyright notices, a license noticegiving the public permission to use the Modified Version under the termsof this License, in the form shown in the Addendum below.G. Preserve in that license notice the full lists of Invariant Sections andrequired Cover Texts given in the Document's license notice.H. Include an unaltered copy of this License.

    I. Preserve the section Entitled "History", Preserve its Title, and add to itan item stating at least the title, year, new authors, and publisher of theModified Version as given on the Title Page. If there is no sectionEntitled "History" in the Document, create one stating the title, year,authors, and publisher of the Document as given on its Title Page, thenadd an item describing the Modified Version as stated in the previoussentence.J. Preserve the network location, if any, given in the Document for publicaccess to a Transparent copy of the Document, and likewise thenetwork locations given in the Document for previous versions it wasbased on. These may be placed in the "History" section. You may omit a

    network location for a work that was published at least four years beforethe Document itself, or if the original publisher of the version it refers togives permission.K. For any section Entitled "Acknowledgements" or "Dedications",Preserve the Title of the section, and preserve in the section all thesubstance and tone of each of the contributor acknowledgements and/ordedications given therein.L. Preserve all the Invariant Sections of the Document, unaltered in theirtext and in their titles. Section numbers or the equivalent are notconsidered part of the section titles.M. Delete any section Entitled "Endorsements". Such a section may not

    be included in the Modified Version.N. Do not retitle any existing section to be Entitled "Endorsements" or toconflict in title with any Invariant Section.O. Preserve any Warranty Disclaimers.

    If the Modified Version includes new front-matter sections or appendices that qualify asSecondary Sections and contain no material copied from the Document, you may atyour option designate some or all of these sections as invariant. To do this, add theirtitles to the list of Invariant Sections in the Modified Version's license notice. These titlesmust be distinct from any other section titles.

    You may add a section Entitled "Endorsements", provided it contains nothing butendorsements of your Modified Version by various parties--for example, statements ofpeer review or that the text has been approved by an organization as the authoritativedefinition of a standard.

    23 GNU Free Documentation License

  • 8/14/2019 Paper-en-1

    30/32

    You may add a passage of up to five words as a Front-Cover Text, and a passage of upto 25 words as a Back-Cover Text, to the end of the list of Cover Texts in the ModifiedVersion. Only one passage of Front-Cover Text and one of Back-Cover Text may be

    added by (or through arrangements made by) any one entity. If the Document alreadyincludes a cover text for the same cover, previously added by you or by arrangementmade by the same entity you are acting on behalf of, you may not add another; but youmay replace the old one, on explicit permission from the previous publisher that addedthe old one.

    The author(s) and publisher(s) of the Document do not by this License give permissionto use their names for publicity for or to assert or imply endorsement of any ModifiedVersion.

    5. COMBINING DOCUMENTS

    You may combine the Document with other documents released under this License,under the terms defined in section 4 above for modified versions, provided that youinclude in the combination all of the Invariant Sections of all of the original documents,unmodified, and list them all as Invariant Sections of your combined work in its licensenotice, and that you preserve all their Warranty Disclaimers.

    The combined work need only contain one copy of this License, and multiple identicalInvariant Sections may be replaced with a single copy. If there are multiple InvariantSections with the same name but different contents, make the title of each such sectionunique by adding at the end of it, in parentheses, the name of the original author orpublisher of that section if known, or else a unique number. Make the same adjustmentto the section titles in the list of Invariant Sections in the license notice of the combinedwork.

    In the combination, you must combine any sections Entitled "History" in the variousoriginal documents, forming one section Entitled "History"; likewise combine anysections Entitled "Acknowledgements", and any sections Entitled "Dedications". Youmust delete all sections Entitled "Endorsements."

    6. COLLECTIONS OF DOCUMENTS

    You may make a collection consisting of the Document and other documents releasedunder this License, and replace the individual copies of this License in the various

    documents with a single copy that is included in the collection, provided that you followthe rules of this License for verbatim copying of each of the documents in all otherrespects.

    You may extract a single document from such a collection, and distribute it individuallyunder this License, provided you insert a copy of this License into the extracteddocument, and follow this License in all other respects regarding verbatim copying ofthat document.

    7. AGGREGATION WITH INDEPENDENT WORKS

    A compilation of the Document or its derivatives with other separate and independentdocuments or works, in or on a volume of a storage or distribution medium, is called an"aggregate" if the copyright resulting from the compilation is not used to limit the legalrights of the compilation's users beyond what the individual works permit. When the

    24 GNU Free Documentation License

  • 8/14/2019 Paper-en-1

    31/32

    Document is included in an aggregate, this License does not apply to the other works inthe aggregate which are not themselves derivative works of the Document.

    If the Cover Text requirement of section 3 is applicable to these copies of theDocument, then if the Document is less than one half of the entire aggregate, theDocument's Cover Texts may be placed on covers that bracket the Document within theaggregate, or the electronic equivalent of covers if the Document is in electronic form.Otherwise they must appear on printed covers that bracket the whole aggregate.

    8. TRANSLATION

    Translation is considered a kind of modification, so you may distribute translations ofthe Document under the terms of section 4. Replacing Invariant Sections withtranslations requires special permission from their copyright holders, but you mayinclude translations of some or all Invariant Sections in addition to the original versions

    of these Invariant Sections. You may include a translation of this License, and all thelicense notices in the Document, and any Warranty Disclaimers, provided that you alsoinclude the original English version of this License and the original versions of thosenotices and disclaimers. In case of a disagreement between the translation and theoriginal version of this License or a notice or disclaimer, the original version will prevail.

    If a section in the Document is Entitled "Acknowledgements", "Dedications", or"History", the requirement (section 4) to Preserve its Title (section 1) will typicallyrequire changing the actual title.

    9. TERMINATION

    You may not copy, modify, sublicense, or distribute the Document except as expresslyprovided for under this License. Any other attempt to copy, modify, sublicense ordistribute the Document is void, and will automatically terminate your rights under thisLicense. However, parties who have received copies, or rights, from you under thisLicense will not have their licenses terminated so long as such parties remain in fullcompliance.

    10. FUTURE REVISIONS OF THIS LICENSE

    The Free Software Foundation may publish new, revised versions of the GNU FreeDocumentation License from time to time. Such new versions will be similar in spirit to

    the present version, but may differ in detail to address new problems or concerns. Seehttp://www.gnu.org/copyleft/.

    Each version of the License is given a distinguishing version number. If the Documentspecifies that a particular numbered version of this License "or any later version" appliesto it, you have the option of following the terms and conditions either of that specifiedversion or of any later version that has been published (not as a draft) by the FreeSoftware Foundation. If the Document does not specify a version number of thisLicense, you may choose any version ever published (not as a draft) by the FreeSoftware Foundation.

    How to use this License for your documents

    25 GNU Free Documentation License

  • 8/14/2019 Paper-en-1

    32/32

    To use this License in a document you have written, include a copy of the License in thedocument and put the following copyright and license notices just after the title page:

    Copyright (c) YEAR YOUR NAME. Permission is granted to copy, distribute and/ormodify this document under the terms of the GNU Free Documentation License, Version1.2 or any later version published by the Free Software Foundation; with noInvariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of thelicense is included in the section entitled "GNU Free Documentation License".

    If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, replace the"with...Texts." line with this:

    with the Invariant Sections being LIST THEIR TITLES, with the Front-Cover Textsbeing LIST, and with the Back-Cover Texts being LIST.

    If you have Invariant Sections without Cover Texts, or some other combination of thethree, merge those two alternatives to suit the situation.

    If your document contains nontrivial examples of program code, we recommendreleasing these examples in parallel under your choice of free software license, such asthe GNU General Public License, to permit their use in free software.

    26 GNU Free Documentation License


Recommended