+ All Categories
Home > Education > Cleaning plain text books with Text::Perfide::BookCleaner

Cleaning plain text books with Text::Perfide::BookCleaner

Date post: 18-Dec-2014
Category:
Upload: andrefsantos
View: 589 times
Download: 1 times
Share this document with a friend
Description:
Slides from a presentation about Text::Perfide::BookCleaner given at PtPW2011. T::P::BC is a Perl module created to clean books in plain text format, making them suitable for further automatic text processing activities.
72
Cleaning plain text books with Text::Perfide::BookCleaner Andr´ e Santos [email protected] September 23, 2011
Transcript
Page 1: Cleaning plain text books with Text::Perfide::BookCleaner

Cleaning plain text books withText::Perfide::BookCleaner

Andre [email protected]

September 23, 2011

Page 2: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Per-Fide

1 IntroductionPer-FideText alignmentBooks

2 Text::Perfide::BookCleaner

3 Conclusions, wish list and future work

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 3: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Per-Fide

1 IntroductionPer-FideText alignmentBooks

2 Text::Perfide::BookCleaner

3 Conclusions, wish list and future work

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 4: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Per-Fide

Project Per-Fide

Joint venture between the Computer ScienceDepartment and the School of Humanities ofthe University of Minho

Portuguese in parallel with six languages:Espanol, Russian, Francais, Italiano, Deutsch,English

Build parallel corpora that will establish arelation between Portuguese and the other 6languages

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 5: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Per-Fide

Project Per-Fide

Joint venture between the Computer ScienceDepartment and the School of Humanities ofthe University of Minho

Portuguese in parallel with six languages:Espanol, Russian, Francais, Italiano, Deutsch,English

Build parallel corpora that will establish arelation between Portuguese and the other 6languages

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 6: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Per-Fide

Project Per-Fide

Joint venture between the Computer ScienceDepartment and the School of Humanities ofthe University of Minho

Portuguese in parallel with six languages:Espanol, Russian, Francais, Italiano, Deutsch,English

Build parallel corpora that will establish arelation between Portuguese and the other 6languages

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 7: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Per-Fide

[Parallel] Corpora

Corpora Collection of natural language texts

Parallel corpora Collection of nat. lang. bitexts

Bitext Pair formed by a text in a givenlanguage and its translation inanother language, frequently aligned.

Alignment Mapping between thesentences/paragraphs/words of onetext and the other.

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 8: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Per-Fide

Project Per-Fide

Original texts in the seven languages and theirtranslations

Two main genres: contemporary fictionand non-fiction

non-fiction: judicial, journalistic, religious,technical, ...

fiction: contemporary novels and shortstories

per-fide.di.uminho.pt

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 9: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Per-Fide

Project Per-Fide

Original texts in the seven languages and theirtranslations

Two main genres: contemporary fictionand non-fiction

non-fiction: judicial, journalistic, religious,technical, ...

fiction: contemporary novels and shortstories

per-fide.di.uminho.pt

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 10: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Per-Fide

Project Per-Fide

Original texts in the seven languages and theirtranslations

Two main genres: contemporary fictionand non-fiction

non-fiction: judicial, journalistic, religious,technical, ...

fiction: contemporary novels and shortstories

per-fide.di.uminho.pt

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 11: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Per-Fide

Project Per-Fide

Original texts in the seven languages and theirtranslations

Two main genres: contemporary fictionand non-fiction

non-fiction: judicial, journalistic, religious,technical, ...

fiction: contemporary novels and shortstories

per-fide.di.uminho.pt

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 12: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Per-Fide

Project Per-Fide

Original texts in the seven languages and theirtranslations

Two main genres: contemporary fictionand non-fiction

non-fiction: judicial, journalistic, religious,technical, ...

fiction: contemporary novels and shortstories

per-fide.di.uminho.pt

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 13: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Text alignment

Text alignment

Manual or automatic

Paragraph/sentence/word levelAutomatic alignment tools/algorithmsgenerally fall into three categories:

length based: “when two sentences correspond, the

words in them also correspond”

lexical/dictionary based: relies on lexical

information or dictionaries to perform the

alignment

partial similarity (cognates) based: relies on

occurrences of tokens graphically or

otherwise identical (cognates)

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 14: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Text alignment

Text alignment

Manual or automaticParagraph/sentence/word level

Automatic alignment tools/algorithmsgenerally fall into three categories:

length based: “when two sentences correspond, the

words in them also correspond”

lexical/dictionary based: relies on lexical

information or dictionaries to perform the

alignment

partial similarity (cognates) based: relies on

occurrences of tokens graphically or

otherwise identical (cognates)

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 15: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Text alignment

Text alignment

Manual or automaticParagraph/sentence/word levelAutomatic alignment tools/algorithmsgenerally fall into three categories:

length based: “when two sentences correspond, the

words in them also correspond”

lexical/dictionary based: relies on lexical

information or dictionaries to perform the

alignment

partial similarity (cognates) based: relies on

occurrences of tokens graphically or

otherwise identical (cognates)

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 16: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Text alignment

Text alignment

Manual or automaticParagraph/sentence/word levelAutomatic alignment tools/algorithmsgenerally fall into three categories:length based: “when two sentences correspond, the

words in them also correspond”

lexical/dictionary based: relies on lexical

information or dictionaries to perform the

alignment

partial similarity (cognates) based: relies on

occurrences of tokens graphically or

otherwise identical (cognates)

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 17: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Text alignment

Text alignment

Manual or automaticParagraph/sentence/word levelAutomatic alignment tools/algorithmsgenerally fall into three categories:length based: “when two sentences correspond, the

words in them also correspond”

lexical/dictionary based: relies on lexical

information or dictionaries to perform the

alignment

partial similarity (cognates) based: relies on

occurrences of tokens graphically or

otherwise identical (cognates)

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 18: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Text alignment

Text alignment

Manual or automaticParagraph/sentence/word levelAutomatic alignment tools/algorithmsgenerally fall into three categories:length based: “when two sentences correspond, the

words in them also correspond”

lexical/dictionary based: relies on lexical

information or dictionaries to perform the

alignment

partial similarity (cognates) based: relies on

occurrences of tokens graphically or

otherwise identical (cognates)

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 19: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Text alignment

Text alignment – Example

Table: Extract of sentence-level alignment performed usingPortuguese and Russian subtitles from the movie Tron.

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 20: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Books

Books

Obtained directly from publishers or, if inpublic domain, from Project Gutenberg andsimilar projects

Large variety of formats: PDF, MS Word,HTML, ebook formats, ...

If not already in plain text, they need to beconverted before the alignment

This is where all the trouble starts!

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 21: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Books

Books

Obtained directly from publishers or, if inpublic domain, from Project Gutenberg andsimilar projects

Large variety of formats: PDF, MS Word,HTML, ebook formats, ...

If not already in plain text, they need to beconverted before the alignment

This is where all the trouble starts!

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 22: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Books

Books

Obtained directly from publishers or, if inpublic domain, from Project Gutenberg andsimilar projects

Large variety of formats: PDF, MS Word,HTML, ebook formats, ...

If not already in plain text, they need to beconverted before the alignment

This is where all the trouble starts!

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 23: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Books

Books

Obtained directly from publishers or, if inpublic domain, from Project Gutenberg andsimilar projects

Large variety of formats: PDF, MS Word,HTML, ebook formats, ...

If not already in plain text, they need to beconverted before the alignment

This is where all the trouble starts!

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 24: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Books

Book alignment problems

pagination – page numbers, headers,footers, . . .

previous text formatting – sub/superscript,bold, italics, . . .

sections

paragraphs

translineations and transpaginations

footnotes

text encoding

. . .

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 25: Cleaning plain text books with Text::Perfide::BookCleaner

Introduction Books

Book alignment problems – Example

(. . . )

gaiement. Sur le devant s<92>’ouvrait la porte

d<92>’entree, donnant acces dans la salle commune.

Une legere veranda, qui en prote-

<96>- 86 <96>-

^L geait la partie anterieure contre l<92>’action

des rayons solaires, reposait sur de sveltes bambous.

Le tout etait peint d<92>’une fraıche

(. . . )

La Jangada, Jules Verne

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 26: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

1 IntroductionPer-FideText alignmentBooks

2 Text::Perfide::BookCleaner

3 Conclusions, wish list and future work

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 27: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

1 IntroductionPer-FideText alignmentBooks

2 Text::Perfide::BookCleaner

3 Conclusions, wish list and future work

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 28: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

First approach

RegExp + Find & Replace

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 29: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

First approach

RegExp + Find & Replace

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 30: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

First approach

Well-intentioned but:

Too naıve

Big mess

A more sofisticated approach was needed!

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 31: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Architecture

Build a pipeline; each step handles a specific set ofproblems.

1 pages

2 sections

3 paragraphs

4 footnotes

5 chars

6 . . .

7 commit

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 32: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Architecture

Build a pipeline; each step handles a specific set ofproblems.

1 pages

2 sections

3 paragraphs

4 footnotes

5 chars

6 . . .

7 commit

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 33: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Architecture

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 34: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Architecture

whenever possible, use ontologies and DSLs

they help organizing stuff

they allow to abstract from the code anddiscuss details at a higher level (even withpeople from other areas)

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 35: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Pages

GoalIdentify and remove from text elements related tobook pagination:

page numbers

headers

footers

page breaks

These elements often lead to a bad performance ofthe aligner.

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 36: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Pages – Example

est vrai qu’il fallait etre assez chanceux pour

rencontrer le nabab, et assez audacieux pour

s’emparer de sa personne.

Page 3

^L La maison a vapeur Jules Verne

Le faquir, - evidemment le seul entre tous

que ne surexcitat pas l’espoir de gagner la

prime, - filait au milieu des groupes, s’arretant

La Maison a Vapeur, Jules Verne

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 37: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Pages – Algorithm

1 identify page breaks (e.g., ^L )2 nearby: candidates to headers and footers3 count the occurrences of each normalized

candidate4 headers and footers are extracted from

candidates which occur more thant a thresholdvalue

5 replace everything with a custom mark6 move all the necessary information to a

standoff file

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 38: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Pages – Example

est vrai qu’il fallait etre assez chanceux pour

rencontrer le nabab, et assez audacieux pour

s’emparer de sa personne.

Page 3

^L La maison a vapeur Jules Verne

Le faquir, - evidemment le seul entre tous

que ne surexcitat pas l’espoir de gagner la

prime, - filait au milieu des groupes, s’arretant

La Maison a Vapeur, Jules Verne

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 39: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Pages – Example

est vrai qu’il fallait etre assez chanceux pour

rencontrer le nabab, et assez audacieux pour

s’emparer de sa personne. _pb2_

Le faquir, - evidemment le seul entre tous

que ne surexcitat pas l’espoir de gagner la

prime, - filait au milieu des groupes, s’arretant

La Maison a Vapeur, Jules Verne

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 40: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Sections

GoalIdentify and normalize the divisions between theseveral sections of a book (parts, chapters, acts,scenes, epilogue, afterword, ...)

An ontology was created, containing types ofdivisions and subdivisions, in several languages.

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 41: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Sections – Ontology

Examplecap

PT capıtulo, cap, capitulo

FR chapitre, chap

EN chapter, chap

NT sec

PT fim

FR fin

EN the_end

BT _alone

This ontology is used to automatically generate aparte of the code.

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 42: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Sections – Example

PRIMEIRA PARTE

FANTINE

^L LIVRO PRIMEIRO

UM JUSTO

O abade Myriel

Em 1815, era bispo de Digne, o reverendo Carlos

Francisco Bemvindo Myriel, o qual contava setenta e

Os Miseraveis, Vitor Hugo

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 43: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Sections – Algorithm

1 Search for potential sections divisions:lines with keywords – capıtulo, chapter, Chap.,Appendix, Table des Matieres, . . .pages or lines containing only numbersroman numbering. . .

2 Insert a custom mark immediately before thesection identified

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 44: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Sections – Example

PRIMEIRA PARTE

FANTINE

^L LIVRO PRIMEIRO

UM JUSTO

O abade Myriel

Em 1815, era bispo de Digne, o reverendo Carlos

Francisco Bemvindo Myriel, o qual contava setenta e

Os Miseraveis, Vitor Hugo

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 45: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Sections – Example

_sec+O:PARTE=PRIMEIRA_

FANTINE

_sec+O:LIVRO=PRIMEIRO_

UM JUSTO

O abade Myriel

Em 1815, era bispo de Digne, o reverendo Carlos

Francisco Bemvindo Myriel, o qual contava setenta e

Os Miseraveis, Vitor Hugo

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 46: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Sections

Identifying the different parts within a bitext:

allows to subsequently compare the twoversions and remove parts which can only befound in one of them

allows to perform a structural alignment1

1Text::Perfide::BookSyncAndre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 47: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Paragraphs

GoalHandles things related with identifying andnormalizing paragraph notation, direct speech, etc.

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 48: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Paragraphs – Example

L’hotesse prit la defense de son cure:

- D’ailleurs, il en plierait quatre comme vous sur

son genou. Il a, l’annee derniere, aide nos gens a

rentrer la paille; il en portait jusqu’a six bottes

a la fois, tant il est fort!

- Bravo! dit le pharmacien. Envoyez donc vos filles

en confesse a des gaillards d’un temperament pareil!

Moi, si j’etais le gouvernement, je voudrais qu’on

saignat les pretres une fois par mois.

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 49: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Paragraphs – Example

L’hotesse prit la defense de son cure:

"D’ailleurs, il en plierait quatre comme vous sur

son genou. Il a, l’annee derniere, aide nos gens a

rentrer la paille; il en portait jusqu’a six bottes

a la fois, tant il est fort! "

"Bravo!" dit le pharmacien. "Envoyez donc vos filles

en confesse a des gaillards d’un temperament pareil!

Moi, si j’etais le gouvernement, je voudrais qu’on

saignat les pretres une fois par mois."

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 50: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Paragraphs – Algorithm

paragraph identification is performed bycalculating metrics based on the number ofblank lines and indentationidentification and normalization of directspeech:

punctuation, paragraph, dashtext in quotes

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 51: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Footnotes

GoalIdentify and remove footnote callmarks andfootnote expansions

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 52: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Footnotes – Example

On fit un inventaire de son argent comptant, et on

le mena dans le chateau que fit construire le roi

Charles V, fils de Jean II, aupres de la rue

Saint-Antoine, a la porte des Tournelles[1].

[1] La Bastille, qui fut prise par le peuple de

Paris, le 14 juillet 1789, puis demolie. B.

^L Quel etait en chemin l’etonnement de l’Ingenu!

je vous le laisse a penser. Il crut d’abord

que c’etait un reve.

Oeuvres de Voltaire, Voltaire

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 53: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Footnotes – Algorithm

1 Search for footnote expansions (lines begginingwith <<1>>, [2], ^3, . . . )

2 Replace with custom mark3 Only footnote call marks left4 Search again for the same patterns in the

middle of the text5 Replace with custom mark

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 54: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Footnotes – Algorithm

On fit un inventaire de son argent comptant, et on

le mena dans le chateau que fit construire le roi

Charles V, fils de Jean II, aupres de la rue

Saint-Antoine, a la porte des Tournelles[1].

[1] La Bastille, qui fut prise par le peuple de

Paris, le 14 juillet 1789, puis demolie. B.

(fbox^LQuel etait en chemin l’etonnement de l’Ingenu!

je vous le laisse a penser. Il crut d’abord

que c’etait un reve.

Oeuvres de Voltaire, Voltaire

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 55: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Footnotes – Algorithm

On fit un inventaire de son argent comptant, et on

le mena dans le chateau que fit construire le roi

Charles V, fils de Jean II, aupres de la rue

Saint-Antoine, a la porte des Tournelles_fnr29_.

_fne8_

^L Quel etait en chemin l’etonnement de l’Ingenu!

je vous le laisse a penser. Il crut d’abord

que c’etait un reve.

Oeuvres de Voltaire, Voltaire

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 56: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Words and characters

translineations

text encoding

. . .

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 57: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Report

Previous steps produce a report

Summarizes what was found, what wasassumed and what was done

Main goal is to allow to make a diagnostic ofthe program, allowing to manually emend whatis wrong

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 58: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Report

livros/_FR_15.pdf.txt:

footers=[’( Page) = 241’]

headers=[

"(La maison \x{e0} vapeur Jules Verne) = 241"]

ctrL=1;

pagnum_ctrL=241;

sectionsO=2;

sectionsN=30;

word_tr=58;

words=118036;

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 59: Cleaning plain text books with Text::Perfide::BookCleaner

Text::Perfide::BookCleaner

Commit

Final and irreversible step which removes allthe custom marks added by the previous steps

Outputs a cleaned copy of the document

This is the last stage before the alignment (orany other further processing)

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 60: Cleaning plain text books with Text::Perfide::BookCleaner

Conclusions, wish list and future work

1 IntroductionPer-FideText alignmentBooks

2 Text::Perfide::BookCleaner

3 Conclusions, wish list and future work

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 61: Cleaning plain text books with Text::Perfide::BookCleaner

Conclusions, wish list and future work

1 IntroductionPer-FideText alignmentBooks

2 Text::Perfide::BookCleaner

3 Conclusions, wish list and future work

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 62: Cleaning plain text books with Text::Perfide::BookCleaner

Conclusions, wish list and future work

Conclusions and wish list

There is no de facto standard format for plaintext books (documents?)

Documents are way heterogeneous(provenience, type and quantity, notationformats, . . . )

Hurrah to regular expressions!

20/80 rule applies

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 63: Cleaning plain text books with Text::Perfide::BookCleaner

Conclusions, wish list and future work

Conclusions and wish list

Ontologies and DSLs lead to a better structureCommon pattern:

search textcalculate metricsperform action accordingly

Report generated at the end should present asmart summary of what was found and done

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 64: Cleaning plain text books with Text::Perfide::BookCleaner

Conclusions, wish list and future work

Related ongoing work

Text::Perfide::BookPairs Find repeated books andpairs of books (same book in differentlanguages) within a collection

Text::Perfide::BookSync Uses the sectiondelimitation made by T::P::BC to make astructural alignment:

Text::Perfide::CorporaFlow Uses a DSL to guide thecorpora preparation workflow (to bedone)

Text::Perfide::SciPaperCleaner Cleaner for scientificpapers (to be done)

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 65: Cleaning plain text books with Text::Perfide::BookCleaner

Conclusions, wish list and future work

Related ongoing work

Text::Perfide::BookPairs Find repeated books andpairs of books (same book in differentlanguages) within a collection

Text::Perfide::BookSync Uses the sectiondelimitation made by T::P::BC to make astructural alignment:

Text::Perfide::CorporaFlow Uses a DSL to guide thecorpora preparation workflow (to bedone)

Text::Perfide::SciPaperCleaner Cleaner for scientificpapers (to be done)

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 66: Cleaning plain text books with Text::Perfide::BookCleaner

Conclusions, wish list and future work

Related ongoing work

Text::Perfide::BookPairs Find repeated books andpairs of books (same book in differentlanguages) within a collection

Text::Perfide::BookSync Uses the sectiondelimitation made by T::P::BC to make astructural alignment:

Text::Perfide::CorporaFlow Uses a DSL to guide thecorpora preparation workflow (to bedone)

Text::Perfide::SciPaperCleaner Cleaner for scientificpapers (to be done)

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 67: Cleaning plain text books with Text::Perfide::BookCleaner

Conclusions, wish list and future work

Related ongoing work

Text::Perfide::BookPairs Find repeated books andpairs of books (same book in differentlanguages) within a collection

Text::Perfide::BookSync Uses the sectiondelimitation made by T::P::BC to make astructural alignment:

Text::Perfide::CorporaFlow Uses a DSL to guide thecorpora preparation workflow (to bedone)

Text::Perfide::SciPaperCleaner Cleaner for scientificpapers (to be done)

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 68: Cleaning plain text books with Text::Perfide::BookCleaner

Conclusions, wish list and future work

Future work

Standoff annotation – no changes in theoriginal file until commit

Export to ebook formats – .fb2, .epub, . . .

. . .

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 69: Cleaning plain text books with Text::Perfide::BookCleaner

Conclusions, wish list and future work

CPAN

Is it on CPAN yet?

No, but it will be really, really soon!

Missing

More and better documentation

More and better tests

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 70: Cleaning plain text books with Text::Perfide::BookCleaner

Conclusions, wish list and future work

CPAN

Is it on CPAN yet?

No, but it will be really, really soon!

Missing

More and better documentation

More and better tests

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 71: Cleaning plain text books with Text::Perfide::BookCleaner

Conclusions, wish list and future work

Questions

o/

Andre [email protected]

Andre Santos [email protected] Cleaning plain text books with Text::Perfide::BookCleaner

Page 72: Cleaning plain text books with Text::Perfide::BookCleaner

Cleaning plain text books withText::Perfide::BookCleaner

Andre [email protected]

September 23, 2011


Recommended