Ontology learning from text

Post on 21-May-2015

179 views 1 download

Tags:

transcript

Ontology Learning From Text?

Robert Stevens

BioHealth Informatics Group

School of Computer Science

University of Manchester

Robert.stevens@manchester.ac.uk

Introduction

• Can we use ontology learning to build ontologies?

• Not text-mining research, but ontology research

• What is ontology learning from text?• The questions we posed• The experiment we performed• The results we obtained• The conclusions we made

Ontology learning

• Text2Onto: http://ontoware.org/projects/text2onto/

• “The erythrocytes are the blood cells that carry oxygen to others cells in the body”

• “Lymphocytes, leukocytes, monocytes, phagocytes and granulocytes are all kinds of white blood cell”

• “These experiments show that the individual hemopoietic stem cell is a multipotent cell and can give rise to the complete range of blood cell types, both myeloid and lymphoid, as well as new stem cells like itself.”

Ontology Learning

Blood Cell

Erythrocyte

White Blood Cell

Monocyte

Leukocyte

Lymphocyte

Phagocyte

Granulocyte

Multipotent Stem Cell

Hemopoietic Stem Cell

arise from

Text to Ontology “Workflow”

Corpus

Tokenising / Sentence splitting

Part-Of-Speech (POS) tagging

Lemmatizing / Stemming

JAPE transducer annotates corpus

Text2Onto Algorithms for extracting modeling primitive

Text2Onto meta-ontology

Promotion to OWL ontology

Extracting Patterns from Text

“CFU-S is a blood stem cell”

CFU-S[NNP] is[VBN] a[DT] blood[NN] stem[NN] cell[NN]

Sentence:

Part of Speech (POS) Tagging:

Pseudo JAPE rule:

Any series of nouns (A) followed by the string “ is a ” followed by series of nouns (B)

Key: NN=noun; DT=determiner; NNP=proper noun; VBN = verb past participle.

Ontological assertions:

A and B are concepts, A is a subclass of B

Text2Onto meta-ontology

Some Text2Onto Instances

• Instance: Astrocyte_c– typeOf: Concept that

– Fact: confidence VALUE 1.0

Instance: AstrocycteNerveCell

TypeOf: Subclass that

Fact: domain VA\LUE NerveCell and

FACT: Range VALUE Astrocyte and

Fact: confidence VALUE 1.0

The Questions We Asked

• Can we press the button and get a good ontology?

• If not, can we get something useful?

• Can we do it without having to write too many rules?

• Does the end-point act as as a donor or recipient ontology?

Strategy

• Collect corpus• Manually markup text for cells: Definitive list

of terms• Process corpus through T2O• Analyse output of T2O for recall and precision

of terms and hierarchy• Iteration of previous two step with variants in

rules• Evaluation against CTO gold standard

The Experimental Conditions

• Default T2O• T2O plus cell specific JAPE rules and all

algorithms• Only cell specific JAPE rules, /EntropyExtraction

Algorithm and some “hierarchy spotting” based on term composition

• Same 3, but with VerticalRelationsConceptClassification to include our simple JAPE rules

• Same 4, but with WordConceptClassificaiton for additional hierarchy

Rules for Extracting Cell Types

• Words ending in ‘cyte’, ‘blast’, ‘cell’, ‘glia’, ‘glium’, ‘cell type’, ‘cell line’ and ‘cell lineage’ (together with their plurals)

• Zero or more adjectives followed by zero or more nouns or proper nouns followed by a ‘cell word’ (together with plural) e.g. ‘renshaw cell’, ‘Muller cell’, ‘immature blood cell’, etc..

• Any stem cell term is a stem cell

• Any term ending with ‘progeneitor cell’ is a Progenitor Cell.

• Any term ending with ‘precursor cell’ is a Precursor Cell.

• Any term ending in ‘blast’ is a Blast Cell.

• Any term ending with ‘cyte’ or ‘cell’ is a Differentiated Cell.

Evaluation Strategy

• Extraction performance

• Ontology evaluation

• Domain coverage

• Expert evaluation

Term Recognition

• 1,277 terms in our definitive list• 16,384 terms from whole corpus; 625 relevant• Increase to 17,851 and 916• All 118 CTO terms in corpus recalled• Corpus has anatomical bias• Simple rules exploit regularity of language• Many false positives from adjective noun rule

Cell Terms

• Morphology: Stellate cell; columnar cell;• Ploidy• Maturity: Tetrapooil cell; multiploid cell;• Potentiality• Lineage: Totipotent stem cell; multipotent cell;• Species origin• Anatomical location: Animal cell; human sell;• Developmental stage: Mitotic cell; S-phase cell;• Lineage: Mesoderm cell;

Common errorsManually

extracted from corpus

Automatically extracted from

corpus

Comments

+t - cell Symbols not handled very well

contains cell False -positive cell type

Foam cell New cell type extracted

leukocyte leucocyte Spelling errors in corpus

naïve cell nave cell Character encoding problem

Spermatogonia No rule to extract

Term Recall and Precision

Default learnt ontology

Final learnt ontology

Still not perfect!

Ontology evaluation

Learnt Ontology under CTO

Discussion

• Exploiting poor performance to focus learning• Exploiting regularity of language• Never really going to find CTO domain general

layer• Terms highly compositional and conflate axes• Ask the questions “is it useful?” not “is it good?”• Is CTO a good standard?• The extracted hierarchy was not bad from a cell

biology and ontological point of view

Nascent Methodology

• Form corpus that includes, but is not limited to scope of target ontology

• Extract terms from corpus• Filter and massage list of terms to find those of

ontological interest• Use ontology learning to see what happens• Inspect and augment rules to recognise and

incorporate into hierarchy• Iterate Use as donor ontology to transfer useful

bits to recipient ontology

Conclusions

• No;

• Yes;

• Yes;

• Donor

Acknowledgements

• Simon Jupp has done the work

• Jaclyn Bibby MSc Project prototype

• Johanna Volker for help with Text2Onto

• David Shotton for knowledge about cell biology