Date post: | 22-Nov-2014 |
Category: |
Technology |
Upload: | university-of-melbourne-australia |
View: | 490 times |
Download: | 2 times |
Methods of
Knowledge Extraction
Deepti Aggarwal
SIEL|SERL, IIIT-Hyderabad, India
AgendaIntroduction to Web as a knowledge
repository
Automated extraction techniques (Input sources, extracted structures, input pre-processing, extraction methods, output generation)
Issues with automated extraction
What is knowledge?A familiarity with someone or
something with experience
Includes facts, information, descriptions, skills
Types of KnowledgeExplicit
Knowledge
Always present explicitly in records
Objective facts having a definite answer
E.g., Hyderabad is the capital of A.P.
Implicit Knowledge
Not present explicitly for analysis
Cultural beliefs with subjective judgments
E.g., Hyderabad is the best city to live in India.
How knowledge is represented over a period of time?From Public library to global library
How knowledge is represented over the web?Millions of documents, blogs, forums,
social networks scattered on web
Diverse topic, different formats, from diverse people in diverse language, different point of views
Benefits of knowledge extraction over the WebQuestion Answering systems
Search engines
Validating knowledge
Tracking a particular information
Predicting market, polls etc.
Community advertisements
Explicitknowledge
Implicitknowledge
Problems with knowledge acquisition over web
Abundance of data
Relevance of information
Personalized retrieval
Possible approachesManual filtering
Automated techniques
Combination of both
Automated Extraction
Input sources
Extraction system
Database of all facts, relations
Inputpre-
processing
Extractionmethods
Outputprocessing
Working of automated extraction systems
Defining output
structures
Input sourcesTypes
Input sourcesweb documents
news articles
blogs
social networks activities (user profiles, posts, comments)
Sentence level parsing required.
Defining the structures of
outputNamed Entities and their relations
Output structures Named Entities
Named entities relations
1. Named Entity: Definition
It is an atomic element in a body of text.
Types: person, organization, location etc.
Different named entities when linked
together, form a relation.
1. Named Entity: An example
Sachin Tendulkar was born in Bombay.
NE of type ‘Person’ NE of type ‘Location’
2. Named Entity Relationship: Structure
Subject – Relation - Object
NE of any type
Verb, Adjective, Adverb
NE of any type
2. Named Entity Relationship: An Example
Sachin Tendulkar was born in Bombay
Subject Relation Object
Co-referencing
Sachin was born in Bombay. He is a ...
Sachin Tendulkar …. Mr. Tendulkar … Master Blaster ...
Input pre-processing
Libraries
NLP libraries: Splitting each sentence into tokens,
words, digits using Sentence Tokenizer
Recognizing language constructs, nouns, verbs, pronouns using Part-of-speech Tagger
Example: Sachin/NNP Tendulkar/NNP was/VBD born/VBN in/IN Bombay/NNP
NLP libraries (contd.): Linking individual constituents of a
sentence with Parser to form parse tree
Identify types of named entity using Named Entity Recognizer
Example: Sachin Tendulkar/PERSON was born in Bombay/LOCATION
NLP libraries (contd.): Identify all co-references and replace
with actual entity using Co -reference Resolution tool
Identify specific meaning of a word Word Sense Disambiguation External vocabularies: MindNet,
DBpedia, WordNet E.g., contextual meaning of ‘crane’:
noun-bird, verb-lift/move
Extraction methods
Extracting relationships among NEs: Standard process
1. Identify named entities within a sentence.
2. Find the verb or adjective that
connects the identified named entities.
3. Connect them together to form
relation.
Extracting relationships among NEs: Required process
1. Identify part-of-speech constructs: noun, verb, adjective etc.
2. Determine Co-references, Acronyms and
abbreviations.
3. Connect them together to form a
relationship.
Extraction Methods
Natural Language Processing: rule based.
Based on sentence structure
E.g., for English language, a rule can be “noun-verb-noun”
Machine Learning: supervised and unsupervised learning.
Features are detected from the training data
E.g., to extract instances of some medical diseases, system is trained over all the symptoms of each given disease.
Extraction Methods (contd.)
Other methods: Vocabulary based systems, context based clustering.
Maintaining a mapping file of all countries and their nationalities helps to determine nationality of a person when his birth place is known.
Hybrid:
NLP based libraries to pre-process the input data, applying machine learning approach to extract the relations by using some external vocabulary as WordNet.
Output generation
Types of output systems
1. Identifies all mentions of named entities and their relations.
E.g., from a given corpus, extract all named entity relations.
2. Identify missing relations of a database
E.g., Given a database, extract the missing attributes of given entities from the corpus.
3. Linking various entities within a database.
E.g., Given a database, link two entities together with some relation extracted from the corpus.
Input sources
Extraction system
Database of all facts, relations
Inputpre-
processing
Extractionmethods
Outputprocessing
Working of automated extraction systems
Defining output
structures
Issues with automated extraction
Accuracy, running time, dependency
Issue 1: Challenges of language structure
Co-reference resolutionAmbiguous, complex sentencesAbbreviationsAcronyms
See an example…
“Tom called his father last night. They
talked for an hour. He said he would be home the next day."
What is ‘He' referring to? Tom or his father?
“You see sir, I can talk English, I can walk English, I can laugh English, I can run English, because
English is such a funny language.” Amitabh in Namak Halal
Issue 2: AccuracyNamed entity detection: 90%,
relationship 50-70%. Introduction of noise at each step.
E.g., disambiguation of acronym ‘crane’ with WordNet, introduces contextual errors, which then decreases accuracy of rule based relationship extraction
Issue 3: EfficiencyFeature detection steps are
expensive.
Require days for computation
Issue 4: Dependencyon external vocabulary sources, like
Wikipedia, WordNet, MindNet etc.Maintenance & updation of vocabulary
sources is manual: costly and require expertise.
Limited size produce context based noise
Domain-dependent: medical domainCorpus-dependent: Wikipedia, news
corpusRelation specific: Date and Place-of-
event
Issue 5: Problem with Implicit knowledge extraction
Community Knowledge is learned and shared
No one can be an expert.
cultural competence and perception of workers are fed into a system as variables.
Cultural Consensus Theory provides models to include such variables into the system.
Can we do better?
Can we seek human intelligence to improve the accuracy of automated techniques?
References[1] I. Tuomi. Data is more than knowledge:
implications of the reversed knowledge hierarchy for knowledge management and organizational memory. J. Manage. Inf. Syst. , 16(3):103–117, Dec. 1999.
[2] S. Sekine. Named Entity: History and Future. 2004.
[3] S. Sarawagi. Information extraction. Found. Trends databases , 1(3):261–377, Mar. 2008.
[4] S. C. Weller. Cultural consensus theory: Applications and frequently asked questions. Field Methods,19(4):339–368, 2007.
References (contd.)[5] Z. Syed, E. Viegas, and S. Parastatidis. Automatic
discovery of semantic relations using mindnet. LREC,2010.
[6] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Wordnet: An on-line lexical database. International Journal of Lexicography , 3:235–244, 1990
[7] T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Eng. Bull. , pages 40–48, 2006.
[8] E. Greengrass. Information retrieval: A survey, 2000.
Thank youQuestions?