Post on 16-Aug-2020
transcript
Converting Information toTopic Maps with Wandora
Tutorial, Aki Kivelä & Olli Lyytinen, 29.9.2010
Tutorial Outline● Introduction to Wandora● Information extractors● Detailed look at select extractors
● Hands on ~ Using the information extractors
Preparing Hands on● Download Wandora
● http://www.wandora.org/wandora/download/wandora.zip ● http://www.wandora.org/wandora/download/other/tmra10/
wandora_workshop_tmra2010.zip ● Unzip the package● Use the scripts in the bin folder to start the application
Introduction to Wandora● General purpose Topic Maps editor● Desktop application● Java 6, Swing● FOSS with GNU GPL 3.0● Developed since 2001 in Grip Studios Interactive● Used in several real-life projects● Download, documentation and forum at
www.wandora.org ● ~300 downloads per month
Introduction to Wandora● Topic map editor
● Topic, occurrence and association editors
Introduction to Wandora● Graph visualizer
More at http://w
ww.w
andora.org/wandora/w
iki/index.php?title=Graph_topic_panel
Introduction to Wandora● Huge set of information importers, extractors, and
generators
Extractors are discussed more detailed later on...
Introduction to Wandora● Topic map analyzers
+ Topic map diameter+ Clustering coefficient of a topic map
Introduction to Wandora● Exporters
Introduction to Wandora● Embedded HTTP server
http: //ww
w.wand ora .org /w
an dor a/wi ki/in dex .php ?titl e=E m
be dde d_HTTP _se rve r
+ Drupal bridge+ Joomla brige
Information Extractors● Extractors are not as strict in preserving information as
importers and may modify imported information heavily and may import only a small fraction of the original information.
● Wandora has more than 50 different information extractors in 13 categories.
● Both file format extractors and web service extractors.● Limited use rights of topic maps generated with
extractors. Always consult the license of the extraction source.
Information Extractors● Incremental extractions
● Next extraction is based on the topic map generated during previous extraction.
● Information mashup is a topic map generated using different information sources and different extractors.● For example: Compose a topic map that interleaves
information from Flickr (some photos have geo coordinates) and Geonames (more information about places).
● Limitation: Mashups are static!
Patterns in Information Extraction● Building a topic map with information extractors is
essentially a design process where...● User consciously triggers information extractors
depending on her current topic map, vision of goal, and the available set of information extractors and resources.
● Information-Extraction-Pattern is a recipe describing extractions and topic map operations required to achieve a desired goal.
● Information-Extraction-Patterns are part of best practices in information design and architecture.
Extractor Categoriesin Wandora
● Subjects● Search engines● Feeds● Classification● Language● Wiki● Bibliographical
● Social ● Media ● Simple files● HTML structures● Microformats● RDF schemas● Other
Next: Looking at selected extractors
Subject Extractors
Dbpedia Extractor● Dbpedia is a huge knowledge base distilled out of
Wikipedia● The extractor is used to enrich selected subjects● The extractor takes a list of terms → Builds DBpedia
URLs → Reads RDF resources → Converts RDF to topic maps
● Generated topic map is structurally like RDF● Requires no service token nor authentication● dbpedia.org
http://ww
w.wandora.org/w
andora/wiki/index.php?title=D
Bpedia_extractor
Subject Extractors
Subj3ct Record Extractor● Subj3ct is a web service of Networked Planet● The extractor is used to solve and bridge subjects● 5 subextractors: By identifiers, by resources, by URIs,
search, and URLs● The extractor takes input → Builds Subj3ct web
service URLs → Reads XML feeds → Converts XML feed to topic maps
● Requires no service token nor authentication● subj3ct.com
http://ww
w.wandora.org/w
andora/wiki/index.php?title=S
ubj3ct_record_extractor
Subject Extractors
OpenCYC Extractor● THE Knowledge base● The extractor is used to describe given subjects● Subextractors: Specializations, Generalizations,
Siblings, Comments, Denotations, Classes and Instances
● Extractors use OpenCYC web service API● Extractors take terms as input → Build web service
URLs → Read XML feeds → Convert the XML to topic maps
● Requires no service token nor authentication● opencyc.org
http://ww
w.wandora.org/w
andora/wiki/index.php?title=O
penCyc_extractor
Search Engine Extractor
Bing Extractor● Bing is Microsoft's search engine● The extractor uses Bing web service API and
constructs a topic map out of query and search result● The extractor can ”search” both web and images● The extractor takes search query and API key as input
→ Builds web service URLs → Reads XML feeds → Converts XML to topic map
● Idea: Think of search result as a finger print of the subject addressed by the query
● Extractor requires a Bing service API key● bing.com
http://ww
w.wandora.org/w
andora/wiki/index.php?title=Bing_extractor
Feed Extractors
RSS and Atom Extractors● Extractors essentially build a topic map where feed
items are associated with the feed● Extractors take a feed URL, file or raw data as input
→ Read feed XML → Convert feed XML to Topic Maps
● Interpretation of resulting topic map depends on feed content
● Requires no service token nor authentication● If the addressed feed is open
http://ww
w.wandora.org/w
andora/wiki/index.php?title=R
SS_2.0_Extractor
http://ww
w.wandora.org/w
andora/wiki/index.php?title=Atom
_extractor
Classification Extractors
OpenCalais Classifier● Calais is an entity extraction service● The extractor is used to classify text fragments and
uses OpenCalais web service API● The extractor takes a text as input → Sends the text to
OpenCalais web service → Receives an XML feed containing extracted terms → Converts the XML to Topic Maps
● Results in a topic map where extracted entities are associated with the topic representing the text fragment
● Requires an application key (included in Wandora)● opencalais.com
http://ww
w.wandora.org/w
andora/wiki/index.php?title=O
penCalais_classifier
Classification Extractors
AlchemyAPI Extractors● AlchemyAPI is a web service with several information-
out-of-text extractors● 4 extractors: Entity, Keyword, Category and Language● The extractors are used to classify free text● Takes plain text as input → Sends the text to
AlchemyAPI → Receives XML data → Converts the data to topic map
● Results in a topic map where extracted entities are associated with the topic representing text fragment
● Requires a personal API key● alchemyapi.com
http://ww
w.wandora.org/w
andora/wiki/index.php?title=Alchem
yAPI_extractors
Language Extractors
Big Huge Thesaurus Extractor● Thesaurus containing word relations● Takes a word as input → Sends the word to BHT web
service → Receives an XML result → Parses a topic map out of the XML
● Extractor results in a topic map where a given word topic is associated with other word topics
● Requires a personal API token● words.bighugelabs.com
http://ww
w.wandora.org/w
andora/wiki/index.php?title=Big_H
uge_Thesaurus_API_extractor
Language Extractors
Stands4 word describer● Thesaurus, Acronym expander● The extractor is used to describe words using
synonyms, antonyms and part-of-speech relations● Takes a word as input → Sends the word to Stands4
web service → Receives an XML result → Parses a topic map out of the XML
● Resulting topic map associates input word to an equivalent concept, and the concept to all synonym words.
● Requires a personal API token● abbreviations.com
http://ww
w.wandora.org/w
andora/wiki/index.php?title=S
tands4_word_describer
Wiki Extractors
Wikipedia extractor● Reads a page from Wikipedia and constructs a topic
for the page.● Takes a wiki term as input → Builds a Wikipedia
page URL → Reads the page source in XML format → Transforms the XML to a topic map.
● Source text is transformed to an occurrence.● No wiki markup cleaning!
● Extractor can be used to attach wikipedia page source to given subjects.
● Requires no API key nor authentication.
http://ww
w.wandora.org/w
andora/wiki/index.php?title=W
ikipedia_extractor
Bibliographical Extractors
Bibtex Extractor● Bibtex is a file format used to describe bibliographical
data, books for example. ● The extractor transforms any Bibtex file, URL resource
or raw data to topic map structures.● Uses an internal Bibtex parser
http://ww
w.wandora.org/w
andora/wiki/index.php?title=Bibtex_extractor
Bibliographical Extractors
MarcXML Extractor● MarcXML is an XML variant used to describe
bibliographical data● The extractor takes a MarcXML file, a URL resource
or raw data and transforms it to a topic map i.e. topics and associations.
● Wandora also has a batch extractor for MarcXML.
http://ww
w.wandora.org/w
andora/wiki/index.php?title=M
AR
CX
ML_extractor
Social Extractors
Facebook Extractor● Facebook is a social media used by over 590 million
people.● Facebook's Open Graph API provides data in JSON.● The extractor takes a graph node → reads equivalent
JSON feed → converts the feed to topic map.● The extractor can be used to blueprint a social graph;
friends, feeds, likes etc. Preservation of social events and graph.
● Incremental topic map building.● Requires a valid Facebook user account. User
account limits possible extractions.
http://ww
w.wandora.org/w
andora/wiki/index.php?title=Facebook_G
raph_extractor
Media Extractors
Flickr Extractor● Flickr is a photograph sharing service.● Has a service API.● Wandora has profile, photo and group extractors.● The extractors transforms Flickr data to topic maps.
The extracted topic map contains topics representing photos and information about them as other associated topics.
● Requires a Flickr user account.● Idea: Use Flickr as an image storage for a topic maps
based web service
http://ww
w.wandora.org/w
andora/wiki/index.php?title=Flickr_extractors
Media Extractors
YouTube Extractor● YouTube is a video sharing service with a web service
API.● Several extractors: Extract predefined video feed,
Extract using context, Search, Extract user, Extract exact feed URL.
● The extracted topic map contains video topics associated with other topics representing additional information (author, genre, keywords, thumbnails etc.).
● Incremental extractions.● Requires a YouTube user account.
http://ww
w.wandora.org/w
andora/wiki/index.php?title=YouTube_extractor
Media Extractors
Last.fm Extractor● Last.fm is a social music service. Keeps track of what
you listen to. Contains general information related to music artists, records and music tracks.
● 8 different extractors: overall top tags, top albums with a tag, top artists with a tag, album info, top tracks of an artist, top tags of an artist, similar artists.
● The extractors use Last.fm web service API and convert the XML feeds to a topic map.
● Requires no API key nor authentication.● An excellent source for music related topic maps.● Incremental extractions.
http://ww
w.wandora.org/w
andora/wiki/index.php?title=Last.fm
_extractors
Miscellaneous Extractors
Geonames Extractor● Geonames is a geographical database covering all
countries and over eight million place names● Geonames provides a web service API.● Wandora features a family of 11 different Geonames
extractors: Neighbours, Siblings, Children, Hierarchy, Near by, Country info, Cities, Search, Weather, Wikipedia search, Wikipedia b-box.
● The extractors build a web service URL → Read XML feed → Transform the XML feed to topic maps
● Incremental extractions.● Requires no API key nor authentication.
http://ww
w.wandora.org/w
andora/wiki/index.php?title=G
eonames_extractors
Miscellaneous Extractors
Simple Email Extractor● Extractor converts email files and repositories to topic
maps.● Supported formats DBX and MBOX
● Thunderbird and Outlook● Limited support for attachments● Preservation of emails
http://ww
w.wandora.org/w
andora/wiki/index.php?title=Em
ail_extractor
Miscellaneous Extractors
GEDCOM Extractor● GEDCOM is a file format for geneological information
i.e. individual and family relations such as child-of, married-to, birth, death, etc.
● The extractor transforms a GEDCOM file, a URL resource or raw data to a topic map.
http://ww
w.wandora.org/w
andora/wiki/index.php?title=G
EDC
OM
_extractor
Simple File Extractors● Directory Structure Extractor● Simple Text Document Extractor● JPG Image Extractor● Simple PDF Extractor
HTML Extractors● Link extractor● Property table extractor● Association table extractor● Instance list extractor● Superclass – Subclass list extractor● Definition list extractor
Aug 21, 2010 – Sep 20, 2010
Usage of www.wandora.org
Aug 21, 2010 – Sep 20, 2010
Usage of www.wandora.org
Summary● Wandora is an open source Topic Maps editor
application with GNU GPL license.● Wandora contains a huge set of information
extractors.● Information extractors enable rapid topic map
construction.● An information mashup is a topic map built using
several different information extractors and information sources.
● Information-Extraction-Patterns are part of best practices in information design and architecture.
Thank You
for more information visit www.wandora.org