Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P....

Post on 13-Jan-2016

216 views 0 download

Tags:

transcript

Article Semanticizer – Stitching Data Mining Services Into a Standalone

Search Appliance

David P. ShorthouseUniversité de Montréal / Canadensys

Dmitry MozzherinMarine Biological Laboratory / Global Names

@dpsSpiders, @dimus

Biota of Canada

http://biologicalsurvey.ca

We want to find & then organize data from printed materials but search is

exasperatingly limited

15,000 OCR articles & their scanned images (9GB)

Key Players

Global Names

http://gnrd.globalnames.orghttp://resolver.globalnames.org

Named Entity Extractionpeople, companies, organizations, cities, geographic features

elasticsearch

http://canent.shorthouse.net

https://github.com/dshorthouse/article_semanticizer

Search Characteristics

• Tokenizers: path hierarchy• Filters: edge Ngram, pattern replace

(abbreviated genera), stemmer (English), elisions (French)

• Analyzers: lowercase, ascii folding, autocomplete

• Full text• Thanks to: Christian Gendreau (Canadensys)

Possible Next Steps

• Generalize the design to best support content types (eg specimen labels)

• Better recognition of other entities, text blocks• Scientific name plugin for elasticsearch

(hackathon?)• Share with Journal Map and Mining Biodiversity• Engage scientific societies, journals