Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | josephine-lee |
View: | 216 times |
Download: | 0 times |
Article Semanticizer – Stitching Data Mining Services Into a Standalone
Search Appliance
David P. ShorthouseUniversité de Montréal / Canadensys
Dmitry MozzherinMarine Biological Laboratory / Global Names
@dpsSpiders, @dimus
Biota of Canada
http://biologicalsurvey.ca
We want to find & then organize data from printed materials but search is
exasperatingly limited
15,000 OCR articles & their scanned images (9GB)
Key Players
Global Names
http://gnrd.globalnames.orghttp://resolver.globalnames.org
Named Entity Extractionpeople, companies, organizations, cities, geographic features
elasticsearch
http://canent.shorthouse.net
https://github.com/dshorthouse/article_semanticizer
Search Characteristics
• Tokenizers: path hierarchy• Filters: edge Ngram, pattern replace
(abbreviated genera), stemmer (English), elisions (French)
• Analyzers: lowercase, ascii folding, autocomplete
• Full text• Thanks to: Christian Gendreau (Canadensys)
Possible Next Steps
• Generalize the design to best support content types (eg specimen labels)
• Better recognition of other entities, text blocks• Scientific name plugin for elasticsearch
(hackathon?)• Share with Journal Map and Mining Biodiversity• Engage scientific societies, journals