+ All Categories
Home > Documents > Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P....

Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P....

Date post: 13-Jan-2016
Category:
Upload: josephine-lee
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
13
Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry Mozzherin Marine Biological Laboratory / Global Names @dpsSpiders, @dimus
Transcript
Page 1: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.

Article Semanticizer – Stitching Data Mining Services Into a Standalone

Search Appliance

David P. ShorthouseUniversité de Montréal / Canadensys

Dmitry MozzherinMarine Biological Laboratory / Global Names

@dpsSpiders, @dimus

Page 2: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.

Biota of Canada

http://biologicalsurvey.ca

Page 3: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.

We want to find & then organize data from printed materials but search is

exasperatingly limited

Page 4: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.

15,000 OCR articles & their scanned images (9GB)

Page 5: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.

Key Players

Page 6: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.

Global Names

http://gnrd.globalnames.orghttp://resolver.globalnames.org

Page 7: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.
Page 8: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.

Named Entity Extractionpeople, companies, organizations, cities, geographic features

Page 9: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.

elasticsearch

Page 10: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.

http://canent.shorthouse.net

Page 11: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.

https://github.com/dshorthouse/article_semanticizer

Page 12: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.

Search Characteristics

• Tokenizers: path hierarchy• Filters: edge Ngram, pattern replace

(abbreviated genera), stemmer (English), elisions (French)

• Analyzers: lowercase, ascii folding, autocomplete

• Full text• Thanks to: Christian Gendreau (Canadensys)

Page 13: Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.

Possible Next Steps

• Generalize the design to best support content types (eg specimen labels)

• Better recognition of other entities, text blocks• Scientific name plugin for elasticsearch

(hackathon?)• Share with Journal Map and Mining Biodiversity• Engage scientific societies, journals


Recommended