Post on 19-Dec-2015
transcript
Search Bootstrapping
How / Where to get
started
Crawling
• Start with Nutch– http://nutch.apache.org/
• Index directly to SOLR– http://www.lucidimagination.com/blog/2010/09/10
/refresh-using-nutch-with-solr/
• Create a seed list from DMOZ rdf– http://www.dmoz.org/rdf.html– http://wiki.apache.org/nutch/NutchTutorial
Understanding Content
• Entity Extraction– LingPipe http://alias-i.com/lingpipe/– OpenNLP http://incubator.apache.org/opennlp/
• Entity Identification / Taxonomies– Freebase http://www.freebase.com/
Some Additional Links
• Basic Web Page Parser– https://github.com/pjaol/Webcrawler
• Example of OpenNLP usage– https://github.com/pjaol/entity_extractor