Search Bootstrapping How / Where to get started. Crawling Start with Nutch – Index directly to...

Post on 19-Dec-2015

215 views 1 download

Tags:

transcript

Search Bootstrapping

How / Where to get

started

Crawling

• Start with Nutch– http://nutch.apache.org/

• Index directly to SOLR– http://www.lucidimagination.com/blog/2010/09/10

/refresh-using-nutch-with-solr/

• Create a seed list from DMOZ rdf– http://www.dmoz.org/rdf.html– http://wiki.apache.org/nutch/NutchTutorial

Understanding Content

• Entity Extraction– LingPipe http://alias-i.com/lingpipe/– OpenNLP http://incubator.apache.org/opennlp/

• Entity Identification / Taxonomies– Freebase http://www.freebase.com/

Some Additional Links

• Basic Web Page Parser– https://github.com/pjaol/Webcrawler

• Example of OpenNLP usage– https://github.com/pjaol/entity_extractor