+ All Categories
Home > Documents > Search Bootstrapping How / Where to get started. Crawling Start with Nutch – Index directly to...

Search Bootstrapping How / Where to get started. Crawling Start with Nutch – Index directly to...

Date post: 19-Dec-2015
Category:
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
4
Search Bootstrapping How / Where to get started
Transcript
Page 1: Search Bootstrapping How / Where to get started. Crawling Start with Nutch –  Index directly to SOLR – .

Search Bootstrapping

How / Where to get

started

Page 2: Search Bootstrapping How / Where to get started. Crawling Start with Nutch –  Index directly to SOLR – .

Crawling

• Start with Nutch– http://nutch.apache.org/

• Index directly to SOLR– http://www.lucidimagination.com/blog/2010/09/10

/refresh-using-nutch-with-solr/

• Create a seed list from DMOZ rdf– http://www.dmoz.org/rdf.html– http://wiki.apache.org/nutch/NutchTutorial

Page 3: Search Bootstrapping How / Where to get started. Crawling Start with Nutch –  Index directly to SOLR – .

Understanding Content

• Entity Extraction– LingPipe http://alias-i.com/lingpipe/– OpenNLP http://incubator.apache.org/opennlp/

• Entity Identification / Taxonomies– Freebase http://www.freebase.com/

Page 4: Search Bootstrapping How / Where to get started. Crawling Start with Nutch –  Index directly to SOLR – .

Some Additional Links

• Basic Web Page Parser– https://github.com/pjaol/Webcrawler

• Example of OpenNLP usage– https://github.com/pjaol/entity_extractor


Recommended