Date post: | 10-May-2015 |
Category: |
Technology |
Upload: | grant-ingersoll |
View: | 3,481 times |
Download: | 1 times |
OpenSearchLab and Lucene
Grant IngersollChief Scientist @LucidWorks
Member, Committer at Apache Soft. Found.Co-Founder, Apache Mahout
Hats
I’m here as an individual who happens to contribute (and commit) to Lucene, Solr, Mahout and other open source projects.
I don’t officially represent the ASF or even Lucene/Solr/Mahout.
Topics
• Openness
• What are some OpenSearchLab (OSL) needs?
• The Lucene Ecosystem
• Lucene for Research?
• A Sample Architecture
Putting the Open in OpenSearchLab
• Open Development >> Open Source
• Open community
• Open corpora
• Open evaluations
• Open Research• w/o being onerous http://www.facebook.com/photo.php?
fbid=10151728075710181&set=a.10151045050120181.780469.68096845180&type=1&theater
OSL Needs?Community
• Openness Model
• Contributions:• Who?• Where?• How?
• Ownership/Legal:• Code• Contributions• Infrastructure
• Privacy• …
Code
• Architecture• Flexible• Scalable
• Experiment Mgmt
• Content Acquisition• Analysis• Indexing• Querying• Downstream Tools
• Faceting, highlighting, auto-suggest, spellchecking, etc.
• Records Mgmt• Testing• …
Infrastructure
• Hardware• Cloud or hosted?• Network/Bandwidth• Production/Staging/
Dev
• $$$$
• Release Management
• Devops• …
What’s this have to do with Lucene?
Code
Committers
Contributors
ASF
Users
“An ecosystem is a community of living organisms in conjunction with the nonliving components of their environment interacting as a system.”
– Wikipedia
The ASF and ASL• ASF == Apache Software Foundation
– Volunteer-based, but many are paid to work on open source by their employer
– Community Over Code• Consensus-driven development
– Meritocracy• “Those who do, make the decisions”
– 100+ Top Level Projects– Infrastructure to support projects– “The Apache Way”
• ASL == Apache Software License (v2)
ASL ≠ ASF
Lucene Community
• In a nutshell: Large, Active Community• 30+ committers, many, many more contributors• (Tens of?) Thousands of Practitioners• Thousands of production instances– Twitter, Apple, IBM Watson, LinkedIn, Netflix, Commercial
Search Engines, …– “… they frequently turn to real-time search: our system
serves over two billion queries a day, with an average query latency of 50 ms. Usually, tweets are searchable within 10 seconds after creation.” -- EarlyBird, Busch et. al.
The Code Ecosystem
Lucene Core
Solr
Hadoop
Mahout
OpenNLP
Nutch
Tika
• Flagship Java library for building search applications– Indexing, Searching, Language Analysis
• Powers apps large and small the world over• More in Apache Lucene 4 talk later• Fast, small footprint• Lots of useful related modules
– Highlighting, Joins, Spatial, etc.
• http://lucene.apache.org/core
• Search server built using Lucene and HTTP• Faceting, highlighting, most Lucene features,
easy admin• Highly Extensible• Scalable (query volume and index size)
• Lucene Best Practices• http://lucene.apache.org/solr
• Originally built for Nutch to solve large scale crawling problems
• Distributed File System and Computation Model– HDFS and MapReduce, YARN coming
• Common Use Cases: storage, log analysis, ETL
• http://hadoop.apache.org
• Web-scale crawler and search built on Lucene/Solr and Hadoop
• Link analysis (aka PageRank)• Plugin framework• Parsers for common document formats (PDF,
Word, HTML, etc.)
• http://nutch.apache.org
• Scalable machine learning– Utilize Hadoop where appropriate
• Primary Focus: “The 3 C’s”– Clustering, classification, collaborative filtering
• Others– Frequent pattern mining, topic extraction,
statistically interesting phrases
• http://mahout.apache.org
• Toolkit for detecting and extracting content from MIME types
• Support for many common file formats– Office, PDF, HTML, etc.
• Intuitive API (think SAX parser)• Wraps best of breed open source extractors• Plug in your own
• http://tika.apache.org
• Supports common NLP tasks– NER, POS tagging, Chunking, Parsing, CoRef
resolution• MaxEnt and Perceptron based– Working to make the machine learning pluggable
• Some Multilingual support• New life at the ASF• Related: cTakes, Stanbol
Other Useful Tools
• Apache Zookeeper – Distrib. Coordination• Apache Pig – Hadoop scripting w/o Java• Apache HBase/Accumulo/Cassandra –
BigTable/Dynamo • Avro and Protobufs – Serialization
frameworks• Netty: Server framework – easy to add
protocols and to scale• Stanbol – Semantic Content Management
using Solr, OpenNLP, others• UIMA – Unstructured Info Management
LUCENE CAN HAS RESEARCH?
• Dispelling a few misconceptions:–No such thing as Lucene OOTB– Lucene ≠ Solr
• Researchers are welcome!– Large audience and many domains– http://wiki.apache.org/lucene-java/HowToContribu
te– Battle-tested code– Speed v. Quality tradeoffs
http://1.bp.blogspot.com/_T2ki5Em5dnI/S8gxtImG7wI/AAAAAAAAAEs/N7aZKZ6g6g4/s1600/cat%2520typing.jpg
Research/Contribution Areas
• Work with the community to do evaluations• Scoring
– BM25, LM, IM, DFR others already implemented– Easy to add your own
• Codecs– Extensible compression/storage– Many already implemented approaches and more coming– SimpleText FTW!
• Others:– Faceting, auto-suggest, spell-checking, highlighting, expansion and
more– Different domains: machine generated data, mobile,
Abstract OSL Architecture
*
Lucene Ecosystem Implementation
Takeaways
• Open Development >> Open Source >> Shared Source– Corollary: You never know where good ideas are coming
from• ASF is a proven model for collaboration• Lucene ecosystem: extensive, production ready• Lucene 4 is viable for IR algorithms and data
structure research• OSL (IMO) needs a services-based, pluggable
architecture
Resources
• Getting Started– {Lucene|Mahout|Hadoop} In Action– Taming Text
• [email protected]• @gsingers• http://www.lucidworks.com