Date post: | 23-Feb-2017 |
Category: |
Technology |
Upload: | kriehl |
View: | 318 times |
Download: | 2 times |
© 2015 Continuum Analytics- Confidential & Proprietary
Memex: Mining the Deep Web
Katrina Riehl, PhD Sr. Data Scientist Continuum Analytics
November 9, 2015
What is MEMEX?
7
• Today's web searches use a centralized, one-size-fits-all approach that searches the Internet with the same set of tools for all queries. While that model has been wildly successful commercially, it does not work well for many government use cases.
• DARPA launched the Memex program in September, 2014. • Memex seeks to develop software that advances online search capabilities • Creation of a new domain-specific indexing and search paradigm
• content discovery • information extraction • information retrieval • user collaboration
• Extension of current search capabilities • deep web • dark web • nontraditional (e.g. multimedia) content.
Memex Search Domains• Human/Labor Trafficking • Child Exploitation • Weapons • Illicit Pharmaceuticals • Material Research Science • Autonomous Systems Research • Financial Fraud • Counterfeit Electronics 9
© 2015 Continuum Analytics- Confidential & Proprietary
LARGE SCALE DATA ANALYTICSAn Overview of the Ecosystem
Analytics Pipeline
14
• Web Crawlers & Scrapers • Entity Extractors • Indexers • Visual Analytics • Search Applications
Memex Explorer
15
• Pluggable Framework for Crawling & Data Discovery • Django Web Application • Elasticsearch Index • Bokeh Visualizations for Crawling Stats • Kibana Dashboards for Initial Data Exploration • Apache Nutch Crawler • NYU ACHE Crawler • NYU Domain Discovery Tool
• Collaborations • CMU Time Anomaly Detection Tool • Sotera Datawake Plug-in
21
Data Storage
Abstract expressions
Computational backend
csv
HDF5bcolz
DataFrame HDFS
selectionfilter
group by
join
column wise
Pandas
Streaming Python
Spark
MongoDB
SQLAlchemy
json