MOVIE QUOTES SEARCH ENGINE
Students:Meytal BialikZvi Cahana
Supervisors:Hayim MakabeeOren Somekh
Technion – Israel Institute Of Technology
Computer Science Department
19.6.12 MQSE 3
Industrial Project – Final Presentation
Introduction
The Movie Quotes Search Engine project focuses on the creation of a search engine allowing a user to search for terms that appear in the dialogues of a movie.
The project consists of two main components:
A web application used as a user interface to the search engine.
A crawling engine used to maintain a searchable index and a content database.
Introduction
Goals
Methodology
System Diagram
Achievements
Testing
Screenshots
Conclusions
Goals
Relevant search results
Modern UI design
Rich search options
Video play option
Browser agnostic website
Large-scale movies database
Incremental, priority-based crawling
Introduction
Goals
Methodology
System Diagram
Achievements
Testing
Screenshots
Conclusions
Methodology
IMDb & OpenSubtitles.org dump files
SRT subtitle files
OpenSubtitles.org XML-RPC API
SQLite database
Apache Lucene
Java Servlets / JSP
HTML5 / CSS / JavaScript
Introduction
Goals
Methodology
System Diagram
Achievements
Testing
Screenshots
Conclusions
System Diagram
Introduction
Goals
Methodology
System Diagram
Achievements
Testing
Screenshots
Conclusions
Achievements
Crawling Command-line tool Dump files parsing OpenSubtitles.org API based Subtitles downloading & indexing Cover art downloading Multithreaded pipelined execution Priority based Index recovery
Introduction
Goals
Methodology
System Diagram
Achievements
Testing
Screenshots
Conclusions
Achievements
Storage SQLite-based database Movies metadata (popularity, rating, IMDb link...) Cover art ~20000 subtitles downloaded & indexed Local videos repository
Introduction
Goals
Methodology
System Diagram
Achievements
Testing
Screenshots
Conclusions
Achievements
Indexing SRT files parsing & validating SRT files filtering
Translator comments Hearing impaired comments Format tags
Partitioning into overlapping search units Indexing using Lucene core
Stemming Stop words removal Actual indexing of the search units
~250ms per average SRT file
Introduction
Goals
Methodology
System Diagram
Achievements
Testing
Screenshots
Conclusions
Achievements
Searching Searching using Lucene core
Query parsing Search operators support Stemming Stop words removal Relevant buckets retrieval & ranking
Aggregating buckets to movies Merging of overlapping buckets Highlighting search words using Lucene core Buckets trimming to most relevant text Configurable weighted movie ranking
Lucene rank Popularity Rating Year
Introduction
Goals
Methodology
System Diagram
Achievements
Testing
Screenshots
Conclusions
Achievements
Web Application JSP/HTML5/CSS/JavaScript based Full support for IE9 Modern UI design Search results snippets Multiple hits per movie Paging Video play option
Per result snippet Relevant scene Captions
Introduction
Goals
Methodology
System Diagram
Achievements
Testing
Screenshots
Conclusions
Testing
A testing platform enables comparing search results “quality” against different system configurations.
In each test, the search engine is queried with famous quotes
A test passes if relevant movie is found in the top-K results
Introduction
Goals
Methodology
System Diagram
Achievements
Testing
Screenshots
Conclusions
Testing
We tested the system with a set of ~100 famous movie quotes.With biased system configuration and K=9, we acquired ~90% pass rate.
Introduction
Goals
Methodology
System Diagram
Achievements
Testing
Screenshots
Conclusions
Screenshots
Introduction
Goals
Methodology
System Diagram
Achievements
Testing
Screenshots
Conclusions
Screenshots
Introduction
Goals
Methodology
System Diagram
Achievements
Testing
Screenshots
Conclusions
Conclusions
Lucene is a powerful search platform
Optimal search results are difficult to define
Subtitles files from public sources should be further validated
HTML5 video support is still limited & browser dependent
Source control systems make life easier
Introduction
Goals
Methodology
System Diagram
Achievements
Testing
Screenshots
Conclusions