+ All Categories
Home > Documents > MOVIE QUOTES SEARCH ENGINE Students: Meytal Bialik Zvi Cahana Supervisors: Hayim Makabee Oren Somekh...

MOVIE QUOTES SEARCH ENGINE Students: Meytal Bialik Zvi Cahana Supervisors: Hayim Makabee Oren Somekh...

Date post: 25-Dec-2015
Category:
Upload: angelica-fisher
View: 223 times
Download: 3 times
Share this document with a friend
Popular Tags:
15
MOVIE QUOTES SEARCH ENGINE Students: Meytal Bialik Zvi Cahana Supervisors: Hayim Makabee Oren Somekh Technion – Israel Institute Of Technology Computer Science Department 19.6.12 MQSE 3 Industrial Project – Final Presentation
Transcript

MOVIE QUOTES SEARCH ENGINE

Students:Meytal BialikZvi Cahana

Supervisors:Hayim MakabeeOren Somekh

Technion – Israel Institute Of Technology

Computer Science Department

19.6.12 MQSE 3

Industrial Project – Final Presentation

Introduction

The Movie Quotes Search Engine project focuses on the creation of a search engine allowing a user to search for terms that appear in the dialogues of a movie.

The project consists of two main components:

A web application used as a user interface to the search engine.

A crawling engine used to maintain a searchable index and a content database.

Introduction

Goals

Methodology

System Diagram

Achievements

Testing

Screenshots

Conclusions

Goals

Relevant search results

Modern UI design

Rich search options

Video play option

Browser agnostic website

Large-scale movies database

Incremental, priority-based crawling

Introduction

Goals

Methodology

System Diagram

Achievements

Testing

Screenshots

Conclusions

Methodology

IMDb & OpenSubtitles.org dump files

SRT subtitle files

OpenSubtitles.org XML-RPC API

SQLite database

Apache Lucene

Java Servlets / JSP

HTML5 / CSS / JavaScript

Introduction

Goals

Methodology

System Diagram

Achievements

Testing

Screenshots

Conclusions

System Diagram

Introduction

Goals

Methodology

System Diagram

Achievements

Testing

Screenshots

Conclusions

Achievements

Crawling Command-line tool Dump files parsing OpenSubtitles.org API based Subtitles downloading & indexing Cover art downloading Multithreaded pipelined execution Priority based Index recovery

Introduction

Goals

Methodology

System Diagram

Achievements

Testing

Screenshots

Conclusions

Achievements

Storage SQLite-based database Movies metadata (popularity, rating, IMDb link...) Cover art ~20000 subtitles downloaded & indexed Local videos repository

Introduction

Goals

Methodology

System Diagram

Achievements

Testing

Screenshots

Conclusions

Achievements

Indexing SRT files parsing & validating SRT files filtering

Translator comments Hearing impaired comments Format tags

Partitioning into overlapping search units Indexing using Lucene core

Stemming Stop words removal Actual indexing of the search units

~250ms per average SRT file

Introduction

Goals

Methodology

System Diagram

Achievements

Testing

Screenshots

Conclusions

Achievements

Searching Searching using Lucene core

Query parsing Search operators support Stemming Stop words removal Relevant buckets retrieval & ranking

Aggregating buckets to movies Merging of overlapping buckets Highlighting search words using Lucene core Buckets trimming to most relevant text Configurable weighted movie ranking

Lucene rank Popularity Rating Year

Introduction

Goals

Methodology

System Diagram

Achievements

Testing

Screenshots

Conclusions

Achievements

Web Application JSP/HTML5/CSS/JavaScript based Full support for IE9 Modern UI design Search results snippets Multiple hits per movie Paging Video play option

Per result snippet Relevant scene Captions

Introduction

Goals

Methodology

System Diagram

Achievements

Testing

Screenshots

Conclusions

Testing

A testing platform enables comparing search results “quality” against different system configurations.

In each test, the search engine is queried with famous quotes

A test passes if relevant movie is found in the top-K results

Introduction

Goals

Methodology

System Diagram

Achievements

Testing

Screenshots

Conclusions

Testing

We tested the system with a set of ~100 famous movie quotes.With biased system configuration and K=9, we acquired ~90% pass rate.

Introduction

Goals

Methodology

System Diagram

Achievements

Testing

Screenshots

Conclusions

Screenshots

Introduction

Goals

Methodology

System Diagram

Achievements

Testing

Screenshots

Conclusions

Screenshots

Introduction

Goals

Methodology

System Diagram

Achievements

Testing

Screenshots

Conclusions

Conclusions

Lucene is a powerful search platform

Optimal search results are difficult to define

Subtitles files from public sources should be further validated

HTML5 video support is still limited & browser dependent

Source control systems make life easier

Introduction

Goals

Methodology

System Diagram

Achievements

Testing

Screenshots

Conclusions


Recommended