+ All Categories
Home > Technology > Television News Search and Analysis with Lucene/Solr

Television News Search and Analysis with Lucene/Solr

Date post: 07-Jul-2015
Category:
Upload: ucla-social-sciences-computing
View: 89 times
Download: 1 times
Share this document with a friend
Description:
A presentation given at the Lucene Revolution 2012 conference to introduce the UCLA Communication Studies Archive. Video: http://youtu.be/YnI7ftPcgJ4 Summary: UCLA Communication Studies Archive hosts a collection of over 100,000 hours of digital television news, updated daily. Its search Lucene Revolution 2012 Download Presentationengine provides closed captioning search and online streaming of videos. The search engine allows researchers and students in various fields to study television news, images and language usage, in ways that were not possible before. In this presentation, we will show the setup of our Lucene/Solr-powered search engine, as well as how it is being used. We will discuss our work on custom result formats, such as linking search result text to the video at particular timestamps, counting occurrences of words, phrases or patterns, grouping the result by fields such as month or show, and creating interactive charts. We will also discuss our work on extending Lucene’s proximity searches, and creating custom query types, such as segment-enclosed (two or more words, phrases or patterns occurring within a story-based text segment), time-enclosed (two or more words, phrases or patterns occurring within a certain time), and multi-word regular expression queries. Future goals will also be discussed, such as supporting multiple languages, multiple sources (speech-to-text along side closed-captioning text), searching user-contributed and generated metadata (programs that identify story segments, objects in video, etc.), and syntactic tags (such as parts of speech).
Popular Tags:
41
Television News Search and Analysis with Lucene/Solr Kai Chan <[email protected]> Social Sciences Computing University of California, Los Angeles Lucene Revolution, May 10, 2012
Transcript
Page 1: Television News Search and Analysis with Lucene/Solr

Television News Search and Analysis with Lucene/Solr

Kai Chan <[email protected]>

Social Sciences Computing

University of California, Los Angeles

Lucene Revolution, May 10, 2012

Page 2: Television News Search and Analysis with Lucene/Solr

Communication Studies ArchiveBackground (1)

• Continuation of analog recording of TV news

– Thousands of tapes since Watergate/1970s

– Hard to look for a particular news program or topic

1

Page 3: Television News Search and Analysis with Lucene/Solr

Communication Studies ArchiveBackground (2)

• Digital recording since 2005

• Capture news programs on computers

– Video: can be streamed over the Web

– Closed captioning (“subtitle text”): indexed and searchable

– Image snapshots

– Search engine and analysis tools

2

Page 4: Television News Search and Analysis with Lucene/Solr

Communication Studies ArchiveBackground (3)

• Also download transcripts and web-streamed news programs

• 100 news programs and 600,000 words added each day

3

Page 5: Television News Search and Analysis with Lucene/Solr

Communication Studies ArchiveBackground (4)

• January 2005 to present

– 28 networks

– 1,600 shows

– 130,000 hours

– 160,000 news programs

– 50,000,000 images

– 880,000,000 words

4

Page 6: Television News Search and Analysis with Lucene/Solr

Why This is Important (1)

• Researchers

– Large and unique collection of communication

– Many modalities

• Speech, facial expression, body gesture, etc.

– Different conditions/settings

– Different networks and communities

– Allows study of TV news + communication in general in ways impossible before

5

Page 7: Television News Search and Analysis with Lucene/Solr

Why This is Important (2)

• Non-researchers

– TV news about presentation and persuasion

• Which happen in daily life also

– TV main source of news for many/most

– Greatly affects the public’s decisions

– Learn about what we watch

6

Page 8: Television News Search and Analysis with Lucene/Solr

7

Page 9: Television News Search and Analysis with Lucene/Solr

8

Page 10: Television News Search and Analysis with Lucene/Solr

9

Page 11: Television News Search and Analysis with Lucene/Solr

10

Page 12: Television News Search and Analysis with Lucene/Solr

11

Page 13: Television News Search and Analysis with Lucene/Solr
Page 14: Television News Search and Analysis with Lucene/Solr

13

Page 15: Television News Search and Analysis with Lucene/Solr

Application in Research

• Communication Studies

– Amount of coverage for events over time

• Linguistic

– Speech and language patterns

• Computer Science

– Object identification

– Identify news anchors, public figures

– Story segmentation

14

Page 16: Television News Search and Analysis with Lucene/Solr

Application in Teaching (1)

• Chicano Studies: Representations of Latinos on the Television News

– May 1, 2007 immigration march

– MacArthur Park, Los Angeles, CA

– 2 days (May 1 & 2, 2007)

– Framing, stereotyping, metaphor, silencing

– reports with screenshots and links to news stories

15

Page 17: Television News Search and Analysis with Lucene/Solr

Application in Teaching (2)

• Communication Studies: Presidential Communication

– 2008 presidential primary

– 6 weeks (Dec 2007 to Feb 2008)

– Coverage of sound bites

• Amount of time given to candidate/party

• Types of response (positive, neutral, negative)

– Students created their own political ad.

16

Page 18: Television News Search and Analysis with Lucene/Solr

Work flow (1)Capture/conversion machines

• 2 groups, 2 machines per group– Keep the best recording– 6 TV tuners per machine

• Capture video and CC to separate files in real-time– MPEG-TS (~7 GB/hr)– Timestamp every 2-3 seconds

• Generate image snapshots• Convert videos

– MP4/H.264 (VGA, ~240 MB/hr)

17

Page 19: Television News Search and Analysis with Lucene/Solr

Work flow (2)Storage/static file servers

• Control server – Download TV schedules

– Download web-streamed news programs

– Collect and check recordings

– Pushes files to places

• Video streaming server

• Backup storage server

• Image server

18

Page 20: Television News Search and Analysis with Lucene/Solr

Work flow (3)Search server

• Lucene index updated daily

– Main text field tokenized

– Separate fields for date, network, show, etc.

– Binary fields for segment and time data

• Hosts search engine

19

Page 21: Television News Search and Analysis with Lucene/Solr

The search process

20

Page 22: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (1)

• Problem 1: search for “X near Z”

• Lucene: search for “X within Y words of Z”

– How to pick Y?

– Hard to pick a fixed number

21

Page 23: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (2)

• Problem 2: all matched search words might not be talking about same story

– E.g. “Obama AND visit AND Afghanistan”

– Might match a news program about Obama’s visit to Canada + violence in Afghanistan

22

Page 24: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (3)

• A news program can contain several stories

– E.g. Local, national, world, weather, sports

23

Page 25: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (4)

24

Page 26: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (5)

• One solution: search for “X and Z within same story segment”

– Possible with Lucene + story segment info

• Bonus: enables searching/filtering for a particular story type

– E.g. Politics

25

Page 27: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (6)

• How to mark segments– Automated

• Computer Science researchers working on them

• Word frequency

• Scene change

• Black frame and silence

– Manual segmentation• Watch the video

• Decide where a story starts and ends

• Mark positions in semi-automated system

26

Page 28: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (7)

27

Page 29: Television News Search and Analysis with Lucene/Solr

Custom query typeSegment-enclosed query (8)

• Idea– Get spans from SpanNearQuery– Filter and keep those fully within segments

• In production: segment info in stored fields– As a list of <start position, end position>– Simple to implement– Reasonably fast searching

• Alternative: store segment info as terms– Possible to find segments by themselves– Appears to run much faster

28

Page 30: Television News Search and Analysis with Lucene/Solr

Custom query typeTime-enclosed query

29

Page 31: Television News Search and Analysis with Lucene/Solr

Custom query typeMulti-term regular expression (1)

• “here is _ _ _ with the (news|story|details|report)”

• Apply RegEx to a phrase or sentence

– Not just individual words

• Lucene core has regular expression query support

– Good starting point

– Not a complete solution for us

30

Page 32: Television News Search and Analysis with Lucene/Solr

Custom query typeMulti-term regular expression (2)

• Problems

– Some analyzers do not work with RegEx

– Lucene’s RegEx query classes only apply RegEx to individual terms

• Want to match a pattern against a phrase/sentence

• Want placeholders for whole words (not just characters)

– Term(fieldName, “.*”) matches all terms, and all documents, and all positions in the index

• very slow

• takes lots of memory

31

Page 33: Television News Search and Analysis with Lucene/Solr

Custom query typeMulti-term regular expression (3)

• What we did– Parse and translate multi-term RegEx into Lucene

built-in queries (SpanNearQuery, RegexQuery)• E.g. “here is _ _ _ with the” = “here is” followed by “with

the” (with exactly 3 terms in between)

– Leading and trailing placeholders• E.g. “_ _ is the _ _ _”

• Preserve for correctness

• Store word count for each document

• Expand each span on both sides

• Bounds checking

32

Page 34: Television News Search and Analysis with Lucene/Solr

Custom query typeMulti-term regular expression (4)

• Regular expression libraries differ in

– Syntax (e.g. Perl 5-compatible)

– Capabilities (e.g. back-references)

– Speed

• Memory usage

– Proportional to number of terms matched

– Increasing available memory might help

33

Page 35: Television News Search and Analysis with Lucene/Solr

Custom result formatOccurrence count

34

Page 36: Television News Search and Analysis with Lucene/Solr

Future workJob queue (1)

• Research front moving towards analysis of whole database

– Want full search result set

– Queries are intensive and take a long time

• Solution will be beyond increasing timeout

– Users might close their browsers

– We might restart the search back-end

35

Page 37: Television News Search and Analysis with Lucene/Solr

Future workJob queue (2)

• Features

– Query runs in background

– Notification when finished/failed

– Restart queries with recoverable errors

– Check and cancel jobs

– Downloadable result

– Schedule recurring queries

– Manage job priority and quota

36

Page 38: Television News Search and Analysis with Lucene/Solr

Future workMultiple sources and languages (1)

• Multilingual news programs

– E.g. some have English + Spanish CC

• Multiple text and timestamp sources

– E.g. CNN transcript available from website

– Applying speech-to-text to videos

– Manual correction of text and timestamps

• Multiple markets

– E.g. Capture TV programs in Denmark and Norway

37

Page 39: Television News Search and Analysis with Lucene/Solr

Future workMultiple sources and languages (2)

• Need language detection

– Libraries exist

• Search for specific channel

– Search by language more useful

– But no fixed channel -> language mapping

• What will proximity search and occurrence counting mean when there are multiple channels/languages?

38

Page 40: Television News Search and Analysis with Lucene/Solr

Future workMetadata

• Types of metadata– Segment boundary, type and topic

– Headline and description (from transcripts)

– Website links

– Syntactic tags (e.g. part of speech)

– Generated annotation (e.g. object identification)

– User annotation (e.g. scene description)

– Screen text

• Eventually: want them to be searchable

39

Page 41: Television News Search and Analysis with Lucene/Solr

Thank you for coming!

• Any questions?

• My e-mail: [email protected]

• Slides available: http://ucla.in/IDJq2u

40


Recommended