Television News Search and Analysis with Lucene/Solr

Television News Search and Analysis with Lucene/Solr

Kai Chan <[email protected]>

Social Sciences Computing

University of California, Los Angeles

Lucene Revolution, May 10, 2012

Communication Studies ArchiveBackground (1)

• Continuation of analog recording of TV news

– Thousands of tapes since Watergate/1970s

– Hard to look for a particular news program or topic

1


• Digital recording since 2005

• Capture news programs on computers

– Video: can be streamed over the Web

– Closed captioning (“subtitle text”): indexed and searchable

– Image snapshots

– Search engine and analysis tools

2


• Also download transcripts and web-streamed news programs

• 100 news programs and 600,000 words added each day

3


• January 2005 to present

– 28 networks

– 1,600 shows

– 130,000 hours

– 160,000 news programs

– 50,000,000 images

– 880,000,000 words

4

Why This is Important (1)

• Researchers

– Large and unique collection of communication

– Many modalities

• Speech, facial expression, body gesture, etc.

– Different conditions/settings

– Different networks and communities

– Allows study of TV news + communication in general in ways impossible before

5

Why This is Important (2)

• Non-researchers

– TV news about presentation and persuasion

• Which happen in daily life also

– TV main source of news for many/most

– Greatly affects the public’s decisions

– Learn about what we watch

6

7

8

9

10

11

13

Application in Research

• Communication Studies

– Amount of coverage for events over time

• Linguistic

– Speech and language patterns

• Computer Science

– Object identification

– Identify news anchors, public figures

– Story segmentation

14

Application in Teaching (1)

• Chicano Studies: Representations of Latinos on the Television News

– May 1, 2007 immigration march

– MacArthur Park, Los Angeles, CA

– 2 days (May 1 & 2, 2007)

– Framing, stereotyping, metaphor, silencing

– reports with screenshots and links to news stories

15

Application in Teaching (2)

• Communication Studies: Presidential Communication

– 2008 presidential primary

– 6 weeks (Dec 2007 to Feb 2008)

– Coverage of sound bites

• Amount of time given to candidate/party

• Types of response (positive, neutral, negative)

– Students created their own political ad.

16

Work flow (1)Capture/conversion machines

• 2 groups, 2 machines per group– Keep the best recording– 6 TV tuners per machine

• Capture video and CC to separate files in real-time– MPEG-TS (~7 GB/hr)– Timestamp every 2-3 seconds

• Generate image snapshots• Convert videos

– MP4/H.264 (VGA, ~240 MB/hr)

17

Work flow (2)Storage/static file servers

• Control server – Download TV schedules

– Download web-streamed news programs

– Collect and check recordings

– Pushes files to places

• Video streaming server

• Backup storage server

• Image server

18

Work flow (3)Search server

• Lucene index updated daily

– Main text field tokenized

– Separate fields for date, network, show, etc.

– Binary fields for segment and time data

• Hosts search engine

19

The search process

20

Custom query typeSegment-enclosed query (1)

• Problem 1: search for “X near Z”

• Lucene: search for “X within Y words of Z”

– How to pick Y?

– Hard to pick a fixed number

21


• Problem 2: all matched search words might not be talking about same story

– E.g. “Obama AND visit AND Afghanistan”

– Might match a news program about Obama’s visit to Canada + violence in Afghanistan

22


• A news program can contain several stories

– E.g. Local, national, world, weather, sports

23


24


• One solution: search for “X and Z within same story segment”

– Possible with Lucene + story segment info

• Bonus: enables searching/filtering for a particular story type

– E.g. Politics

25


• How to mark segments– Automated

• Computer Science researchers working on them

• Word frequency

• Scene change

• Black frame and silence

– Manual segmentation• Watch the video

• Decide where a story starts and ends

• Mark positions in semi-automated system

26


27


• Idea– Get spans from SpanNearQuery– Filter and keep those fully within segments

• In production: segment info in stored fields– As a list of <start position, end position>– Simple to implement– Reasonably fast searching

• Alternative: store segment info as terms– Possible to find segments by themselves– Appears to run much faster

28

Custom query typeTime-enclosed query

29

Custom query typeMulti-term regular expression (1)

• “here is _ _ _ with the (news|story|details|report)”

• Apply RegEx to a phrase or sentence

– Not just individual words

• Lucene core has regular expression query support

– Good starting point

– Not a complete solution for us

30


• Problems

– Some analyzers do not work with RegEx

– Lucene’s RegEx query classes only apply RegEx to individual terms

• Want to match a pattern against a phrase/sentence

• Want placeholders for whole words (not just characters)

– Term(fieldName, “.*”) matches all terms, and all documents, and all positions in the index

• very slow

• takes lots of memory

31


• What we did– Parse and translate multi-term RegEx into Lucene

built-in queries (SpanNearQuery, RegexQuery)• E.g. “here is _ _ _ with the” = “here is” followed by “with

the” (with exactly 3 terms in between)

– Leading and trailing placeholders• E.g. “_ _ is the _ _ _”

• Preserve for correctness

• Store word count for each document

• Expand each span on both sides

• Bounds checking

32


• Regular expression libraries differ in

– Syntax (e.g. Perl 5-compatible)

– Capabilities (e.g. back-references)

– Speed

• Memory usage

– Proportional to number of terms matched

– Increasing available memory might help

33

Custom result formatOccurrence count

34

Future workJob queue (1)

• Research front moving towards analysis of whole database

– Want full search result set

– Queries are intensive and take a long time

• Solution will be beyond increasing timeout

– Users might close their browsers

– We might restart the search back-end

35

Future workJob queue (2)

• Features

– Query runs in background

– Notification when finished/failed

– Restart queries with recoverable errors

– Check and cancel jobs

– Downloadable result

– Schedule recurring queries

– Manage job priority and quota

36

Future workMultiple sources and languages (1)

• Multilingual news programs

– E.g. some have English + Spanish CC

• Multiple text and timestamp sources

– E.g. CNN transcript available from website

– Applying speech-to-text to videos

– Manual correction of text and timestamps

• Multiple markets

– E.g. Capture TV programs in Denmark and Norway

37

Future workMultiple sources and languages (2)

• Need language detection

– Libraries exist

• Search for specific channel

– Search by language more useful

– But no fixed channel -> language mapping

• What will proximity search and occurrence counting mean when there are multiple channels/languages?

38

Future workMetadata

• Types of metadata– Segment boundary, type and topic

– Headline and description (from transcripts)

– Website links

– Syntactic tags (e.g. part of speech)

– Generated annotation (e.g. object identification)

– User annotation (e.g. scene description)

– Screen text

• Eventually: want them to be searchable

39

Thank you for coming!

• Any questions?

• My e-mail: [email protected]

• Slides available: http://ucla.in/IDJq2u

40

Date post:	07-Jul-2015
Category:	Technology
Upload:	ucla-social-sciences-computing
View:	89 times
Download:	1 times

Television News Search and Analysis with Lucene/Solr

Technology