Date post: | 07-Jul-2015 |
Category: |
Technology |
Upload: | ucla-social-sciences-computing |
View: | 89 times |
Download: | 1 times |
Television News Search and Analysis with Lucene/Solr
Kai Chan <[email protected]>
Social Sciences Computing
University of California, Los Angeles
Lucene Revolution, May 10, 2012
Communication Studies ArchiveBackground (1)
• Continuation of analog recording of TV news
– Thousands of tapes since Watergate/1970s
– Hard to look for a particular news program or topic
1
Communication Studies ArchiveBackground (2)
• Digital recording since 2005
• Capture news programs on computers
– Video: can be streamed over the Web
– Closed captioning (“subtitle text”): indexed and searchable
– Image snapshots
– Search engine and analysis tools
2
Communication Studies ArchiveBackground (3)
• Also download transcripts and web-streamed news programs
• 100 news programs and 600,000 words added each day
3
Communication Studies ArchiveBackground (4)
• January 2005 to present
– 28 networks
– 1,600 shows
– 130,000 hours
– 160,000 news programs
– 50,000,000 images
– 880,000,000 words
4
Why This is Important (1)
• Researchers
– Large and unique collection of communication
– Many modalities
• Speech, facial expression, body gesture, etc.
– Different conditions/settings
– Different networks and communities
– Allows study of TV news + communication in general in ways impossible before
5
Why This is Important (2)
• Non-researchers
– TV news about presentation and persuasion
• Which happen in daily life also
– TV main source of news for many/most
– Greatly affects the public’s decisions
– Learn about what we watch
6
7
8
9
10
11
13
Application in Research
• Communication Studies
– Amount of coverage for events over time
• Linguistic
– Speech and language patterns
• Computer Science
– Object identification
– Identify news anchors, public figures
– Story segmentation
14
Application in Teaching (1)
• Chicano Studies: Representations of Latinos on the Television News
– May 1, 2007 immigration march
– MacArthur Park, Los Angeles, CA
– 2 days (May 1 & 2, 2007)
– Framing, stereotyping, metaphor, silencing
– reports with screenshots and links to news stories
15
Application in Teaching (2)
• Communication Studies: Presidential Communication
– 2008 presidential primary
– 6 weeks (Dec 2007 to Feb 2008)
– Coverage of sound bites
• Amount of time given to candidate/party
• Types of response (positive, neutral, negative)
– Students created their own political ad.
16
Work flow (1)Capture/conversion machines
• 2 groups, 2 machines per group– Keep the best recording– 6 TV tuners per machine
• Capture video and CC to separate files in real-time– MPEG-TS (~7 GB/hr)– Timestamp every 2-3 seconds
• Generate image snapshots• Convert videos
– MP4/H.264 (VGA, ~240 MB/hr)
17
Work flow (2)Storage/static file servers
• Control server – Download TV schedules
– Download web-streamed news programs
– Collect and check recordings
– Pushes files to places
• Video streaming server
• Backup storage server
• Image server
18
Work flow (3)Search server
• Lucene index updated daily
– Main text field tokenized
– Separate fields for date, network, show, etc.
– Binary fields for segment and time data
• Hosts search engine
19
The search process
20
Custom query typeSegment-enclosed query (1)
• Problem 1: search for “X near Z”
• Lucene: search for “X within Y words of Z”
– How to pick Y?
– Hard to pick a fixed number
21
Custom query typeSegment-enclosed query (2)
• Problem 2: all matched search words might not be talking about same story
– E.g. “Obama AND visit AND Afghanistan”
– Might match a news program about Obama’s visit to Canada + violence in Afghanistan
22
Custom query typeSegment-enclosed query (3)
• A news program can contain several stories
– E.g. Local, national, world, weather, sports
23
Custom query typeSegment-enclosed query (4)
24
Custom query typeSegment-enclosed query (5)
• One solution: search for “X and Z within same story segment”
– Possible with Lucene + story segment info
• Bonus: enables searching/filtering for a particular story type
– E.g. Politics
25
Custom query typeSegment-enclosed query (6)
• How to mark segments– Automated
• Computer Science researchers working on them
• Word frequency
• Scene change
• Black frame and silence
– Manual segmentation• Watch the video
• Decide where a story starts and ends
• Mark positions in semi-automated system
26
Custom query typeSegment-enclosed query (7)
27
Custom query typeSegment-enclosed query (8)
• Idea– Get spans from SpanNearQuery– Filter and keep those fully within segments
• In production: segment info in stored fields– As a list of <start position, end position>– Simple to implement– Reasonably fast searching
• Alternative: store segment info as terms– Possible to find segments by themselves– Appears to run much faster
28
Custom query typeTime-enclosed query
29
Custom query typeMulti-term regular expression (1)
• “here is _ _ _ with the (news|story|details|report)”
• Apply RegEx to a phrase or sentence
– Not just individual words
• Lucene core has regular expression query support
– Good starting point
– Not a complete solution for us
30
Custom query typeMulti-term regular expression (2)
• Problems
– Some analyzers do not work with RegEx
– Lucene’s RegEx query classes only apply RegEx to individual terms
• Want to match a pattern against a phrase/sentence
• Want placeholders for whole words (not just characters)
– Term(fieldName, “.*”) matches all terms, and all documents, and all positions in the index
• very slow
• takes lots of memory
31
Custom query typeMulti-term regular expression (3)
• What we did– Parse and translate multi-term RegEx into Lucene
built-in queries (SpanNearQuery, RegexQuery)• E.g. “here is _ _ _ with the” = “here is” followed by “with
the” (with exactly 3 terms in between)
– Leading and trailing placeholders• E.g. “_ _ is the _ _ _”
• Preserve for correctness
• Store word count for each document
• Expand each span on both sides
• Bounds checking
32
Custom query typeMulti-term regular expression (4)
• Regular expression libraries differ in
– Syntax (e.g. Perl 5-compatible)
– Capabilities (e.g. back-references)
– Speed
• Memory usage
– Proportional to number of terms matched
– Increasing available memory might help
33
Custom result formatOccurrence count
34
Future workJob queue (1)
• Research front moving towards analysis of whole database
– Want full search result set
– Queries are intensive and take a long time
• Solution will be beyond increasing timeout
– Users might close their browsers
– We might restart the search back-end
35
Future workJob queue (2)
• Features
– Query runs in background
– Notification when finished/failed
– Restart queries with recoverable errors
– Check and cancel jobs
– Downloadable result
– Schedule recurring queries
– Manage job priority and quota
36
Future workMultiple sources and languages (1)
• Multilingual news programs
– E.g. some have English + Spanish CC
• Multiple text and timestamp sources
– E.g. CNN transcript available from website
– Applying speech-to-text to videos
– Manual correction of text and timestamps
• Multiple markets
– E.g. Capture TV programs in Denmark and Norway
37
Future workMultiple sources and languages (2)
• Need language detection
– Libraries exist
• Search for specific channel
– Search by language more useful
– But no fixed channel -> language mapping
• What will proximity search and occurrence counting mean when there are multiple channels/languages?
38
Future workMetadata
• Types of metadata– Segment boundary, type and topic
– Headline and description (from transcripts)
– Website links
– Syntactic tags (e.g. part of speech)
– Generated annotation (e.g. object identification)
– User annotation (e.g. scene description)
– Screen text
• Eventually: want them to be searchable
39
Thank you for coming!
• Any questions?
• My e-mail: [email protected]
• Slides available: http://ucla.in/IDJq2u
40