Media Fragment Indexing Using Social Media
Yunjia Li1, Raphael Troncy2, Mike Wald1 and Gary Wills1 1School of Electronics and Computer Science
University of Southampton, UK 2EURECOM, Sophia Antipolis, France,
1
Agenda
• Media Fragments
• Media Fragment Indexing Framework
• Survey on Media Fragment URI Implementations on Video Sharing Platforms
• Indexing Media Fragments Using Twitter
• Conclusions and Future Work
2
Media Fragment • Denote the inside content of multimedia resources
• Dimensions defined in the Media Fragment URI 1.0 spec
– Temporal dimension
http://example.org/test.mp4#t=3,7
– Spatial dimension (a rectangle area)
http://example.org/test.mp4#xywh=120,240,180,240
3
Current Situation • Multimedia uploading, sharing, tagging is easy
• Searching a complete multimedia resource on major search engines is easy
• But searching multimedia resource at a fine-grained level on major search engines is difficult
– Availability of annotations: limited amount of annotations linked to media fragments
– SEO problem:
• The landing page is not search-engine-friendly • Everything is on the same page and the notion of
media fragment is not explicitly embedded in HTML 4
Media Fragment Indexing Framework
5
Google’s Ajax Content Crawler
• The Crawler is designed to index Ajax content
• Replace token “#!” in URLs with “_escaped_fragment_”
6 *Diagram from https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
Key Ideas
• The fragment information must be included in the URL
– Syntax: W3C Media Fragment 1.0 Specification
• Prepare two sets of pages for every media fragment
– original landing page for end-users
– a snapshot page for SEO
• Landing page keeps the original user interaction
– Highlight media fragments on opening
• SEO page
– ONLY includes annotations of the media fragment
– Embed rich snippet
7
The Solution
8
Server
Crawler
1:
1: Submit pretty URL replay/1#!t=3,7 to the crawler
2:
2: Crawler asks server for replay/1?_escaped_fragment_=t=3,7
Terrace Theater 3:
Snapshot page Snapshot/1?_escaped_frag
ment_=t=3,7
3: Redirect the request to the snapshot page generated by the server. The snapshot page only contains annotations and Microdata for “#t=3,7”,
Terrace Theater Linked Data
Landing page replay/1#!t=3,7
Terrace Theater replay/1#!t=3,7
4:
4: The snapshot page is returned to the crawler with URL replay/1#!t=3,7
5: Terrace Theater
5: A user searches keyword “Terrace Theater”
6: replay/1#!t=3,7
6: Google includes replay/1#!t=3,7 in the search results
7:
7: The user click the link and ask for the document at replay/1#!t=3,7
8:
8: The server returns the landing page containing both “Terrace Theater” and “Linked Data”
9:
9: The landing page highlights the media fragment by start playing from 3s to 7s
Discussion
• The Media Fragment Indexing Framework solved the SEO problem of media fragments
• The scalability of such method largely relies on whether there are large number of annotations linked to media fragments
• Looking for media fragment annotations?
– Timed-text, transcript, speech recognition
– Manual annotations on each video sharing platforms
– Social Media (Twitter)
9
Survey on Media Fragment URI Implementation
10
Media Fragments and Social Media • The deep-linking function
• A Media Fragment URL can be embedded in a Tweet
• Text of the Tweet is the annotation to the URL
• Get annotations by filtering Tweets that have MF URIs
11
Filter Tweets by Media Fragment URIs
• Problem:
– Any URL in Tweet is potentially a MF URI
– Too many false-positive cases
http://example.org/1234#t=23
http://example.org/1234?t=23
http://example.org/1234?track=23
– They could all be MF URIs, need to be identified manually
• Work around:
– Identify platforms (partially-)implementing MF URI
– Only filter Tweets containing URLs from those domains
12
Survey Methodology
• Find a list of video sharing platforms
– http://en.wikipedia.org/wiki/List_of_video_hosting_services
– 59 websites are targeted in the survey
– Some of them have access restrictions
• Go through each website manually to see whether they provide deep-linking function, such as:
– Social sharing button from a certain time point
– Deep-linking option in right click menu
13
Survey Results (1) • 9 websites partially-implemented MFURI
– 56.com, Dailymotion, Hulu, Vbox7, Viddler, vimeo, Tudou, Youku and YouTube
• They use different syntax to encode temporal dimension
– Most of them use URI query, except YouTube & Vimeo
– Parameter name: “start”, “t”, “st”, etc
– Only Hulu implemented the end time
• Only YouTube partially implemented spatial dimension
– This is an external function implemented by Clickberry
https://clickberry.tv/video/6dafe30e-dcb8-44b8-8190-32be8249a297 14
Survey Results (2) • Only 9 websites partially-implemented MFURI, however:
– Those websites have covered most videos shared on the web
– eBizMBA report: http://www.ebizmba.com/articles/video-websites
• Select filter keywords based on the survey results:
– Twitter is banned in China, so 56.com, Tudou and Youku are ignored
– Hulu has access restriction outside U.S.
• Filter keywords: “YouTube”, “Dailymotion”, “Vbox7”, “Vimeo” and “Viddler”
15
Indexing Media Fragments Using Twitter
16
Twitter Media Fragment Indexer • Collect Tweets filtered by the keywords
• Extract MF URIs in Tweets, parse the media fragment information
• Use Media Fragment Indexing Framework to publish Tweets as media fragment annotations
• Embed rich snippet in the snapshot pages
• Create sitemap for Google to crawl the snapshot pages
• User searches keywords in the Tweet in Google and the link will lead to the video with corresponding start time
17
The Detailed Workflow
18
Indexing Results (1) • Monitor 50-hour non-stop Twitter stream
• Filter phrase: “youtube, dailymotion, vimeo, vbox7, viddler”
• 5,779,858 Tweets examined, 5,269,742 contain URLs
• 32,754 Tweets contain MF URIs, 32796 MF URIs in total
• Media Fragment URIs shared in each website:
19
Website No. of MFURIs %
YouTube 32,666 99.604
Dailymotion 101 0.308
Vbox7 0 0
Viddler 0 0
Vimeo 29 0.088
Indexing Results (2) • 13,088 distinct videos are found
• 17,854 distinct MF URIs for sitemap
– Many Tweets share the same video, but different fragments
– Many retweets
– Some video are not available in UK
• 17,479 URLs (97.9%) in the sitemap have been indexed by Google
• Only 775 URLs are indexed as VideoObject even though all rich snippets are embedded in all snapshot pages
20
Demo • Search “Chris Eppstein”
• As a result, this landing page will be opened and the video start playing from the time indicated in the Tweet containing keywords “Chris Eppstein”
21
Conclusions and Future Work
22
Conclusions • Introduced Media Fragment Indexing Framework
• Propose the using of social media to acquire more annotations to media fragments
• Survey the MF URI implementation on major video sharing platforms
• Twitter Media Fragment Indexer
– Monitor Tweet Stream and automatically create media fragment annotations
– Index media fragments in Google
– YouTube is the most important domain to share media fragments on Twitter
23
Future Work • How valid tweets could be served as media fragment
annotations
– many noisy and unrelated text
– many re-tweets
• Experiment on larger scale (billions of tweets and continuous monitoring)
• Expand the methodology to other media fragment annotations, such as timed-text
• Extract named entities from tweets and further link media fragments to the Linked Data Cloud
24
Questions?
25