+ All Categories
Home > Documents > Mia Imc07 Sumitted

Mia Imc07 Sumitted

Date post: 04-Jun-2018
Category:
Upload: mario-lamenha
View: 222 times
Download: 0 times
Share this document with a friend

of 14

Transcript
  • 8/13/2019 Mia Imc07 Sumitted

    1/14

    I Tube, You Tube, Everybody Tubes: Analyzing the WorldsLargest User Generated Content Video System

    Meeyoung Cha

    Haewoon Kwak

    Pablo Rodriguez

    Yongyeol Ahn

    Sue Moon

    Telefonica Research Lab, Barcelona, Spain KAIST, Daejeon, Korea

    ABSTRACTUser Generated Content (UGC) is re-shaping the way peo-ple watch video and TV, with millions of video producersand consumers. In particular, UGC sites are creating newviewing patterns, social interactions, empowering users to bemore creative, and developing new business opportunities.To better understand the impact of such UGC systems, wehave analyzed YouTube, the world-largest UGC VoD sys-tem. Based on a large amount of data collected, we pro-vide an in-depth study of YouTube and other similar UCGsystems. In particular, we study the popularity life-cycleof videos, the intrinsic statistical properties of requests andtheir relationship with video age, and the level of contentaliasing or of illegal content in the system. We also pro-vide insights on the potential for more efcient UCG VoDsystems (e.g., utilizing P2P systems or making better use of caching). Finally, we discuss the opportunities to leveragethe latent demand for niche videos that are not reached todaydue to information ltering effects or other system scarcity

    distortions. Overall, we believe that the results presentedin this paper are crucial in understanding UGC systems andtheir inefciencies, which can have tremendous commercialand technical consequences.

    KeywordsUser-generated contents, VoD, P2P, Caching, Popular-ity analysis, Content aliasing

    1. INTRODUCTIONVideo content in standard Video-on-Demand (VoD)

    systems has been historically created and supplied bya limited number of media producers, such as licensedbroadcasters and production companies. Content pop-ularity was somewhat controllable through professionalmarketing campaigns. The advent of user-generatedcontent (UGC) has re-shaped the online video marketenormously. Nowadays, hundreds of millions of Inter-net users are self-publishing consumers. The content The data traces used in this paper will be sharedfor the wider community use in due time athttp://an.kaist.ac.kr/YouTube Trace 2007.html .

    length is shortened by two orders of magnitude andso is the production rate. Wired magazine refers tothis small-sized content pop culture as bite-size bitsfor high-speed munching [31].

    The scale, dynamics, and decentralization of the UGCvideos makes the old mode of content popularity pre-diction impractical. UGC popularity is more ephemeral

    and has a much more unpredictable behavior. As op-posed to the early days of TV where everyone watchedthe same program at the same time, such strong re-inforcement of popularity (or unpopularity) is muchmore diluted in UGC. Constant waves of new videosand the convenience of the Web is quickly personalizingthe viewing experience, leading to a great variability inuser behavior and attention span. At the same time, thecorresponding lack of editorial control in UGC is creat-ing problems of copyright infringement, which seriouslythreatens the future viability of such systems.

    Understanding the popularity characteristics is im-portant because it can bring forward the latent demandon UGC created by bottlenecks that distort the popu-larity distribution. It also greatly affects the strategiesfor marketing, target advertising, recommendation, andsearch engines.

    To understand the nature and the impact of UGCsystems, in this paper, we analyze YouTube, the world-largest UGC VoD system. The main contribution of this paper is an extensive trace-driven analysis of UGCvideo popularity distributions. We have collected alarge amount of data from YouTube and another UGCsystem, Daum. Our analysis reveals very interestingproperties regarding the distribution of requests across

    videos, the evolution of viewers focus, and the shiftsin popularity. Such analysis is pivotal in understand-ing some of the most pressing questions regarding UGCopportunities. Our analysis also reveals key results re-garding the level of piracy and the level of content du-plication in such systems, which could have major im-plications in the deployment of future UGC services.

    The highlights of our work could be summarized asfollows:

    1. We compare some prominent UGC systems with

    1

  • 8/13/2019 Mia Imc07 Sumitted

    2/14

    other standard VoD systems such as Netix andLovelm. We highlight the main differences be-tween the two systems and point out very interest-ing properties regarding content production, con-sumption patterns, and user participation.

    2. By analyzing the popularity distributions from var-ious categories of UGC services and by trackingthe time evolution of it, we show that the popu-larity distribution of UGC exhibits power-law withtruncated tails. We discuss several ltering mech-anisms that create truncated power-law distribu-tion. Based on this, we estimate the potentialbenets arising from leveraging the latent demandthat is hidden due to the ltering effects

    3. The increase amount of traffic generated by UGCis a pressing issue for both ISPs and content providersdue to the exploding mass of videos. We provideinsights into more efficient UGC VoD systems bymaking a better use of caching and utilizing a peer-to-peer (P2P) technique in UGC distribution.

    4. Content aliasing and illegal uploads are criticalproblems of todays UGC systems, since they canhamper the efficiency of UGC systems and causecostly lawsuits respectively. We measure the preva-lence of content duplication and illegal uploads inUGC, and their impact in various systems char-acteristics.

    The rest of the paper is organized as follows. 2 de-scribes our trace methodology and the key characteris-tics of UGC. In 3, we analyze the popularity distribu-

    tion of UGC and the forces that shape it. 4 investigateshow popularity of videos evolve over time. 5 consid-ers the performance potential of server workload andbandwidth savings via caching and P2P techniques. 6focuses on the level of content duplication and illegaluploads in UGC. Finally, we present related works in7 and in 8, we conclude.

    2. METHODOLOGY AND PROPERTIESThis section introduces our data collection process

    and the general properties of the measured UGC videos.

    2.1 Data CollectionOur dataset consists of meta-information about user-

    generated videos from YouTube and Daum UGC ser-vices. YouTube , the largest UGC site world-wide,serves 100 million distinct videos and 65 , 000 uploadsdaily [6]. Daum UCC) , the most popular UGC ser-vice in Korea, is well-known for its high-quality videos(streaming as high as at 800 kb/s) and serves 2 millionvisitors and 35 million views weekly [1].

    We crawled YouTube and Daum sites and collectedmeta information about videos by visiting their indexed

    pages that link all videos belonging to a category. Dueto the massive scale of YouTube, we limited our datacollection to two of the categories: Entertainment andScience & Technology (now called Howto & DIY).Throughout this paper, we simply refer to them as Entand Sci . For Daum, we have collected video informa-tion from all the categories. Each video record con-

    tains xed information, such as the uploader, the up-load time, and the length, and time-varying informa-tion, such as views, ratings, stars, and links. Views and ratings indicate the number of times the video hasbeen played or evaluated by users. Stars indicate theaverage score from rating, and links indicate the list of external web pages hyper-linking the video. Our traceincludes multiple snapshots of video information takendaily across 6 days for Sci . These multiple snapshotsgive insights on the actual request patterns and the pop-ularity evolution of UGC videos. Table 1 summarizesour datasets with basic statistics.

    Our trace does not contain information about indi-vidual user requests. However, our analysis focuses onvideo popularity evolution, aggregated request distri-bution, and other statistics that do not require detailedknowledge of such individual users behavior.

    2.2 UGC versus Non-UGCTo begin with, we present several distinctive features

    of our UGC video trace. To put things in perspec-tive, we highlight the key differences and similaritiesbetween UGC and non-UGC (or professionally gener-ated contents). For comparison purposes, we use datafrom three representative non-UGC services. Netix ,a popular online video rental store, make customer rat-ings of their 17,770 videos publicly available at [4] andwe include this data in our comparison. We additionallycrawled the web site of Lovelm [3], Europes largestonline DVD rental store, and Yahoo! Movies [5] formeta-information about their movie collections. OurLovelm dataset contains the video length and the di-rector. Our Yahoo dataset contains the daily top tenUS Box Office Chart from 2004 to March 2007, andtheir theater gross. Table 2 summarizes the non-UGCdataset.

    Table 2: Summary of non-UGC tracesTrace # Videos Period Description

    Netflix 17,770 Oct 2006 Customer ratingsLovefilm 39,447 Jan 2007 Length and director

    Yahoo 361 2004 - 2007 Theater gross income

    2.2.1 Content Production Patterns

    One key characteristic of UGC is the fast content pro-duction rate. As we have reported in [28], the scale of production for UGC shows a striking difference withnon-UGC. IMDb, the largest online movie database,

    2

  • 8/13/2019 Mia Imc07 Sumitted

    3/14

    Table 1: Video trace summary and statistics.Name Category # Videos Tot. views Tot. length Data collection period

    YouTube Ent 1,687,506 3,708,600,000 15.2 years Dec 28, 2006 (crawled once)YouTube Sci 252,255 539,868,316 1.8 years Jan 14 - 19, 07 (daily), Feb 14, 07, Mar 15, 07 (once)

    Daum All 196,037 207,555,622 1.0 year Mar 1, 2007 (crawled once)

    carries 963,309 titles of movies and TV episodes pro-

    duced since 1888, up until now [2]. In contrast, YouTubeenjoys 65,000 daily new uploads which means that itonly takes 15 days in YouTube to produce the samenumber of videos as in IMDb.

    UGC requires less production efforts, compared tonon-UGC. Accordingly, the number of distinct publish-ers is massive for UGC. The average number of postsper publisher, however, is similar for UGC and non-UGC (e.g., 90% of lm directors publish less than 10movies). Interestingly, there exist extremely heavy pub-lishers in UGC, who post over 1,000 videos over a fewyears. In contrast, the largest number of movies pro-duced by a single director scales only up to a hundredmovies over half a century.

    Next, length of UGC videos varies across categories.Daum CF category shows the shortest median length of 30 seconds, while Daum Music Video shows the longestmedian length of 203 seconds. Compared with non-UGC, the UGC video length is much shorter by twoorders of magnitude. The median movie length in Love-lm is 94 minutes.

    2.2.2 User Participation

    The video popularity and ratings (i.e., the numberof viewers who evaluated the video) show a strong lin-

    ear relationship for both UGC and non-UGC, with thecorrelation coefficient of 0.8 for YouTube and 0.87 forYahoo. This is an interesting observation, because it in-dicates that users are not biased towards rating popularvideos more than unpopular ones.

    Despite the Web2.0 features to encourage user par-ticipation, the level of active user participation is verylow in YouTube. While 54% of all videos are rated, theaggregate ratings only account for 0.22% of the totalviews. Comments, a more active form of participation,account for mere 0.16% of total views. While we arenot able to verify this from VoD traces, other Web 2.0sites also report similar trends on relatively low userinvolvements [11].

    2.2.3 How Content Is Found?

    We will now examine at the pages that link to YouTubevideos. Based on Sci trace, 47% of all videos have in-coming links from external sites. The aggregate viewsof these linked videos account for 90% of the total views,indicating that popular videos are more likely to belinked. Nevertheless, the total clicks derived from theselinks account for only 3% of the total views, indicat-

    ing that views coming from external links is not very

    signicant. We have identied that the top ve websites linking to videos in YouTube Sci are myspace.com ,blogspot.com , orkut.com , Qooqle.jp , and friendster.com ; four of them from social networking sites, and oneon video recommendation.

    3. IS UGC TRAFFIC POWER-LAW?Analyzing the exact form of probability distribution

    does not only help us to understand the underlyingmechanism, but also help us answer important designquestions in UGC services. This is true in multipleother areas, for instance, the study of the scale-free na-

    ture of Web requests has brought insights into improv-ing search engines and advertising. Similarly, under-standing the distributions from book sales in an onlinestore helps online retailers estimate their lost opportu-nities due to poor item categorization or description, ornave recommendation engines [10,20,34].

    The power-law has been increasingly used to modelvarious statistics appearing in the computer science andits applications. A distinguished feature of power-lawis a straight line in the log-log plot of views versus fre-quency. However, there are some distributions (e.g.,log-normal) that show almost straight line waist acrossa few orders in a log-log plot. Therefore it is entirelya non-trivial task to determine whether a certain dis-tributions is power-law or log-normal, unless the plotshows a clear straight line across several orders of mag-nitude [16,18,30,32,35]. The shape of a distribution im-plies the underlying mechanism that generates it. Nor-mally, the power-law distribution arises from rich-get-richer principle , while the log-normal distribution arisesfrom the law of proportionate effect 1 .

    In a real-world, the shape of the natural distribu-tion can be affected due to various reasons (e.g., bottle-necks in the system). In fact, many distributions whoseunderlying mechanism is power-law fail to show clearpower-law patterns, especially at the both end of thedistribution: the most popular and the least popularitems. In the case of movies in cinemas [9], the distor-tion may come from the lack of enough movie theaters,where niche content is not seen as much as it should.This is a distribution bottleneck and bringing such con-tent online removes the distribution bottleneck.

    However, this is not the only bottleneck that modiesthe shape of a distribution. For example, NetFlix data1

    The log-normal distribution is very similar to the normal distri-bution; the difference is at is multiplicative process, not additive.

    3

  • 8/13/2019 Mia Imc07 Sumitted

    4/14

    (Figure 1) does not show a power-law pattern for thenon-popular videos. This is an information bottleneck ,and relates to the fact that users cannot easily discoverniche content, or content is not properly categorized orranked 2 . The latent demand for products, that cannotbe reached by inefficiencies in system, can have tremen-dous commercial and technical consequences [10]. No

    wonder, NetFlix recently launched the $1 million net-ix prize to improve their recommendation engine [4].

    100

    101

    102

    103

    104

    105

    100

    101

    102

    103

    104

    105

    106

    Video rank

    R a t i n g s

    Netflix

    Figure 1: Empirical plot of ranks against ratings,with a synthetic power-law plot for ranks [1 100].

    In this section, we study the statistical propertiesof YouTube video popularity. We rst examine howskewed requests are across videos. Then we delve deeperinto the actual statistical properties of the system, fo-cusing on how user requests are distributed across pop-ular and non-popular content, and discuss the potentialfactors that shape such distributions. To provide differ-ent comparison points of view, we will use traces fromboth UGC and non-UGC services.

    3.1 Pareto PrincipleThe Pareto Principle (or 80-20 rule) is widely used

    to describe the skewness in distribution. Such skewnesstells us how niche-centric the service is and is useful inre-adjusting the design principle of the system. To testthe Pareto Principle, we count the number of views forthe least r -th popular videos and show it in Figure 2.The horizontal axis represents the videos sorted fromthe most popular to the least popular, with video ranksnormalized between 0 and 100. The graph shows that10% of the top popular videos account for nearly 80% of views, while the rest 90% of videos account for less than20% of views. This result is quite surprising, since inthe other online systems, the 90% of least popular lescontributes much larger portion to the total number of views. For instance, analysis of a large VoD system inChina, PowerInfo, shows that 90% of least popular VoDles account for 40% of all requests [36]. It is expectedthat more broader availability of videos enhances the2

    Note that we plot customer ratings rather than views since this wasthe only data available [4]. However, we have observed from otherVoD and UCG sites that ratings and views are related by a linearrelationship (see 2.2.2). Thus the general distribution presented inthis plot should not differ greatly when plotting rank against views.

    diversity of users requests and results in more spread of requests across les. However, counter-intuitively, therequests on YouTube seem to be highly skewed towardspopular les.

    A nice immediate implication of this skewed distri-bution is that caching can be made very efficient sincestoring only a small set of objects can produce high

    hit ratios. That is, by storing only 10% of long-termpopular videos, a cache can serve 80% of requests. Werevisit this issue in 5.1, where we hypothesize a globalcache for YouTube and assess its performance. Anotherimplication is that YouTube is not so niche-centric andserves mostly popular content. It is disputable whetherthis phenomenon is a signature characteristic of UGC,as opposed to commercial videos, or a consequence of the YouTubes video categorization or recommendation.We expect that a better recommendation engine wouldmitigate the strong dominance of the popular contentand shift the users requests toward non-popular con-tent. High skewness in popularity is also conrmed fromDaum data as shown in Figure 2.

    10090807060504030201000

    0.2

    0.4

    0.6

    0.8

    1

    Normalized video ranks

    F r a c

    t i o n o

    f a g g r e g a

    t e v

    i e w s

    Daum

    YouTube Sci

    Figure 2: Skewness of user interests across videos

    3.2 Statistical PropertiesWe now analyze the intrinsic statistical properties of

    UGC video popularity. Here we will use two differ-ent representations of the popularity distribution. Eachrepresentation will serve to analyze the behavior of dif-ferent types of videos. In particular, we use a plot of views against the complementary cumulative number of views (i.e., frequency) and a plot of video ranks againstviews. The rst representation focuses on the mostpopular videos and has been widely used to determinewhether a given distribution exhibits certain statisticalproperties or not (e.g., power-law) by many researchers.

    The second representation shows the behavior of unpop-ular videos and has recently been used to understandthe behavior and so-called the Long Tail potential of the non-popular content by Anderson [10]. These twoplots in fact are transposed versions of one another andrepresent the same quantity [34].

    3.2.1 Popular Content Analysis

    Figures 3(a) and (b) display the popularity distri-bution of videos of four representative categories fromYouTube and Daum. All of them exhibit power-law be-

    4

  • 8/13/2019 Mia Imc07 Sumitted

    5/14

    100

    102

    104

    106

    100

    102

    104

    106

    Views

    N u m

    b e r o

    f v

    i d e o s w

    i t h

    x v

    i e w s

    2.3

    2.5

    YouTube Ent (V=1.6M)

    YouTube Sci (V=250K)

    100

    102

    104

    100

    101

    102

    103

    104

    Views

    N u m

    b e r o

    f v i

    d e o s w

    i t h

    x v

    i e w s

    1.5

    1.9

    Daum Travel (V=8574)

    Daum Food (V=1385)

    100

    102

    104

    106

    10 0

    101

    102

    103

    104

    105

    106

    Views

    N u m

    b e r o

    f v i

    d e o s w

    i t h

    x v i e w s

    YouTube SciPowerPower + exp cutoffLognormal

    Exponential

    (a) YouTube empirical plot (b) Daum empirical plot (c) Curve tting for YouTube

    Figure 3: Video popularity distribution of YouTube and Daum follows power-law in the waist, with varyingexponent from 1.5 to 2.5. YouTube Sci and Daum Food exhibit sharp decay in the tail of hot content.

    havior (a straight line in a log-log plot) across for morethan two orders of magnitude. The tted power-law ex-ponents are also shown in the gure. However YouTubeSci and Daum Travel categories show a sharp decay forthe most popular content. To examine the truncation indetail, Figure 3(c) shows the plot of Sci with the best-t curves of power-law, log-normal, exponential, andpower-law with an exponential cutoff. A log-normallydistributed quantity is one whose logarithm is normallydistributed. Power-law with an exponential cutoff hasan exponential decay term e x that overwhelms thepower-law behavior at large values of x . For x < 1 ,it is almost identical to a normal power-law, and forx > 1 , a normal exponential decay.

    Our tting result suggests that truncation at the tailfollows power-law with an exponential cutoff. DaumTravel shows a similar result. Video popularity alsoseems category-dependent . Popularity distributions of

    other Daum categories (not shown here) showed highvariability; some do not follow power-law distribution,the others follow power-law distribution but in whichthe exponent varies. Nonetheless, all of them showedpower-law waist , with most of them having a truncated tail that ts best by power-law with an exponential cut-off.

    While there exists signicant difficulty in determin-ing whether a certain distribution is power-law or not,here, we will next consider the case where the innateshape of popularity distribution is power-law, and thatthe exponential cutoff arises from the limitation on thenumber of videos and the users behavior. There areseveral mechanisms that generate power-law distribu-tions, but the simplest and the most convincing one isthe Yule process (also rephrased as preferential attach-ment or rich-get-richer principle ) [12,27,37]. In UGC,this process can be translated as following: if k usershave already watched a video, then the rate of otherusers watching the video is proportional to k. We willnow investigate why a power-law distribution can havea sharp decay for the most popular content.

    Power-law with a truncated tail appears frequently

    in the degree distributions of various real-world net-works such as WWW, protein networks, e-mail net-works, actor networks, and scientic collaboration net-works [19, 33]. Several models have been suggested toexplain the cause of this truncation. We will review twomodels and investigate whether they are applicable toour case.

    First, Amaral et. al [8] suggested that the aging ef-fect can yield truncation. Consider a network of actors,where every actor will stop acting, in time. This meansthat even a very highly connected vertex will, eventu-ally, stop receiving new links. However, the aging ef-fect does not apply to our case, because videos acrossall ages shows truncated tail. In fact, as we will seelater in the paper, our daily trace shows that 80% of the videos requested on a given day are older than 1month, contradicting the hypothesis of aging effect inour case.

    Second, Mossa et. al. [33] suggested a network modelto explain the degree distribution of WWW. Along withthe preferential attachment, the model adopts the con-cept of information ltering, which means that a usercannot regard all the information but receive informa-tion from only a fraction or a xed number of existingpages. Due to this information ltering process, thepreferential attachment is hindered and the exponen-tial cutoff appears. The information ltering is surelypresent also in both UGC and standard VoD services.However, highly popular videos are prominently fea-tured within these VoD services to attract more viewers,and thus it is unlikely that information ltering causestruncation for our case.

    A study by Gummadi et al. [23] gives us some hintson the truncated tail. In their study of le popularityin P2P downloads, they suggest the cause of distortionfrom fetch-at-most-once behavior of users. That is,unlike in the WWW traffic where a single user fetches apopular page (e.g., CNN) many times, P2P users fetcheach object at most once. Given a xed number of users, U , the videos, V , and the average number of requests per user, R , the authors simulate P2P down-

    5

  • 8/13/2019 Mia Imc07 Sumitted

    6/14

    100

    101

    102

    103

    104

    100

    101

    102

    103

    Views for V=1000 videos

    N u m

    b e r o

    f v

    i d e o s w

    i t h

    x v

    i e w s

    HitOnceU=10000 R=10U=2000 R=50U=10000 R=50U=2000 R=10

    powerlaw

    102

    103

    104

    100

    101

    102

    Views for V=100 videos

    N u m

    b e r o

    f v

    i d e o s w

    i t h

    x v

    i e w s

    HitOnce U=10000 R=10

    U=5000 R=20U=2000 R=50

    powerlaw

    (a) Varying the requests per user ( R ) and the number of users ( U ) (b) Varying the number of videos ( V )

    Figure 4: Numerical simulation on the impact of fetch-at-most-once on the tail distribution

    loads with two types of user populations: Power andHitOnce . Both user groups make requests based on thesame initial Zipf le popularity. However, Power groupmay request videos multiple times, and HitOnce group,at most once. HitOnce user will make multiple drawsuntil a new item is requested. The resulting popularity

    graph for HitOnce users appears truncated, comparedto a straight line for Power users [23].

    UGC also has Fetch-at-most-once-like behavior; sincevideo content does not change (i.e., immutable), viewersare not likely to watch the same video multiple times,as they do for mutable web objects. Expanding thework in [23], we suggest that other system character-istics such as R and V , in combination with fetch-at-most-once, have a major impact in forming the trun-cated tail. To numerically verify this, we repeat thesimulation described above, but with varying parame-ters for U , R , and V . In our setting, the Zipf parameteris set as 1.0 for the initial video popularity.

    Figure 4 shows the resulting video popularity in aplot of views against the cumulative number of videos.We make several observations from Figure 4(a). First,compared with Power, HitOnce shows in a truncatedtail, as expected. Interestingly, the truncated tail getsamplied as the number of requests per user, R , in-creases. If R is small, then the fetch-at-most-onceeffect does not take place. With increased R , fetch-at-most-once effect starts playing a bigger role, sincethere is a higher chance the a particular user is gearedtowards the same popular le multiple times. Second,adding more users in the system, U , increases views per

    videos (shifting the plot in the x-axis). However, theoverall shape of the graph does not change, indicatingthat the U has little impact in the tail truncation. Fi-nally, increasing both R and U (from U = 2000 , R = 10to U = 10000 , R = 50), the tail shape changes in a sim-ilar way as when R increases. Note that larger R and U values represents the case where new users are added tothe system and old users make more and more requests(thus R increases). This intuitively captures what hap-pens in the real UGC systems. In fact, our trace also

    shows similar trends. Figure 5 shows the popularitydistribution of Sci , over a short and long-term win-dow. Having a long-term window represents large Rand U values. The plot of popularity during one day(i.e., small R ) exhibits a clear power-law decay, whilefor longer terms, the distribution exhibits a truncated

    tail as in Figure 4(a).

    100

    102

    104

    106

    108

    100

    102

    104

    106

    Views

    N u m

    b e r o

    f v

    i d e o s w

    i t h

    x v i e w s

    YouTube Sci videos

    One dayOne weekAll time

    Figure 5: Tail shape over different time-windowsAnother factor that can greatly impact the shape of

    a distribution is the number of videos, V . Figure 4(b)shows the same simulation results, repeating for a smallernumber of videos ( V = 100). If V is small, fetch-at-most-once effect becomes amplied since there is onlya small number of videos to choose from. This resultsin a highly truncated tail, as shown in Figure 4(b) forthe case of U = 2000 , R = 50. We can also empiricallyverify this from our plots of YouTube and Daum data.Let us revisit the plots in Figures 3(a) and (b). We ob-serve that the tail cutoff is much more pronounced for

    categories with smaller number of videos, such as Sciin the case of YouTube and Food in the case of Daum.So far, we focused on the popularity distribution of

    popular content and showed, via numerical simulationsand empirical validation, that the tail truncation is af-fected by both the average requests per users and thenumber of videos in a category. Next, we move on tothe non-popular portion of the distribution.

    3.3 Studying The Long TailAnderson, in his book [10], asserted that there ex-

    6

  • 8/13/2019 Mia Imc07 Sumitted

    7/14

    ist huge opportunities in the unlimited number of non-popular items, or so he calls this the economics of theLong Tail. Here we will investigate the Long Tail op-portunities in UGC services. In particular, the follow-ing questions are of our interest: what is the underlyingdistribution of non-popular items, what shapes the dis-tribution in one way or another, and how much benet

    the Long Tail can bring for UGC services.Let us look into the distribution of the non-popularcontent. We use a plot of video ranks against views,where unpopular videos are put at the tail. This repre-sentation, suggested by Zipf, has been used to observeZipfs law. Figure 6(a) shows such empirical plot of Scivideos, on a log-log scale. The gure shows a Zipf-likewaist (a straight line in a log-log plot) with a truncatedtail. When we perform goodness-of-t test with severaldistributions, the truncated tail ts best with Zipf withan exponential cutoff, as clearly shown in the gure.Log-normal is the second best t.

    However, as stressed before, it is hard to decide whetherit is a Zipf distribution modulated by a removable bot-tleneck, or it is just a natural log-normal distribution.Identifying the true nature of the distribution is hugelyimportant because it can affect strategies for market-ing, target advertising, recommendation, and search en-gines. In the following, we list the potential causes forthe truncated tail and discuss how they apply to ourscenario:

    Natural shape of UGC is truncated: User-generatedcontent, by denition, varies widely in its qual-ity. One may argue that the natural shape of thepopularity distribution of UGC is truncated, sincesignicant fraction of videos in UGC are producedfor small audiences (e.g., family members), as op-posed to professionally generated content, whichis produced for much wider audiences. For mostof the UGC categories we examined, goodness-of-t suggests Zipf with an exponential cutoff as thebest-t, rather than a log-normal. However, it isunlikely that such this distribution captures thenatural user behavior. Zipf (so as power-law) isscale-free in nature, while exponential is a distri-bution that is scaled or limited in size. Thereforethe two will rarely appear coherently and natu-

    rally as a single mechanism. Rather, a more likelyscenario is that the underlying mechanism is Zipf,but a bottleneck in the system truncates the tail.

    Sampled publishing (pre-lters): The plot of Net-ix in Figure 1 shows a sharp decay in the tail.This can be explained by sampling bias. Eventhough NetFlix provides an enormous online cat-alog of DVDs world-wide, their videos are a set of movies that are sampled from all the movies evermade; only a small portion of movies world-wide

    are made into DVD titles. In UGC services, pub-lishes post videos sampled from the video pool intheir possession. However, they may only uploadthose that they consider most interesting. The fol-lowing process explains how such pre-ltering af-fects the shape of a distribution: Consider a com-plete list of N videos, whose popularity distribu-

    tion follows Zipf. Then let us remove h videos fromthe set, such that the probability of a video re-moved is proportional to the inverse order of theirranks. The remaining N h videos will have trun-cated tail.

    Information ltering (post-lters): Search or rec-ommendation engines typically return a small num-ber of hits , compared with the total number of items that are indexed as relevant [15, 33]. Theimpact of these post-lters has been extensivelyanalyzed by Fortunato et al. [20], where they showpost-lters yield truncated distributions. If we

    assume that UGC too is truncated in the non-popular items due to post-lters, then older videosshould have more pronounced truncation than theyounger ones (as older videos have been exposedlonger to the ltering effect). Indeed, we are ableto observe this from our trace. Figure 6(b) showsthe popularity distributions of Sci videos of differ-ent ages. Videos aged 1 day are clearly less trun-cated in tail than older ones. The graphs for oldervideos show: popular items gaining more views;the slop at the waist becoming steeper; and thetail becoming more truncated. This reinforces thecase for post-ltering, where top videos are morelikely to be favored in the way they are presentedto the users, and this impact gets amplied astime passes, since non-popular videos will rarelybe brought to users attention.

    The above discussion reects an important observa-tion, since it suggests that the truncated Long Tailrepresents a latent demand that could potentially bebrought forward with adequate recommendation andsearch engines, better tagging techniques, and the easeof video posting. If Zipf is the natural shape and thetruncated tail is due to some removable bottlenecks

    (e.g., ltering), then in the system with no bottleneck,the videos in the truncated region would gain deservedviews, offering the better chances to discover rare nichevideos to users and potential business opportunities tothe company. We estimate the benet from the removalof the bottleneck of system. The estimation is denedas the ratio of aggregated additional views against theexisting total views. Table 3 shows the measured bene-ts for the four UGC video categories. We also presentthe number of videos that may benet. YouTube Entand Sci show great opportunities in the Long Tail eco-

    7

  • 8/13/2019 Mia Imc07 Sumitted

    8/14

    100

    102

    104

    106

    108

    10

    0

    102

    104

    106

    108

    Video ranks

    V i e w s

    YouTube SciZipfZipf + exp cutoffLognormalExponential

    100

    101

    102

    103

    100

    101

    102

    103

    104

    105

    Video rank

    V i e w s

    1 day old3 days old1 week old3 months old

    (a) YouTube tail tting of non-popular videos (b) Popularity distribution of videos with varying ages

    Figure 6: Ranks versus views plot for YouTube Sci videos.

    nomics (42-45% potential improvement), due to the largenumber of videos that can benet. While in DaumTravel and Food , the benet is reduced due to a smallnumber of videos that benet. When the number of videos is small, the inefficiencies of the system (due toltering effects) are smaller since information can befound easier.

    Table 3: Potential gain from the Long Tail in terms of additional views and the number of beneciary videos

    Ent Sci Travel FoodGain 45%(1.2M) 42%(240K) 4%(5K) 14%(400)

    4. POPULARITY EVOLUTION OF UGCAs opposed to standard VoD systems where the con-

    tent popularity uctuation is rather predictable (viastrategic marketing campaigns of movies), UGC videopopularity can be ephemeral and have a much moreunpredictable behavior. Similarly, as opposed to theearly days of TV when most people watched the sameprogram at the same time, such temporal correlation ismuch more diluted in UGC. Videos come and go all thetime, and the viewing patterns also uctuate based onhow people get directed to such content, through RSSfeeds, web reviews, blogs, e-mail or other recommen-dation networks. To better understand this temporalfocus, in this section, we analyze the UGC video pop-ularity evolution over time. Our analysis is conductedfrom two different angles. We rst analyze whether re-quests concentrate on young or old videos. Then we in-

    vestigate how fast or slow popularity changes for videosof different age, and further test if the future popularityof a video can be predicted. For the analysis, we usedaily trace of YouTube Sci videos.

    4.1 Popularity Distribution Versus AgeTo examine the age distribution of requested videos,

    we rst group videos by age (binned every ve days) andcount the total volume of requests for each age group.Figure 7 displays the maximum, median, and the av-erage requests per age group. We only consider videos

    that are requested at least once during the trace period.The vertical axis is in log-scale. For very young videos(e.g., newer than 1 month), we observe slight increase inthe average requests, which indicates viewers are mildlymore interested in new videos, than the rest. However,this trend is not very pronounced, when we examine theplot of maximum requests. Some old videos too receive

    signicant requests. In fact, our trace shows massive80% of videos requested on a given day are older than 1month and this traffic accounts to 72% of total requests.The plot becomes noisy for age groups older than 1year, due to small number of videos. In summary, if we exclude the very new videos, users preference seemsrelatively insensitive to videos age.

    01 month 6 months 1 year 1.5 year10

    0

    101

    102

    103

    104

    105

    106

    Videos grouped by age (days)

    D a

    i l y r e q u e s

    t s p e r a g e g r o u p

    20% new videos receive 28% requestsMaxmedianAverage

    Figure 7: Distribution of request volume acrossvideos age, based on Sci daily trace.

    While users interests is video-age insensitive in a

    gross scale, most of the top requests on a given day seemto target on recent videos in Figure 7. To further ver-ify this, we look into the age distribution of top twentymost requested videos. Figure 8 shows the result for adifferent time-window of a day, a week, a month, and alltime. For each plot, we used two snapshots, taken thecorresponding periods apart, and ranked videos basedon the increase in their views. For the plot of all time,we assume the initial views of videos are zero. Over aone day period, roughly 50% of the top twenty videosare recent. However, as the time-window increases, the

    8

  • 8/13/2019 Mia Imc07 Sumitted

    9/14

    median age shifts towards older videos. This suggestsephemeral popularity of young videos. To better un-derstand its effect, in the next section, we discuss thevideo popularity evolution over time.

    0 50 100 150 200 250 300 350 400 4500

    0.2

    0.4

    0.6

    0.8

    1

    Age of top 20 videos (days)

    C D F

    Daily recordWeekly recordMonthly recordAll time record

    Figure 8: Age distribution of top 20 videos

    4.2 Temporal FocusWe now continue our discussion on the video pop-

    ularity and investigate how the popularity of individ-ual UGC videos evolve over time, how fast or slow itchanges, and whether the future popularity of a videocan be predicted.

    4.2.1 Probability of Videos Being Watched Over Time

    When a video is posted, it has zero views; graduallyvideos will gain views over time. To capture this trendin UGC videos, in Figure 9, we show the percentageof videos aged X days having V views. We pro-vide several view points by considering a range of V values from 0 to 10,000. The graph shows that after aday, 90% of videos are watched at least once, and 40%are watched over 10 times. After a longer period of time, more videos gain views, as expected. One notice-able trend in the graph is the consistent deeps at cer-tain times (e.g. 1 day, 1 month, 1 year). These pointsseem to coincide with the time classication made byYouTube in their video categorization. From this plot,we can see that the slope of the graph seems to decayas time passes. Noting the log-scale in the horizontalaxis, this indicates the probability of a given video tobe requested decreases sharply over time. In fact, if weconsider the case of V = 10, the probability that a givele gets more than 10 requests over the the duration

    of rst 24 hours, 6 days, 3 weeks, and 11 months, is0.43, 0.18, 0.17, and 0.14, respectively. This indicatesthat videos are more likely to get most of their requestssoon after they are posted. Conversely, if a video didnot get enough requests during its rst days, then, it isunlikely that they will get many requests in the future.Based on these observations, we will next test if it ispossible to predict a videos future popularity.

    4.2.2 Predicting Near-Future Popularity

    The ability to predict future popularity is immensely

    1 day 1 week 1 month 1 year 2 years0

    20

    40

    60

    80

    100

    V i d e o s w

    i t h

    V v i e w s

    ( % )

    Videos with X age

    V=0

    V=5

    V=10

    V=100V=10000

    V=1000

    Figure 9: Probability of videos being watched overtime, based on YouTube Sci trace

    useful in many ways, because the service providers maypre-populate these videos within multiple proxies orcaches and the content owners may use this fast feed-back to better manage their contents (e.g., productioncompanies releasing trailers to predict popularity). Wenow explore the possibility of using early views recordsin predicting near-future popularity. We compare therst few days video views with those after some pe-riod of time (i.e., 5, 7, and 90 days). Table 4 showsthe correlation coefficient of views for combinations of snapshots. We also present the number of videos usedfor sampling. Our results show that second day recordgives a more accurate estimation than using the rstdays records, in fact, at a relatively high accuracy (cor-relation coefficient above 0.8). This is due to the timeit takes for videos to be known and start ramping upthe popularity curve. Using the third day record im-proves the prediction accuracy, yet, only marginally.The result also shows a high correlation with the secondday record even for more distant future popularity (e.g.

    three months afterwards).

    Table 4: Correlation coefficient of video views intwo snapshots (Number of videos analyzed)

    Age (x 0 ) x 0 +5 days old 7 days old 90 days old1 day old 0.5885 (7221) 0.8776 (3394) 0.5561 (11884)

    2 days old 0.9665 (5185) 0.8793 (3394) 0.8425 (11215)3 days old 0.9367 (3394) 0.9367 (3394) 0.8525 (9816)

    4.2.3 Popularity Shifts

    Now we examine how easy or hard it is for new and

    old videos to become very popular as a function of theirage. To observe this, we will rst look at how the videorank changes against the video age. In Figure 10(a),we use two snapshots from our 6-day trace, taken atday zero and day 5, and consider only those videos thatappear on both of the snapshots. We group videos bytheir age (bin in units of ten days) and plot the changein ranks (i.e., rank ) over age. For each age group, weplot the maximum, top 99 percentile, average, and theminimum change in rank . The vertical axis rangesfrom -4059 to 235132, which indicates that some videos

    9

  • 8/13/2019 Mia Imc07 Sumitted

    10/14

    decreased in their rank by 4059 during the trace period,while some jumped up 235132 ranks.

    0 100 200 300 400 500 600 days0

    0.5

    1

    1.5

    2

    2.5x 10

    5

    Video age binned every 10 days

    r a n

    k o v e r o n e w e e

    k

    Maximum99 percentileAverageMinimum

    (a) Popularity distribution based on rank

    0 100 200 300 400 500 600 days10

    0

    105

    1010

    Video age binned every 10 days

    ( r a n

    k * r e q

    ) o v e r o n e w e e

    k

    Maximum

    99 percentileAverage

    (b) Popularity distribution based on rank views

    Figure 10: Changes in ranking and popularity

    We make several observations from Figure 10(a). First,young videos can change many rank positions very fast,while old videos have a much smaller rank uctuation,indicating a more stable ranking classication for oldvideos. Still, some of the old videos also increased theirranks dramatically. This could indicate that old videosare able to ramp up the popularity ladder and becomepopular after a long time, e.g., due to the Long Taileffects and good recommendation engines. However, itis hard to conclude this from Figure 10(a) since a fewrequests may also result in major rank changes. We willrevisit this issue at the end of this section.

    The gap between the maximum and the top 99 per-centile lines reects that only a few young videos (e.g.,less than 1%) make large rank changes, indicating thatonly a very small percentage of the young videos make itto the top popular list while the rest have much smallerranking changes. We also see a consistent minimum

    rank line at nearly -4000 across all age group. A de-tailed look at those videos reveals that those videos didnot receive any request during the trace period, how-ever their ranking was pushed back as other videos gotat least one request. This shows that unpopular videosthat do not receive any request will die in the rankingchart at a rate of 2000 positions per day.

    As discussed before, when it comes to identifying ma- jor shifts in the popularity distribution, considering theactual change in views or ranks is not enough. Videoscan get many requests but make a minor rank change,

    and vice versa; a large rank change could be due to avery few requests (e.g. from zero to ve requests). Toidentify videos that made dramatic rank changes as wellas received large number of requests, we propose usingthe product of ( rank views ) as in Figure 10(b).The vertical axis is in log scale. Now we observe moredrastic popularity shifts for young videos; barely no sin-

    gle old video received a signicant number of requests tomake major upward shift in the popularity distribution.In short, revival-of-the-dead effect, where old videos aresuddenly brought up to the top of the chart, does notseem to happen strongly in our trace.

    5. EFFICIENT UGC SYSTEM DESIGNWith the increasing popularity of UGC, YouTube

    alone is estimated to carry astonishing 60% of all videosonline, serving daily 100 million distinct videos [6]. Thiscorresponds to, in our estimation, a massive 50 - 200Gb/s of server loads as well as access bandwidth on

    a traditional server-client model. Accordingly networkoperators are reporting a rise in overall Web traffic anda rise in HTTP video streaming as a second aspect [7].In this section, we provide insights on the potential formore efficient UCG system designs in terms of cachingand Peer-to-Peer (P2P) techniques.

    5.1 Better Use of CachingCaching stores redundant copies of a le near the

    end user and has been proven to be extremely effec-tive in many Web applications. Several factors affectthe caching efficiency: the cache size, the number of users and videos, the correlation of requests, the shiftsin popularity, and so on. In this section, we will hy-pothesize a global cache system for YouTube and assessits efficiency using our 6-day daily trace. Our interestis at investigating the cache performance, under mas-sive new uploads and dynamic popularity evolution. Weconsider the following three simple caching schemes:

    1. A static nite cache, where at day zero the cacheis lled with long-term popular videos. The cachecontent is not altered during the trace period.

    2. A dynamic innite cache, where at day zero thecache is populated with all videos ever requestedin Sci category, and thereafter stores any othervideos requested during the trace period.

    3. A hybrid nite cache, which works like the staticcache, but where there is an extra cache portionthat stores the daily most popular videos.

    We populate the static cache with long-term popu-lar videos accounting 90% of total traffic in the ParetoPrinciple. This corresponds to 16% of videos in Sci .Dynamic innite cache simply stores all the videos ever

    10

  • 8/13/2019 Mia Imc07 Sumitted

    11/14

    requested. In hybrid nite cache, the cache is rst pop-ulated with the top 16% of Sci videos, then the cachealso allocates a small extra space to store the daily top10,000 videos. We perform a trace-driven simulation toassess the cache performance in terms of the requiredcache size and the cache miss ratio. To do this, wereplay the 6-day trace under our three cache scenarios

    and calculate the average the hit and miss ratios overmultiple days. We simply use the number of videoscached as the cache size, because the video length andthe encoding rate do not vary much across les. Ta-ble 5 summarizes the cache performance. The resultsindicate that about 40% of the videos are requested neweach day. However, the volume of requests accountedfor such videos is very small and they only account forabout 20% of the requests. In fact, we see that a sim-ple static cache that stores the top long-term popularles uses 84% less space than a dynamic innite cachesolution, at the cost of only 23% extra missed volume.We should also mention that, by adapting to changesin daily requests, a hybrid cache improves the cacheefficiency about 10%, compared to the static cache.

    Table 5: Synthetic cache efficiencyType Size # Missed videos Missed volumeStatic 41,456 115,002 (48.8%) 5,093,832 (26.7%)

    Dynamic 256,647 4,683 (1.9%) 648,376 (3.4%)Hybrid 51,456 94,893 (40.3%) 3,271,649 (17.1%)

    5.2 Potential for P2PIn this section we explore the potential benets of

    a P2P technique in UGC distribution. In a P2P sys-tem, users (or peers) share videos they watched for acertain period of time. A new user may fetch videosfrom other peers who are concurrently online and havethe content of interest, rather than fetching from theserver. Inherently, P2P system is effective only whenthere are enough number of online peers sharing content this is called a torrent . Here we investigate the po-tential benets a P2P technique can bring to YouTube.However unlike the existing network environment whereP2P has shown great efficiency [21], massive scale of videos, small-sized content, and the ephemeral popu-larity makes it unclear if P2P will be as effective in

    UGC. Therefore we rst assess the feasibility of a P2P inUGC by examining how many les can benet via P2Papproach. We then perform a trace-driven analysis tomeasure how much server workload can be saved usingP2P, compared to the traditional server-client model.

    We commence by estimating the inter-arrival timesof requests. Our trace provides granularity of requestsup to a day and it shows that daily requests of indi-vidual video varies across the monitored period. Weexploit this daily granularity and assume that requestswithin a single day are exponentially distributed *in

    time. Within a day, the inter-arrival time of requestshas a mean of 1 , where is the intensity of requests(i.e., the number of requests made that day). This inter-arrival time will be shortly used to calculate the numberof concurrent users online. Figure 11 shows the CDFof the average inter-arrival times per video. We ob-serve that over a quarter of videos are requested more

    frequent than every 10 minute.

    1sec 1min 10mins 1hr 1day 5days0

    0.5

    1

    Interarrival time

    C D F

    YouTube Sci

    Figure 11: Inter-arrival times of requestsNext we calculate the number of concurrent users per

    video (e.g., torrent size). The torrent size depends onthe duration and the frequency of users in the P2P sys-tem. The system time of a user is important becausethe P2P sharing may happen only when the user is on-line. We consider the following four cases of P2P systemtime: 1) length of the video user is watching, 2) durationof time user spends on YouTube, 3) one hour, and 4)one day. In the rst case, users share videos only whenthey are watching them. In the second case, users willshare videos while they are using YouTube. Accordingto Nielsen/NetRatings [13], the average time spent bya user at YouTube is currently 28 minutes. We henceassume users may share videos for 28 minutes in oursecond case. In the last two cases, we consider users

    sharing videos even when they are no longer in the sys-tem. We mention that this may become a reality inthe future (e.g., users equipped with always-on set-topboxes that run P2P).

    Then for a given P2P system time of a user, t , and theinter-arrival time of requests, 1 , the expected numberof concurrent users is t . Note that this value can be lessthan 1, indicating that there are times within that daywith no users watching the video. We only consider P2Papproach only when t is greater than one (i.e., morethan one user watched a video). When t 1, we sim-ply apply traditional server-client model. Figure 12(a)

    shows the CDF of the average concurrent users over themonitoring period per video. We observe that for mostof the cases the expected number of concurrent users,t , is less than 1, indicating that only few videos willbenet from P2P. However, when users share videos fora longer period of time (e.g, 1 day), P2P may assist 60%of videos with at least 10 current users all the time.

    While the number of les that can benet from P2Pcome out relatively small, this does not necessarily meanP2P is inefficient for UGC. As we have seen from theprevious sections, UGC requests are highly skewed and

    11

  • 8/13/2019 Mia Imc07 Sumitted

    12/14

    106

    104

    102

    100

    102

    104

    106

    0

    0.2

    0.4

    0.6

    0.8

    1

    Expected number of concurrent users per video

    C D F

    Video length28 minutes1 hour1 day

    (a) The number of concurrent online users

    Video length 28 minutes 1 hour 1 day0

    1

    2

    3

    4

    5x 10

    9

    S e r v e r s

    t r e a m

    i n g

    t i m e

    ( s e c

    )

    100%

    59%

    41%26%

    1.3%

    ServerClient P2Passited

    (b) Server workload savings against server-client modelFigure 12: Potentials of a P2P system

    temporal. Therefore, we investigate the benets of P2Pby comparing the estimated server workload betweentraditional client-server and P2P-assisted distributionapproaches. In a client-server model, each request is di-rectly served by the server. While in the P2P-assistedmodel, peers will participate in streaming only whenthere are concurrent users. As a measure of server work-load, we use the total length of the streamed content.Figure 12(b) compares the server workload based ontrace-driven analysis. Our results show that the poten-tial of P2P is actually very large. The server workloadis reduced by 41% even when users share only videos

    while they are watching. When users share videos forone day, the server workload reduces by tremendous98.7%, compared to a client-server approach.

    6. ALIASING AND ILLEGAL UPLOADSContent aliasing and illegal uploads are critical prob-

    lems of todays UGC systems, since they can hamperthe efficiency of UGC systems as well as cause costlylawsuits. In this section, we study the prevalence of content duplication and illegal uploads in UGC, andtheir impact in various systems characteristics.

    6.1 Content AliasingTraditional VoD services offer differently encoded ver-

    sions of the same video, typically to support diversedownward streaming bandwidths. In UGC, there oftenexist multiple identical or very similar copies for a sin-gle popular event. We call this group of videos, aliases ,and this new phenomenon content aliasing . Multiplecopies of video for a single event dilute the popular-ity of the corresponding event, as the number of viewsis distributed over multiple copies. This has a directimpact on the design of recommendation and ranking

    systems, as it is no longer straightforward to track thepopularity of an event from a single view count norpresent users with unique videos, instead of numerousidentical copies.

    To estimate the prevalence of aliases, we have con-ducted the following experiment. We rst sample 216videos from the top 10 , 000 videos of YouTube Ent cat-

    egory. Then we ask our 51 volunteers to view a fewvideos and read the title and description. After viewingsome from our sample set, volunteers search YouTubeusing keywords of their choice and ag any video theydeem pertaining to the same event as aliases 3 Our vol-unteers have identied 1 , 224 aliases for 184 videos outof original 216. Most videos have 1 to 4 aliases, whilethe maximum number of aliases is 89. Out of all videosthat pertain to the same event, we call the video withthe earliest upload time original .

    10 3

    10 4

    10 5

    10 6

    10 7

    10 8

    10 3 10 4 10 5 10 6 10 7 10 8 S u m

    o f o r i g

    i n a

    l a n

    d a l i a s e s

    v

    i e w

    Views of originals

    y=x

    Figure 13: Sum of all views of the original andaliases versus views of original videos

    Figure 13 shows the sum of views from all aliases andthe original video against the number of views of theoriginal videos. For a few videos, the sum of views fromaliases grows more than two orders of magnitude thanthe views of the original. This clearly demonstratesthe popularity dilution effect of content aliasing. Undi-luted and augmented by the views of aliases, the originalvideo could have been ranked much higher.

    0

    20

    40

    60

    80

    100

    120

    0 100 200 300 400 500 600 700

    C o u n t

    Age differences

    all_diff_positive_binning_by_5

    Figure 14: Number of aliases against age differences

    Next, we analyze the time intervals between aliases.We plot the age differences between the original video3 We have created a webpage http://beta.kaist.ac.kr forvolunteers to view the video along with the description, andthen search for content aliases in YouTube.

    12

  • 8/13/2019 Mia Imc07 Sumitted

    13/14

    10-6

    10 -510 -410 -310 -210 -110 010 110 210 3

    0 100 200 300 400 500 600 700

    N o r m a

    l i z e

    d v i e w s

    Age difference

    Figure 15: Normalized views against age differences

    and its aliases in Figure 15 (the bin size is 5 days). Alarge number of aliases are uploaded on the same dayas the original video or within a week. To examine howthe number of views has changed, we plot the views of aliases normalized against that of the original againstthe age difference in Figure 15. One conspicuous pointrepresents an alias that showed up more than 200 days

    later than the original and received almost 1000 timesmore views. This particular video was originally listedin the Music category, and later posted on the Comedycategory with much more views. We nd it rather sur-prising to see so many aliases still appear 100 or moredays after the original video. As we dig deeper intothose aliases that have 10 times or more views than theoriginal and 100 days or more older. They are foundto belong to different categories from the original andhave been cross-posted over multiple categories. Thesealiases could be a potential reason for the attened pop-ularity tail. We leave further investigation behind thisdelayed popularity for future work.

    Those aliases that turn up 100 days later with muchfewer views are likely to serve personal archiving pur-poses. The Pearson correlation coefficient of the plotin Figure 15 is 0.004. It signies little correlation orno decrease in the number of views over time. With agood number of aliases older than 100 and more views,we discern no clear trend in the aliases and their viewsover time.

    Finally, we check for the existence of heavy alias up-loaders. Suspecting their strong motivation for onlinepopularity, we have wondered if they could post aliasesof already popular videos. Our data, however, showsthat over 80% of all aliases are by one-time uploadersand the maximum number of aliases by one uploader is15.

    6.2 Illegal UploadsUGCs derived from copyrighted contents raise a seri-

    ous legal dilemma for UGC service providers. In a sense,aliases can be considered to a great extent as a form of video spam. A recent study from Vidmeter [25] sug-gests that nearly 10 percent of videos in YouTube areuploaded without the permission of the content owner.

    Vidmeters report cover only the top ranked UGCs. Weaugment Vidmeters work by looking not only at the topranked videos, but all in Ent .

    We get the list of all videos at two different times, andcompare the two lists. The discrepancy represents thedeleted videos. When we follow the links to the deletedvideos, YouTube offers a notice about the reason behind

    deletion. Possible reasons are: removed by users, termsof use violation, copyright claim, and restricted access.From the the rst set of videos (1 , 687, 506), the numberof all deleted videos are 6 , 843 (0.4%). Only about 5%of deleted videos have violated the copyright law, whichis a far smaller number than Vidmeters.

    7. RELATED WORKWe have already incorporated many of the references

    that closely relate to our work in the previous sectionsof the paper. As this work covers a broad spectrum of topics from popularity analysis to web caching and p2pstreaming, we next briey summarize related works.

    Large-scale video on-demand streaming (or VoD) ser-vices have become popular in recent days, while UGCservices have grown explosively. Among the numerousUGC sites, YouTube, MSN, Google Video, and Yahoo!Video are the notable ones. Due to relatively shorthistory of UGC, little work has been done on the char-acteristics of UGC or comparisons to traditional VoDsystems. One of the rst related work on video popu-larity is that of Griwodz et al. using the video rentalrecords [22]. Recently, Yu et al. [36] presented an in-depth analysis of access patterns and user behaviors ina centralized VoD system. Newman [34] carried out a

    good comprehensive study of power law distributions.He examined several examples of power-law: web hits,copies of books sold, telephone calls, etc. Also a pa-per by Alderson et al. develops an interesting and richtheory for scale-free networks [29].

    The idea of P2P streaming has been extensively ex-plored in recent works in the context of patch updates,VoD, etc [14, 21,24,26]. Most of existing work aboutP2P VoD [17,24] systems was concentrated on the pro-tocol design under various topological constraints andthe analysis of simulation results. Our study consid-ers the potential for P2P delivery in large scale UGCsystems, which have unique characteristics in terms of user consumption patterns and video popularity distri-bution.

    8. CONCLUSIONSIn this paper we have presented an extensive data-

    driven analysis on the popularity distribution, popular-ity evolution, and content duplication of user-generatedvideo contents. To the best of our knowledge, this workis the rst major stab at understanding the explosivegrowth of UGC and its implications on underlying in-

    13

  • 8/13/2019 Mia Imc07 Sumitted

    14/14


Recommended