+ All Categories
Home > Documents > From QoS to QoE: A Tutorial on Video Quality...

From QoS to QoE: A Tutorial on Video Quality...

Date post: 06-Jun-2019
Category:
Upload: trantuyen
View: 227 times
Download: 2 times
Share this document with a friend
41
1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao Chen, Student Member, IEEE, Kaishun Wu, Member, IEEE, and Qian Zhang, Fellow, IEEE Abstract—Quality of Experience (QoE) is the perceptual Qual- ity of Service (QoS) from the users’ perspective. For video service, the relationship between QoE and QoS (such as coding parameters and network statistics) is complicated because users’ perceptual video quality is subjective and diversified in different environments. Traditionally, QoE is obtained from subjective test, where human viewers evaluate the quality of tested videos under a laboratory environment. To avoid high cost and offline nature of such tests, objective quality models are developed to predict QoE based on objective QoS parameters, but it is still an indirect way to estimate QoE. With the rising popularity of video streaming over the Internet, data-driven QoE analysis models have newly emerged thanks to availability of large-scale data. In this article, we give a comprehensive survey of the evolution of video quality assessment methods, analyzing their characteristics, advantages and drawbacks. We also introduce QoE-based video applications, and finally identify the future research directions of QoE. Index Terms—Quality of Experience, Subjective Test, Objective Quality Model, Data-driven Analysis I. I NTRODUCTION With the exponential growth of the video-based services, it becomes ever more important for the video service providers to cater to the quality expectation of the end users. It is estimated that the sum of all forms of videos (TV, video-on-Demand (VoD), Internet, and P2P) will be around 80% 90% of global consumer traffic by 2017 [1]. Video streaming over the Internet, especially through mobile network, is becoming more and more popular. Throughout the world, Internet video traffic will be 69% of all consumer Internet traffic by 2017 [1], and mobile video traffic will be over one third of mobile data traffic by the end of 2018 [2]. In early works, researchers were trying to increase us- er perceptual video quality by appropriately selecting QoS parameters (such as video compression optimization [3]– [5] and network bandwidth allocation [6]–[8]). In [5], the authors study the relationship between the peak signal-to- noise ratio and quantization parameter, and propose a linear rate-quantization model to optimize quantization parameter calculation. In [8], the authors present a dynamic network resource allocation scheme for high-quality variable bitrate video transmission, based on the prediction of future traffic patterns. While monitoring and controlling QoS parameters of the video transmission system is important for achieving high video quality, it is more crucial to evaluate video quality Yanjiao Chen and Qian Zhang are with the Department of Computer Sci- ence and Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong. Kaishun Wu is with the College of Computer Science and Software Engineering, Shenzhen University and HKUST Fok Ying Tung Research Institute from the users’ perspective, which is known as Quality of Experience (QoE), or user-lever QoS. QoE-based video quality assessment is difficult because user experience is subjective, hard to quantify and measure. Moreover, the advent of new video compression standards, the development of video trans- mission systems, and the advancement of consumer video technologies, all call for a new and better understanding of user QoE. Video quality assessment has gone through four stages, as shown in Fig. 1. Table I gives a comparison of these video quality assessment methods. Objective QoS parameters, e.g., fastest bit rate, shortest delay… Stage I: QoS Monitoring Subjective QoE: e.g., MOS, DMOS… Stage II Subjective Test Objective Video Quality Metric (VQM) , which correlates well with MOS. Stage III Objective Quality Model Measurable QoE metrics, e.g., viewing time, probability of return… Stage IV: Data-driven Analysis Fig. 1: Video quality assessment evolution QoS monitoring for the video traffic includes two parts: QoS provisioning from the network and QoS provisioning from the video application. QoS support from the network, especially wireless or mobile network, is essential for video delivery over the Internet. Three major approaches are congestion control, error control and power control. The challenges facing network QoS support include unreliable channels, bandwidth constraints, heterogeneous access technologies. QoS support from the video application include advanced video encoding scheme, error concealment and adaptive video streaming pro- tocol. A survey of video QoS provisioning in mobile network is given in [9], mostly from the network point of view. Error- concealment schemes are investigated in [10]. [11] and [12] consider both network and application QoS support. In this tutorial, we mainly focus on Stages II IV of video quality assessment. In the main text, we will not discuss Stage I, and interested readers can refer to the above surveys for more information. Subjective test directly measures user QoE by soliciting
Transcript
Page 1: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

1

From QoS to QoE: A Tutorial on Video QualityAssessment

Yanjiao Chen, Student Member, IEEE, Kaishun Wu, Member, IEEE, and Qian Zhang, Fellow, IEEE

Abstract—Quality of Experience (QoE) is the perceptual Qual-ity of Service (QoS) from the users’ perspective. For videoservice, the relationship between QoE and QoS (such as codingparameters and network statistics) is complicated because users’perceptual video quality is subjective and diversified in differentenvironments. Traditionally, QoE is obtained from subjective test,where human viewers evaluate the quality of tested videos undera laboratory environment. To avoid high cost and offline nature ofsuch tests, objective quality models are developed to predict QoEbased on objective QoS parameters, but it is still an indirect wayto estimate QoE. With the rising popularity of video streamingover the Internet, data-driven QoE analysis models have newlyemerged thanks to availability of large-scale data. In this article,we give a comprehensive survey of the evolution of video qualityassessment methods, analyzing their characteristics, advantagesand drawbacks. We also introduce QoE-based video applications,and finally identify the future research directions of QoE.

Index Terms—Quality of Experience, Subjective Test, ObjectiveQuality Model, Data-driven Analysis

I. INTRODUCTION

With the exponential growth of the video-based services, itbecomes ever more important for the video service providers tocater to the quality expectation of the end users. It is estimatedthat the sum of all forms of videos (TV, video-on-Demand(VoD), Internet, and P2P) will be around 80% ∼ 90% ofglobal consumer traffic by 2017 [1]. Video streaming overthe Internet, especially through mobile network, is becomingmore and more popular. Throughout the world, Internet videotraffic will be 69% of all consumer Internet traffic by 2017[1], and mobile video traffic will be over one third of mobiledata traffic by the end of 2018 [2].

In early works, researchers were trying to increase us-er perceptual video quality by appropriately selecting QoSparameters (such as video compression optimization [3]–[5] and network bandwidth allocation [6]–[8]). In [5], theauthors study the relationship between the peak signal-to-noise ratio and quantization parameter, and propose a linearrate-quantization model to optimize quantization parametercalculation. In [8], the authors present a dynamic networkresource allocation scheme for high-quality variable bitratevideo transmission, based on the prediction of future trafficpatterns. While monitoring and controlling QoS parametersof the video transmission system is important for achievinghigh video quality, it is more crucial to evaluate video quality

Yanjiao Chen and Qian Zhang are with the Department of Computer Sci-ence and Engineering, The Hong Kong University of Science and Technology,Clear Water Bay, Kowloon, Hong Kong. Kaishun Wu is with the Collegeof Computer Science and Software Engineering, Shenzhen University andHKUST Fok Ying Tung Research Institute

from the users’ perspective, which is known as Quality ofExperience (QoE), or user-lever QoS. QoE-based video qualityassessment is difficult because user experience is subjective,hard to quantify and measure. Moreover, the advent of newvideo compression standards, the development of video trans-mission systems, and the advancement of consumer videotechnologies, all call for a new and better understanding ofuser QoE. Video quality assessment has gone through fourstages, as shown in Fig. 1. Table I gives a comparison ofthese video quality assessment methods.

• Objective QoS parameters, e.g., fastest bit rate, shortest delay…

Stage I:

QoS Monitoring

• Subjective QoE: e.g., MOS, DMOS…Stage II

Subjective Test

• Objective Video Quality Metric (VQM) ,which correlates well with MOS.

Stage III

Objective Quality Model

• Measurable QoE metrics, e.g., viewing time, probability of return…

•Stage IV:Data-driven Analysis

Fig. 1: Video quality assessment evolution

QoS monitoring for the video traffic includes two parts: QoSprovisioning from the network and QoS provisioning from thevideo application. QoS support from the network, especiallywireless or mobile network, is essential for video deliveryover the Internet. Three major approaches are congestioncontrol, error control and power control. The challenges facingnetwork QoS support include unreliable channels, bandwidthconstraints, heterogeneous access technologies. QoS supportfrom the video application include advanced video encodingscheme, error concealment and adaptive video streaming pro-tocol. A survey of video QoS provisioning in mobile networkis given in [9], mostly from the network point of view. Error-concealment schemes are investigated in [10]. [11] and [12]consider both network and application QoS support. In thistutorial, we mainly focus on Stages II ∼ IV of video qualityassessment. In the main text, we will not discuss Stage I,and interested readers can refer to the above surveys for moreinformation.

Subjective test directly measures user QoE by soliciting

Page 2: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

2

TABLE I: Comparison of Video Quality Assessment Methods

Direct measure of QoE Objective or subjective Real-time Wide application Cost

QoS monitoring No Objective Yes Wide Not sure

Subjective test Yes Subjective No Limited High

Objective quality model No Objective Yes/No Limited Low

Data-driven analysis Objective Yes Yes Wide Not sure

users’ evaluation scores under the laboratory environment.Users are given a series of tested video sequences, originalones and processed ones included, and then required to givescores on the video quality. Detailed plans for conductingsubjective tests have been made by the Video Quality ExpertGroup (VQEG) [13]. Though being viewed as a relativeaccurate way of measuring user QoE, subjective test suffersfrom three major drawbacks. First, subjective test has high costin terms of time, money, and manual effort. Second, subjectivetest is conducted in the laboratory environment, with limitedtest video types, test conditions, and viewer demography.Therefore, the results may not be applicable to video qualityassessment in the wild. Third, the subjective test cannot beused for real-time QoE evaluation.

In order to avoid high cost of subjective test, objectivequality models are developed. The major purpose is to identifythe objective QoS parameters that contribute to user perceptualquality, and map these parameters to user QoE. Subjective testresults are often used as ground truth to validate the perfor-mance of the objective quality models. Most of the objectivequality models are based on how the Human Visual System(HVS) receives and processes the information of the videosignals. One of the commonly used methods is to quantifythe difference between the original video and the distortedvideo, then weigh the errors according to spatial and temporalfeatures of the video. However, the need to access originalvideo hinders online QoE monitoring. In order to developQoE prediction models that do not depend on original videos,network statistics (such as packet loss) and spatiotemporalfeatures extracted or estimated from the distorted video, areleveraged. Though some of the objective quality models canrealize real-time QoE prediction(e.g., [14]–[27]), it is stillan indirect way for QoE prediction. Most of the objectivequality models rely on subjective test results to train modelparameters. Therefore, these models cannot be widely applieddue to limitations of the subjective test.

Data-driven video quality analysis emerges as a promisingway of solving the problems faced by the previous methods.Video streaming over the Internet has made large-scale ofdata available for analyzing user QoE. How to effectivelyleverage these valuable data is both challenging and promising.There are two ongoing trends for data-driven video qualityassessment. The first trend is from user quality of “experience”to user quality of “engagement”. In stead of user opinionscore, which can only be obtained from subjective test, QoEmetrics that can be easily quantified and measured without

much human interference are being explored, for example,the viewing time, the number of watched videos and theprobability of return. The second trend is from small-scalelab experiments (e.g., VQEG FRTV-I subjective test involved287 viewers [28], LIVE database involved 31 viewers [29]) tolarge-scale data mining (e.g., [30] contains 40 million videoviewing sessions). Sophisticated models with high computa-tional complexity may work well on small-scale data, but arevery likely to be outperformed by simple models on large-scaleonline QoE evaluation. Developing light-weight, efficient andreliable QoE prediction models based on big data is the futuredirection.

There have been several surveys on video quality assessment[31]–[34], mostly focusing on objective quality models. Thissurvey paper differs from all the previous survey papers as itprovides a comprehensive overview of the evolution of QoE-based video quality assessment methods. As far as we know,we are the first to include the data-driven QoE analysis models,which have newly emerged and raised research interest.

The rest of the paper is organized as follows. In Section II,we provide the background of video quality assessment, andidentify factors that may influence user QoE. In Section III,we give a detailed description of subjective test. In Section IV,we classify existing objective quality models and introducerepresentative ones in each class. In Section V, we presentthe new research progress on the data-driven QoE analysismodels. In Section VI, applications of video QoE models arereviewed. Future research directions on QoE are discussed inSection VII. We finally summarize our work in Section VIII.

II. BACKGROUND

In this section, we give a brief introduction of the videotransmission system, focusing on the factors that may havean influence on user experience by causing video distortion-s or affecting viewing environment. In the subjective test,these factors are often considered as test conditions; in theobjective quality models, these factors are often used as inputfor computing the final objective metrics; in the data-drivenanalysis, these factors are often collected in the data set forQoE prediction.

The video transmission path from the server side to theclient side includes: encoder, transmission network, decoderand display, as shown in Fig. 2. Each of these four places mayintroduce distortions or impairment that will affect the view-ers’ perception of the video quality. The resulting distorted

Page 3: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

3

Encoder Network Decoder

Original

video

Distorted

video

Coding & Compression

Distortions

Transmission Network

Distortions

End user

External Factors

Fig. 2: Video transmission path

videos usually exhibit the following typical visual distortions[35]:

• Blocking effect. Blocking effect refers to the discontinuityat the boundaries of two adjacent blocks. The reason forblocking effect is that the video coding is block-based,that is, individual blocks are coded separately, resultingin different types and levels of coding errors.

• Blurring. Blurring refers to the loss of spatial informationor edge sharpness, especially for roughly textured areasor around scene object edges.

• Edginess. Edginess refers to the distortions happened atthe edges of an image. The differences between the edgecharacteristics of the original video and those of thedistorted video are often given special attention.

• Motion jerkiness. Motion jerkiness refers to the time-discrete intermission of the original continuous, smoothscene. This often happens due to delay variance (alsoknown as “jitter”), which will be explained in SectionII-B.

The visual impact of the above distortions does not onlydepend on the absolute quantization error, but also on thespatiotemporal features of the video sequence, both in the locallevel and in the global level. The threshold, above which thedistortion is perceivable, is often referred to as Just NoticeableDifference (JND) [36], [37]. In the JND model, the followingcharacteristics of the Human Visual System (HVS) are mostcommonly considered [38], [39]:

• Low-level characteristics:– Frequency-dependent sensitivity. The HVS has d-

ifferent sensitivity to motion, shape, depth, color,contrast, and lumination. Therefore, different errorswill receive different sensitivity from the HVS. TheHVS sensitivity decreases as the spatial or temporalfrequency increases [40]. Many models use low-passfilter or band-pass filter to simulate such a feature[36], [41]–[44].

– Masking effect. Under masking conditions, the per-ception of the visual target will be weakened by themasking stimulus in temporal or spatial proximity.A review of the research on visual masking can befound in [45].

• Mid- to higher-level characteristics include attention, eyemovement and different unpleasantness towards differentdistortions. For example, looking at an image, the HVSfirst perceives the global structure, and then observes the

TABLE II: Comparison of Video Compression Formats

Standard Lossy/ Lossless Major Applications

MPEG-2 Lossy DVD, HDTV,Blu-ray Disc

MPEG-4 Part 2 LossyDVD, HDTV,

electronic surveillancesystems

H.264/MPEG-4 AVC LossyLow/high resolution video,

broadcast, DVD,RTP/IP packet networks

MPEG-C Part 3 N/A Auxiliary video data format,e.g., stereoscopic video

MPEG-H Part 2N/A Ultra HD videoHigh Efficiency Video

Coding (HEVC)MPEG-DASH N/A Multimedia over the Internet

Multiview Video N/AStereoscopic video,

free viewpoint television,Coding (MVC) multiview 3D television

Scalable Video N/AVideo storage, streaming,broadcast, conferencing,

Coding (SVC) surveillance

VP8 Lossy Real-time applicationslike videoconferencing

Dirac Lossy/ LosslessHigh-quality video

compression for UltraHDTV and beyond

detailed specifics. This coarse-to-fine-grained process isknown as Global precedence, one important feature ofthe HVS [46].

Interested readers can refer to [47] for a detailed descriptionof the artifacts of video compression and the mechanism ofHVS. [47]

A. Coding and Compression

In order to transmit rich video content through a capacity-limited network, the original video information needs to bereduced by compression. The compression methods may belossy or lossless: lossless compression method can restorethe original video while the lossy method may lead to videoquality degradation. Video compression formats define the wayto represent the video and audio as a file or a stream. Videocodec, a device or software, encodes (compresses) or decodes(decompresses) a digital video based on the video compressionformat. The encoded video is often combined with an audiostream (encoded based on the audio compression format) tofit in a multimedia container format1 such as FLV, 3GP, MP4,and WebM. Table II gives a comparison of commonly-usedvideo compression formats.

The video compression formats, such as MPEG or H26x,significantly influence the video quality, because they decidehow a video is coded. The following coding-related factors areoften taken into consideration for QoE evaluation.

• Bitrate. Bitrate is the rate at which the codec outputsdata. Constant bitrate (CBR) and variable bitrate (VBR)may be used. CBR is simple to implement, but it maynot allocate enough data for more complex part of the

1A container format can contain different types of video and audio compres-sion. The container format may also includes subtitles, chapter-information,and meta-data.

Page 4: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

4

video. VBR fixes the problem by flexibly assigningdifferent bitrates according to the complexity of the videosegments, but it takes more time to encode. Moreover,the instant bitrate of the VBR may exceed the networkcapacity. Efficient compression formats can use lowerbitrates to encode video at a similar quality. Moreover, itis shown that high bitrate does not always lead to highQoE (e.g., frequent bitrate switching annoys viewers [30],[48]). Therefore, bitrate alone is not reliable to measurethe video quality.

• Frame rate. Frame rate is the number of frames persecond. The human visual system (HVS), can analyze10 to 12 images per second [49]. The frame rate thresh-old, beyond which the HVS perceives no interruption,depends on both the content (e.g., motion) and the display(e.g., lighting). Given a fixed encoding bitrate subjectto bandwidth limitation, higher frame rate means lowernumber of bits for each frame, therefore higher codingand compression distortions. It is shown that the framerate affects QoE depending on the temporal and spatialcharacteristics of the video content [50].

• Temporal and spacial features of the video. Videos withdifferent temporal and spacial features will have differentdegree of perceptual quality. For example, videos withlow temporal complexity, where the frames are verysimilar to each other, may suffer less from jitter or packetloss as the viewers may not notice the delayed or missingframes. However, videos with high temporal complexity,where frames are quite different from each other, may besensitive to jitter or packet loss because much informationwill get lost. A classification of video content, based ontheir temporal (e.g., movement) and spatial (e.g., edges,blurriness, brightness) features, is given in [51].

B. Transmission NetworkCommon transmission networks that are considered in the

video QoE research include television broadcasting networkand the Internet. For television broadcasting network, videoquality assessment is usually conducted for different displayresolutions, such as standard-definition television (SDTV),enhanced-definition television (EDTV), high-definition televi-sion (HDTV) and ultra-high-definition television (UHDTV).For video over the internet, special attention has been paid toIP network and wireless network, the latter including cellularnetwork (or mobile network), wireless local area network(WLAN), sensor network and vehicular network. The videomay be delivered by client-server video distribution or P2Pvideo sharing.

Transmission network condition will greatly affect the videoquality. Fig. 3 gives a brief illustration of the end-to-end videotransmission between the server and the client. There are threemajor factors that will lead to video quality degradation.

• Packet loss, which is due to unreliable transmission.• Delay, which depends on the network capacity.• Jitter, also called delay variance, refers to irregular de-

lays.If there is only transmission delay (no packet loss or jitter),

the video can be played smoothly with the help of a buffer.

(a) Delay

Receiver

Sender

(c) Delay + Jitter

(b) Delay + Packet loss

Receiver

Sender

Receiver

Sender

Fig. 3: Video transmission

With packet loss, the most recent frame may freeze, then jumpto the next inconsecutive frame that arrives. Packet loss can becompensated by retransmission at the cost of increased delayand jitter. Retransmission is a tradeoff between decreasedpacket loss and increased delay and jitter. With jitter, the mostrecent frame may freeze, until the belated frame arrives. Jittercan be mitigated through buffering, where the receiver playsthe frames in the buffer with more steadiness. Choosing theoptimal buffer size is a tradeoff between decreased jitter andincreased delay. Some research found that jitter has nearly thesame effect on the QoE as packet loss [52].

C. External Factors

Apart from distortions, there are other factors that will affectQoE. These external factors, some of which may not havedirect impact on the video quality, influence users’ experienceby affecting viewing environment. The following are sometypical external factors:

• Video service type, whether the video is live streamingvideo or Video-on-Demand (VoD). In [30], [53], it isassumed that viewers may have different quality expecta-tions of VoD and live streaming video. By separating thetwo types of videos, the QoE prediction can be improved.

• Viewer demography. The characteristics of the viewerssuch as age, gender, occupation, nationality or eveneducation background and economic factors will all havesome impact on their perceived quality.

• Viewer geography. Studies show that people from differ-ent countries have different patience when faced delay ofa service [54].

• Video length. It has been verified that viewer behaviorsare different towards long videos (e.g., more than 10minutes) and short videos (e.g., less than 10 minutes).

Page 5: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

5

For example, viewers are likely to be more tolerant ofdistortions when watching long videos than short videos.

• Video popularity. Viewers tend to be more tolerant ofbad QoS for popular videos. However, there is also aninteresting finding that more popular video has shortviewing session [55]. Possible explanation is that popularvideos may be viewed from other sources, and viewersquit not wanting to watch repeated sessions.

• Device. The devices on which viewers can watch thevideo include TV, desktop computer, laptop, tablet, smart-phone, etc. Specifically, the fast-growing popularity ofsmartphone and tablet draws attention to study on viewerexperience on these devices. Viewers may have differentexpectations when they watch video on different devices.Device also determines the screen size. Typical screensizes include QCIF, CIF, VGA, SDTV or HDTV [56].

• Time of the day & day of the week. User experiencemay be different when they watch video in peak hours oridle hours. It is estimated that viewers may have betterviewing experience in the evening and on weekends,when they are more relaxed and are expected to watchthe videos for a longer time.

• Connectivity. The major concern is usually the last-mileconnection, for example, fiber, cable, DSL, 3G/4G, etc.

Before we discuss each stage of video quality assessment,we first give a brief summary of the related works in Fig. 4.

QoE metrics

Subjective: MOS,

DMOS ...

Objective: viewing time, probability of return ...

QoS metrics

Coding parameters:

bitrate,spatial and temporal features

Network parameters: packet loss, delay, jitter

External factors

Demography:gender, age...

Video type: Live, VoD

Time: peak hour or not

Device: smartphone, TV, desktop

QoS-QoE relationship

Correlation

Regression: linear, non-

linear

Causality

Validation Data Set

Subjective: user score

Objective: large-scale

data

Fig. 4: A summary of existing perceptual video quality assess-ment works

III. SUBJECTIVE TEST

Subjective test directly measures QoE by asking humanassessors to give their scores for the quality of the videosequences under test. Subjective test results are often used asthe ground truth for validating the performance of the objectivequality model in Section IV. In this section, we first describethe conventional procedures of conducting subjective test inthe laboratory context. Then, we give special instructions tothe requirement of subjective test for 3D videos. Finally,

Test Preparation

• Test environment & equipment setup

• Source video selection and processing

• Assessors recruiting

Execution

• DSIS

• DSCQS

• SS

• SC

Data Processing

• Data completeness

• Outlier detection

• Score consistency

Result Presentation

• Test configuration

• Information of assessors

• Test results

Fig. 5: Flow of the subjective test

we introduce subjective test crowdsourcing through Internetcrowdsourcing platforms.

The flow of the subjective test is shown in Fig 5.

A. Test Preparation

Test preparation includes checking the test environment,setting up equipment, selecting source videos, processingsource videos, and recruiting assessors [57].

1) Test Environment: The subjective test can be conductedin two kinds of environment: laboratory environment andhome environment, yet nearly all the subjective tests areconducted in the laboratory environment. Table III shows therequirement of both environment specified by the InternationalTelecommunication Union (ITU) Recommendation ITU-R BT.500-11 [57].

While the laboratory environment is easier to control, thehome environment is more close to the users’ real viewingexperience. The screen size affects the preferred viewingdistance (PVD), at which the viewers have the optimal viewingexperience. Therefore, in the test, the viewing distance shouldbe adjusted to satisfy the PVD determined by the screen size.It is suggested that the maximum and minimum resolutionsof the monitor be reported, especially the consumer TV setsused in the home environment.

2) Source Video Selection: As we discussed before, thevideo content will influence user viewing experience. Whenselecting the source materials, the following factors have tobe taken into consideration.

• Color.• Luminance.

– High luminance– Low luminance.

• Motion and spatial features.– Still images or video sequences.– Moving directions of the objects.

• Source origin, e.g., film, news, sports.• Other factors, e.g., avoiding cultural or gender offensive

materials.3) Source Video Processing: The experimenters have to

choose the Hypothetical Reference Circuits (HRC), such asthe encoding bitrate and packet loss rate, to process the sourcevideos. Firstly, the encoder encodes the video with a certain

Page 6: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

6

TABLE III: Test Environment Requirement [57]

Requirement Laboratory HomeRatio of inactive

≤ 0.02 ≤ 0.02screen luminance topeak luminance

Ratio of background≈ 0.15 N/Aluminance to picture’s

peak luminancePeak luminance N/A 200cd/m2

Ratio of screen only≈ 0.01 N/Ablack level luminance

to peak white luminanceDisplay brightness PLUGE [58], [59] PLUGEand contrast

BackgroundD65 N/Achromaticity

EnvironmentalN/A 200luxilluminance

on the screenOther room Low N/AilluminationMaximum

30◦ 30◦observation anglerelative to the normal

Screen size N/A Meet rules of PVD

Monitor N/ANo digital processing;

Meet resolutionrequirement

video compression format, during which the encoder’s distor-tions are applied. Secondly the video goes through the (oftensimulated) transmission network, during which the network’sdistortions are applied. Finally, the processed video can beobtained after decoding. If more than one distortion factors areconsidered (let F1, F2, ...Fk denote the various factors, and Fi

has ni levels fi,1, fi,2, ..., fi,ni), “reasonable” range for eachdistortion factor (i.e., fi,1, fi,2, ..., fi,ni ) should be determined,and the maximum and minimum values be specified. There aretwo ways to process the videos:

• Each processed video represents a level of one fac-tor, while other factors are fixed at a chosen level.For instance, for factor Fi, we have processed videos{(f1,0, ..., fi,j , ..., fk,0)}j=1,...,ni , in which f1,0, ..., fk,0are reference levels.

• All combinations of the factor levels are considered, thatis, we have processed videos {(f1,j1 , ..., fk,jk)}ji=1,...,ni .

After the video processing, the processed videos need tobe normalized to eliminate “deterministic” differences fromthe source videos. The normalization includes temporal frameshift, horizontal and vertical spatial image shift, and chromaand luma scaling and alignment. The amount of normalizationis estimated from the source and processed videos, and willbe applied uniformly to all the video sequences. The accuracyof the alignment can be verified by MSE.

4) Assessor Recruitment: It is required that at least 15 non-expert assessors should be recruited for the tests. The assessorsshould be tested on visual acuity, color vision and familiarityof the language used in the test. Since the demography of theassessors may have influence on the final evaluation results,their personal information should be collected as broadlyas possible such as age, gender, occupation, education, etc.Before the test sessions start, the assessors should be given

OriginalMid-

GreyProcessed Mid-Grey

Grade Phase

Fig. 6: DSIS video/image presentation sequence option I

OriginalMid-

GreyProcessed

Mid-

Grey

Grade Phase

OriginalMid-

GreyProcessed Mid-Grey

Fig. 7: DSIS video/image presentation sequence option II

instructions on:• The flow of the test, e.g., training subsessions and test

subsessions;• The presentation of each trial, e.g., double stimulus or

single stimulus;• The possible quality impairment, e.g., color, brightness,

depth, motion and “snow”;• The evaluation scale, e.g., continuous or categorical.

B. Test Execution

Test execution includes conducting the subjective tests andcollecting the test results (e.g., user scores) [57]. Each testsession should last fewer than 30 minutes, consisting of threesubsessions:

• Training subsession is used to give instructions to theassessors about the sequence and timing of the test.

• Stabilizing subsession is used as a “warm-up” for theassessors to stabilize the following assessment. The as-sessment in this subsession will not be included as theresults for further analysis.

• Main test subsession is the formal test phase, the resultsof which will be used for further analysis.

The order of the video presentation should be randomized,covering all the possible impairment conditions that are understudy. In the main test subsession, there are several testmethods that can be applied:

1) Double-Stimulus Impairment Scale (DSIS) method (theEBU method) : For the DSIS, the assessors are first presentedthe source video, then presented the processed video. Theassessors only grade the processed video, based on his knowl-edge or impression of the source video. For the assessment ofa certain video, the presentation sequence has two options asshown in Fig. 6 and Fig. 7. In Fig. 6, the source video andthe processed video are presented to the assessor only once,and the assessor can grade the video at the start when hesees the processed video. In Fig. 7, the source video and theprocessed video are presented to the assessors twice, and theassessor can grade at the start when he sees the source videofor the second time. The scale for DSIS is discrete gradesfrom 1 to 5 as shown in Tab IV, indicating how the assessorsevaluate the impairment of the processed video. It is found

Page 7: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

7

that the DSIS results is more stable for small impairment thanfor large impairment.

TABLE IV: DSIS Scale

1 2 3 4 5

Imperceptible Perceptible, but Slightly Annoying Verynot annoying annoying annoying

2) Double-Stimulus Continuous Quality-Scale (DSCQS)method: In DSCQS, the assessors are also presented boththe source video and the processed video. Let “PS” and “SP”denote the order of “first processed video, then source video”and“first source video, then processed video” respectively. Itshould be followed that the same video with different testconditions are not presented consecutively. The number ofconsecutive “PS” presentation order should be no more thana threshold, the same is for the “SP” presentation order. Inaddition, the number of events that two video sequences arepresented consecutively should be no more than a threshold.Compared with DSIS, DSCQS is different in the followingaspects:

• For the same video, both the source version and theprocessed version are presented to the assessors, but theassessors do not know which one is the source version.

• The assessors are asked to grade both versions of thesame video. The scale for DSCQS grading is different(as shown in Fig. 8) in two aspects:

– It has continuous grade bars;– It has two bars for the same video.

• For DSCQS grades, it is not the absolute value, but thedifference between the two values for the same video,that matters.

Sequence 1

Sequence 2

Excellent Good Fair Poor Bad

Fig. 8: DSCQS scale

3) Single-Stimulus (SS) method: In SS, only the processedvideos are presented to the assessors. The presentation canhave two forms:

• Each processed video is shown once to the assessors (asshown in Fig. 9). The order to present the processedvideos is random.

• Each processed video is shown three times in threesessions to the assessors (as shown in Fig. 10). The orderto present the processed videos in each session should bedifferent. Only the results in the last two sessions arecounted for final results. The first session is to stabilizeassessors’ grading.

The grading scheme for SS can have three different forms:• Categorical grading. The assessors categorize the videos

into pre-defined categories. The category can be givennumerically (e.g., category “1”, “2”..., “10”) or verbally

Mid-

GreyVideo1

Grade Phase

Mid-GreyMid-

GreyVideo 2 Mid-Grey

Grade Phase

… …

Fig. 9: SS video/image presentation sequence option I

Mid-

GreyVideo11

Grade Phase

Mid-GreyMid-

GreyVideo 12 Mid-Grey

Grade Phase

y … …Mid

Gre

M

GSession I

Mid-

GreyVideo21

Grade Phase

Mid-GreyMid-

GreyVideo 22 Mid-Grey

Grade Phase

y … …Mid

Gre

M

GSession II

Mid-

GreyVideo31

Grade Phase

Mid-GreyMid-

GreyVideo 32 Mid-Grey

Grade Phase

y … …Mid

Gre

M

GSession III

Fig. 10: SS video/image presentation sequence option II

(e.g., category “Excellent”,“Good”, “Fair”, “Poor”,“Bad”).

• Numerical grading. The assessors give marks, for exam-ple, 1 ∼ 100.

• Performance-based grading. While the above two meth-ods solicit assessors’ grading directly, the video qualitycan be indirectly inferred by asking assessors to givevideo-related information.

Compared with the Double-Stimulus (DS) method, theSingle-Stimulus method has the following advantages:

• For DS, if the source and processed videos are presentedsimultaneously on split screens, the assessors attentionmay be distracted [56].

• For DS, if the source and processed videos are presentedconsecutively, more time is required for one pair ofvideo sequence. Since it is required that one sessionshould not exceed 30 minutes, the possible pairs of videosequences tested in one session have to be reduced.Therefore, multiple sessions may be conducted, leadingto the problem of how to best combine the results fromdifferent sessions.

4) Stimulus-Comparison (SC) method: In the SC, two (pro-cessed) videos are presented to the assessors, and the assessorsgrade the relationship of the two videos. The grading schemefor SC also has three different forms:

• Categorical grading. The assessors categorize the rela-tionship between the two videos into pre-defined cat-egories. The category can be given numerically (e.g.,category (the second video is) “-3, much worse”, “-2,slightly worse”,..., “3, much better”) or verbally (e.g.,category “Same”,“Different”).

• Numerical grading. The assessors give (continuous)grades, for example, 1 ∼ 100 to the difference degreeof the two videos.

Page 8: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

8

• Performance-based grading. Assessors are asked to iden-tify whether one video has more or less of a certainfeature than the other video.

C. Data Processing

Data processing includes checking data completeness,screening the outliers and inconsistent assessors. To start with,the assessors’ grading can be processed into two user scoremetrics:

• Mean Opinion Score (MOS) is for the single-stimulustests. It is calculated as the average of the grades fora processed video. MOS is often used to validate theperformance of the no reference objective quality model,which will be introduced in Section IV.

• Difference Mean Opinion Score (DMOS) is for the doublestimulus tests. It is calculated as the average of thearithmetic difference between the grades given to theprocessed video and the grades given to the source video.DMOS is often used to validate full reference models andreduced reference objective quality models, which will beintroduced in Section IV.

Then, the results should be screened as follows.• Check the completeness of the data: whether an assessor

gives score to every video; whether an assessor gradesboth source and processed video in the double stimulusscore.

• Remove assessors with extreme scores (outliers).• Remove assessors with unstable scores.Check the data completeness is easy to do. Now we intro-

duce how to screen the outliers and inconsistent assessors inmore details. The basic assumption is that the data collectedfrom the subjective test follow a certain distribution within thescoring range (e.g., 1 ∼ 5, or 1 ∼ 100), with variations due todifferences in assessors, video contents, and so on. Let OS bethe individual opinion score, i be the assessor index (a totalof I assessors), j be the test condition index (a total of J testconditions), k be the video sequence index (a total of K videosequences). First, let’s define some key parameters:

• Mean Score. The mean score for the jth test conditionand kth video sequence is

MOSjk =1

I

∑i

OSijk (1)

• Standard Deviation. The standard deviation of MOSjk

is

Sjk =

√∑i

(MOSjk −OSijk)2

I − 1(2)

• 95% Confidence Interval. The 95% Confidence Intervalof MOSjk is

[MOSjk − δjk,MOSjk + δjk] (3)

in which δjk = 1.96Sjk/√I .

• Kurtosis Coefficient. The Kurtosis Coefficient, β2,jk, usedto verify whether the data distribution of the jth test

condition and kth video sequence is normal, can becalculated as

β2,jk =I∑

i(MOSjk −OSijk)4

[∑

i(MOSjk −OSijk)2]2(4)

1) Data Screening for DS: The data screening for DS ismainly to screen outliers, using algorithm 1. The detailedexplanation is as follows:

• Step 2: Verify whether the data distribution of the jthtest condition and kth video sequence is normal. Ifβ2,jk ∈ [2, 4], the data distribution is regarded to benormal, otherwise, it is not.

• Step 3 ∼ 16: Compare the individual user score OSijk

with two reference value MOSjk +2Sjk and MOSjk −2Sjk for normal distribution, or MOSjk +

√20Sjk and

MOSjk −√20Sjk for non-normal distribution. Individ-

ual user scores that are outside the range [MOSjk +2Sjk,MOSjk−2Sjk] or [MOSjk+

√20Sjk,MOSjk−√

20Sjk] will be recorded in Highi and Lowi.• Step 18∼21: Decide whether to remove assessor i or not

based on Highi and Lowi.

Algorithm 1 Data Screening for DS

1: for all i, j, k do2: if β2,jk ∈ [2, 4] then3: if OSijk ≥ MOSjk + 2Sjk then4: Highi ++;5: end if6: if OSijk ≤ MOSjk − 2Sjk then7: Lowi ++;8: end if9: else

10: if OSijk ≥ MOSjk +√20Sjk then

11: Highi ++;12: end if13: if OSijk ≤ MOSjk −

√20Sjk then

14: Lowi ++;15: end if16: end if17: end for18: Ratio1 = Highi+Lowi

JK ;19: Ratio2 = |Highi−Lowi

Highi+Lowi|

20: if Ratio1 > 0.05&&Ratio2 < 0.3 then21: Remove assessor i;22: end if

2) Data Screening for SS: The data screening for SSis two-folds: to screen the outliers who deviate from theaverage behavior, and to screen the assessors whose behavior isinconsistent. The difference between the screening process forDS and for SS is: for DS, we test each (condition, sequence)configuration; for SS, we test each (condition, sequence, timewindow) configuration. Let m be the index of time window(A total of M time windows).

• Screen outliers: also use Algorithm 1, but replace theOSijk with OSijkm, and modify the Kurtosis Coefficientβ2,jkm and standard deviation Sjkm correspondingly.

Page 9: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

9

Further make the changes Ratio1 = Highi

JKM in Step18, Ratio2 = Lowi

JKM in Step 19, and the condition forremoving assessor i is Ratio1 > 0.2 or Ratio2 > 0.2 inStep 20.

• Screen inconsistent assessorsThe variable under test is

OSijkm = OSijkm −MOSijk +MOSjk (5)

in which

MOSjk =

∑i

∑m OSijkm

I ×M

MOSijk =

∑m OSijkm

M

(6)

The corresponding Kurtosis Coefficient is

β2,jkm =I∑

i(OSijkm)4

(∑

i OS2

ijkm)2(7)

The screening process is: use Algorithm 1, but replaceOSijk with OSijkm, β2,jk with β2,jkm, and modify thestandard deviation Sjkm correspondingly. Further makethe changes Ratio1 = Highi+Lowi

JKM in Step 18, Ratio2 =|Highi−Lowi||Highi+Lowi| in Step 19, and the condition for removingassessor i is Ratio1 > 0.1 or Ratio2 < 0.3 in Step 20.

D. Results Presentation

The final results should include the following:• Test configuration;• Test video sequences information;• Types of video source• Types of display monitors;• Number and demographic information of assessors;• Reference systems used;• The grand mean score for the experiment;• The mean and 95% confidence interval of the statistical

distribution of the assessment grades.A common data format is desirable for inter-lab data ex-

change, because usually large-scale subjective tests will becarried out in different laboratories in different countries,maybe with assessors speaking different languages.

E. Subjective Test for 3D Videos

In [60], the ITU gives the guidance for subjective test forstereoscopic television pictures. Apart from the assessmentfactors for conventional monoscopic television pictures, thereare additional factors to be considered for stereoscopic televi-sion pictures.

• Depth resolution and depth motion. Depth resolution isthe spatial resolution in the depth direction; and depthmotion is the movement along the depth direction.

• Puppet theater effect refers to the distortion in the repro-duced 3D image, that the objects appear unnaturally largeor small.

• Cardboard effect refers to the distortion in the reproduced3D image, that the objects appear unnaturally thin.

In [61], the authors argue that the subjective test specifiedby the ITU may not simulate the home environment wherethe actual viewing is happening. In the standard ITU subjectivetest, short video sequences are often used, whose contents maynot be interested to the viewers. Therefore, in [61], the authorspropose to use long video sequences, with the test methodsshown in Fig. 11. The same long video is continuouslyplayed with alternating processed and original segments, andassessors grade the video quality during the period when theoriginal(unprocessed) segments are being played.

OriginalProcessed

Distortion A

Grade Phase

OriginalProcessed

Distortion BOriginal

Processed

Distortion CC … …

Grade Phase

Fig. 11: Proposed 3D video evaluation method in [61]

F. Subjective Test Crowdsourcing

Conventionally, the subjective test is conducted in a labor several cooperating labs, which is labor-intensive, time-consuming and expensive. A more cost-effective alternativeis to conduct subjective test through Internet crowdsourcingplatforms, such as Amazon Mechanical Turk (MTurk) [62].

One problem with the crowdsourcing subjective test is todetect the outliers, because the online assessors are performingthe evaluation tasks without supervision. For example, if thetest lasts a long time and the assessors get impatient, theymay input random evaluations. In [63], the authors propose toverify the consistency of the ratings based on the transitivityproperty, that is, if the assessor prefers A to B, B to C, thenhe should prefer A to C. But the method cannot work whenthe data is incomplete. To solve this problem, the authors in[64] propose an outlier detection algorithm based on HodgeDecomposition theory, which is first proposed in [65] to checkdata consistency from incomplete and imbalanced data. In[66], paired comparison is proposed as a simpler rating schemeto replace MOS. In paired comparison, the assessors are givena pair of images or videos, and they only have to decidewhich one has better quality. A cheat detection mechanismbased on the transitivity property is given to check and screeninconsistent assessment.

G. Discussion

Although the subjective test directly measures QoE byasking assessors for their evaluations, it suffers from somesignificant drawbacks:

• High cost. The subjective test is time-consuming, money-consuming and manpower-consuming.

• Limited assessors. Usually, no more than 100 assessorsare involved in the subjective test due to its high cost.These assessors can only represent the demographicfeatures of a very small fraction of the entire viewerpopulation.

• Controlled environment. The subjective test is often con-ducted in the laboratory environment, which is not the

Page 10: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

10

usual place where the common viewers watch video. Theresults may not be an accurate reflection of viewers’ trueviewing experience in the wild, where other factors, suchas delay, may also have an influence on QoE.

• Limited distortion types. The lab-processed distortiontypes are representative but cannot account for all pa-rameters that have an impact on the QoE. Some of theconditions are hard to test in the laboratory environment,such as transmission network induced delay and jitter, orexternal factors such as different locations where viewerswatch the video.

• Distortion factor correlation. One problem about videoprocessing is that many of the distortion factors arecorrelated in reality. Some combinations of factors wouldnot happen in the real environment. For example, if bitrateand frame rate are chosen as distortion factors, it isunlikely that the processing of (high bitrate, low framerate) will happen in real environment.

• Hard to account for frames of different importance ina video. A video can be regarded as an aggregation ofimages (or frames), whose quality can be assessed byboth double stimulus and single stimulus subjective tests.However, the quality of the video does not simply equalthe sum of the quality of all its images. For example,some frames in a video is less visually important thanothers. Moreover, in video compression, certain frames(e.g., I-frame) contain more information than others (e.g.,P-frame and B-frame).

• Not applicable for online QoE estimation. The subjectivetest cannot be used for real-time QoE monitor or predic-tion. Thus, it cannot provide instrumental guidance forreal-time system adaptation.

IV. OBJECTIVE QUALITY MODEL

To give relatively reliable QoE prediction but avoid the ne-cessity of doing subjective test, researchers develop objectivequality models. Objective quality models compute a metric asa function of QoS parameters and external factors. The outputmetric should correlate well with the subjective test results,which serve as the ground truth QoE. In this section, we firstintroduce representative objective quality models. Then, wedescribe the process of validating the performance of objectivequality models. Finally, we introduce projects and internationalstandards for objective quality models.

In previous survey papers on objective quality models, thereare three major classification methods:

• The “psychophysical approach” and the “engineeringapproach” [47]. The two approaches are also termedas vision-based model and signal-driven model in somearticles. The psychophysical approach is mainly basedon characterizing the mechanisms of the HVS, such asmasking effect, contrast sensitivity, and adaptation tocolor and illumination. The engineering approach is basedon extracting and analyzing certain distortion patternsor features of the video, such as statistical features,structural similarity (SSIM) and compression artifacts(e.g., blockiness, edginess).

• Reference-based classification method [47]. Based onwhether the reference to the original video is needed, theobjective quality models are classified as Full Reference(FR) model, Reduced Reference (RR) mode and NoReference (NR) model.

– Full Reference (FR) Model. Full access to the sourcevideo is required.

– Reduced Reference (RR) Model. Partial informationof the source video is required.

– No Reference (NR) Model. No reference model doesnot need the access to the source video.

The full reference and reduced reference models needto refer to the original video for quality comparison andassessment, making them less suitable for online QoEestimation. They are “intrusive” models in the sense thatthey insert additional load to the network or service [67].No reference model is non-intrusive, adding no load tothe network or service, thus more suitable for online QoEevaluation and system adaptation. When choosing a noreference model or metric for online QoE evaluation, realtime performance and speed are also the deciding factors.

• Input data-based classification method [68]. Based on thetype of the input data, there are five categories of models:

– Media-layer models, whose input is the media signal.– Parametric packet-layer model, whose input is the

packet header information.– Parametric planning model, whose input is quality

design parameters.– Bitstream layer model, whose input is packet header

and payload information.– Hybrid model, the combination of any of the other

models.The first two classification methods are most commonly adopt-ed, and often used to complement each other. In general,psychophysical approach usually belongs to the FR, whileRR and NR are mostly based on the engineering approach.Many survey papers mention both classification methods, butusually follow one of them. For example, [32], [69] mainlyfollow the psychophysical/engineering approach classificationmethod; [31], [70]–[72] mainly adopt the reference-basedclassification method, and [47] adopts a combination of thetwo. The third classification methods is proposed in [68]and referenced in [31]. In [73], the objective models areclassified as pixel-based model (e.g., PSNR and MSE), vision-based single-channel model, vision-based multi-channel modeland specialized model, yet this is not a commonly adoptedclassification method.

The main purpose of this tutorial paper is to introduce theevolution of the video quality assessment methods on thewhole, and in particular, to point out potential future direc-tions. We will just adopt the existing classification methodsfor the objective quality model. Fig. 12 gives a summary ofthe objective quality models that we mainly focus on. Weuse FR/RR/NR as the first-tier classification, psychophysi-cal/engineering approach as the second-tier classification, andother more specific criterion as the third-tier classification.It should be noted that some classification is non-exclusive.

Page 11: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

11

Objective Model

FR

Pixel-based: MSE, PSNR

Psychophysical approach

[36,41,42,46,79-81]

Engineering approach

Video artefacts

[44,92-94,98]

NSS [95-97]

SSIM

[33,43,76,87-91,106]

RR

Engineering approach

PLV

[113-119]

NSS

[122-128]

Others

[135-138]

Psychophysical approach

NR

Engineering approach

PLV NSS

[14, 15]

Streamed video

Bitstream-layer

[16-18]

Packet-layer

[19,20]

Hybrid

[21-27]

Fig. 12: An overview of objective quality models

For example, similarity structural (SSIM) is an engineeringapproach, but many variations of SSIM also incorporate psy-chophysical features in the design. In this case, we still classifythese variations as engineering approach as their major basisis SSIM. We believe that as the research on objective qualitymodels advances, there will be a need for an evolution of theclassification methods, but this is not the focus of this tutorialpaper.

A. Full Reference Model

In this section, we mainly introduce three kinds of fullreference models: simple pixel-based models, psychophysicalapproach and engineering approach. In the engineering ap-proach, we further introduce models based on video artefacts,natural scene statistics (NSS) and structural similarity (SSIM).

1) Pixel based Models: Two most basic objective qualitymodels are Mean Squared Error (MSE) and Peak-Signal-to-Noise Ratio (PSNR), which are simple to compute andtherefore usually serve as the benchmark for evaluating moreadvanced models.

• MSE. MSE can be calculated as

MSE =1

N

∑i

(yi − xi)2 (8)

in which xi is the ith sample of the original signal, andyi is the ith sample of the distorted signal.

• PSNR. PSNR is defined as

PSNR = 10 log10MAX

MSE(9)

in which MAX is the maximum signal energy [74].

The advantage of pixel based model is simplicity. However,neither model consider the features of the HVS and viewingconditions, and are poorly correlated to subjective results [75]–[77].

2) Psychophysical Approach: Objective quality models ofthe psychophysical approach are based on the features ofthe HVS which are related to visual perception, for instance,contrast sensitivity, frequency selectivity, spatial and temporalfeatures, masking effects, and color perception [72]. Table Vgives a summary of the psychophysical approach models. Notethat the performance factors given in the table are highlydependent on the database used for evaluation and differentmodel parameters; the values provided in the table only serveas a reference.

• Moving Picture Quality Metric (MPQM)MPQM is based on two features of human perception:contrast sensitivity and masking effect [36]. Contrastsensitivity means that a signal is visible only if its contrastis higher than a detection threshold, which is a function ofspatial frequency. The inverse of the detection thresholdis defined as the contrast sensitivity, which is usuallydenoted by the Contrast Sensitivity Function (CSF). Thecontrast sensitivity function proposed by Manos andSakrison is [83]

A(f) = 2.6(0.0192 + 0.114f) exp[−(0.114f)1.1] =1

D0(10)

in which f is the spatial frequency, D0 is the detectionthreshold of the distortion without masking effect.One of the characteristics of the HVS is contrast mask-ing: the visibility of a signal is highly affected by itsbackground. The detection threshold of the foregroundsignal is a function of the contrast of the background.The distortion can be viewed as the foreground signal onthe background original image. The foreground distortionmay be highly visible, or partly/completely masked by thebackground original image. Let D denote the detectionthreshold of the distortion with the masking effect, Cb

denote the contrast of the background. The masking effectmodel gives the following function of the D dependingon D0 and Cb:

Page 12: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

12

TABLE V: Psychophysical Approach ModelsCC: Correlation Coefficient; RMS: Root Mean Squared Error; OR: Outlier Ratio; SRCC: Spearman Rank-order Correlation Coefficient

Year Model Basis Validation Database Performance factorsCC RMS OR SRCC

MPQM [36] 1996 Human spatio-temporal vision, 3 video sequences with N/A N/A N/A N/Acontrast sensitivity and masking MPEG-2 and H.263DVQ [41], [42] 2000 Human spatial-temporal contrast sensitivity Data set in [78] N/A 14.61 N/A N/A

PVQM [79] 2002

Medium to high quality database

0.934 N/A N/A N/ALuminance “edginess”, color error with digital codec distortions andand temporal decorrelation analog PAL, VHS and Betacam

distortions

VSNR [46] 2007 Contrast sensitivity, visual masking LIVE database 0.889 7.390 N/A 0.889and global precedence

MOSp [80] 2009 Edginess and masking effect Video sequences with 0.947 N/A 0.402 N/AH.264 at different bitrate

AFViQ [81] 2013 Contrast sensitivity, foveated LIVE database, 0.83 8.74 N/A N/Avision, visual attention VQEG HDTV Phase I database [82]

D =

{D0, Cb < D0

D0(Cb

D0)η, Cb ≥ D0

(11)

in which η is a constant parameter.

Filter

Bank

Masking

Compute

JND

Original

video

Distorted

video

Error

signall

Filter

Bank

Error

PoolingMPQM

Detection

threshold

Fig. 13: Flow of MPQM

Fig. 13 shows the flow of calculating the MPQM metric.The thick lines represent multi-channel output or input.Firstly, the original video and the error signal (the d-ifference between the original and distorted videos) gothrough the filter bank, which decomposes them intomultiple channels according to the orientation, spatial fre-quency and temporal frequency. Secondly, the detectionthreshold under the masking effect is calculated accordingto (10) and (11), for each channel. Thirdly, the errorsignal is divided by the detection threshold to get the JustNoticeable Difference (JND), which will be pooled overall channels by Minkowski summation (with exponent β)to get the final metric as follows:

MPQM = (1

N

N∑f=1

(1

NxNyNt

∑x,y,t

|e(x, y, t, f)|)β)1β

(12)in which e(x, y, t, f) is the computed JND at position(x, y), time t and channel f .

• Digital Video Quality (DVQ).DVQ calculates the visual difference between the originaland distorted video sequences using Discrete CosineTransform (DCT). It incorporates spatial and temporalfiltering, spatial frequency channels and contrast masking[41], [42]. The flow of calculating DVQ is illustrated

in Fig. 14. Pre-processing includes sampling, cropping,and color transformations to restrict the later processinginto the Region of Interest (RoI). Then, blocked DCTis performed on the processed video sequence. Localcontrast is obtained by dividing the DCT coefficientwith the DC coefficients. Temporal filtering and JNDconversion implements the temporal and spatial feature ofthe CSF respectively. After the contrast masking process,the results are pooled by Minkowski summation as in(12).

Pre-

processing

Blocked

DCT

Temporal

Filtering

Original

video

Distorted

video

Local

Contrast

Error

PoolingDVQ

Contrast

Masking

JND

Conversion

Fig. 14: Flow of DVQ

• Perceptual Video Quality Measure (PVQM)PVQM calculates the following three indicators:

– Edginess indicator E. HVS is sensitive to the edgeand local luminance change. The local edginesscan be approximated by the local gradient of theluminance signal. The difference between the edgi-ness of the distorted video and the original videocan be viewed as sharpness loss (if the edginessof the distorted video is smaller) or distortion (ifthe edginess of the distorted video is higher). Theintroduced edginess difference is more obvious inareas with less edginess than in areas with muchedginess. The edginess indicator is the local edginessof the distorted video minus the local edginess of theoriginal video, then divided by the local edginess ofthe original video.

– Temporal indicator T . While edginess indicator isa pure spatial indicator mostly for still images,the temporal indicator characterizes the motion ofthe video sequence. The fast-moving sequence willdecrease visual sensitivity in details. Temporal indi-cator quantifies the temporal variability of the video

Page 13: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

13

sequence by calculating the correlation of the currentframe (t) and the previous frame (t− 1).

– Chrominance indicator C. Color errors in areas withsaturated colors are less perceptible to the HVS.Chrominance indicator calculates the color saturationof the original and distorted videos.

Two cognitive models are further applied for pooling theabove indicators from both spatial and temporal aspects:

– Spatial pooling. Errors on the edge are less disturbingthan those in the central area, therefore, the edginessindicator and the chrominance indicator are givenheavier weights in the center of the image and lighterweights on the bottom and top.

– Spatio-temporal pooling. HVS punishes more severeerrors. Therefore, large local spatial and temporalerrors are given heavier weights.

The final PVQM is the linear combination of the threeindicators after aggregation.

PV QM = 3.95E + 0.74C − 0.78T − 0.4 (13)

• Visual Signal-to-Noise Ratio (VSNR)

Compute distortions

E = D - S

M-level DWT of S and D

Compute the M-level frequencies

Compute

global

precedence

Compute

perceived

contrast

VSNR = Compute VSNR

E > thresholdYesNo

Source

image S

Distorted

image D

Stage I

Near-threshold distortion

Stage II

Suprathreshold distortion

Fig. 15: Flow of VSNR

The VSNR determines near-threshold and suprathresholddistortions in two stages, as shown in Fig. 15. For pre-processing, the original image S and distorted image Dare decomposed by M -level DWT to obtain two sets of3M + 1 subbands. Then, the assessment goes throughtwo stages. In the first stage, near-threshold distortionis considered. Low-level HVS properties are used todetermine whether the distortion is beyond the threshold:if not, the image is assessed to have perfect visualfidelity, thus V SNR = ∞; otherwise, the image willbe put through the second stage. In the second stage,suprathreshold distortion is considered. Both low-leveland mid-level HVS properties are used to compute thefinal VSNR value.

– Stage I: Near-threshold Distortion.

Whether an observer can detect the distortion de-pends on the spatial frequency of the image, whichdepends on the viewing conditions: the resolutionof the display r and the viewing distance d. M -tuple frequency vector f = [f1, ..., fm, ..., fM ] canbe computed as

fm = 2−mrd tanπ

180(14)

To decide whether the distortions are visually percep-tible, the contrast detection threshold for a particularfrequency f is calculated as follows:

Tm =C(Sf )

a0fa2 ln f+a1(15)

in which C(·) is the root-mean-square (RMS) con-trast function [84]; a0, a1, a2 are parameters that canbe obtained from experiment. If, for any subbandfm, the distortion contrast is less than the thresholdTm, assign V SNR = ∞ and the assessment processterminates. If, for a particular fm the distortioncontrast C(Em) exceeds the threshold Tm, Stage IIis processed for further assessment.

– Stage II: Suprathreshold Distortion.The assessment of suprathreshold distortion is basedon Global Precedence, a mid-level HVS property(see Section II). The principle of Global Precedenceis, the HVS processes the image in a coarse-to-fine-grained manner: from the global structuring to thelocal details [85]. It is found in [86] that “structuraldistortion” that affects the global precedence is mostperceptible; while additive white noise, which isuncorrelated with the image, is least perceptible. Theglobal precedence-based VSNR is computed as

V SNR = 20 log10C(S)

αC(E) + (1− α)GP/√2(16)

in which α ∈ [0, 1] is to adjust the relative impor-tance; C(S) and C(E) are the sum of C(Sm) andC(Em), respectively; GP is the global precedencedisruption given as follows:

GP =

√∑m

[C∗(Em)− C(Em)]2 (17)

in which C∗(Em) is the global-precedence preserv-ing contrast.

• MOSpMOSp is a simple and easy-to-compute metric, whichis based on the linear relationship between MSE andsubjective results.

MOSp = 1− k ·MSE (18)

in which k is the slope of the linear regression andthe key element of MOSp. Due to the masking effect,distortions in highly detailed regions are less visible than

Page 14: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

14

TABLE VI: Engineering Approach ModelsCC: Correlation Coefficient; RMS: Root Mean Squared Error; OR: Outlier Ratio; SRCC: Spearman Rank-order Correlation Coefficient

Year Model Basis Validation Database Performance factorsCC RMS OR SRCC

SSIM [33], [43], [76] 2002 Structural similarity VQEG Phase I FR-TV 0.821 N/A 0.644 0.833[87]–[91]LVQM [92]–[94] 2004 Blocking, content richness, masking 90 test video sequences 0.897 N/A N/A 0.902

VIF [95] [96] 2005 Natural Scene Statistics Subjective tests 0.950 5.025 0.013 0.950Video VIF [97] 2005 Natural Scene Statistics and Video Motion VQEG Phase I FR-TV 0.891 N/A N/A 0.865

KVQM [98] 2008 Edginess, blockiness and blur 140 video sequences with N/A 11.5 N/A N/AH.263 and H.264/AVC

MOVIE [44] 2010 Spatial & temporal & VQEG Phase I FR-TV 0.821 N/A 0.644 0.833spatio-temporal distortion

those in low detailed regions. Therefore, k is calculatedas follows:

k = 0.03585 exp(−0.02439 ∗ EdgeStrength) (19)

in which EdgeStrength is used to quantify the detailwithin a region.

• Attention-driven Foveated Video Quality metric (AFViQ)AFViQ models the contrast sensitivity of the HVS basedon the mechanisms of vision foveation and visual atten-tion. The vision foveation refers to the fact that the HVSperceives different amount of detail, or resolution, acrossthe area of view, with highest resolution at the point offixation. The point of fixation is projected onto the centerof the eye’s retina, i.e., the fovea [99]. Different fromexisting quality metrics based on static foveated vision[99]–[101], AFViQ simulates the dynamic foveation bypredicting video fixation based on eye movement. Giventhe traditional critical frequency fc (beyond which thecontrast change is imperceptible by the HVS) given inexisting work [102], the adjusted critical frequency f ′

c

for a moving object is:

f ′c = fc

vc| cos θ · vr|+ vc

(20)

in which vc = 2 deg/sec is the corner velocity, vr is thedifference between the velocity of the moving object andthe eye movement, and θ is the retinal velocity direction.Moreover, the HVS has different attention towards differ-ent objects. The critical frequency of the different partsof the video can be adjusted by the attention map [103].

f ′′c = f ′

c[ρ+ (1− ρ)AM ] (21)

in which AM is the attention map, ρ ∈ [0, 1] is a controlparameter. Then the contrast sensitivity for a given spatialfrequency sf is:

CS(sf) =

{f ′′c , f ≤ f

0, f > f(22)

in which f = min(f ′′c , r/2), r is the effective display

visual resolution [104].The predicted perceived quality at the frame level is:

Qframe = SD · TD (23)

in which SD is spatial distortion index and TD is tem-poral distortion index. Both SD and TD are a functionof CS(sf). Then the video sequence is partitioned intosegments based on saccade duration, since the HVS hasno visual detectability during the saccadic eye movement.The quality metric for Qsegment is derived by a short-term spatial-temporal pooling. Finally, the overall qualitymetric for the entire video Qvideo is derived by a long-term spatial-temporal pooling.

3) Engineering Approach: In this section, we first introduceengineering approach models which are based on modelingone or more video artefacts such as blockiness, edginessand blur; then we present a well-known NSS-based model;finally, we focus on an important branch of engineeringapproach models based on structural similarity. Table VI givesa summary of the engineering approach models.

Video Artefacts based Models:• Low-bitrate Video Quality Model (LVQM)

Noting that pixel-wise error measurements (e.g., MSE,PSNR), used for TV types of video, are unsuitable forvideos encoded at a low bitrate, LVQM is proposed andevaluated on QCIF and CIF videos encoded by MPEG-4with bitrates ranging from 24kbps to 384kbps and framerates ranging from 7.5Hz to 30Hz. LVQM incorporatesthree aspects:

– Distortion-invisibility D. Subject to luminance mask-ing, spatial-textural masking, and temporal mask-ing, distortions below the detection threshold aredeemed invisible. Distortions greater than the detec-tion threshold are incorporated into D.

– Block fidelity BF . At low bitrate, lossy block-basedvideo compression will introduce distortions at blockboundaries. Block fidelity computes the differencebetween the distorted video and the original video atblock-boundaries.

– Content richness fidelity RF . The HVS favors livelyand colorful images. RF compares the content rich-ness of the distorted video and the original video interms of luminance occurrences.

The final quality rating is:

LV QM =

∑t D(t) ·BF (t) ·RF (t)

Nt(24)

• KVQM

Page 15: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

15

KVQM metric is the linear combination of three factors:

– Fedge quantifies the edge features, with the help ofan edge detection algorithm and an edge boundarydetection algorithm.

– Fblock quantifies the distortions at the block bound-ary, with the help of a block boundary detectionalgorithm.

– Fblur quantifies blur distortion of the image, bycalculating the differences of the average gradientsof the distorted and original images.

The flow of computing the KVQM is shown in Fig. 16.The edge detection algorithm extracts edge pixels, andthe edge detection algorithm extracts pixels adjacent tothe edge pixels, both from the original image, since theedges of the distorted image may suffer from blur or otherdegradation. The block boundary detection algorithmdetects blockiness at block boundaries in the distortedimage. The gradient feature is the difference between theaverage gradients of the original image and the distortedimage. It quantifies the blur factor. Then KVQM iscalculated as the weighted sum of the three factors.

Block

Detection

Original

video

Distorted

video

Gradient

Feature

KVQM

Edge

Boundary

Detection

Block

Factor

Blur Factor

Edge

Detection

Edge

Factor

Fig. 16: Flow of KVQM

KVQM = w1Fedge + w2Fblock + w3Fblur + offset(25)

in which w1, w2 and w3 are the weights for each factor;offset is the residual of the regression.

• MOtion-based Video Integrity Evaluation (MOVIE)The MOVIE index assesses the video distortions notonly separately in space domain and time domain, butalso in space-time domain, characterizing the motionquality along the motion trajectories. Fig. 17 shows howto calculate the MOVIE index. The original video anddistorted video signals first go through the Gabor filterto model the linear filtering function of the HVS. Leti = (x, y, t) denote a spatio-temporal location; R(i, k)denote the Gabor filtered original video signal, andD(i, k) denote the Gabor filtered distorted video signal,in which k = 1, 2, ...,K is the index of Gabor filters. Thedecomposed signals are then used to estimate motion andcompute spatial and temporal MOVIE indexes.

– Spatial MOVIE Index

Original

video

Distorted

video

Motion

Estimation

MOVIE

Spatial

MOVIE

Calculation

Error

Pooling

Gabor

Filter

Temporal

MOVIE

Calculation

Fig. 17: Flow of MOVIE

Local spatial movie index is computed for a refer-ence location i0, with N sample signals within awindow centered at i0:

QS(i0) = 1− PES(i0)/K + EDC(i0)

P + 1(26)

in which ES is the error index of the Gabor sub-band and EDC is the error index of the Gaussiansub-band. P is the scale of Gabor filters, K is thenumber of Gabor filters. ES(i0, k) is calculated as

ES(i0) =1

2N

∑k

∑n

[R(in, k)−D(in, k)

C1 + E(i0, k)]2 (27)

in which E(i0, k) measures the local energy.EDC(i0) is calculated in a similar manner.

– Motion estimationMotion information is extracted from the originalvideo based on the Fleet and Jepson algorithm [105];and is used for the temporal MOVIE calculation.

– Temporal MOVIE IndexThe idea of temporal MOVIE index is to computea weighted sum of the Gabor filtered signals: ifthe distorted video has the same motion (speed anddirection) as the original video, the weight is stronglypositive, vice versa.

QT = 1− 1

N

∑n

(vrn − vdn)2 (28)

in which vrn is the response of the original video to amechanism that is tuned to its own motion, and vdn isthe response of the distorted video to a mechanismthat is tuned to the motion of the original video.

– Error poolingFrame level spatial and temporal MOVIE is

FS =δQs

µQs

, FT =δQT

µQT

(29)

in which δ is the standard deviation and µ is themean.

The final MOVIE index is

MOV IE =1

M

∑m

FS(tm) ·√

1

M

∑m

FT (tm) (30)

Page 16: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

16

TABLE VII: Structural Similarity based ModelsCC: Correlation Coefficient; RMS: Root Mean Squared Error; OR: Outlier Ratio; SRCC: Spearman Rank-order Correlation Coefficient

Year Model Basis Validation Database Performance factorsCC RMS OR SRCC

SSIM [76] [33] 2002 Luminance, Contrast, Structure VQEG Phase I FR-TV 0.967 5.06 0.041 0.963Multi-scale 2003 Luminance, Contrast, Structure, Image LIVE JPEG/JPEG2000 0.969 4.91 1.16 0.966SSIM [43] details at different resolutions (Multi-scale)

Video SSIM [87] 2004 SSIM at the local region level, the frame VQEG Phase I FR-TV 0.849 N/A 0.578 0.812level, and the sequence levelSpatial Weighted

2004Minkowski SSIM , local quality-

LIVE JPEG/JPEG2000 N/A N/A N/A +0.01302SSIM [89], [90] -weighted SSIM, information

content-weighted SSIM

Wang et al. [88] 2005Structural and non-structural distortions

N/A N/A N/A N/A N/A(Luminance, Contrast, Gamma Distortion,Horizontal and Vertical translation)

Speed Weighted 2007 Luminance, Contrast, VQEG Phase I FR-TV N/A N/A N/A 0.8621SSIM [106] Structure, Video Motion

PF/FP-SSIM [91] 2009 Visual fixation-based weighting SSIM LIVE JPEG/JPEG2000 0.9664(SS) 5.9383(SS) N/A 0.9402(SS)and Quality-based weighting SSIM 0.9554(MS) 6.8245(MS) 0.9469(MS)

in which M is the number of frames.NSS based Models:Image and video are natural scenes, of which the statistical

information is different from random signals. However, thecompression artefacts will result in unnaturalness. NaturalScene Statistics models [107], [108], combined with distortionmodels, can better quantify the statistical information differ-ence between the original and the distorted videos. Here, weintroduce VIF, a widely cited NSS-based model.

• Video Visual Information Fidelity (VIF)Video VIF evaluates visual fidelity by comparing theinformation that can be extracted by the brain from theoriginal video and the distorted video [97], as shown inFig. 18. In the upper path in Fig. 18, the original videofirst passes through the distortion channel, then passesthrough the HVS, resulting in the distorted video. In thelower path in Fig. 18, the original video directly passesthrough the HVS, resulting in the reference video. Thequality of the video can be represented by the amount ofinformation that the brain can extract from the video. LetS represent the original video, D represent the distortedvideo, R represent the reference video.

Original

video

Distorted

video

Reference

video

HVS

HVS

Distorted

channel

Fig. 18: Flow of Video VIF

R = S +ND = aS + B +N ′ (31)

in which N and N ′ are the visual noises from the HVSchannel, which can be approximated as additive whiteGaussian noise. The response of the distortion channelis aS + B, in which a = {ai, i ∈ I} is a deterministic

scalar gain(I represents all the spatiotemporal blocks), Bis a stationary additive zero-mean Gaussian noise. Thissimple model is proved to be effective in modeling thenoise (by B) and blur (by a) effects in the distortionchannel.For one channel, the information that can be extractedfrom the reference and distorted video is as follows:

IR =1

2

∑i∈I

log2(1 +s2iδ2n

)

ID =1

2

∑i∈I

log2(1 +a2i s

2i

δ2v + δ2n)

(32)

in which ai is the distortion gain of the ith spatiotemporalblock, si is the ith original spatiotemporal block, δb andδn are the variances of the distortion noise B and HVSnoise N respectively.The video VIF is defined as the information that canbe extracted from the distorted video and that from thereference video of all channels.

V IF =

∑all channels

ID∑all channels

IR(33)

Structural Similarity based Models:The objective of the structural similarity based models is to

measure the similarity (fidelity) between the original video andthe distorted video, based on the knowledge of the transmitter,channel and the receiver [109]. Table VII shows the examplesof widely-used structural similarity based models.

• Structural SIMilarity (SSIM)SSIM is first proposed in [76], then developed in [33],on the basis that HVS is highly developed to capturethe “structure” of the image. Therefore, SSIM measuresthe “difference of structure” between the original imageand the distorted image, by taking into consideration thefollowing three factors: luminance, contrast and structure.The luminance and contrast are mostly affected by theillumination of the environment, while the structure is

Page 17: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

17

the intrinsic feature of the object. Let x = {xi, i ∈ I}and y = {yi, i ∈ I} denote the original and the distortedsignals. I is the set of spatiotemporal blocks.

– Luminance is represented by the mean of the signal.µx =

∑i xi, µy =

∑i yi. The luminance index is

l(x, y) =2µxµy + C1

µ2x + µ2

y + C1. (34)

in which C1 is included to avoid near-zero denomi-nator.

– Contrast is represented by the standard deviationof the signal. δx =

√(xi − µx)2/(I − 1), δy =√

(yi − µy)2/(I − 1). Therefore, the contrast indexis

c(x, y) =2δxδy + C2

δ2x + δ2y + C2(35)

in which C2 is included to avoid near-zero denomi-nator.

– Structure. The index to quantify the structural simi-larity is

s(x, y) =δxy + C3

δxδy + C3(36)

in which δxy =∑

i(xi − µx)(yi − µy)/(I − 1), C3

is included to avoid near-zero denominator.In [76], when the SSIM was first proposed, the parametersC1, C2 and C3 are excluded. But very soon they wereadded, because if C1 = C2 = C3 = 0, the results becomeunstable when µ2

x +µ2y or δ2x + δ2y are close to zero. The

SSIM index is then calculated as

SSIM(x, y) = [l(x, y)]α[c(x, y)]β [s(x, y)]γ (37)

in which α, β and γ are constant parameters. The SSIMindex has the following ideal properties:

– Symmetric: SSIM(x, y) = SSIM(y, x).– Bounded: SSIM(x, y) ≤ 1.– Unique Maximal: SSIM(x, y) is the maximum only

when x = y.SSIM is calculated locally as in (37) for an 8× 8 squarewindow, which moves pixel-by-pixel to cover the wholeimage, resulting in an SSIM map. To avoid “blocking”,the calculation of mean and standard deviation is weight-ed by a circular-symmetric Gaussian weighted functionw = {w1, w2, ..., wI}:

µx =∑i

wixi

δx =

√∑i

wi(xi − µx)2

δxy =∑i

wi(xi − µx)(yi − µy)

(38)

The SSIM index for the whole image is

SSIM(X,Y ) =1

N

∑j

SSIM(xj , yj) (39)

in which N is the number of windows, and xj , yj are thesignals at the jth window.

• Muti-scale SSIMViewer’s perceptibility of image details relies on theviewing conditions, such as the sampling density of theimage, the distance between the viewer and the image,and the perceptual ability of the viewer’s HVS. So, tochoose the right scale on which to evaluate the perceptualquality is difficult. The single-scale SSIM is, therefore,extended to multi-scale SSIM [43], summing up the in-fluence of each scale with different weights to account fortheir relative importance. Assume there are K intendedscales. The original and distorted images are repeatedlyprocessed by a low-pass filter, which downsamples theimage by a factor of 2. The number of repetition is K.At the jth scale, the contrast index cj(x, y) and structureindex sj(x, y) are computed; while the luminance indexof the last iteration lK(x, y) is computed. Multi-scaleSSIM is then calculated as:

SSIM(x, y) = [lK(x, y)]αKΠKj=1[cj(x, y)]

βj [sj(x, y)]γj

(40)in which αj , βj , γj can be adjusted for different impor-tance of each scale. In fact, the challenge of the methodlies in determining the value of αj , βj , γj , j ∈ [1,K]and the number of scales K. One way is to refer tothe contrast sensitivity function (CSF) of the HVS [110],another way is to calibrate the values via subjective test.

• Video SSIMThe SSIM for image is extended to SSIM for videosequence in [87]. The procedure of calculating the videoSSIM is shown in Fig. 19.

Local Level

SSIMij

Frame Level

SSIMi

Sequence Level

SSIM

Local

Luminance

Frame

Motion

Original

Video

Distorted

Video

Weight wij Weight Wi

Fig. 19: Flow of Video SSIM

– Local Level SSIM is calculated for random sampled8×8 windows in each frame, according to (37). Theselection of windows is unlike that in image SSIMcalculation, which exhausts all possible windows bymoving pixel-by-pixel over the entire image. In videoSSIM calculation, the number of sampled windowsfor each frame should consider both computationalcomplexity and evaluation accuracy. Local SSIM forY, Cb and Cr color components are calculated andthen combined as (the jth window of the ith frame):

SSIMij = WY SSIMYij +WCbSSIMCb

ij +WCrSSIMCrij (41)

in which WY ,WCb and WCr are weights for Y, Cband Cr color components.

– Frame Level SSIM is calculated as the weighted sumof the local level SSIM. The weight given to each

Page 18: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

18

local level SSIM is based on its luminance. Highweights are given to high-luminance regions as theyare more likely to attract fixation. Frame level SSIMfor the ith frame is:

SSIMi =

∑j wijSSIMij∑

j wij(42)

in which the value of wij is determined as

wij =

0, µx ≤ 40(µx − 40)/10, 40 < µx ≤ 501, µx > 50

(43)

in which µx is the mean of the Y components.– Sequence Level SSIM is calculated as the weighted

sum of the frame level SSIM. The weight given toeach frame level SSIM is based on its motion withrespect to the next frame. Low weights are givento large-motion frames as the experiments show thatSSIM performs less stable with large-motion frames.A motion-related parameter Mi is defined as Mi =∑

j mij/(16Ni), in which mij is the motion vectorof the jth window and Ni is the number of sampledwindows in the ith frame. Sequence level SSIM is:

SSIM =

∑i WiSSIMi∑

i wi(44)

in which the value of Wi is determined as

Wi =

j wij , Mi ≤ 0.8

(3− 2.5Mi)∑

j wij , 0.8 < Mi ≤ 1.2

0, Mi > 1.2(45)

• Spatial Weighted SSIMIn stead of giving equal weight to local level SSIM in(39), three spatial weighting methods are proposed in[90].

– Minkowski weighting gives high weights to windowswith large distortions since the HVS is more sensi-tive towards poor quality. The Minkowski weightedSSIM is:

SSIMMinkowski =1

N

∑j

SSIMpj (46)

in which p is the Minkowski power.– Local quality weighting also gives high weights to

the windows with large distortions or poor qualities,but through a function of the local quality index,which is more flexible than the Minkowski weight-ing. The local quality weighted SSIM is:

SSIMQuality =

∑j f(SSIMj)SSIMj∑

j f(SSIMj)(47)

in which f(·) is a (monotonic) function based on thelocal SSIMj .

– Information content weighting also gives highweights to the windows with large distortions or poor

qualities, but through a function of the local qualityindex. The information content weighted SSIM is:

SSIM Information =

∑j g(xj , yj)SSIMj∑

j g(xj , yj)(48)

in which g(xj , yj) is a function of the signal of theoriginal image xj and the signal of the distortedimage yj . In [89], the weighting function g(xj , yj)characterizes the local energy

g(x, y) = δ2x + δ2y + C (49)

in which C is included to account for near-zeroδ2x + δ2y . In [90], the weighting function g(xj , yj)is defined based on the received information

g(x, y) = log[(1 +δ2xC)(1 +

δ2yC)] (50)

• Speed Weighted SSIMDifferent from a set of still images, the video sequencecontains motion information, which is used to adjustthe SSIM in [106]. The basis of the speed weightingadjustment is the Bayesian human visual speed perceptionmodel [111], as shown in Fig. 20 . The original videopasses through the noisy HVS channel, to get the noisyinternal estimation of the motion, which is then combinedwith prior probability distribution of the speed of motion,to get the final estimated speed. Two kinds of speedare considered: vg, the background speed, and vm, theabsolute speed subtracting vg. vm can be viewed as themotion of the moving object. The perception of the speedincludes the following two aspects:

HVS

ChannelEstimation

Prior

Distribution

Internal Noise

Likelihood

Noisy Internal

Estimate

Estimated

Speed

Original

Speed

v v*e

Fig. 20: Bayesian human visual speed perception model

– Information Content. High-speed motion acts as asurprisal for the human vision, and is likely to attractmore attention. The prior probability distribution ofvm is assumed to be τ/vαm (τ and α are two positiveconstants). The information content is computed asthe self-information of vm.

I = α loge vm − loge τ (51)

The information content increases with the speed ofthe object, which is reasonably true.

– Perception Uncertainty. The perception uncertaintyis determined by the noise in the HVS channel. Asshown in Fig. 20, given the true speed (approximatedby vg), the likelihood of the internal noise e follows

Page 19: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

19

a log-normal distribution. The perception uncertaintyis computed as the entropy of this likelihood func-tion.

U = loge vg + β (52)

in which β is a constant. The perception uncertaintyincreases with the background speed, meaning thatthe HVS channel cannot accurately process the videoinformation, if the background motion is too fast.β decreases in video contrasts, meaning that high-contrast videos yield less uncertainty through theHVS channel.

Information content contributes to the importance of avisual stimulus, while the perception uncertainty reducesits importance. Hence, the speed-related weight is rep-resented as w = I − U , and speed weighted SSIM iscalculated as

SSIMspeed =

∑x

∑y

∑t w(x, y, t)SSIM(x, y, t)∑

x

∑y

∑t w(x, y, t)

(53)in which SSIM(x, y, t) is the SSIM index of the spa-tiotemporal region (x, y, t).

• PF/FP-SSIMPF/FP-SSIM is a combination of visual fixation weightedSSIM (P-SSIM) and quality weighted SSIM (F-SSIM)[91], as shown in Fig. 21 . The weight for a local SSIMis determined by its visually importance.

Local SSIM F-SSIM FP-SSIM

Visual

Fixation

Quality

Ranking

Original

Video

Distorted

Video

Weight wf Weight wp

Local SSIM P-SSIM PF-SSIM

Quality

Ranking

Visual

Fixation

Original

Video

Distorted

Video

Weight wp Weight wf

Fig. 21: Flow of PF/FP-SSIM

– Visual fixation weighted SSIM (F-SSIM). The areaswhich attract most human attention and the eyesfix upon, are more important. For each image, tenfixation points are chosen according to the Gaze-Attentive Fixation Finding Engine (GAFFE) algo-rithm [112], then the fixation areas are determinedby a 2-D gaussian function. The pixels within thefixation areas are given weight wf > 1, while otherpixels are given weight wf = 1. The F-SSIM of thejth window is obtained by:

F − SSIMj =

∑x∈J

∑y∈J SSIM(x, y)wf (x, y)∑x∈J

∑y∈J wf (x, y)

(54)

For multi-scale SSIM, the number of fixation pointsand the size of fixation areas reduce with the scalelevel.

– Quality weighted SSIM (P-SSIM). The areas with“poor” quality are easier to capture attention thanareas with “good” quality. Therefore, the “poor”quality areas hurt the perceptual quality more thanthe “good” quality areas improve the perceptualquality. Rank the quality of all windows according totheir quality in ascending orders; then assign weightwp > 1 to the lowest p% items, and assign weightwp = 1 to others. In [91], p = 6 yields good results.For multi-scale SSIM, only the second scale imageis given the weight wp.

– PF/FP-SSIM. The PF-SSIM is obtained by first ap-plying the quality weighting to get P-SSIM, thenvisual fixation weighting to get FP-SSIM. The FP-SSIM is obtained by first applying the visual fixationweighting to get F-SSIM, then quality weightingto get PF-SSIM. F-SSIM and P-SSIM can also becomputed separately.

Unfortunately, the experiments show that only the P-SSIM gives significant improvements over the non-weighted SSIM [91].

B. Reduced Reference Model

We mainly introduce two kinds of reduced reference mod-els: one is based on packet loss visibility (PLV), the other isbased on natural scene statistics.

1) Packet Loss Visibility Based Model: Packet loss visibilitybased models indirectly measure the loss of video qualityby measuring the visibility of the packet loss. The majorproblem is to classify what kind of packet loss is visibleand what kind of packet loss is invisible. Therefore, differentclassification techniques and different packet types have beenexplored to improve the classification accuracy. Table VIIIgives a summary of the packet loss visibility based RR models.Packet loss visibility based models usually process as follows.Firstly, subjective tests are conducted, in which assessors areasked whether they see artifacts in the displayed video. Then,classification algorithms (known as classifier) are applied toclassify packet loss into visible or invisible classes, or theregression models are applied to predict the probability ofpacket loss visibility, using the subjective test results as theground truth, and objective quality metrics as features.

In [113] and [114], the location of the packet loss andthe content of the video are considered as the major factorsthat influence the visibility of the packet loss. The followingobjective quality metrics are specified to characterize thelocation of the packet loss and the content of the video:

• Content-independent factors– Temporal duration, that is, the number of frames

affected by the packet loss. If the packet loss occursin a B-frame, the influence will last only a singleframe, however, if the packet loss occurs in an I-frame, the influence will last until the next I-frame.

Page 20: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

20

TABLE VIII: Packet Loss Visibility based Models

Year Codec Objective quality metrics Analysis tool[113] 2004 MPEG-2 Content-independent and content-dependent factors CART[114] 2004 MPEG-2 Same as [113] GLM[115] 2006 MPEG-2 More factors than [113] GLM and CART[116] 2006 H.264 [113] + Multiple packet loss factors GLM[117] 2007 MPEG-2 and H.264 [113] + SSIM-based factors GLM[118] 2007 MPEG-2 and H.264 [117]+camera motion and proximity to a scene change Patient Rule Induction Method (PRIM)[119] 2010 MPEG-2 and H.264 [118] + Group-of-picture structures GLM

– Initial spatial extent, that is, the number of slices lost.Due to a single packet loss, the decoder may have toabandon one slice, double slices or the entire frame.

– Vertical position, that is, the index of the topmostslice affected by the packet loss. In a frame, fromthe top to the bottom, slices are indexed from 0 to29. The location of the affected slice is consideredsince different regions of the picture capture differentdegrees of viewers’ attention.

Content-independent factors do not rely on video content,and can be extracted from the distorted videos.

• Content-dependent factors:– Variance of motion and residual energy. These fac-

tors characterize the motion information of the video,which may mask the error and influence the visibilityof the packet loss.

– Initial Mean Square Error, is the mean squared errorper pixel between the decoded videos with andwithout packet loss, only considering the pixels inlost slices.

Content-dependent factors can be estimated with helpof reduced information of the original videos from theencoder.

In [113], tree-structured data analysis based on Classifica-tion And Regression Tree (CART) [120], is used to classifythe visibility of the packet loss. However, using tree-structureddata analysis is hard to distinguish the packet loss visibilitynear the threshold and far from the threshold. Therefore, in[114], a Generalized Linear Model (GLM) [121] is used topredict the probability that the packet loss is visible to theviewer. Also, in [114], two NR models are developed, inwhich the content-dependent factors are estimated from thedistorted video. In [115], both CART and GLM are adoptedand their performances are compared. More objective qualitymetrics are considered in [115], including: type of the frame inwhich packet loss occurs, the magnitude and the angle of themotion. [116] extends [115] in two ways: H.264 is consideredin stead of MPEG-2; multiple packet loss is considered instead of isolated packet loss. Multiple packet loss is consideredbecause packet loss is usually bursty, and multiple packet lossmay correlate with each other. More specifically, in [116],dual packet loss is considered, characterized by spatial andtemporal separation of the two packet losses. In [117], SSIMis adapted for RR and NR models to predict the visibilityof packet loss (SSIM is originally an FR model). In [118],scene-level factors, specifically camera motion and proximityof a scene cut, are considered, and the Patient Rule Induction

Method (PRIM) is used to decide visibility of a packet loss.It is found that global camera motion will increase the packetloss visibility compared with a still camera, and packet lossnear the scene cut is less invisible. In [119], different Group-of-Picture (GoP) structures (e.g., IBBP) are considered forprediction, and the model is applied to packet prioritization forthe router to decide which packets to drop when the networkis congested.

One of the problems of the PLV based models is that qualitydegradation is simply classified as visible or invisible, withoutfurther quantification of how severe the quality degradationis. PLV based models may be used for preliminary qualityevaluation.

2) NSS based Model: The NSS based models assume thatthe real-world image and video are natural scenes, whosestatistical features will be disrupted by distortions. The com-parison of the statistics of the original image and the distortedimage can be used to quantify the quality degradation. Surveypaper [133] offers a nice introduction of NSS based RR andNR models. Table IX gives a summary of the NSS based RRmodels. In this section, we introduce WNISM, recognized asthe standard NSS based RR model proposed by [122].

Let p(x) and q(x) be the probability density function of thewavelet coefficients in the same subband of the original imageand distorted image respectively. According to the law of largenumbers, the difference of log-likelihood between p(x) andq(x) asymptotically approaches the Kullback-Leibler distance[134] between p(x) and q(x), denoted by d(p||q).

d(p||q) =∫

p(x) logp(x)

q(x)dx (55)

While q(x) can be easily extracted from the distorted image atthe receiver, p(x) should be extracted from the original image,and transmitting p(x) as an RR feature is costly. Fortunately,it is found that p(x) can be approximated by a 2-parametergeneralized Gaussian density model (GGD) as:

pm(x) =β

2αΓ(1/β)e−(|x|/α)β (56)

where Γ(·) is the Gamma function. Also, the KLD betweenpm(x) and p(x) is computed as

d(pm||p) =∫

pm(x) logpm(x)

p(x)dx (57)

For each subband, based on the RR feature{α, β, d(pm||p)}, the KLD between p(x) and q(x) canbe approximated as

d(p||d) = d(pm||q)− d(pm||p) (58)

Page 21: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

21

TABLE IX: NSS based RR ModelsCC: Correlation Coefficient; RMS: Root Mean Squared Error; OR: Outlier Ratio; SRCC: Spearman Rank-order Correlation Coefficient

Year Model Basis Validation Database Performance factorsCC RMS OR SRCC

WNISM [122] [123] 2005 KLD, wavelet-domain LIVE database 0.8226 N/A 0.2311 0.8437RRIQA [124] 2009 KLD, DNT-domain LIVE database, Cornell-A57 [46] 0.9173 0.1069 0.9287

βW-SCM [125] 2010 Steerable pyramid decomposition LIVE database 0.8353 0.8391Weibull distributionRRED [126] 2011 Scaled entropy, wavelet-domain LIVE database N/A N/A N/A 0.9580

GSM model [127] 2012 KLD, Tetrolet-domain Cornell-A57 0.70 0.74

RR-SSIM [128] 2012 SSIM, KLD, DNT-domain

LIVE database, Cornell-A57,

0.9194 0.9129IVC database [129], Toyama-MICTdatabase [130], TID2008 [131],

CSIQ database [132]

in which d(pm||q) can be calculated at the receiver side as

d(pm||q) =∫

pm(x) logpm(x)

q(x)dx (59)

Finally, aggregate the distortions in all subbands and theoverall distortion metric can be obtained as:

D = log2(1 +1

D0

K∑k=1

|dk(pk||qk)|) (60)

in which D0 is a constant parameter; pk and qk are theprobability density functions of the kth subband of the originalimage and distorted image respectively; and dk is the KLDestimation between pk and qk.

In [123], the authors introduced the concept of quality-aware image, in which the RR information is encoded asinvisible hidden messages. And after decoding, these hiddenmessages can help compute the quality metric. In [124], itis noted that linear image decomposition, such as wavelettransformation, cannot reduce statistical dependence betweenneuronal responses. Therefore, divisive normalization trans-form (DNT), a nonlinear decomposition, is leveraged as theimage representation. Instead of using KLD, in [126], thequality metric is computed as the average difference betweenscaled entropies of wavelet coefficients of original image anddistorted image. In [127], Tetrolet transform for both originalimage and distorted image is used to better characterize localgeometric structures. Subbands are modeled by Gaussian ScaleMixture (GSM) to account for the statistical dependenciesbetween tetrolet coefficients. In [125], coefficients with max-imum amplitude, instead of all coefficients, are used to getthe RR metric by fitting them with a Weibull distribution. In[128], an SSIM-like metric largely based on [133], [124] andstructural similarity is developed.

Apart from the above mentioned PLV based and NSS basedRR models, there are some other models that are worthy ofnoting. In [135], the blockiness and blurriness features aredetected by harmonic amplitude analysis, and local harmonicstrength values consist the RR information for quality esti-mation. In [136], [137], the RR models are based on theHVS characteristics, more specifically, the contrast sensitivityfunction (CSP). The images are decomposed by contourlettransform in [136], and grouplet transform in [137]. The qual-ity criterion C4 in [138] first models the HVS characteristicsin respect of color perception, CSF, psychophysical subband

decomposition and masking effect modeling; then extracts thestructural similarity between the original image and distortedimage to get the final RR metric.

C. No Reference Model

No reference model can meet the demand of real-timeQoE monitor. However, it is hard to develop since thereis no access to the original video. Therefore, much efforthas been put on mapping the network statistics (e.g., packetloss rate, bandwidth), which can be obtained from simplemeasurement, and application-specific factors (e.g., encodingbitrate, packetization scheme), to the quality estimation. In thissection, we introduce NR models according to Fig. 12. Notethat the classification mostly depends on the major techniquesor theory basis of the model, and may not be exclusive. Inparticular, the PLV based and NSS based NR models are theextensions of their RR counterparts; and the bitstream-layer,packet-layer, and hybrid models are based on the access ofinformation of streamed videos. In this section, we will focuson the bitstream-layer, packet-layer, and hybrid models forstreamed videos, since the PLV and NSS based models havealready been explained in the previous session. Table X givesa summary of the NR models.

1) Bitstream-layer Models: A survey of bitstream-basedmodels is given in [139]. Now we introduce several typicalbitstream-layer models.

• QANV-PAApart from coding factors and video motion information,QANV-PA further consider the temporal information andthe influence of packet loss.

– Frame quality.QP parameter and spatial and temporal complexityof the nth frame are included in the frame quality:

Qn = f(qn) + (b3 − f(qn))((δS,na1

)b1 + (δT,n

a2)b2)

(61)in which f(qn) is a linear function of the QPparameter qn, δS,n and δT,n are the spatial and tem-poral complexity, respectively, and a1, a2, b1, b2, b3are constant parameters.

– Packet loss influenceThe degradation due to the packet loss is character-ized by parameter pn, which depends on the number

Page 22: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

22

TABLE X: No Reference ModelsCC: Correlation Coefficient; RMS: Root Mean Squared Error; OR: Outlier Ratio; SRCC: Spearman Rank-order Correlation Coefficient

Year Model Basis Validation Database Performance factorsCC RMS OR SRCC

NSS based models

BLIINDS [14] 2012 Natural scene statistics, DCT-domain LIVE database 0.821BLIINDS-II [15] 2012 Natural scene statistics, DCT-domain LIVE database, TID2008 0.9232 0.9202

Bitstream-layer models

QANV-PA [16] 2010 Quantization, packet loss and error Standard test sequences 0.912 0.014 0.021 0.913propagation, temporal features of HVSC-VQA [17] 2012 Quantization, motion, bit allocation LIVE database 0.7927 0.7720

NR-blocky [18] 2012 Blockiness of a base layer, coding parameters 4 sequences 0.8719of its enhancement layers

Packet-layer models

Vq [19] 2008 Encoding betrate, packet loss rate 8 video sequences 0.968 0.287

CAPL [20] 2012 Number of bits, frame type, temporal Standard test sequences 0.941 0.358 0.000 0.935complexity, positions of lost packets

Hybrid models

rPSNR [21] 2008 Codec, loss recovery technique, Video clipsencoding bitrate, packetization

SVC quality model [22] 2008 Spatial and temporal information, 10 video sequencesnetwork bandwidth determined SNRMotion-based model [23] 2008 Bitrate, framework, content motion information 2 sets of video sequences 0.8190

Vq [24] 2009 Encoding betrate, packet loss rate, 8 video sequences 0.96 0.37 0.36spatial and temporal featuresAPM [25], [26] 2011 Startup delay, rebuffering, user-viewing activity Flash videos

UMTS quality model [27] 2012 Content type, sender bitrate, block LIVE database 0.912 0.014 0.021 0.913error rate, mean burst length

of frames that affected by the packet loss, and thetemporal complexity δT,n. Then, the quality metricbecomes:

Q′

n = Qn − pn (62)

– Temporal poolingThe quality factors of the frames are integrated bytemporal pooling.

QANV − PA =

∑n∈D(QF ′′

n Tn)∑n∈D Tn

(63)

in which D is the set of successfully decoded frames,Tn is the duration of the nth frame, and Q

′′

n is thecontribution of the quality of the nth frame to theentire video.

Q′′

n = Q′n(a4 + b4δ

′T,n + c4δ

′T,nlog(Tn)) (64)

in which δ′T,n = δT,n/max(δT ) is the normalizedtemporal complexity.

• C-VQAThree factors: quantization parameter factor, motion fac-tor and bit allocation factor, are calculated and thencombined to form C-VQA.

– Quantization parameter (QP) factor.The quantization process causes loss of temporal andspatial information. The higher the QP is, the moresevere the quality degradation will be. The QP factoris computed as:

FQ = (aCn + b)cq (65)

in which a, b, c are constants, q is the average QPover n consecutive frames, and Cn is the featureparameter of the n frames, including width, heightand so on.

– Motion factor.The motion factor accounts for the global motionconsistency and local motion consistency. Globalmotion consistency Mg is calculated based on thevariance of horizontal and vertical motion vectorof moving objects (as opposed to stationary back-ground). Local motion consistency Ml is calculatedbased on the absolute different of motion factorsbetween successive macro block. The motion factoris the combination of the above two motion factors.

Fm = Mg +Ml (66)

– Bit allocation factor Bitrate control is applied tostreamed video because bitstream is restricted bylimited bandwidth. The effectiveness of the bitratecontrol scheme is characterized by factor Cr, andthe bit allocation factor is calculated as follows

FB = VB × Cr (67)

in which VB is the variance of bit consumption ofthe macro blocks.

Finally, the C-VQA is a weighted sum of the QP factor,motion factor, and bit allocation factor:

C − V QA = θ(αFQ + βFM + γFB + η) (68)

Page 23: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

23

in which FQ, FM , FB are the average values over Nframes

2) Packet-layer Models: The packet-layer models use onlythe information of the packet header for quality estimation, notdepending on the information from the payload. For packetswhere the payload is encrypted, packet-layer models are moreapplicable

• Vq

Vq is a simple packet-layer model, which estimated thequality affected by the packet loss rate. Firstly, the videoquality, when there is no packet loss, is estimated.

Vq|PL=0 = 1 + Ic (69)

in which Ic is a function of the bitrate BR.

Ic = a1 −a1

1 + (Br/a2)a3(70)

in which a1, a2, a3 are constant parameters.When, the packet loss rate PL is non-zero, the videoquality is fitted by an exponential function

Vq = 1 + Ic exp(−PL

a4) (71)

in which PL is the packet loss rate, a4 is a constant.• CARL

CARL is developed based on the bitstream-layer modelQANV-PA. However, due to a lack of payload informa-tion, the frame quality Qn and temporal complexity δT,n

are computed differently.

Qn = 1 + a1(1− (Rn

a2δT,n + b2)−b1) (72)

in which a1, a2, b1, b2 are constant parameters, Rn is theaverage number of bit allocation for a frame in a Groupof Pictures (GoP).For packet-layer model, the motion vector, used to com-pute temporal complexity, is not available. Therefore, thetemporal complexity is estimated as follows.

δT,n = |RP,n/RI,n − a3| (73)

in which RP,n and RI,n are the average bit allocation forthe P frame and I frame in a GoP respectively, a3 is aconstant. After calculating Qn, the packet loss influenceand temporal pooling process are similar to those ofQANV-PA.

3) Hybrid Models:• rPSNR

rPSNR is a light-weight no reference model, focusingon the relationship between packet loss and QoE, whilealso considering video codec, loss recovery technique,encoding bitrate, packetization, and content characteris-tics. Video distortion (denoted by D) is measured throughMean Square Error (MSE), which is derived as a functionof packet loss as follows:

D = Pef(n)LD1

PSNR = 10log102552

D

(74)

TABLE XI: User-viewing Activities

Activities Viewer experience indicationPause Negative; need more buffering time

Resume Unclear; continue playing after pause

Reload the pageNegative; the quality is bad, and it may helpto allow the player to choose another serverfrom the CDN

Switch to a lower Negative, lower quality video may be morevideo quality smooth

Switch to a higher Positive, higher quality video may bevideo quality supported by current speedFull screen Positive

Return to normal size Negative

Minimize the video Unclear, maybe the viewer just runs thevideo in the background

Forward Unclear, maybe the video is not interesting

Backward Unclear, maybe replay the buffered part canachieve a smoother playback

Frequent/infrequent Unclear, maybe frequent mouse movementmouse movement indicate annoying QoS

in which Pe is the probability of packet loss event in thevideo streaming; f(n) is the average number of slicesaffected by a loss event; L is the number of packets usedfor transmitting one frame; D1 is total average distortioncaused by losing a single slices. f(n) is different fordifferent codec. For example, in MPEG-2, once a packetloss is detected in a frame, the entire frame will bediscarded, and replaced by the previously-decoded frame.However, in H.264, more sophisticated error-concealmentis used. All slices will be decoded, and the slices affectedby packet loss will be recovered using the correspondingslices in the previous slice and the motion informationfrom other slices in the same frame. The estimationof D1 depends on the error propagation resulting fromloss of one slice due to coding dependencies betweenframes. Pe and f(n) is network-dependent, and can beeasily obtained from network statistics. L can be easilydetermined based on application configuration. However,D1 is dependent on individual video characteristics andmay not be efficiently estimated when considering real-time quality monitoring of a large number of videostreams. To tackle this problem, we can compare thequality of the video transmitted over a path with thattransmitted over a reference path. A reference path isa transmission path whose QoE is known beforehand.Usually, we can select the path which generates targetedQoE as the reference path, so that we know how muchbetter or worse the actual path performs. Relative PSNR(rPSNR) is the difference between the monitored networkpath and the reference path.

rPSNR = PSNR− PSNR0 (75)

The resulting rPSNR is independent of D1, and therefore,easy to compute.

• Application Performance Metrics (APM)APM characterizes the impact of rebuffering events onthe QoE for HTTP video streaming service. Unlike tradi-tional UDP-based video streaming, the HTTP over TCPvideo streaming does not suffer from frame loss. First,

Page 24: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

24

NR

Packet

-layer:

Vq

Hybrid: Vq Bitstream-layer:

QANV-PA

NSS, DCT domain: BLIINDS

Bitstream-layer: C-VQA &

NR-blocky

Packet-layer: CAPL

Hybrid models

RR PLV for

MPEG-2

NSS,

wavelet-

domain

PLV

for

H.264

PLV for

MPEG-2 &

H.264

NSS,

Weibull

Distribution

NSS, DCT

-domain NSS, Tetrolet-domain

FR,

Signal-

driven

SSIM MS-SSIM LVQM

Video-SSIM

VIF

Video- VIF VIF KVQM

Speed

SSIM KVQM PF/FP-SSIM MOVIE

FR,

Vision-

based

MPQM DVQ PVQM VSNR MOSp AFViQ

1996 2000 2002 2003 2004 2005 2006 2008 2007 2008 2009 2010 2012 2013

Fig. 22: Timeline of the objective quality models

network QoS metrics, such as the round-trip time (RTT),packet loss rate, and bitrate(determined by bandwidth),are used to estimate the three APM metrics: startup delay,rebuffering time and rebuffering frequency. Then, theAPM metrics are fed into the prediction model to getthe estimated MOS value. Linear regression is performedfor the APM values and MOS values obtained fromsubjective tests to get the QoE prediction model. Theregression results show that the rebuffering frequency hasthe most significant influence on the QoE.In [26], the above APM model is refined by incorporatingthe influence of user-viewing activities and resort tologistic regression. It is observed that video impairmentcan trigger viewer interactive activities as listed in TableXI. Two major user-activity metrics, number of pauseevent and number of screen size reducing event, are putinto the logistic regression model, along with the threeAPM metrics. The results show an improved explanatorypower of the regression model.

• UMTS Quality MetricVideo transmission over wireless network, more specif-ically the Universal Mobile Telecommunication System(UMTS) is considered in [27], taking into account thedistortions caused by the transmission network. Subjec-tive tests are first conducted for different combinations ofsender bitrate (SBR), block error rate (BLER), mean burstlength (MBL) and content type (CT). SBR reflects thedistortion from the encoder; both BLER and MBL reflectthe distortions from the transmission network; CT is thecontent type in terms of temporal and spatial features,identified by cluster analysis tool in [51]. Nonlinear re-gression on the subjective test results yields the followingfunction:

MOS =α+ β × ln(SBR) + CT × (γ + δ ∗ ln(SBR))

1 + (η × (BLER) + σ(BLER)2)×MBL(76)

in which α, β, γ, δ, η and σ are regression parameters.In Fig. 22, we show the timeline of all the major objective

quality models introduced in this section. We can see severaltrends of the evolution of objective quality models.

• From FR models to NR models. As the need for real-time

QoE monitoring and prediction becomes increasinglyurgent, more and more NR models are being proposed. Atthe meantime, the FR models are further developed dueto better understanding of HVS and other related areas.

• From image to streamed video. Previously, many modelsare first designed for image quality assessment, then ex-tended to video quality assessment. The development ofvideo streaming services motivates research on streamedvideo quality assessment depending on the informationextracted from packet header or payload.

D. Performance Validation

The output of the objective quality model should be wellcorrelated with the subjective results, which are regarded asthe ground truth for user QoE. The Video Quality ExpertGroup (VQEG) gives a test plan [13] for validating objectivequality models. The relationship between the output from theobjective quality model and the results from the subjectivetest is usually estimated by a nonlinear regression function. Itdoes not matter what form of nonlinear function is used aslong as it is monotonic, applicable to a wide range of videocontent, and has minimum free parameters. Multiple forms ofnonlinear functions will be tried to find the best-fitting one. LetV QR denote the output of the objective quality model; MOSp

denote the predicted MOS value by the regression function;MOSnorm denote the normalized output of the subjective test.

MOSnorm =MOS −MOSmin

MOSmax −MOSmin(77)

Following are some of the most common-used nonlinearregression functions, fitted to data [V QR,MOSnorm].

• Simplistic logistic function.

MOSp =1

1 + expC0(V QR− C1)(78)

For ease of analysis, the above function can be trans-formed as the linear form loge(1/MOSp − 1) =C0(V QR− C1).

• Four-parameter cubic polynomial function

MOSp = C0+C1×V QR+C2×V QR2+C3×V QR3

(79)

Page 25: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

25

• “Inverse” four-parameter cubic polynomial function

V QR =C0 + C1 ×MOSp + C2 ×MOS2p

+ C3×MOS3p

(80)

• The 5-parameter logistic curve

DMOSp(V QR) = A0 +A1 −A0

1 +A4 × (V QR+A5)/A3(81)

Apart from MOS, similar analysis can be performed onindividual opinion scores (OS), and difference opinion scores(DOS). The performance of the objective quality model is e-valuated from three aspects: prediction accuracy, monotonicityand consistency.

• Prediction Accuracy is represented by the Pearson linearcorrelation coefficient and root mean-square-error (MSE).The Pearson linear correlation coefficient between twovariables X and Y is:

ρX,Y =E[(X − E(X))(Y − E(Y ))]√

[E(X2)− (E(X))2][E(Y 2)− (E(Y ))2](82)

The Pearson linear correlation coefficient quantifies thecorrelation between two variables. It has the value in[−1, 1], where −1 means total negative correlation, 0means no correlation, and 1 means total positive correla-tion.Root mean-square-error (MSE) is:

MSE =1

N

∑i

(MOSp −MOS)2 (83)

• Prediction Monotonicity is represented by the Spearmanrank order correlation coefficient.The Spearman rank order correlation coefficient charac-terizes how well one variable can be represented as amonotonic function of the other variable. One merit ofthe Spearman rank order correlation coefficient is thatno knowledge of the relationship (e.g., linear, logistic)between the two variables is required (referred to asnonparametric). Assume that we have N raw samples(X,Y ). The calculation of the Spearman rank ordercorrelation coefficient is as follows:

– Sort X and give rank number xi to the ith sample,e.g., if in the 1st sample, the value of variable X isthe 4th largest, then x1 = 4;

– Sort Y and give rank number yi to the ith sample,e.g., if in the 1st sample, the value of variable Y isthe 5th largest, then y1 = 5;

– The Spearman rank order correlation coefficient ρ is

ρ = 1−6∑

i(xi − yi)2

N(N2 − 1)(84)

The Spearman rank order correlation coefficient has thevalue in [−1, 1], where −1 means X can be representedas a monotonically decreasing function of Y , 1 means Xcan be represented as a monotonically increasing functionof Y .

• Prediction Consistency is represented by the outlier ratio.

TABLE XIII: Differences between LIVE database and VQEGPhase I FR-TV database

LIVE database VQEG FRTV-I databaseCodec H.264 and MPEG-2 H.263 and MPEG-2

Video format 50Hz 50Hz, 60HzNumber of source 10 20video sequence

Encoding bitrate

768kb/s, 1.5Mb/s, 2Mb/s,500 kb/s, 1 Mb/s, 3Mb/s, 4.5Mb/s, 6Mb/s,1.5 Mb/s, 2 Mb/s 8Mb/s, 12Mb/s, 19Mb/s,

50Mb/sPacket loss rate 0.5%, 2%, 5%, 17% N/ASubjective test SSCQS DSCQSmethodSubjective test DMOS DMOSscore

Outlier ratio =number of outliers

N(85)

in which N is the total number of samples, and anoutlier is a point for which |MOS − MOSp| > 2 ∗(Standard Error of MOS).

Furthermore, wide application and computational complex-ity are two other aspects to evaluate the objective qualitymodel. It is ideal for the objective quality model to giverelatively good prediction for a wide range of video content.However, there is no metric to evaluate the wide applicabilityof the model. Therefore, it is desirable to cover as manytypes of video content and test conditions as possible in thesubjective test. It is recommended that at least 20 differentvideo sequences should be included.

E. Objective Quality Model Projects and Standards

1) VQEG Projects: : The Video Quality Experts Group(VQEG), established in 1997, with experts from ITU-T andITU-R study groups, carried out a series of projects to validateobjective quality models. Their work leads to inclusion of rec-ommended objective quality models in International Telecom-munication Union (ITU) standards for standard definitiontelevision and for multimedia applications [140]. Subjectivetest plan is given for laboratories to carry out subjective test.The resulting data base is used for validating objective qualitymodels’ prediction power. Objective test plan is given toevaluate the submitted objective quality models with specifiedstatistical techniques and evaluation metrics. The final reportof each test summarizes the testing results as well as providingdetailed description of the subjective evaluation procedure, theproposed objective quality models, the evaluation criteria andsome discussion and comments. The subjective test sequencesand corresponding scores are made accessible for researchersto validate their objective models. The validation test projectsthat have been accomplished by the VQEG is summarized inTab. XII.

2) LIVE project: The Laboratory for Image and VideoEngineering (LIVE) at the University of Texas at Austin, ledby Prof. Alan C. Bovik, establishes the LIVE Video QualityDatabase, due to two deficits of the existing VQEG Phase IFR-TV database [29]:

Page 26: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

26

TABLE XII: VQEG Completed Validation Tests

Project Completion Date Application Scope Model type Resulting ITU RecommendationsFRTV-I June, 2000 SDTV FR and NR None

FRTV-II August 25, 2003 SDTV FR and NR ITU-T Rec. J.144 (2004)ITU-R Rec. BT.1683 (2004)

MM-I September 12, 2008 Multimedia model FR and RR

ITU-T Rec. J.247 (2008)ITU-T Rec. J.246 (2008)

ITU-R Rec. BT.1866 (2010)ITU-R Rec. BT.1867 (2010)

ITU-T Rec J.340 (2010)

RRNR-TV June 22, 2009 SDTV RR and NR ITU-T Rec. J.249, (2010)ITU-T Rec J.340 (2010)

HDTV-I June 30, 2010 HDTV FR, RR and NR ITU-T Rec. J.341 (2011)ITU-T Rec. J.342 (2011)

• VQEG database uses old-generation codec such as H.263and MPEG-2, while the more advanced H.264/MPEG-4Part 10 codec may exhibit different distortion patterns.

• VQEG database subjective test results are skewed towardshigh user scores (it is ideal that user scores are uni-formly distributed), suggesting that the processed videosequences have poor perceptual separation.

The LIVE database is publicly accessible with the aimto “enable researchers to evaluate the performance of qual-ity assessment algorithms and contribute towards attainingthe ultimate goal of objective quality assessment research -matching human perception” [141]. Table XIII summarizes thedifferences between the VQEG Phase I FR-TV database andthe LIVE database. In the LIVE database, H.264 advancedvideo coding is used, and the wireless network distortionis simulated. There are ten source sequences provided byBoeing, with a diversity of motions, objects and people. Theencoded source sequences are regarded as the original versionas it is claimed that H.264 compression is visually lossless(with average PSNR greater than 45 dB). Each sequence isprocessed with a combination of 4 bitrates (500 kb/s, 1 Mb/s,1.5 Mb/s, 2 Mb/s) and 4 packet loss rates (0.5%, 2%, 5%,17%), resulting in 160 processed sequences. The subjectivetest results show that the DMOS value has good perceptualseparation (i.e., the values are nearly uniformly distributed).Single stimulus continuous quality-scale (SSCQS) based on[57] is used as the subjective test method. To counteract theindividual bias, the original video sequences are inserted in thetesting sequence. Therefore, (score for the processed video) -(score for the original video) is regarded as an unbiased score.The use of continuous quality-scale also breaks the limitationof categorical quality-scale used in VQEG database. However,only 60Hz refresh rate is considered (the VQEG Phase I FR-TV test includes both 50Hz and 60Hz).

F. DiscussionObjective quality model has a wide range of applications,

including equipment testing (e.g., codec evaluation), in-servicenetwork monitoring, and client-based quality measurement.However, in [142], the author points out seven challengesfacing the current objective quality models. Interested readerscan refer to the original paper for more details. 3

3Though [142] limits the discussion to image quality assessment, the mainpoints are still applicable to video quality assessment.

• Insufficient knowledge of HVS and natural image. Mostof the objective quality models only employ low-levelHVS properties. Though VSNR leverages mid-level HVSproperty (global precedence), the modeling of higherlevel HVS property is far from complete. Another prob-lem is that visual neurons have different responses tosimple, controlled stimuli and to natural image. Thismay affect masking results, in particular, the contrastthreshold. However, there is a lack of ground truth dataof local contrast detection thresholds for natural images.

• Compound and suprathreshold distortions. Compounddistortions refer to distortions that stimulate morethan one channel of the HVS multichannel system;suprathreshold distortions refer to distortions that areobviously visible. Existing near-threshold distortion anal-ysis focuses on the visual detectability of the distortion.However, it is found that visual detectability of the dis-tortions may not accord with viewers’ perception towardssuprathreshold distortions [143], [144]. Therefore, modelssuitable for near-threshold distortions may not be able tobe extended to account for suprathreshold distortions.

• Interaction of the distortion and the image. There aretwo different assumptions about the relationship betweenthe distortion and the image. One is that the distortedimage is a single stimulus (“overlay distortion”); theother is that the distorted image is a combination oftwo separate stimuli: the distortion added to the image(additive distortion). It is important to distinguish thesetwo types of distortions.

• Interaction between distortions. One type of distortionmay mask another type of distortion, known as cross-masking. To quantify the interaction between distortionsand their effect on the image quality is needed.

• Geometric changes. It is argued that current objectivequality models are bad at dealing with geometric changes.For example, slight rotation of the objects has little impacton perceptual quality but will result in lower qualityestimation by the objective quality models.

• Evaluation of enhanced image. Image enhancement suchas noise reduction, color correction and white-balancing,may in turn make the original image seem like inferior.One way to evaluate enhanced image is to treat theoriginal image as “distorted”, and the enhanced image as“original”; then apply existing objective quality models.

Page 27: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

27

Startup Playing Rebuffer Playing Rebuffer Playing Remaining

Viewer

request

Buffer

filled up

Buffer

deplete

Viewer

quite

Video

complete

Fig. 23: A typical video watching session

The feasibility of such method still needs to be verified.• Efficiency. Efficiency concern includes running time and

memory requirement.Apart from the above challenges, we also have the following

comments for the objective quality models.• Full reference model is impossible to implement for

real-time QoE prediction and monitoring, because of itscomplexity and the need to access the original video.Reduced reference model, though does not need theaccess to the original video, requires extra resources (e.g.,a side channel) to transmit the extracted informationof the original video. Psychophysical approach modelsthat are based on the mechanisms of the HVS, thoughperform well with the subjective MOS scores, often havehigh complexity. Engineering approach models usuallyhave lower complexity, and can be calibrated using thesubjective test results.

• All of the existing objective quality models comparedtheir predicted QoE with the MOS scores to evaluate theirperformance. The MOS scores are obtained from the sub-jective test, which is limited in test video types, number ofhuman assessors, and test conditions. Therefore, objectivequality models with predicted QoE close to one set ofMOS scores of a particular subjective test, may not havethe same good performance compared with another setof MOS scores obtained from a different subjective test.

V. DATA-DRIVEN QOE ANALYSIS

The dramatic development of video distribution over theInternet makes massive data available for analysis, and triggersa new research interest of data-driven QoE assessment. Com-mercial broadcast television corporations (e.g., FOX, NBC)and on-demand streaming video service providers (e.g., Net-flex, Hulu) now provide millions of videos online. Improvinguser QoE is crucial to the service providers and networkoperators, since small changes in viewer behavior will lead towhopping changes in monetization opportunities due to hugeviewer base over the Internet.

To begin with, we give a detailed description of a typicalvideo viewing session, based on which we introduce the QoEand QoS metrics that are concerned by the current data-drivenQoE-related works. Define a viewer as a specific identifiableuser who watches video through the service of a provider;define a view as the event that a viewer watches a specificvideo; define a visit as the event that a viewer continuallywatches a series of videos from a specific website. Twovisits are separated by a duration of inactivity for a timethreshold. Fig. 23 shows a typical video watching session.

A viewer initiates a video request, and the video playerestablishes the connection to the server. A certain amountof data has to be downloaded in the buffer before the videostarts playing (startup state). During playing, the video playerfetches the data in the buffer; and meanwhile, downloadsmore data from the server (playing state). If the rate of usingthe data exceeds the rate of downloading (e.g., due to poorconnection), the buffer will be exhausted. In this case, thevideo player has to pause to fill its buffer to a certain levelbefore start playing again (rebuffer state). The viewer thereforeexperiences interruptions during this period. During the videosession, viewers may also have interactive actions such aspausing, fast-forwarding, rewinding, changing the resolutionor changing the screen size. A view may end in four manners.

• Abandoned view. The viewer voluntarily quits during thestartup state, and does not watch any of the video.

• Aborted view. The viewer watches a certain part of thevideo, but voluntarily quits during the playing state orrebuffer state before the video completes.

• Failed view. The requested video involuntarily ends dueto failure of the server, the connection or the videocontent.

• Complete view. The view ends when the video is com-pletely watched.

Except for the case of complete view, all other three casesmay be a result of user dissatisfaction, which may be caused bypoor video quality, user’s lack of interest in the video content,or external interruption (e.g., mobile users on the train reachesdestination). The following metrics are often used to representuser QoE by quantifying the user engagement for the videoservice:

• View-level metrics, which regard the engagement of eachvideo viewing session.

– Viewing time per view: the actual time that a userwatches a video. Usually, the ratio of the viewingtime to the total duration of the video is used as anindicator for user engagement.

– Abandoned view ratio, the percentage of views thatare voluntarily abandoned by the viewers duringstartup state.

• Viewer-level metrics, which regard the engagement ofeach viewer.

– Number of view, the number of video clips a userwatches within a certain time period on a certainwebsite.

– Viewing time per visit, the total length a user watchesthe video during a visit to a certain website.

– Return rate, the percentage of viewers who visit thesame website again within a specified time period.The return rate indicates the possibility that a userwill visit the video website in the future.

– Video rating. Many video websites enable users torate the video. For example, YouTube uses a a scaleof 0− 5 “stars”; Youku and Tudou have a “Thumb-up” or “Thumb-down” choice.

These measurable QoE metrics are also directly relatedto the service providers’ business objectives. For example,

Page 28: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

28

TABLE XIV: Measurement Study on User Behavior Research

Scenario User behavior QoS factors External factors

[145] Online VoD service Viewing ratio distribution Rate of buffering Video length, video popularitynumber of views per viewer time of the day

[146] YouTube Number of unique users and request HTTP request methods Content type, video release and update timeviewing time, viewer rating video bitrate, file size viewer geography, temporal features

[147] P2P IPTV ystem Number of peers, peer arrival TCP connection, video TV channel popularity, peerand departure rate traffic characteristics geography, temporal features

[148] Live VoD system Number of daily access, Video length, video popularityviewing time ratio temporal features

[149] YouTube Resolution switch, video download Flow size, startup delay Video length, size, bitrate andratio, viewing time ratio format, device type

[150] Online VoD service Viewing ratio distribution, seeks, Rate of buffering Content type, video popularity indexNumber of views per viewer

[151] Mobile video service Subjective ratingVideo quality, bandwidth,

startup delay, rate ofbuffering, RTT

for advertisement-supported video service, if the viewingtime is longer, more ads can be played to the viewers; forsubscription-supported video service, better QoE can reducethe viewer churn rate.

While still considering the influential factors described inSection II, current data-driven QoE research is more focusedon the following QoS metrics.

• Startup delay, also called join time. As shown in Fig. 23,join time is the time between the user requests the videoand the video actually begins playing, during which thebuffer is being loaded.

• Rebuffering. The encoded video stream is temporarily putin a buffer to be played back later. As shown in Fig. 23,when the buffer is depleted, the player pauses to rebuffer.There are two ways to quantify the rebuffering event.

– Rebuffering time ratio, the ratio of the total time forrebuffering to the total viewing time.

– Number of rebuffering. If the rebuffering happensquite frequently, but the time for each rebufferingis very short, the ratio of rebuffering is low, yetsuch intermittent playing may annoy the viewer. Thenumber of rebuffering can characterize the frequencyof the rebuffering event.

• Average Bitrate at which the video is rendered on thescreen to the viewer. This rendered bitrate depends onthe video encoding bitrate, network connectivity and thebitrate-switch heuristics employed by the media player.

In the rest of this section, we first introduce the earlier workof video measurement study on user behavior, as summarizedin Table XIV, then we introduce three recent directions ofdata-driven QoE analysis, as summarized in Table XV.

A. Measurement Study on User Behavior in Video Service

Large-scale measurement studies have long been carriedout to study general user behavior in various video services,including online VoD service [145], [150], Live VoD [148],P2P IPTV system [147], the YouTube traffic [146], [149],[153] and mobile video service [151]. A survey of userbehavior in P2P video system is recently given by [154]. In thissection, we first identify the general user behavior revealed by

these measurement study, then introduce a decision theoreticapproach to model user behavior.

1) General User Behavior: We discuss the following userbehaviors that have been studied by many measurement stud-ies.

• Early quitter phenomenon / video browsing. It is foundthat the most video sessions are terminated before com-pletion [55], [148], [150]. More specifically, many view-ers quit the video session within the first short periodof time. One of the explanations for this early quitterphenomenon is that a viewer browses several videosbefore dedicating to a specific one which interests him.The video browsing behavior is intensively studied by[150]. It is found that viewers often use seeks (jumpto a new part) to browse a video, and that the viewersare more likely to browse popular videos first due torecommendation. Another problem caused by the earlyquitter problem is that the downloaded video files willexceed the watched video files, resulting in data waste,which is found to be more severe for the player on themobile device than the computer [149].

• Temporal user access pattern. It has been confirmed inmany papers that user access has a clear and consistentdaily or weekly pattern [55], [145], [146], [148]. Thediurnal viewing pattern is also found in the P2P videosystem [147].

• Video quality metrics. Three video quality metrics, i.e.,startup delay, rebuffer events, and encoding bitrate, aremost-commonly characterized by the their cumulativedistribution function [145], [147], [149]. In particular,the impact of rebuffering time is studied in [151] bya subjective test like experiment. Each assessor watchespreassigned videos with different bandwidth, quality andrebuffering time combinations, in a mobile context. Then,they are asked to answer questionnaires to express theirexperience. Finally, the relationship between the rebuffer-ing time and viewers’ acceptance of the video quality isfitted by a logistic regression model.

• Video popularity. It is found that the video popularity canbe approximated by the Parento Principle, or 80-20 rule.That is to say, a few top videos accounts for most of

Page 29: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

29

TABLE XV: Data-driven Video QoE Research

QoE metrics QoS metrics External factors MethodData collection

Size Source Method Duration(million views)

[48]

Average bitrate,

300/ week Conviva Fall, 2010Viewing time, startup delay, Video length Client-side

number of buffering time ratio, (short or long), Linear instrumentationviews, total rate of buffering, VoD or live regression library

viewing time rendered quality

[30] [53] Viewing time

Average bitrate, Video type,

40 Conviva Over 3 monthsstartup delay, time of day, Decision Client-sidebuffering time ratio, device, tree instrumentation

rate of buffering connectivity library

[152]

Average bitrate,

N/A QED 23 Worldwide 10 daysViewing time, startup delay, Akamai’sprob. of return, rebuffer delay, client-sideabandonment number of failed media analytics

rate views plug in

the viewer accesses [55], [145], [148], which is usuallycompared with a Zipf-like distribution. It is found that thepopular video list changes quite frequently [148]. As thevideo release time increases, the video popularity oftendrops. However, if later, a remake version appears or acertain event happens, the related video may have a surgein popularity [148].

• Flash crowd phenomenon. Normally, user arrival dis-tribution is found to follow the Poisson distribution in[55]. Flash crowd refers to a burst of video accessor request within a short period of time. It is usuallytriggered by special national or international events, forexample, popular events in the Olympic Games [148], orChinese spring festival gala show [147]. The flash crowdphenomenon will impose great pressure on the networkdue to huge amount of video traffic. One solution is topush related videos to multiple edge servers during suchevent.

2) Decision Theoretic User Behavior Model: In [155],a theoretic model based on decision network, an extensionto the Bayesian network [156], is proposed to characterizeuser behavior. There are four types of nodes in the decisionnetwork.

• Chance nodes, also the bottom nodes. Chance nodesrepresent all random variables in the system, includingall possible QoS parameters and external factors weintroduce in Section II.

• Query nodes, the parents of chance nodes. Query nodesdetermine the current state, including four contexts: net-work context, service context, environment context anduser behavior.

• Utility nodes, associated with each of the four types ofquery nodes, including network utility, service utility,environment utility and user behavior utility. Utility nodesspecify the utility function in each context.

• Decision nodes, the top nodes. Decision nodes choosethe optimal option according to predefined target, e.g.,maximum QoE.

Firstly, the chance nodes are fed with evidence variables.After the values of the evidence variables are determined, theposterior probability distribution of the query nodes can be

QoE – QoS Kendall correlation

Information gain analysis

Linear regression

Fig. 24: Linear regression based QoS-QoE model

calculated. Then, the utility nodes figure out the utility fordifferent options. Finally, the decision nodes choose the optionwhich maximizes the QoE. The Bayesian network or decisionnetwork can be applied to estimate user departure time [156] orperceptual quality [155]. Further development and verificationof such models are expected.

Measurement study can only give a general understanding ofthe user behavior in video service under different conditions.In order to monitor, predict and even control user QoE, weneed more in-depth analysis.

B. Data-driven QoE Analysis

1) Correlation and Linear Regression based Analysis: In[48], a framework is built for identifying QoS metrics that havesignificant impact on user QoE for different video types; andquantifying such influence by linear regression. QoS metricsinclude startup delay, rebuffering and bitrate; QoE metricsinclude the viewing time ratio, number of views and total timeof viewing. The data is collected at the client side via affiliatedvideo websites, covering five influential content providers.Videos are classified as Long VoD, Short VoD and Live videos.The flow of the analysis is shown in Fig. 24

• QoE-QoS Kendall CorrelationThe correlations between each QoS and QoE metrics arecalculated to determine the magnitude and the directionof the influence of each QoS metric. The paper choosesKendall correlation coefficient, a non-parametric rankcorrelation measurement to quantify the similarity be-tween two random variables. Unlike Pearson correlation

Page 30: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

30

coefficient, which measures the linear dependence of tworandom variables, the Kendall correlation coefficient doesnot assume the relationship between the two variables.High absolute correlation value is regarded as an indicatorfor significant impact of the QoS metric on the QoEmetric. Kendall correlation coefficient can be calculatedas follows. Let (x1, y1), (x2, y2), ..., (xn, yn) denote thejoint observation of two random variables X and Y . Pair(xi, yi), (xj , yj) is concordant if xi > xj , yi > yj orxi < xj , yi < yj , otherwise, the pair is discordant. Thecase xi = xj , yi = yj can be treated as concordant ordiscordant. The Kendall correlation can be calculated as:

τ =Nconcordant −Ndiscordant

12n(n− 1)

(86)

The number of possible pairs of observation is12n(n− 1), so τ ∈ [−1, 1]. If the ordering of X and Yis perfectly agreed, τ = 1; If the ordering of X and Y isperfectly disagreed, τ = −1; If X and Y are independent,|τ | ≈ 0.

• Information Gain AnalysisThe Kendall correlation coefficient cannot reveal thenon-monotonic relationship between the QoS and QoEmetrics. Information gain helps to get a more in-depthunderstanding of the QoS-QoE relationship by quantify-ing how the knowledge of a certain QoS metric decreasesthe uncertainty of the QoE metrics. Let X denote the QoEmetric, and Y denote the QoS metric. The informationgain for X , given Y is [I(X)− I(X|Y )]/I(Y ), in whichI(·) is the entropy, a characterization of how much infor-mation is known of the random variable. Information gaincan be calculated for not only an isolated QoS metric, butalso the QoS metric combinations. High information gainis regarded as an indicator for significant impact of theQoS metric on the QoE metric.

• Linear RegressionLinear regression based curve fitting is applied to theQoS-QoE pairs which are visually confirmed to havequasi-linear relationship. By observing the QoS-QoEcurves, it is obvious that the relationship is not linearin the entire range. Therefore, linear regression is onlyapplicable to a certain range of data.The above analysis framework is applied for Long VoD,Short VoD and Live videos. There are two key findings.First, certain QoS metrics have high influence on onetype of video, but low influence on other types of video.In other words, the influence of QoS metrics is content-dependent. Second, certain QoS metrics have low abso-lute correlation coefficient values, but high informationgain. The possible reason is that the QoS-QoE rela-tionship may be non-monotonic. Therefore, correlationanalysis alone is not enough to decide the importance ofQoS metrics.

Though being a simple way to characterize the QoS-QoE re-lationship, the correlation and linear regression based analysisfail to deal with the following problems.

• Non-monotonic relationship between the QoS and QoE.

Data Collection and Pruning

QoS-only Decision Tree

External factors identification

Decision tree refinement

QoE-aware CDN & bitrate

selection

Fig. 25: Decision-tree based QoE prediction model

• Interdependence between QoS parameters. The linearregression requires that the QoS parameters are indepen-dent, which may not be true, e.g., it is shown that bitrateand startup delay are correlated [30], [53].

• External factors handling. There is a lack of analysis onexternal factors and their influence on user QoE.

2) Decision Tree Based QoE Prediction Model: To over-come the drawbacks of the linear regression and correlationalanalysis, in [30], [53], a decision-tree based QoE predictionmodel is developed based on 40 million video views collectedon the video website conviva.com. Viewing time ratio ischosen as the QoE metric; startup delay, buffer events andaverage bitrate are chosen as the QoS metrics; external factorsconsidered are video type (live or VoD), connectivity and soon. The analysis framework is shown in Fig 25.

• Data Collection and PruningNot only the QoE and QoS metrics are recorded, theviewer-specific parameters (e.g., video type, device typeand time stamp) are also collected for external factors.The early-quitters who watch the video for a very brieftime are eliminated from the data set to improve predic-tion accuracy.

• QoS-only Decision Tree BuildingDecision Tree model is a non-parametric model, whichdoes not presume the QoS-QoE relationship (thereforecan deal with non-monotonicity), and does not require theQoS metrics to be independent. In addition, it is simplebut expressive enough to characterize QoS-QoE relation-ship and give relatively accurate predictions. First, eachparameter is discretized because decision tree can onlydeal with discrete values. Then, the data set is separatedinto 10 groups. The model is trained 10 times. Each time,9 groups are used for training and the remaining groupfor validation.

• External Factors IdentificationThe impact of external factors is on three aspects: theQoE metrics, the QoS metrics and the QoS-QoE relation-ship. The impact on QoS and QoE metrics is identifiedby the information gain; and the impact on QoS-QoErelationship is identified by the difference in decision tree

Page 31: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

31

structure and QoE-QoS curve. If an external factor hashigh information gain for a certain QoS metric or QoEmetric, or makes the tree structure and/or QoE-QoS curvedifferent, it is identified as an important external factor.

• Decision Tree RefinementAfter figuring out the important external factors, there aretwo way to refine the QoS-only decision tree

– Add as an input to build the decision tree. It issimple, but mixing the QoS metrics with externalfactors gives confusing guidance.

– Split the data according to different external factors(or combinations, like VoD-TV). It will lead to aforest of decision trees. The curse of dimensionalitymay happen when the data is sparse.

It is shown that splitting the data often gives better resultsthan adding the external factor as an input.

• QoE-aware CDN & Bitrate SelectionBrute force method is used to find the optimal ContentDelivery Networks (CDN) and bitrate combination byfeeding the (CDN, bitrate) pair and other QoS metricsand external factors into the QoE prediction model. The(CDN, bitrate) pair that yields the highest predicted QoEis optimal.

Though overcoming the drawbacks of the linear regression,the above decision tree based analysis framework still suffersfrom the following major problems:

• The final QoE prediction is a range rather than a value.Therefore, it cannot meet the need for fine-grained QoEprediction.

• The decision tree can only deal with discrete values. Theway how the parameters are discretized may influence theperformance of the model.

3) QED based QoS-QoE Causality Analysis: To verify theexistence of causal relationship between QoS and QoE, aQED-based model is built to identify the QoS metrics that hasa significant causal effect on the QoE metrics, thus providinga guidance to service providers of which QoS metrics shouldbe optimized [152]. Correlational relationship does not infercausal relationship, thus may lead to incorrect conclusions.For example, one can not conclude that high bitrate alonewill result in longer viewing time, unless all the other factors(e.g., video popularity, buffering time) are accounted for. Theauthors only consider VoD videos, with a dataset of 23 millionviews from 6.7 million unique viewers, using cable, fiber,mobile and DSL as major connections. The QoE metrics underanalysis are the abandonment rate, viewing time and returnrate; and the QoS metrics are failures, startup delay, averagebitrate and rebuffer delay.

To verify that a QoS metric X has a causal influence on theQoE metric Y , the most ideal method is through controlledtest. In the test, two viewers with perfectly the same attributesbut only differ in X are compared in terms of their resultingY . Such controlled test is infeasible to implement for the videodistribution service. But Quasi-Experimental Designs (QED)[157], can be leveraged to reveal the causal relationship fromthe observational data. The flow of the QED-based QoS-QoEcausal relationship analytical framework is shown in Fig. 26.

Establish Null hypothesis H0

Match treated & untreated view/viewer

Calculate the score of matched pairs

Sum the scores of all matched pairs

Significance test

Fig. 26: QED-based QoS-QoE causal relationship analyticalframework

• Establish Null HypothesisA null hypothesis usually takes the form “The QoSmetric X has no impact on the QoE metric Y ”. The nullhypothesis will be rejected if there is causal relationshipbetween the QoS metric and the QoE metric.

• Match Treated and Untreated View/ViewerA view/viewer is treated if the view/viewer undergoesa certain “bad” QoS condition, e.g., a rebuffering timeratio more than α%. A view/viewer is untreated if theview/viewer undergoes a corresponding normal QoS con-dition, e.g., a rebuffering time ratio less than α%. Regard-ing a certain QoS metric, all the treated view/viewer formthe treated set T , all the untreated view/viewer form theuntreated set U . Then, for each t ∈ T , uniformly andrandomly pick a u ∈ U , that is “identical” to t in everyother aspects. (t, u) is a matched pair, and all matchedpairs form the match set M .

• Calculate Scores for Matched PairsFor a matched pair (t, u) in M , if the QoE values conformto the hypothesis, (e.g., t has lower QoE value than u),the score of pair (t, u) is assigned 1; otherwise, the scoreof pair (t, u) is assigned −1. Other ways of assigningscore value are also possible [152].

• Sum Up ScoresThe sum of the scores for all matched pairs is

Sum of score =

∑(u,t)∈M Score(u, t)

|M |(87)

• Significant TestA “p-value” based on sign test is calculated, which indi-cates the probability that the data conforms with the nullhypothesis. If the “p-value” is small, the null hypothesiscan be rejected with high confidence, corroborating theassumption that the QoS metric has a causal influence onthe QoE metric.

Though verifying the causal relationship between QoS andQoE, the above framework does not quantify the QoS-QoErelationship. Hence, it cannot be used for QoE prediction, orproviding instrumental guidance on how to achieve QoE-basedvideo service optimization.

Page 32: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

32

C. Discussion

After discussing the advantages and disadvantages of theexisting models, we now identify the requirements of an idealdata-driven QoE analysis model:

• Requirement for QoE Metrics.– Measurable. Since the raw data is collected in the

wild rather than in a controlled laboratory environ-ment, the QoE metrics for large-scale data-drivenanalysis should be easy to measure and monitor inreal-time. This is also true for the QoS metrics andthe external factors.

– Informative. The selected QoE metrics should be agood indication of user experience or engagement.It may be needed to verify the correlation betweenthe measurable QoE metrics (such as viewing timeratio) and real subjective user QoE.

– Business fitting. Ideally, the QoE metrics shouldbe closely linked to the service providers’ busi-ness objectives, e.g., contributing to the monetiza-tion of the advertisement-supported or subscription-supported video service.

• Requirement for QoS-QoE Model– Reliable. The model should give reliable QoE predic-

tion, given the QoS parameters and external factors.Models that assume independency among QoS vari-ables may be not be accurate, e.g., it is found thatthe bitrate and buffering are correlated [30].

– Expressive. The model should be expressive enoughto capture the complex and non-monotonic relation-ship between QoS and QoE. Regression models thatpreassign a certain relationship (linear, logistic, etc.)may be problematic.

– Real-time. For the model to be able to conductreal-time QoE prediction, monitoring and even con-trolling, the computational complexity and storagerequirement have to be acceptable.

– Scalable. As the network and user experience evolveswith time, the model should be able to readily takenew variables, and give relatively accurate results.

VI. APPLICATIONS OF VIDEO QOE ANALYSIS MODELS

In this section, we introduce existing works which lever-age video quality assessment models for video transmissionoptimization or network control.

A. Cross-layer Video Transmission Optimization

QoE metrics evaluate the video quality from the users’ per-spective, which can provide the guideline for MAC/PHY leveloptimization. This is especially important for delivering videoover wireless network, constrained by the limited bandwidthand unstable channel quality. There are two major concernsfor the cross-layer video transmission optimization:

• Reliable QoE prediction model. Given the input of QoSparameters and external factors, the QoE prediction mod-el should give reliable results, so that correspondingadaptation actions can be taken to improve user QoE.

The process should be performed online to give real-timefeedback.

• Cross-layer timescale difference. At the application level,the video source adaptation is at the timescale of oneframe or one Group of pictures (GoP), which is muchlonger than the link adaptation at the PHY level. Furthermore, the channel condition variation is much faster thanthe video signal variation. Therefore, the application levelvideo source adaptation may use the aggregated PHYlevel information, while the PHY level link adaptationuses the relatively coarse application level information.

Cross-layer video transmission optimization is studied in[158]–[160], using PSNR as the QoE metric. In [161], theauthors propose a classification-based multi-dimensional videoadaptation using the subjective test results, not practical foronline network management. In [162], the authors propose anAPP/MAC/PHY cross-layer video transmission optimizationarchitecture. An online QoS-QoE mapping is developed toestimate the lower-bound of QoE value based on the packeterror rate. Then, the QoS-QoE mapping is leveraged by thePHY level to perform unequal error protection to maximize theQoE. At the APP level, source rate is adapted based on channelcondition and buffer state. In [163], the authors use slice lossvisibility (SLV) model [115] to estimate the visual importanceof video slices (a frame is divided into multiple slices, each ofwhich consists of multiple macroblocks). The most importantslices are allocated to the most reliable subbands of the OFDMchannels.

B. QoE-aware Congestion Control

Congestion control in the conventional TCP protocol, whenapplied to video traffic, may lead to long delay due to thefollowing reasons:

• According to the TCP protocol, a lost packet will beretransmitted until it is successfully received, resultingin long delay and therefore poor QoE.

• The Additive Increase Multiplicative Decrease (AIMD)algorithm leads to fluctuated throughput over time, whichwill increase the delay, leading to user dissatisfaction.

• The congestion control is QoS-based while the video ismore user-centric and QoE-based.

In order to design a video-friendly congestion controlmechanism for the TCP protocol, Media-TCP is proposedin [164], which optimizes the congestion window size tomaximize the long-term expected QoE. The distortion impactand delay deadline of each packet are considered, in orderto provide differential services for different packet classes.Media-TCP is shown to improve the PSNR over the con-ventional TCP congestion control approaches. While Media-TCP is still QoS-based, a MOS-based congestion control formultimedia transmission is proposed in [165]. The MOS valueis estimated in real time by the Microsoft Lync system, basedon quantitative measurements such as packet loss, bit errors,packet delay and jitter. The QoE-aware congestion windowadaptation is then formulated as a Partially Observable MarkovDecision Process (POMDP), and is solved by the onlinelearning algorithm. Another way to mitigate the delay problem

Page 33: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

33

in video transmission, without modify the TCP protocol, isto use video-friendly application protocol such as DynamicAdaptive Streaming over HTTP (DASH).

C. Video Transmission over Wireless NetworkSpecial attention has been paid to video transmission over

wireless network because of two reasons. First, channel con-dition in wireless network is ever changing due to noise,interference, multipaths and the mobility of user devices.Second and more importantly, with the growing popularity ofsmartphones and tablets, mobile video traffic is expected to bedominant in the near future. There are two mainstream wirelessnetworks: licensed cellular networks and unlicensed wirelesslocal area networks (WLANs). While the cellular system has acentralized management, the WLAN, most of which based onIEEE 802.11 standards, operates in a distributed way, sharingthe same spectrum with many other networks or systemswithout a centralized interference management. Thus, videotransmission over WLAN is more challenging and attractsmore research interests.

1) Interference Management: Rather than average videoquality, it is found that viewers are sensitive to small re-gions of poor quality in the recent past (hysteresis effects)[166], [167]. Rapid change of channel condition and networkthroughput lead to variation in video quality, which contributesto poor QoE. Different from existing interference managementschemes, which often target at reducing the interference power,in [168], the authors propose an interference shaping scheme,which spreads the received interference power in time to“smooth” the burstiness of interference. Though prioritizingreal-time video traffic over best effort traffic, it is shown thatthe QoE improvement (quantified by MS-SSIM index) for thevideo users only leads to negligible decrease in QoE for besteffort users (quantified by Weber-Fechner Law (WFL)-basedweb QoE modeling [169], [170]).

2) Admission Control: Admission control, or access con-trol, of the IEEE 802.11 WLAN is generally contention based.To cater for different traffic types (real-time and non real-time), it is proposed to prioritize the video traffic, or splitthe contention time into real-time and non real-time traffic[171]. In [172], the authors use Pseudo-Subjective QualityAssessment PSQA as the QoE metric, and propose a QoE-aware real-time admission control mechanism to manage thenetwork access of multiple users. In [173], the authors considerthe reverse problem where a user has multiple network tochoose from. Given the information provided by the accesspoints (AP), the user estimates the overall QoE (representedby PSQA [174]) of the APs’ existing users and choose the APwith lower load.

3) Resource Allocation: Resource allocation concernsabout how to allocate frequency, transmission time, or band-width to multiple users when a centralized scheduling ispossible. In [175], a channel allocation scheme is proposedfor cognitive radio (CR) network. The CR base station willallocate available channels to secondary users based on theirQoE expectations. In [176], [177], the system adapts videoconfigurations through transcoding to meet resource con-straints, aiming to have the best possible quality (PSNR).

4) Multicast Rate Selection: In [178], the authors designa video multicast mechanism for multirate WLANs. Thehierarchical video coders of the H.264 are combined with themulticast data rate selection: users with poor channel condition(low data rate) will receive only the Base Layer of the encodedvideo, while users with good channel condition (high datarate) will receive both the Base Layer and the EnhancementLayers. The mechanism is extended for compatibility withIEEE 802.11 standards in [179].

D. QoE-aware Video StreamingHTTP-based video streaming protocols have been developed

to cater for video traffic. The representative protocols includeHTTP Live Streaming (HLS) protocol and Dynamic AdaptiveStreaming over HTTP (DASH), also known as MPEG-DASH.A video is divided into chunks of the same time duration,and each chunk is available in multiple quality levels (withdifferent encoding bitrates). During the video session, theplayer can switch between video streams of the same videocontent but different bitrates. For instance, if the buffer isnearly empty, the player can select a low bitrate to quicklyfill up the buffer to avoid interruption. Given the choiceof different quality video streams, the remain issue is thestreaming strategy, which specifies how to choose the “right”quality for each video chunk, in order to maximize QoE,subject to network conditions and buffer size. The intuitionto achieve better QoE is to get higher quality, less frequentquality switch, and avoid video “freezing” (rebuffer). Singleuser adaptive video streaming is considered in [180], [181]. In[180], the wireless channel prediction information is assumedto be available to the video streaming application, whichschedules the video chunks and chooses their quality at eachtime slot. The problem is formulated as an optimizationproblem to maximize quality and minimize rebuffering time.In [181], the number of quality switches is added in theutility function, and Markov Decision Process(MDP) is usedto solve the optimization problem. Three MDP approachesare proposed, based on online or offline network bandwidthstatistics. Multi-user adaptive video streaming is considered in[182], [183]. Different from single user scenario, multi-userscenario has to consider not only efficiency but also fairnessamong multiple users.

E. Media Player Buffer DesignThe design of media player buffer is of great importance,

since the rebuffering event has a major influence on user QoE.The buffer size will affect the startup delay and the rebufferingtime. If the buffer size is large, the startup delay will be longerbecause more data has to be downloaded before the playerstarts playing. Nevertheless, during the playing state, fewerrebuffering events may happen, vice versa. In addition, it isfound in [149] that most of the downloaded data in the buffer isuseless because many users quit before the video completes.This results in a huge waste of the bandwidth both for theInternet Service Providers (ISP) and the Content DeliveryNetwork (CDN) operators. Predicting the fraction of videosthat may be watched by the viewer will be a great help toavoid transferring excessive data.

Page 34: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

34

VII. FUTURE DIRECTION

In this section, we present future directions of QoE-orientedvideo quality assessment.

A. Development of Data-driven QoE Research

Data-driven QoE analysis is still at its infancy, and there isstill great room for development.

• New metric selection. New metrics for representing QoE,QoS and external factors may come up as the network andthe user expectations change with time. The selected QoEmetrics should be a good indicator of user experienceor engagement, and easy to track and monitor in real-time. Other aspects of user QoE are also interesting. Forexample, interactivity can be reflected by user behaviorssuch as pause, fast-forward and rewind. With abundantQoS metrics and external factors, it should be verifiedwhich QoS metrics and external factors have a significantimpact on user QoE.

• In-depth user expectation understanding. Just as mostobjective quality models are designed based on HVS,theories on user expectation of Internet video service maybe further advanced, for example, the user patience forwaiting a video to start or restart; the user viewing habitsat different time of a day or different days of a week.

• Analysis tool development. Many advanced analysis toolscan be leveraged to give a more accurate QoE prediction.For example, deep learning algorithms can help extractimportant QoS and external factors that contribute to userQoE; better regression models can characterize complexQoS-QoE relationship.

• Early-quitter phenomenon analysis. A large number ofviewers will first “skim” a few number of videos beforedevoting to watching a specific one or simply quit thewebsite. The early-quitters may exhibit different behav-iors from other viewers, e.g., their QoE may be moresensitive towards the video content (e.g., popularity), butless sensitive towards some QoS metrics (due to smallQoS changes within a very short time). Other interestingobservations also deserve further investigations.

• Database establishment. As consumer data is often hardto access and time-consuming to collect, a database thatis available to the research community will be of greatboost to the QoE-related research. So far, there is no suchwell-established database like the VQEG database and theLIVE database.

B. QoE-based Video Transmission Optimization

Most of the previous video transmission optimization isQoS-oriented. As the goal changes from QoS-oriented to QoE-oriented, the optimization problem may be quite different.Though many existing video QoE-related applications havebeen discussed in Section VI, there are still more to beexplored. The following may be some potential researchdirections.

• QoE-aware multi-user video traffic scheduling. This isespecially needed for the scenario where multiple users

share a bottleneck link. Since different users have dif-ferent QoE expectations, scheduling can be performedbased on user QoE sensitivity. In this way, higher aggre-gated user QoE may be achieved with limited networkresources.

• QoE-aware video streaming. Built on the existing adap-tive video streaming protocols (e.g., DASH and HLS),sophisticated streaming strategy (find the optimal qualityfor each video chunk) still needs further exploration.Future solutions must strike a balance between videoquality, rebuffering time and quality switch frequency,while relying on relatively accurate channel capacityestimation. In the multi-user case, fairness is also aconcern.

• QoE-aware network management. Once QoE degradationis detected, first and foremost, the causes should beidentified (possibly through the QoE prediction model). Ifthe cause is network-related, ISP and CDN operators maytake corresponding actions. If the cause is due to externalfactors, there is no need for ISP and CDN operatorsto waste their resources, such as increase bandwidthor change edge servers. All the management decisionsshould be based on a comprehensive understanding ofthe QoS-QoE relationship.

• QoE-aware traffic prioritization. Video traffic often haslarger packet size than other traffic, and the user patiencefor video service delay is often less than that for otherservices. Traffic prioritization based on different defini-tions of user QoE towards different services will be amatter of concern for future research directions.

C. QoE Evaluation in Emerging Technologies

1) 3D Video: There have been a huge number of researchworks on perceptual quality of 2D video, while the workson 3D video QoE are rather limited. The evaluation of 3Dvideo QoE is challenging because additional factors such asdepth perception, comfort levels and naturalness, have to beconsidered. There are two mainstream coding schemes for 3Dvideo: Scalable Video Coding (SVC) and Multi-view VideoCoding (MVC), see Table II. SVC is simulcast coding, whereviews are independently encoded with different SNR, temporalor spatial scalability. MVC exploits inter-view correlations,and sequential views are dependently encoded. Apart fromdifferent coding methods, 3D video can also leverage asym-metric coding. Asymmetric coding encodes the right and leftviews at different PSNR, spatial resolution or frame rate,being able to reduce the overall bitrate and required bandwidthfor transmission. The performance of symmetric coding andasymmetric coding is compared in [184]–[186] via subjectivetest, based on which efficient asymmetric video encodingapproaches are proposed. The influence of packet losses onthe QoE of 3D video is studied in [187] using subjective test.The relationship between the DMOS results and the PSNR ischaracterized by a symmetrical logistic function. The futuredirection for 3D video QoE evaluation may be a study ofthe combination of scalable stereo coding, multi-view videocoding and asymmetric coding.

Page 35: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

35

Origin Server

Surrogate Server

Surrogate ServerSurrogate Server

Surrogate ServerDesktop

Laptop

Tablet

Smart Phone

Content Provider

CDN Operator

Network Operator

Fig. 27: Video delivery network

2) Interactive Video: Interactive video services, or audio-visual communication services, include videotelephony, videoconferencing, and online gaming. Unlike QoE metrics forconventional video services, in the interactive video services,interactivity measurement is of great importance, and shouldbe incorporated in the QoE assessment. In [188], a conceptualframework is proposed to model, measure and evaluate QoE inthe distributed interactive multimedia environments. In partic-ular, cognitive perceptions (such as telepresence and perceivedtechnology acceptance) and behavioral consequences (such asperformance gains and technology adoption) are incorporatedin the QoE metrics. A novel test methodology for QoE eval-uation in the interactive video services is proposed in [189].Conversational interactivity and perceived social presence areincorporated in the QoE metrics. Social presence is the “degreeof salience of the other person in the (mediated) interactionand the consequent salience of the interpersonal relationships”[190]. An objective quality model for voice and video over IP(VVoIP) is built in [191], using network bandwidth, delay,jitter and loss to predict QoE. However, there is a lack ofconsideration for interactivity.

3) Ultra Definition Video: Ultra-high definition television(UHDTV) is standardized in the ITU-R RecommendationBT.2020 [192], aiming at providing users with advancedviewing experience beyond high definition TV. Various workshave compared the performance of two common compressionmethods for UHDTV: High Efficiency Video Coding (HEVC)and H.264/MPEG-4 Part 10 or AVC (Advanced Video Cod-ing). The results show that the HEVC generally outperformsthe AVC, achieving higher MOS scores [193], [194] andhigher PSNR [195]. However, there is a lack of study onunderstanding the human perception towards ultra definitionvideo, and building the models to characterize the QoS-QoErelationship for ultra definition video.

4) New Transmission Network: With the rapid developmentof network technologies, it is desirable to evaluate the QoE ofvideo transmission over different networks, such as mobilenetwork, sensor network and vehicular network.

• Mobile network. The popularization of the smartphone

has made the traffic of mobile media increase dramat-ically. The mobile video is characterized by its usagein dynamic and heterogeneous environment. Accordingto the study of mobile TV in [196], the subjective testresults in real contexts (e.g., wait in the train station, killtime in cafe or transit by bus) are different from those inthe controlled lab. Therefore, it is proposed to evaluateQoE of mobile video in a Living Lab setting, wherethe viewers watch the pre-defined videos and performevaluation tasks on mobile devices in real-life scenarios[197], [198].

• Sensor network. Wireless Multimedia Sensor Network(WMSN) refers to the sensor network that is able toretrieve, process, store and fuse multimedia informationfrom the physical world [199]. WMSN can be applied forvideo surveillance, traffic control system, environmentalmonitoring and so on. However, WMSN faces challengesof resource constraints, channel capacity variation, videoprocessing complexity as well as network management.QoS-provisioning system design for the WMSN has beenwidely explored [200], [201], but there is a lack of workon the QoE evaluation of such systems.

• Vehicular network. Vehicular communications includevehicle-to-vehicle, vehicle-to-infrastructure and vehicle-to-roadside wireless communications. Video transmissionover vehicular networks is studied in [202]–[204], usingPSNR or packet loss rate as evaluation metrics.

D. QoE-based Internet Video EconomicsThe success of the advertisement-supported or subscription-

supported revenue models is the major driven force for the fastdevelopment of Internet video. Improving user QoE is essentialto maintain such revenue models. Therefore, creating a QoE-based economic analysis framework for Internet video will beof great interest.

Fig. 27 shows the general architecture of an Internet videotransmission network. Video files are initially generated by thevideo content providers; then distributed by the Content Deliv-ery Networks (CDN), often chosen by the content providers.

Page 36: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

36

After that, the video files are transmitted via wired or wirelessnetwork provided by the Internet Service Providers (ISP); andfinally displayed on end users’ devices by the media player.We can see that there are four major participants in the Internetvideo service ecosystem:

• Video Content Provider, e.g., YouTube, Netflix, and Com-cast.

• Content Delivery Network (CDN) Operator, e.g., AkamaiTechnologies in the U.S. [205], ChinaCache in China,and StreamZilla in the Europe. CDN consists of largenumbers of servers distributed across multiple ISPs’ datacenters close to the end users. CDN transports the videosfrom the content provider to servers at the “edge” of theinternet, where the videos are cached and delivered to theend users with high quality.

• Internet Service Providers (ISP), e.g., AT & T, Vodafone,and China Telecom. There are two major types of ISP:fixed-line operators who provide wired network access,and mobile network operators who provide wireless net-work access. Typical wireless networks include cellularnetwork and WLAN (Wi-Fi); typical wired networksinclude cable, DSL and fiber.

• Media Player Designer, e.g., Adobe which designedAdobe Flash Player, Microsoft which designed WindowsMedia Player, and Apple which designed QuickTime.

The economic ties between these participants are as follows.The video content providers will choose and pay the CDNoperators for delivering their videos. The CDN operators haveto pay the ISPs for hosting CDN servers in the ISPs’ datacenters. Though most media players are free of charge, theybenefit the designers by completing their products or services.Improving user QoE is of common interest to all participants,but different participants have different control parameters. Forexample, CDN operators can select the CDN servers; ISPs candecide the bandwidth. On the one hand, each participant canmaximize his individual utility by choosing his own controlstrategy; on the other hand, all or some participants cancooperate with each other to maximize end user QoE or totalutility. Future research on either direction is promising.

VIII. CONCLUSION

Video quality assessment has evolved from system-centricQoS-oriented to user-centric QoE-oriented. With the ever-increasing user demand for video service, developing reliablemodels that can monitor, predict and even control QoE isof great importance to the service providers and networkoperators. In this tutorial, we give a comprehensive review ofthe evolution of QoE-based video quality assessment methods:first the subjective test, then the objective quality model, andfinally the data-driven analysis. We give detailed descriptionof the state of art of each model. Subjective test is a direct wayof measuring QoE, but has a great many of limitations. Objec-tive quality model indirectly predicts QoE through objectivemetrics, but it relies heavily on the subjective test results. Withgrowing popularity of video streaming over the Internet, large-scale data-driven QoE models have emerged, based on newQoE metrics and data mining techniques. We believe that this

will be the research frontier, with many issues to be exploredand resolved. We also identify other future research directions,such as QoE-based video transmission optimization and QoE-based Internet video economics.

ACKNOWLEDGMENT

The research was supported in part by grants from973 project 2013CB329006, China NSFC under Grant61173156, RGC under the contracts CERG 622613, 16212714,HKUST6/CRF/12R, and M-HKUST609/13, the grant fromHuawei-HKUST joint lab, Program for New Century Ex-cellent Talents in University (NCET-13-0908), GuangdongNatural Science Funds for Distinguished Young Scholar(No. S20120011468), the Shenzhen Science and TechnologyFoundation (Grant No.JCYJ20140509172719309), New Starof Pearl River on Science and Technology of Guangzhou(No.2012J2200081).

REFERENCES

[1] “Cisco visual networking index: Forecast and methodology,2012−2017, may 29, 2013.”

[2] “Cisco visual networking index: Global mobile data traffic forecastupdate, 2013−2018, february 5, 2014.”

[3] G. J. Sullivan and T. Wiegand, “Rate-distortion optimization for videocompression,” Signal Processing Magazine, IEEE, vol. 15, no. 6, pp.74–90, 1998.

[4] Z. Chen and K. N. Ngan, “Recent advances in rate control for videocoding,” Signal Processing: Image Communication, vol. 22, no. 1, pp.19–38, 2007.

[5] Y. Liu, Z. G. Li, and Y. C. Soh, “A novel rate control scheme forlow delay video communication of h. 264/avc standard,” Circuits andSystems for Video Technology, IEEE Transactions on, vol. 17, no. 1,pp. 68–78, 2007.

[6] S. Chong, S.-q. Li, and J. Ghosh, “Predictive dynamic bandwidthallocation for efficient transport of real-time vbr video over atm,”Selected Areas in Communications, IEEE Journal on, vol. 13, no. 1,pp. 12–23, 1995.

[7] A. M. Adas, “Using adaptive linear prediction to support real-timevbr video under rcbr network service model,” Networking, IEEE/ACMTransactions on, vol. 6, no. 5, pp. 635–644, 1998.

[8] M. Wu, R. A. Joyce, H.-S. Wong, L. Guan, and S.-Y. Kung, “Dynamicresource allocation via video content and short-term traffic statistics,”Multimedia, IEEE Transactions on, vol. 3, no. 2, pp. 186–199, 2001.

[9] H. Luo and M.-L. Shyu, “Quality of service provision in mobilemultimedia-a survey,” Human-centric computing and information sci-ences, vol. 1, no. 1, pp. 1–15, 2011.

[10] B. Wah, X. Su, and D. Lin, “A survey of error-concealment schemesfor real-time audio and video transmissions over the internet,” inMultimedia Software Engineering, 2000. Proceedings. InternationalSymposium on. IEEE, 2000, pp. 17–24.

[11] Q. Zhang, W. Zhu, and Y.-Q. Zhang, “End-to-end qos for video deliveryover wireless internet,” Proceedings of the IEEE, vol. 93, no. 1, pp.123–134, 2005.

[12] B. Vandalore, W.-c. Feng, R. Jain, and S. Fahmy, “A survey ofapplication layer techniques for adaptive streaming of multimedia,”Real-Time Imaging, vol. 7, no. 3, pp. 221–235, 2001.

[13] “Vqeg objective video quality model test plan,” 27-29 May 1998.[14] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality

assessment: A natural scene statistics approach in the dct domain,”Image Processing, IEEE Transactions on, vol. 21, no. 8, pp. 3339–3352, 2012.

[15] M. A. Saad and A. C. Bovik, “Blind quality assessment of videos usinga model of natural scene statistics and motion coherency,” in Signals,Systems and Computers (ASILOMAR), 2012 Conference Record of theForty Sixth Asilomar Conference on. IEEE, 2012, pp. 332–336.

[16] F. Yang, S. Wan, Q. Xie, and H. R. Wu, “No-reference qualityassessment for networked video via primary analysis of bit stream,”Circuits and Systems for Video Technology, IEEE Transactions on,vol. 20, no. 11, pp. 1544–1554, 2010.

Page 37: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

37

[17] X. Lin, H. Ma, L. Luo, and Y. Chen, “No-reference video qualityassessment in the compressed domain,” Consumer Electronics, IEEETransactions on, vol. 58, no. 2, pp. 505–512, 2012.

[18] S.-O. Lee and D.-G. Sim, “Hybrid bitstream-based video qualityassessment method for scalable video coding,” Optical Engineering,vol. 51, no. 6, pp. 067 403–1, 2012.

[19] K. Yamagishi and T. Hayashi, “Parametric packet-layer model formonitoring video quality of iptv services,” in Communications, 2008.ICC’08. IEEE International Conference on. IEEE, 2008, pp. 110–114.

[20] F. Yang, J. Song, S. Wan, and H. R. Wu, “Content-adaptive packet-layermodel for quality assessment of networked video services,” SelectedTopics in Signal Processing, IEEE Journal of, vol. 6, no. 6, pp. 672–683, 2012.

[21] S. Tao, J. Apostolopoulos, and R. Guerin, “Real-time monitoring ofvideo quality in ip networks,” IEEE/ACM Transactions on Networking(TON), vol. 16, no. 5, pp. 1052–1065, 2008.

[22] G. Zhai, J. Cai, W. Lin, X. Yang, and W. Zhang, “Three dimensionalscalable video adaptation via user-end perceptual quality assessment,”Broadcasting, IEEE Transactions on, vol. 54, no. 3, pp. 719–727, 2008.

[23] M. Ries, O. Nemethova, and M. Rupp, “Video quality estimation formobile h. 264/avc video streaming,” Journal of Communications, vol. 3,no. 1, pp. 41–50, 2008.

[24] K. Yamagishi, T. Kawano, and T. Hayashi, “Hybrid video-quality-estimation model for iptv services,” in Global TelecommunicationsConference, 2009. IEEE, 2009, pp. 1–5.

[25] R. K. Mok, E. W. Chan, and R. K. Chang, “Measuring the quality ofexperience of http video streaming,” in Integrated Network Manage-ment (IM), 2011 IFIP/IEEE International Symposium on. IEEE, 2011,pp. 485–492.

[26] R. K. Mok, E. W. Chan, X. Luo, and R. K. Chang, “Inferring the qoeof http video streaming from user-viewing activities,” in Proceedingsof the first ACM SIGCOMM workshop on Measurements up the stack.ACM, 2011, pp. 31–36.

[27] A. Khan, L. Sun, and E. Ifeachor, “Qoe prediction model and its ap-plication in video quality adaptation over umts networks,” Multimedia,IEEE Transactions on, vol. 14, no. 2, pp. 431–442, 2012.

[28] M. G. T. P. Video Quality Experts Group (VQEG), “Final report fromthe video quality experts group on the validation of objective modelsof video quality assessment,” 2000.

[29] A. K. Moorthy, K. Seshadrinathan, R. Soundararajan, and A. C. Bovik,“Wireless video quality assessment: A study of subjective scores andobjective algorithms,” Circuits and Systems for Video Technology, IEEETransactions on, vol. 20, no. 4, pp. 587–599, 2010.

[30] A. Balachandran, V. Sekar, A. Akella, S. Seshan, I. Stoica, andH. Zhang, “Developing a predictive model of quality of experience forinternet video,” in Proceedings of the ACM SIGCOMM 2013. ACM,2013, pp. 339–350.

[31] S. Chikkerur, V. Sundaram, M. Reisslein, and L. J. Karam, “Objectivevideo quality assessment methods: A classification, review, and per-formance comparison,” Broadcasting, IEEE Transactions on, vol. 57,no. 2, pp. 165–182, 2011.

[32] W. Lin and C.-C. Jay Kuo, “Perceptual visual quality metrics: Asurvey,” Journal of Visual Communication and Image Representation,vol. 22, no. 4, pp. 297–312, 2011.

[33] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Imagequality assessment: from error visibility to structural similarity,” ImageProcessing, IEEE Transactions on, vol. 13, no. 4, pp. 600–612, 2004.

[34] Y. Wang, “Survey of objective video quality measurements,” EMCCorporation Hopkinton, MA, vol. 1748, p. 39, 2006.

[35] M. Yuen and H. Wu, “A survey of hybrid mc/dpcm/dct video codingdistortions,” Signal processing, vol. 70, no. 3, pp. 247–278, 1998.

[36] C. J. Van den Branden Lambrecht and O. Verscheure, “Perceptualquality measure using a spatiotemporal model of the human visualsystem,” in Electronic Imaging: Science & Technology. InternationalSociety for Optics and Photonics, 1996, pp. 450–461.

[37] A. Liu, W. Lin, M. Paul, C. Deng, and F. Zhang, “Just noticeabledifference for images with decomposition model for separating edgeand textured regions,” Circuits and Systems for Video Technology, IEEETransactions on, vol. 20, no. 11, pp. 1648–1652, 2010.

[38] W. Osberger, A. J. Maeder, and D. McLean, “A computational model ofthe human visual system for image quality assessment,” in proceedingsDICTA, vol. 97, 1997, pp. 337–342.

[39] W. Osberger, N. Bergmann, and A. Maeder, “An automatic imagequality assessment technique incorporating higher level perceptualfactors,” in Image Processing. International Conference on. IEEE,1998, pp. 414–418.

[40] S. Westen, R. Lagendijk, and J. Biemond, “Perceptual image qualitybased on a multiple channel hvs model,” in Acoustics, Speech, andSignal Processing (ICASSP). International Conference on, vol. 4.IEEE, 1995, pp. 2351–2354.

[41] F. Xiao et al., “Dct-based video quality evaluation,” Final Project forEE392J, p. 769, 2000.

[42] A. B. Watson, J. Hu, and J. F. MCGOwAN, “Digital video qualitymetric based on human vision,” Journal of Electronic imaging, vol. 10,no. 1, pp. 20–29, 2001.

[43] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structuralsimilarity for image quality assessment,” in Signals, Systems andComputers, 2003. Conference Record of the Thirty-Seventh AsilomarConference on, vol. 2. IEEE, 2003, pp. 1398–1402.

[44] K. Seshadrinathan and A. C. Bovik, “Motion tuned spatio-temporalquality assessment of natural videos,” Image Processing, IEEE Trans-actions on, vol. 19, no. 2, pp. 335–350, 2010.

[45] U. Ansorge, G. Francis, M. H. Herzog, and H. Ogmen, “Visual maskingand the dynamics of human perception, cognition, and consciousness acentury of progress, a contemporary synthesis, and future directions,”Advances in Cognitive Psychology, vol. 3, no. 1-2, p. 1, 2007.

[46] D. M. Chandler and S. S. Hemami, “Vsnr: A wavelet-based visualsignal-to-noise ratio for natural images,” Image Processing, IEEETransactions on, vol. 16, no. 9, pp. 2284–2298, 2007.

[47] H. R. Wu and K. R. Rao, Digital video image quality and perceptualcoding. CRC press, 2005.

[48] F. Dobrian, V. Sekar, A. Awan, I. Stoica, D. A. Joseph, A. Ganjam,J. Zhan, and H. Zhang, “Understanding the impact of video qualityon user engagement,” SIGCOMM-Computer Communication Review,vol. 41, no. 4, p. 362, 2011.

[49] P. Read and M.-P. Meyer, Restoration of motion picture film.Butterworth-Heinemann, 2000.

[50] Q. Huynh-Thu and M. Ghanbari, “Temporal aspect of perceived qualityin mobile video broadcasting,” Broadcasting, IEEE Transactions on,vol. 54, no. 3, pp. 641–651, 2008.

[51] A. Khan, L. Sun, and E. Ifeachor, “Content clustering based videoquality prediction model for mpeg4 video streaming over wirelessnetworks,” in Communications. IEEE International Conference on.IEEE, 2009, pp. 1–5.

[52] M. Claypool and J. Tanner, “The effects of jitter on the peceptualquality of video,” in Proceedings of the seventh ACM internationalconference on Multimedia (Part 2). ACM, 1999, pp. 115–118.

[53] A. Balachandran, V. Sekar, A. Akella, S. Seshan, I. Stoica, andH. Zhang, “A quest for an internet video quality-of-experience metric,”in Proceedings of the 11th ACM Workshop on Hot Topics in Networks.ACM, 2012, pp. 97–102.

[54] H. Chen, S. Ng, and A. R. Rao, “Cultural differences in consumerimpatience,” Journal of Marketing Research, pp. 291–301, 2005.

[55] H. Yu, D. Zheng, B. Y. Zhao, and W. Zheng, “Understanding userbehavior in large-scale video-on-demand systems,” in ACM SIGOPSOperating Systems Review, vol. 40, no. 4. ACM, 2006, pp. 333–344.

[56] M. H. Pinson and S. Wolf, “Comparing subjective video quality testingmethodologies,” in Visual Communications and Image Processing2003. International Society for Optics and Photonics, 2003, pp. 573–582.

[57] I. T. Union, “Rec. itu-r bt-500-11: Methodology for subjective assess-ment of the quality of television picture.”

[58] ——, “Rec. itu-r bt.814-1: Specifications and alignment procedures forsetting of brightness and contrast of displays.”

[59] ——, “Rec. itu-r bt.815-1: Specification of a signal for measurementof the contrast ratio of displays.”

[60] “Itu-t recommendation bt.1438: Subjective assessment of stereoscopictelevision pictures,” 2000.

[61] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcıa,“Validation of a novel approach to subjective quality evaluation of con-ventional and 3d broadcasted video services,” in Quality of multimediaexperience (QoMEX), 2012 fourth international workshop on. IEEE,2012, pp. 230–235.

[62] A. Sorokin and D. Forsyth, “Utility data annotation with amazonmechanical turk,” Urbana, vol. 51, no. 61, p. 820, 2008.

[63] K.-T. Chen, C.-C. Wu, Y.-C. Chang, and C.-L. Lei, “A crowdsourceableqoe evaluation framework for multimedia content,” in Proceedings ofthe 17th ACM international conference on Multimedia. ACM, 2009,pp. 491–500.

[64] Q. Xu, J. Xiong, Q. Huang, and Y. Yao, “Robust evaluation for qualityof experience in crowdsourcing,” in Proceedings of the 21st ACMinternational conference on Multimedia. ACM, 2013, pp. 43–52.

Page 38: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

38

[65] Q. Xu, Q. Huang, T. Jiang, B. Yan, W. Lin, and Y. Yao, “Hodgerank onrandom graphs for subjective video quality assessment,” Multimedia,IEEE Transactions on, vol. 14, no. 3, pp. 844–857, 2012.

[66] C. Wu, K. Chen, Y. Chang, and C. Lei, “Crowdsourcing multimediaqoe evaluation: A trusted framework,” 2013.

[67] R. G. Cole and J. H. Rosenbluth, “Voice over ip performance moni-toring,” ACM SIGCOMM Computer Communication Review, vol. 31,no. 2, pp. 9–24, 2001.

[68] D. Hands, O. V. Barriac, and F. Telecom, “Standardization activities inthe itu for a qoe assessment of iptv,” IEEE Communications Magazine,p. 79, 2008.

[69] S. Winkler, A. Sharma, and D. McNally, “Perceptual video quality andblockiness metrics for multimedia streaming applications,” in Proc.International Symposium on Wireless Personal Multimedia Communi-cations, 2001, pp. 547–552.

[70] S. Olsson, M. Stroppiana, and J. Baina, “Objective methods forassessment of video quality: state of the art,” Broadcasting, IEEETransactions on, vol. 43, no. 4, pp. 487–495, 1997.

[71] A. Punchihewa, D. G. Bailey, and R. Hodgson, “A survey of codedimage and video quality assessment,” in Proceedings of Image andVision Computing New Zealand, 2003, pp. 326–331.

[72] U. Engelke and H.-J. Zepernick, “Perceptual-based quality metricsfor image and video services: A survey,” in Next Generation InternetNetworks, 3rd EuroNGI Conference on. IEEE, 2007, pp. 190–197.

[73] S. Winkler, Digital video quality: vision models and metrics. JohnWiley & Sons, 2005.

[74] I.-T. R. J.340, “Reference algorithm for computing peak signal to noiseratio (psnr) of a video sequence with a constant delay,” 2009.

[75] A. M. Eskicioglu and P. S. Fisher, “Image quality measures and theirperformance,” Communications, IEEE Transactions on, vol. 43, no. 12,pp. 2959–2965, 1995.

[76] Z. Wang and A. C. Bovik, “A universal image quality index,” SignalProcessing Letters, IEEE, vol. 9, no. 3, pp. 81–84, 2002.

[77] Z. Wang, A. C. Bovik, and L. Lu, “Why is image quality assessment sodifficult?” in Acoustics, Speech, and Signal Processing (ICASSP), 2002IEEE International Conference on, vol. 4. IEEE, 2002, pp. IV–3313.

[78] A. Schertz and M. G. Institut fuer Rundfunktechnik GmbH, IRTTektronix Investigation of Subjective and Objective Picture Quality for2-10 Mbit-sec MPEG-2 Video. IRT, 1997.

[79] A. P. Hekstra, J. Beerends, D. Ledermann, F. De Caluwe, S. Kohler,R. Koenen, S. Rihs, M. Ehrsam, and D. Schlauss, “Pvqm–a perceptualvideo quality measure,” Signal processing: Image communication,vol. 17, no. 10, pp. 781–798, 2002.

[80] A. Bhat, I. Richardson, and S. Kannangara, “A new perceptual qualitymetric for compressed video,” in Acoustics, Speech and Signal Pro-cessing (ICASSP). IEEE International Conference on. IEEE, 2009,pp. 933–936.

[81] J. You, T. Ebrahimi, and A. Perkis, “Attention driven foveated videoquality assessment,” 2013.

[82] V. Q. E. Group et al., “Report on the validation of video quality modelsfor high definition video content,” 2010.

[83] J. Mannos and D. Sakrison, “The effects of a visual fidelity criterion ofthe encoding of images,” Information Theory, IEEE Transactions on,vol. 20, no. 4, pp. 525–536, 1974.

[84] P. J. Bex and W. Makous, “Spatial frequency, phase, and the contrastof natural images,” JOSA A, vol. 19, no. 6, pp. 1096–1106, 2002.

[85] D. Navon, “Forest before trees: The precedence of global features invisual perception,” Cognitive psychology, vol. 9, no. 3, pp. 353–383,1977.

[86] D. M. Chandler, K. H. Lim, and S. S. Hemami, “Effects of spatialcorrelations and global precedence on the visual fidelity of distortedimages,” in Electronic Imaging 2006. International Society for Opticsand Photonics, 2006, pp. 60 570F–60 570F.

[87] Z. Wang, L. Lu, and A. C. Bovik, “Video quality assessment based onstructural distortion measurement,” Signal processing: Image commu-nication, vol. 19, no. 2, pp. 121–132, 2004.

[88] Z. Wang and E. P. Simoncelli, “An adaptive linear system frameworkfor image distortion analysis,” in Image Processing, 2005. ICIP 2005.IEEE International Conference on, vol. 3. IEEE, 2005, pp. III–1160.

[89] ——, “Stimulus synthesis for efficient evaluation and refinement ofperceptual image quality metrics,” in Electronic Imaging 2004. Inter-national Society for Optics and Photonics, 2004, pp. 99–108.

[90] Z. Wang and X. Shang, “Spatial pooling strategies for perceptual imagequality assessment,” in Image Processing, 2006 IEEE InternationalConference on. IEEE, 2006, pp. 2945–2948.

[91] A. K. Moorthy and A. C. Bovik, “Visual importance pooling forimage quality assessment,” Selected Topics in Signal Processing, IEEEJournal of, vol. 3, no. 2, pp. 193–201, 2009.

[92] E. Ong, X. Yang, W. Lin, Z. Lu, and S. Yao, “Video quality metric forlow bitrate compressed videos,” in Image Processing, 2004. ICIP’04.2004 International Conference on, vol. 5. IEEE, 2004, pp. 3531–3534.

[93] E. Ong, W. Lin, Z. Lu, and S. Yao, “Colour perceptual video qualitymetric,” in Image Processing(ICIP). IEEE International Conference on,vol. 3. IEEE, 2005, pp. III–1172.

[94] P. Ndjiki-Nya, M. Barrado, and T. Wiegand, “Efficient full-referenceassessment of image and video quality,” in Image Processing, 2007.ICIP 2007. IEEE International Conference on, vol. 2. IEEE, 2007,pp. II–125.

[95] H. R. Sheikh, A. C. Bovik, and G. De Veciana, “An information fidelitycriterion for image quality assessment using natural scene statistics,”Image Processing, IEEE Transactions on, vol. 14, no. 12, pp. 2117–2128, 2005.

[96] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,”Image Processing, IEEE Transactions on, vol. 15, no. 2, pp. 430–444,2006.

[97] ——, “A visual information fidelity approach to video quality assess-ment,” in The First International Workshop on Video Processing andQuality Metrics for Consumer Electronics, 2005, pp. 23–25.

[98] S.-O. Lee and D.-G. Sim, “New full-reference visual quality assessmentbased on human visual perception,” in Consumer Electronics (ICCE).Digest of Technical Papers. International Conference on. IEEE, 2008,pp. 1–2.

[99] S. Lee, M. S. Pattichis, and A. C. Bovik, “Foveated video qualityassessment,” Multimedia, IEEE Transactions on, vol. 4, no. 1, pp. 129–132, 2002.

[100] Z. Wang, A. C. Bovik, L. Lu, and J. L. Kouloheris, “Foveated waveletimage quality index,” in International Symposium on Optical Scienceand Technology. International Society for Optics and Photonics, 2001,pp. 42–52.

[101] S. Rimac-Drlje, M. Vranjes, and D. Zagar, “Foveated mean squarederrorła novel video quality metric,” Multimedia tools and applications,vol. 49, no. 3, pp. 425–445, 2010.

[102] W. S. Geisler and J. S. Perry, “A real-time foveated multiresolutionsystem for low-bandwidth video communication,” in in Proc. SPIE,1998, pp. 294–305.

[103] J. You, A. Perkis, M. M. Hannuksela, and M. Gabbouj, “Perceptualquality assessment based on visual attention analysis,” in Proceedingsof the 17th ACM international conference on Multimedia. ACM, 2009,pp. 561–564.

[104] A. B. Watson, G. Y. Yang, J. A. Solomon, and J. Villasenor, “Visibilityof wavelet quantization noise,” Image Processing, IEEE Transactionson, vol. 6, no. 8, pp. 1164–1175, 1997.

[105] D. J. Fleet and A. D. Jepson, “Computation of component image veloc-ity from local phase information,” International Journal of ComputerVision, vol. 5, no. 1, pp. 77–104, 1990.

[106] Z. Wang and Q. Li, “Video quality assessment using a statistical modelof human visual speed perception,” JOSA A, vol. 24, no. 12, pp. B61–B69, 2007.

[107] A. Srivastava, A. B. Lee, E. P. Simoncelli, and S.-C. Zhu, “On advancesin statistical modeling of natural images,” Journal of mathematicalimaging and vision, vol. 18, no. 1, pp. 17–33, 2003.

[108] E. P. Simoncelli and B. A. Olshausen, “Natural image statistics andneural representation,” Annual review of neuroscience, vol. 24, no. 1,pp. 1193–1216, 2001.

[109] Z. Wang and A. C. Bovik, “Mean squared error: love it or leave it?a new look at signal fidelity measures,” Signal Processing Magazine,IEEE, vol. 26, no. 1, pp. 98–117, 2009.

[110] B. A. Wandell, Foundations of vision. Sinauer Associates, 1995.[111] A. A. Stocker and E. P. Simoncelli, “Noise characteristics and prior

expectations in human visual speed perception,” Nature neuroscience,vol. 9, no. 4, pp. 578–585, 2006.

[112] U. Rajashekar, I. van der Linde, A. C. Bovik, and L. K. Cormack,“Gaffe: A gaze-attentive fixation finding engine,” Image Processing,IEEE Transactions on, vol. 17, no. 4, pp. 564–573, 2008.

[113] A. R. Reibman, S. Kanumuri, V. Vaishampayan, and P. C. Cosman,“Visibility of individual packet losses in mpeg-2 video,” in ImageProcessing (ICIP). International Conference on, vol. 1. IEEE, 2004,pp. 171–174.

[114] S. Kanumuri, P. Cosman, and A. R. Reibman, “A generalized linearmodel for mpeg-2 packet-loss visibility,” in Proceedings of 14thInternational Packet Video Workshop (PV04), 2004.

Page 39: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

39

[115] S. Kanumuri, P. C. Cosman, A. R. Reibman, and V. A. Vaishampayan,“Modeling packet-loss visibility in mpeg-2 video,” Multimedia, IEEETransactions on, vol. 8, no. 2, pp. 341–355, 2006.

[116] S. Kanumuri, S. G. Subramanian, P. C. Cosman, and A. R. Reibman,“Predicting h. 264 packet loss visibility using a generalized linearmodel,” in Image Processing, 2006 IEEE International Conference on.IEEE, 2006, pp. 2245–2248.

[117] A. R. Reibman and D. Poole, “Characterizing packet-loss impairmentsin compressed video,” in Image Processing, 2007. ICIP 2007. IEEEInternational Conference on, vol. 5. IEEE, 2007, pp. V–77.

[118] ——, “Predicting packet-loss visibility using scene characteristics,” inPacket Video 2007. IEEE, 2007, pp. 308–317.

[119] T.-L. Lin, S. Kanumuri, Y. Zhi, D. Poole, P. C. Cosman, and A. R.Reibman, “A versatile model for packet loss visibility and its appli-cation to packet prioritization,” Image Processing, IEEE Transactionson, vol. 19, no. 3, pp. 722–735, 2010.

[120] L. Breiman, Classification and regression trees. CRC press, 1993.[121] P. MacCullagh and J. A. Nelder, Generalized linear models. CRC

press, 1989, vol. 37.[122] Z. Wang and E. P. Simoncelli, “Reduced-reference image quality

assessment using a wavelet-domain natural image statistic model,”in Electronic Imaging 2005. International Society for Optics andPhotonics, 2005, pp. 149–159.

[123] Z. Wang, G. Wu, H. R. Sheikh, E. P. Simoncelli, E.-H. Yang, and A. C.Bovik, “Quality-aware images,” Image Processing, IEEE Transactionson, vol. 15, no. 6, pp. 1680–1689, 2006.

[124] Q. Li and Z. Wang, “Reduced-reference image quality assessment usingdivisive normalization-based image representation,” Selected Topics inSignal Processing, IEEE Journal of, vol. 3, no. 2, pp. 202–211, 2009.

[125] W. Xue and X. Mou, “Reduced reference image quality assessmentbased on weibull statistics,” in Quality of Multimedia Experience(QoMEX), 2010 Second International Workshop on. IEEE, 2010, pp.1–6.

[126] R. Soundararajan and A. C. Bovik, “Rred indices: reduced referenceentropic differencing framework for image quality assessment,” inAcoustics, Speech and Signal Processing (ICASSP), 2011 IEEE In-ternational Conference on. IEEE, 2011, pp. 1149–1152.

[127] A. A. Abdelouahad, M. El Hassouni, H. Cherifi, and D. Aboutajdine,“Image quality assessment measure based on natural image statisticsin the tetrolet domain,” in Image and Signal Processing. Springer,2012, pp. 451–458.

[128] A. Rehman and Z. Wang, “Reduced-reference image quality assess-ment by structural similarity estimation,” Image Processing, IEEETransactions on, vol. 21, no. 8, pp. 3378–3389, 2012.

[129] P. Le Callet, F. Autrusseau et al., “Subjective quality assessmentirccyn/ivc database,” 2005.

[130] Y. Horita, K. Shibata, Y. Kawayoke, and Z. P. Sazzad, “Mict imagequality evaluation database,” 2011.

[131] N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian, M. Carli, andF. Battisti, “Tid2008-a database for evaluation of full-reference visualquality assessment metrics,” Advances of Modern Radioelectronics,vol. 10, no. 4, pp. 30–45, 2009.

[132] E. C. Larson and D. Chandler, “Categorical image quality (csiq)database,” Online, http://vision. okstate. edu/csiq, 2010.

[133] Z. Wang and A. C. Bovik, “Reduced-and no-reference image qualityassessment,” Signal Processing Magazine, IEEE, vol. 28, no. 6, pp.29–40, 2011.

[134] T. M. Cover and J. A. Thomas, Elements of information theory. JohnWiley & Sons, 2012.

[135] I. P. Gunawan and M. Ghanbari, “Reduced-reference picture qualityestimation by using local harmonic amplitude information,” in LondonCommunications Symposium, vol. 2003, 2003.

[136] D. Tao, X. Li, W. Lu, and X. Gao, “Reduced-reference iqa in contourletdomain,” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEETransactions on, vol. 39, no. 6, pp. 1623–1627, 2009.

[137] A. Maalouf, M.-C. Larabi, and C. Fernandez-Maloigne, “A grouplet-based reduced reference image quality assessment,” in Quality ofMultimedia Experience, 2009. QoMEx 2009. International Workshopon. IEEE, 2009, pp. 59–63.

[138] M. Carnec, P. Le Callet, and D. Barba, “Objective quality assessment ofcolor images based on a generic perceptual reduced reference,” SignalProcessing: Image Communication, vol. 23, no. 4, pp. 239–256, 2008.

[139] F. Yang and S. Wan, “Bitstream-based quality assessment for networkedvideo: a review,” Communications Magazine, IEEE, vol. 50, no. 11, pp.203–209, 2012.

[140] K. Brunnstrom, D. Hands, F. Speranza, and A. Webster, “Vqeg val-idation and itu standardization of objective perceptual video qualitymetrics [standards in a nutshell],” Signal Processing Magazine, IEEE,vol. 26, no. 3, pp. 96–101, 2009.

[141] R. S. A. K. Moorthy, K. Seshadrinathan and A. C.Bovik, “Live wireless video quality assessment database,”http://live.ece.utexas.edu/research/quality/live wireless video.html,2009.

[142] D. M. Chandler, “Seven challenges in image quality assessment: past,present, and future research,” ISRN Signal Processing, vol. 2013, 2013.

[143] T. N. Pappas, T. A. Michel, and R. O. Hinds, “Supra-thresholdperceptual image coding,” in Image Processing, 1996. Proceedings.,International Conference on, vol. 1. IEEE, 1996, pp. 237–240.

[144] D. M. Chandler and S. S. Hemami, “Dynamic contrast-based quanti-zation for lossy wavelet image compression,” Image Processing, IEEETransactions on, vol. 14, no. 4, pp. 397–410, 2005.

[145] M. Vilas, X. G. Paneda, R. Garcıa, D. Melendi, and V. G. Garcıa,“User behavior analysis of a video-on-demand service with a widevariety of subjects and lengths,” in Software Engineering and AdvancedApplications, 2005. 31st EUROMICRO Conference on. IEEE, 2005,pp. 330–337.

[146] P. Gill, M. Arlitt, Z. Li, and A. Mahanti, “Youtube traffic characteriza-tion: a view from the edge,” in Proceedings of the 7th ACM SIGCOMMconference on Internet measurement. ACM, 2007, pp. 15–28.

[147] X. Hei, C. Liang, J. Liang, Y. Liu, and K. W. Ross, “A measurementstudy of a large-scale p2p iptv system,” Multimedia, IEEE Transactionson, vol. 9, no. 8, pp. 1672–1687, 2007.

[148] H. Yin, X. Liu, F. Qiu, N. Xia, C. Lin, H. Zhang, V. Sekar, and G. Min,“Inside the bird’s nest: measurements of large-scale live vod from the2008 olympics,” in Proceedings of the 9th ACM SIGCOMM conferenceon Internet measurement conference. ACM, 2009, pp. 442–455.

[149] A. Finamore, M. Mellia, M. M. Munafo, R. Torres, and S. G. Rao,“Youtube everywhere: Impact of device and infrastructure synergieson user experience,” in Proceedings of the 2011 ACM SIGCOMMconference on Internet measurement conference. ACM, 2011, pp.345–360.

[150] L. Chen, Y. Zhou, and D. M. Chiu, “Video browsing-a study of userbehavior in online vod services,” in Computer Communications andNetworks (ICCCN), 2013 22nd International Conference on. IEEE,2013, pp. 1–7.

[151] T. De Pessemier, K. De Moor, W. Joseph, L. De Marez, and L. Martens,“Quantifying the influence of rebuffering interruptions on the user’squality of experience during mobile video watching,” Broadcasting,IEEE Transactions on, vol. 59, no. 1, pp. 47–61, 2013.

[152] S. S. Krishnan and R. K. Sitaraman, “Video stream quality impactsviewer behavior: inferring causality using quasi-experimental designs,”in Proceedings of the 2012 ACM conference on Internet measurementconference. ACM, 2012, pp. 211–224.

[153] A. Balachandran, V. Sekar, A. Akella, and S. Seshan, “Analyzing thepotential benefits of cdn augmentation strategies for internet videoworkloads,” 2013.

[154] I. Ullah, G. Doyen, G. Bonnet, and D. Gaiti, “A survey and synthesisof user behavior measurements in p2p streaming systems,” Communi-cations Surveys & Tutorials, IEEE, vol. 14, no. 3, pp. 734–749, 2012.

[155] A. U. Mian, Z. Hu, and H. Tian, “A decision theoretic approach forin-service qoe estimation and prediction of p2p live video streamingsystems based on user behavior modeling and context awareness?”

[156] I. Ullah, G. Doyen, G. Bonnet, and D. Gaiti, “User behavior an-ticipation in p2p live video streaming systems through a bayesiannetwork,” in Integrated Network Management (IM), 2011 IFIP/IEEEInternational Symposium on. IEEE, 2011, pp. 337–344.

[157] W. R. Shadish, T. D. Cook, and D. T. Campbell, “Experimental andquasi-experimental designs for generalized causal inference,” 2002.

[158] D. Wang, P. C. Cosman, and L. B. Milstein, “Cross layer resourceallocation design for uplink video ofdma wireless systems,” in GlobalTelecommunications Conference (GLOBECOM 2011), 2011 IEEE.IEEE, 2011, pp. 1–6.

[159] Y. P. Fallah, H. Mansour, S. Khan, P. Nasiopoulos, and H. M.Alnuweiri, “A link adaptation scheme for efficient transmission of h.264 scalable video over multirate wlans,” Circuits and Systems forVideo Technology, IEEE Transactions on, vol. 18, no. 7, pp. 875–887,2008.

[160] Y. Zhang, W. Gao, Y. Lu, Q. Huang, and D. Zhao, “Joint source-channelrate-distortion optimization for h. 264 video coding over error-pronenetworks,” Multimedia, IEEE Transactions on, vol. 9, no. 3, pp. 445–454, 2007.

Page 40: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

40

[161] Y. Wang, M. van der Schaar, S.-F. Chang, and A. C. Loui,“Classification-based multidimensional adaptation prediction for scal-able video coding using subjective quality evaluation,” Circuits andSystems for Video Technology, IEEE Transactions on, vol. 15, no. 10,pp. 1270–1279, 2005.

[162] A. A. Khalek, C. Caramanis, and R. Heath, “A cross-layer design forperceptual optimization of h. 264/svc with unequal error protection,”Selected Areas in Communications, IEEE Journal on, vol. 30, no. 7,pp. 1157–1171, 2012.

[163] L. Toni, P. C. Cosman, and L. B. Milstein, “Channel coding optimiza-tion based on slice visibility for transmission of compressed video overofdm channels,” Selected Areas in Communications, IEEE Journal on,vol. 30, no. 7, pp. 1172–1183, 2012.

[164] H.-P. Shiang and M. van der Schaar, “Media-tcp: A quality-centrictcp-friendly congestion control for multimedia transmission,” arXivpreprint arXiv:0910.4186, 2009.

[165] O. Habachi, Y. Hu, M. van der Schaar, Y. Hayel, and F. Wu, “Mos-based congestion control for conversational services in wireless en-vironments,” Selected Areas in Communications, IEEE Journal on,vol. 30, no. 7, pp. 1225–1236, 2012.

[166] M. A. Masry and S. S. Hemami, “A metric for continuous quality eval-uation of compressed video with severe distortions,” Signal processing:Image communication, vol. 19, no. 2, pp. 133–146, 2004.

[167] K. Seshadrinathan and A. C. Bovik, “Temporal hysteresis model oftime varying subjective video quality,” in Acoustics, Speech and SignalProcessing (ICASSP), 2011 IEEE International Conference on. IEEE,2011, pp. 1153–1156.

[168] S. Singh, J. G. Andrews, and G. de Veciana, “Interference shaping forimproved quality of experience for real-time video streaming,” SelectedAreas in Communications, IEEE Journal on, vol. 30, no. 7, pp. 1259–1269, 2012.

[169] E. Ibarrola, F. Liberal, I. Taboada, and R. Ortega, “Web qoe evaluationin multi-agent networks: Validation of itu-t g. 1030,” in Autonomic andAutonomous Systems, 2009. ICAS’09. Fifth International Conferenceon. IEEE, 2009, pp. 289–294.

[170] P. Reichl, S. Egger, R. Schatz, and A. D’Alconzo, “The logarithmicnature of qoe and the role of the weber-fechner law in qoe assessment,”in Communications (ICC), 2010 IEEE International Conference on.IEEE, 2010, pp. 1–5.

[171] S.-T. Sheu and T.-F. Sheu, “A bandwidth allocation/sharing/extensionprotocol for multimedia over ieee 802.11 ad hoc wireless lans,” SelectedAreas in Communications, IEEE Journal on, vol. 19, no. 10, pp. 2065–2080, 2001.

[172] K. Piamrat, A. Ksentini, C. Viho, and J.-M. Bonnin, “Qoe-awareadmission control for multimedia applications in ieee 802.11 wirelessnetworks,” in Vehicular Technology Conference, 2008. VTC 2008-Fall.IEEE 68th. IEEE, 2008, pp. 1–5.

[173] ——, “Qoe-based network selection for multimedia users in ieee802.11 wireless networks,” in Local Computer Networks, 2008. LCN2008. 33rd IEEE Conference on. IEEE, 2008, pp. 388–394.

[174] G. Rubino, M. Varela, and J.-M. Bonnin, “Controlling multimedia qosin the future home network using the psqa metric,” The ComputerJournal, vol. 49, no. 2, pp. 137–155, 2006.

[175] T. Jiang, H. Wang, and A. V. Vasilakos, “Qoe-driven channel allocationschemes for multimedia transmission of priority-based secondary usersover cognitive radio networks,” Selected Areas in Communications,IEEE Journal on, vol. 30, no. 7, pp. 1215–1224, 2012.

[176] J.-G. Kim, Y. Wang, and S.-F. Chang, “Content-adaptive utility-basedvideo adaptation,” in Multimedia and Expo (ICME). InternationalConference on, vol. 3. IEEE, 2003, pp. III–281.

[177] Y. Wang, J.-G. Kim, and S.-F. Chang, “Content-based utility functionprediction for real-time mpeg-4 video transcoding,” in Image Process-ing (ICIP). International Conference on, vol. 1. IEEE, 2003, pp.I–189.

[178] J. Villalon, P. Cuenca, L. Orozco-Barbosa, Y. Seok, and T. Turletti,“Cross-layer architecture for adaptive video multicast streaming overmultirate wireless lans,” Selected Areas in Communications, IEEEJournal on, vol. 25, no. 4, pp. 699–711, 2007.

[179] M. A. Santos, J. Villalon, and L. Orozco-Barbosa, “A novel qoe-aware multicast mechanism for video communications over ieee 802.11wlans,” Selected Areas in Communications, IEEE Journal on, vol. 30,no. 7, pp. 1205–1214, 2012.

[180] M. Draxler and H. Karl, “Cross-layer scheduling for multi-quality videostreaming in cellular wireless networks,” in Wireless Communicationsand Mobile Computing Conference (IWCMC), 2013 9th International.IEEE, 2013, pp. 1181–1186.

[181] A. Bokani, M. Hassan, and S. Kanhere, “Http-based adaptive streamingfor mobile clients using markov decision process,” in Packet VideoWorkshop (PV), 2013 20th International. IEEE, 2013, pp. 1–8.

[182] J. Jiang, V. Sekar, and H. Zhang, “Improving fairness, efficiency, andstability in http-based adaptive video streaming with festive,” in Pro-ceedings of the 8th international conference on Emerging networkingexperiments and technologies. ACM, 2012, pp. 97–108.

[183] J. Chen, R. Mahindra, M. A. Khojastepour, S. Rangarajan, and M. Chi-ang, “A scheduling framework for adaptive video delivery over cellularnetworks,” in Proceedings of the 19th annual international conferenceon Mobile computing & networking. ACM, 2013, pp. 389–400.

[184] G. Saygili, C. G. Gurler, and A. M. Tekalp, “Evaluation of asymmetricstereo video coding and rate scaling for adaptive 3d video streaming,”Broadcasting, IEEE Transactions on, vol. 57, no. 2, pp. 593–601, 2011.

[185] S. S. Savas, C. G. Gurler, and A. M. Tekalp, “Evaluation of adaptationmethods for multi-view video,” in Image Processing (ICIP), 2012 19thIEEE International Conference on. IEEE, 2012, pp. 2273–2276.

[186] C. Hewage, H. Appuhami, M. Martini, R. Smith, I. Jourdan, andT. Rockall, “Quality evaluation of asymmetric compression for 3dsurgery video,” in e-Health Networking, Applications Services (Health-com), 2013 IEEE 15th International Conference on, Oct 2013, pp.680–684.

[187] C. T. Hewage, M. G. Martini, M. Brandas, and D. De Silva, “A studyon the perceived quality of 3d video subject to packet losses,” in Com-munications Workshops (ICC), 2013 IEEE International Conferenceon. IEEE, 2013, pp. 662–666.

[188] W. Wu, A. Arefin, R. Rivas, K. Nahrstedt, R. Sheppard, and Z. Yang,“Quality of experience in distributed interactive multimedia environ-ments: toward a theoretical framework,” in Proceedings of the 17thACM international conference on Multimedia. ACM, 2009, pp. 481–490.

[189] S. Egger, M. Ries, and P. Reichl, “Quality-of-experience beyond mos:experiences with a holistic user test methodology for interactive videoservices,” in 21st ITC Specialist Seminar on Multimedia Applications-Traffic, Performance and QoE, 2010, pp. 13–18.

[190] J. Short, E. Williams, and B. Christie, “The social psychology oftelecommunications,” 1976.

[191] P. Calyam, E. Ekici, C.-G. Lee, M. Haffner, and N. Howes, “A“gap-model based framework for online vvoip qoe measurement,”Communications and Networks, Journal of, vol. 9, no. 4, pp. 446–456,2007.

[192] “Recommendation itu-r bt.20207: Parameter values for ultra-high def-inition television systems for production and international programmeexchange,” 2012.

[193] P. Hanhart, M. Rerabek, F. De Simone, and T. Ebrahimi, “Subjectivequality evaluation of the upcoming hevc video compression standard,”in SPIE Optical Engineering+ Applications. International Society forOptics and Photonics, 2012, pp. 84 990V–84 990V.

[194] M. Horowitz, F. Kossentini, N. Mahdi, S. Xu, H. Guermazi, H. Tmar,B. Li, G. J. Sullivan, and J. Xu, “Informal subjective quality comparisonof video compression performance of the hevc and h. 264/mpeg-4 avcstandards for low-delay applications,” in SPIE Optical Engineering+Applications. International Society for Optics and Photonics, 2012,pp. 84 990W–84 990W.

[195] M. T. Pourazad, C. Doutre, M. Azimi, and P. Nasiopoulos, “Hevc: thenew gold standard for video compression: how does hevc compare withh. 264/avc?” Consumer Electronics Magazine, IEEE, vol. 1, no. 3, pp.36–46, 2012.

[196] S. Jumisko-Pyykko and M. M. Hannuksela, “Does context matter inquality evaluation of mobile television?” in Proceedings of the 10thinternational conference on Human computer interaction with mobiledevices and services. ACM, 2008, pp. 63–72.

[197] K. De Moor, I. Ketyko, W. Joseph, T. Deryckere, L. De Marez,L. Martens, and G. Verleye, “Proposed framework for evaluatingquality of experience in a mobile, testbed-oriented living lab setting,”Mobile Networks and Applications, vol. 15, no. 3, pp. 378–391, 2010.

[198] T. De Pessemier, K. De Moor, W. Joseph, L. De Marez, and L. Martens,“Quantifying subjective quality evaluations for mobile video watchingin a semi-living lab context,” Broadcasting, IEEE Transactions on,vol. 58, no. 4, pp. 580–589, 2012.

[199] I. F. Akyildiz, T. Melodia, and K. R. Chowdhury, “A survey on wirelessmultimedia sensor networks,” Computer networks, vol. 51, no. 4, pp.921–960, 2007.

[200] S. Ehsan and B. Hamdaoui, “A survey on energy-efficient routing tech-niques with qos assurances for wireless multimedia sensor networks,”Communications Surveys & Tutorials, IEEE, vol. 14, no. 2, pp. 265–278, 2012.

Page 41: From QoS to QoE: A Tutorial on Video Quality Assessmentiqua.ece.toronto.edu/ychen/Materials/papers/QoEsurvey.pdf · 1 From QoS to QoE: A Tutorial on Video Quality Assessment Yanjiao

41

[201] S. Pudlewski, A. Prasanna, and T. Melodia, “Compressed-sensing-enabled video streaming for wireless multimedia sensor networks,”Mobile Computing, IEEE Transactions on, vol. 11, no. 6, pp. 1060–1072, 2012.

[202] P. Bucciol, E. Masala, N. Kawaguchi, K. Takeda, and J. De Martin,“Performance evaluation of h. 264 video streaming over inter-vehicular802.11 ad hoc networks,” in Personal, Indoor and Mobile RadioCommunications (PIMRC). IEEE 16th International Symposium on,vol. 3. IEEE, 2005, pp. 1936–1940.

[203] F. Xie, K. A. Hua, W. Wang, and Y. H. Ho, “Performance study of livevideo streaming over highway vehicular ad hoc networks,” in VehicularTechnology Conference (VTC), IEEE 66th. IEEE, 2007, pp. 2121–2125.

[204] I. Rozas-Ramallal, T. M. Fernandez-Carames, A. Dapena, and P. A.Cuenca-Castillo, “Improving performance of h. 264/avc transmissionsover vehicular networks,” in Integrated Network Management (IM2013), 2013 IFIP/IEEE International Symposium on. IEEE, 2013,pp. 1324–1327.

[205] E. Nygren, R. K. Sitaraman, and J. Sun, “The akamai network: aplatform for high-performance internet applications,” ACM SIGOPSOperating Systems Review, vol. 44, no. 3, pp. 2–19, 2010.

YANJIAO CHEN received her B.E. degree ofelectronic engineering from Tsinghua University in2010. She is currently a Ph.D. candidate in HongKong University of Science and Technology. Herresearch interests include spectrum management forFemtocell networks and network economics.

KAISHUN WU is currently a research assistant pro-fessor in Fok Ying Tung Graduate School with theHong Kong University of Science and Technology(HKUST). He received the Ph.D. degree in computerscience and engineering from HKUST in 2011. Hisresearch interests include wireless communication,mobile computing, wireless sensor networks anddata center networks.

QIAN ZHANG joined Hong Kong University ofScience and Technology in Sept. 2005 where sheis a full Professor in the Department of ComputerScience and Engineering. Before that, she was inMicrosoft Research Asia, Beijing, from July 1999,where she was the research manager of the Wire-less and Networking Group. Dr. Zhang has pub-lished about 300 refereed papers in internationalleading journals and key conferences in the areasof wireless/Internet multimedia networking, wirelesscommunications and networking, wireless sensor

networks, and overlay networking. She is a Fellow of IEEE for ”contributionto the mobility and spectrum management of wireless networks and mobilecommunications”. Dr. Zhang has received MIT TR100 (MIT TechnologyReview) worlds top young innovator award. She also received the BestAsia Pacific (AP) Young Researcher Award elected by IEEE CommunicationSociety in year 2004. Her current research is on cognitive and cooperativenetworks, dynamic spectrum access and management, as well as wirelesssensor networks. Dr. Zhang received the B.S., M.S., and Ph.D. degreesfrom Wuhan University, China, in 1994, 1996, and 1999, respectively, allin computer science.


Recommended