A No-Reference Video Streaming QoE Estimator based on ...

A No-Reference Video Streaming QoE Estimatorbased on Physical Layer 4G Radio MeasurementsD. Moura1,4 and M. Sousa1,2,4

1Instituto Superior Tecnico2CELFINET,

Consultoria em Telecomunicacoes, Lda.Lisbon, Portugal

[email protected],[email protected]

P. Vieira3,43Instituto Superior de Engenharia Lisboa

Lisbon, [email protected]

A. Rodrigues1,4 and M. P. Queluz1,44Instituto de Telecomunicacoes,

Lisbon, Portugal[ar, paula.queluz]@lx.it.pt

Abstract—With the increase in consumption of multimediacontent through mobile devices (e.g., smartphones), it is crucial tofind new ways of optimizing current and future wireless networksand to continuously give users a better Quality of Experience(QoE) when accessing that content. To achieve this goal, it isnecessary to provide Mobile Network Operator (MNO) with realtime monitorization of QoE for multimedia services (e.g., videostreaming, web browsing), enabling a fast network optimizationand an effective resource management. This paper proposes anew QoE prediction model for video streaming services over 4Gnetworks, using layer 1 (i.e., Physical Layer) key performanceindicators (KPIs). The model estimates the service Mean OpinionScore (MOS) based on a Machine Learning (ML) algorithm, andusing real MNO drive test (DT) data, where both applicationlayer and layer 1 metrics are available. From the severalconsidered ML algorithms, the Gradient Tree Boosting (GTB)showed the best performance, achieving a Pearson correlation of78.9%, a Spearman correlation of 66.8% and a Mean SquaredError (MSE) of 0.114, on a test set with 901 examples. Finally,the proposed model was tested with new DT data together withthe network’s configuration. With the use case results, QoEpredictions were analyzed according to the context in which thesession was established, the radio transmission environment andradio channel quality indicators.

Index Terms—Mobile Wireless Networks, LTE, Video Stream-ing, Quality of Experience, Machine Learning.

I. INTRODUCTION

One of the most prominent issues that current and next-generation wireless network operators will face is assuringhigh Quality of Experience (QoE) to an increasing number ofusers, in services that generate a high network load, as videostreaming [1]. In this context, much work has been done onassessing Quality of Service (QoS) through the measurementof specific network key performance indicators (KPIs) anddrive tests (DTs) (e.g., [2] [3] [4]). However, these metricscannot directly assess the user’s QoE. Most of the proposedvideo streaming QoE models (e.g., [5] [6]) rely on applicationlayer metrics that are not measurable in real-time by theMobile Network Operators (MNOs) being only available atthe client side; thus, these models are inadequate to predict theQoE from the MNO’s perspective. Hence, MNOs have limited

capability to obtain a proper gauge of the network’s QoE statusat any given time, undermining near real time monitoring andoptimization.

Some work has been developed regarding the estimation ofQoE for other services, namely voice and web browsing [7],using real network data. Concerning video streaming QoEestimation, although it has been an active research area, tothe best of the authors knowledge, most of the works targetanalytical and simulation approaches (e.g., [8] [9]), lackingcontributions with real MNO data. In this work, a QoE modelfor predicting the quality that a user experiences in a videostreaming session through a 4G network, based on objectiveradio channel QoS metrics is proposed. To this end, MachineLearning (ML) techniques are used with the goal of buildinga mathematical model between network QoS metrics and aQoE metric, the Mean Opinion Score (MOS), based on dataobtained from several dedicated DTs in a live 4G network.

The paper is organized as follows: in Section II the used datais analyzed; Section III presents the QoE model developmentprocess and results; Section IV presents a real scenario wherethe model is applied. Finally, conclusions are drawn in SectionV.

II. DATA ANALYSIS

The development of the proposed QoE model was supportedby data provided from dedicated DT campaigns, where videostreaming services were established and monitored with thefollowing conditions:

• The video streaming service was tested by playing aYoutube video, from a set of three different videos, twoof them lasting 45 s and one lasting 47 s.

• Several application layer metrics were measured duringthe Youtube video session (e.g., number of video freezes,average video resolution) and with the layer 1 protocol(e.g., Channel Quality Indicator (CQI), Reference SignalReceived Quality (RSRQ), Scheduled Throughput).

• Three MNOs of 4G networks were tested.• Two distinct mobile devices, Samsung Galaxy S8 (SM-

G950F) and Sony XZ (F8331) were used.

• The DT campaigns were conducted in different locations(cities, highways, suburban and others) and mobilitysettings (no mobility or at different speeds);

From the DT campaigns, the sessions with missing KPIs orwith incomplete data due to failure in initiating the videostream were removed. This resulted in a dataset for the modeldevelopment with 4510 video streaming sessions of which80% (3608 sessions) were used as training data and theremaining sessions (901 sessions) were used for testing theproposed model performance. For each session, 60 networkKPIs were retained together with the available applicationlayer metrics.

A. QoE Reference Model

In adaptive video streaming, the server has different repre-sentations of the same video, each one with a different bitrate(and subjective quality), and each representation is divided intosegments, which can be independently decoded by the client.During the streaming session, the client may switch betweenthe different video representations, in order to cope withchannel throughput fluctuations. The main application layermetrics that condition the QoE along the streaming sessionare [10]: initial playout delay (due to initial buffering at theclient side), average quality of the transmitted video, frequencyand amplitude of the video quality switches, video freezes(due to empty buffer) frequency and average video freezeduration. The objective model proposed in [5] incorporates allof these metrics, and was used in the present work to obtainthe MOS for each streaming session (and since the subjectiveassessments, obtained with real users, were not available); itis formally described by:

MOSest = 5.67∗ q

qmax−6.72∗ q

qmax−4.95∗F + 0.17 (1)

where q represents the average video quality level (requestedby the client) and q as its standard deviation; both arenormalized with respect to the highest available quality level,qmax, for that video. Parameter F models the influence offreezes and is given by:

F =7

8∗ max

(ln(φ)

6+ 1, 0

)+

1

8∗(

min(ψ, 15)

15

)(2)

where φ and ψ represents the freeze frequency and the averagefreeze duration, respectively. In this work, since Youtubegenerates the different representations for the same video bychanging their spacial resolution, the q and qmax parametersare given by the average transmitted video spacial resolutionand by the maximum video spacial resolution available for thatvideo, respectively, which are directly available on the dataset.Both φ and ψ were obtained through simple calculations usingsome of the parameters provided in the dataset, as presented:

φ =Interruptions [#]

Duration [s](3)

ψ =Interruptions [s]

Interruptions [#](4)

The q parameter could not be obtained since the dataregarding the changes in segment quality required for thiscalculation was not available in the dataset. The q was setto zero assuming that due to the small size of smartphonescreens in which these videos were/are going to be watched,the changes in resolution does not greatly impact the users’QoE. After computing the parameters of (1) for each session,their corresponding MOS values were obtained. Since (1)results in MOS values in the range of [0, 5.84], these werelinearly rescaled to the usual range, [1 5]. The rescaled MOSvalues are shown in Figure 1.

Fig. 1: MOS distribution (all sessions).

From Figure 1, it can be seen that the dataset is extremelyunbalanced, where the MOS values in the interval ]4, 5] areat a clear majority. This may result in a possible difficultyfor the ML process to accurately predict the less represented,lower values of MOS, which is not desirable since these arethe sessions in which the user is experiencing critical levels ofvideo streaming quality to which the MNO needs to be alertedto.

B. Dimensionality ReductionSince 60 network KPIs per streaming session were available

for model development, a feature selection process was carriedout with the purpose of reducing the ML training processcomplexity and overfitting. In the first stage, and to eliminatefeatures providing similar information, the Pearson Correla-tion Coefficient (PCC) between all KPIs was computed; theresulting correlation matrix is presented in 2.

To eliminate some of the most correlated features, a thresh-old of 0.8 was applied to the PCC values, reducing the numberof KPI features to 37.

In the second correlation stage, PCC and Spearman Corre-lation Coefficient (SCC) between the remaining features andthe respective MOS values were computed. After obtainingthe Cumulative Distribution Function (CDF) of the PCC andSCC values, it was verified that the 50th percentiles were,respectively, 0.143 and 0.089. Given these low correlationvalues, the KPIs included in the lower 50th percentile on bothCDF distributions were discarded, resulting in 22 KPI featuresto be used on the model development.

Fig. 2: Heatmap of the Pearson correlation matrix between all60 network KPIs.

III. QOE MODEL DEVELOPMENT

The proposed model aims to predict the MOS values ofvideo streaming sessions, from the appropriate network KPIs- available in real time to the MNOs - and was obtainedusing a learning algorithm over DT data provided by MNOs.According to [11], ensemble methods - where multiple learn-ing algorithms are combined (e.g., decision trees) - may offersignificant improvements in both robustness to skewed distri-butions and in predictive power, when used with unbalancedtraining sets, which is the case in this work.

For this reason, several ensemble algorithms were con-sidered during model development, namely: Gradient TreeBoosting (GTB), Extra Trees (ET), Random Forest (RF) andAdaptive Boosting (AB). Additionally, the Support Vector Re-gression (SVR) algorithm was used as a comparison betweenensemble-based methods and a commonly used regressionalgorithm.

A. Performance Metrics

During model development, the following performance met-rics were used:

• Correlations - Spearman and Pearson correlations wereused to assess the relationship between predicted andground truth MOS values;

• 10-Fold Cross Validation (CV) using Mean SquaredError (MSE) - Allows for tuning the ML model hyper-parameters by splitting the training data into ten subsetsof data, where in each iteration nine subsets are used fortraining and remaining one is used for validation. Thebest hyperparameter configuration is the one that resultsin the lowest MSE between predicted and ground truthMOS values;

• Stratified Error - By splitting the dataset in four MOSstrata (i.e., [1, 2], ]2, 3], ]3, 4], ]4, 5]), the MSE can becomputed per stratum, allowing a higher granularity forthe error analysis. In fact, due to the skewed MOS distri-bution towards the highers values, if the MSE is measuredon the whole MOS range it will be mainly influenced bythe error in the higher MOS values, preventing a properassessment of the model performance in the lower MOSrange.

• Mean Absolute Scaled Error (MASE) - The MASEcompares the Mean Absolute Error (MAE) of the model’spredictions with the predictions of a naıve model [12],which, for this dataset, can be a model that simply outputsthe median of MOS values;

• R-Squared - Represents the proportion of variance ofthe prediction results that have been explained by theindependent variables (features) of the model. It providesan indication of goodness of fit and therefore a measureof how well unseen samples are likely to be predicted bythe model.

B. Data Balancing and ML algorithms Analysis

In order to address the concerns posed at the end of SectionII-A regarding the possible learning impairments when the MLalgorithms receive as input a severely unbalanced dataset, theinitial training set was modified so that all MOS strata weremore evenly represented in the resulting distribution. To testthe hypothesis that the prediction accuracy of the more criticalvalues of MOS - those belonging to the strata [1, 2], ]2, 3] -could increase, 90% of the sessions belonging to the interval]4,5] were removed, at random, from the training set, in orderto achieve a more balanced training set while still capturing thetendency, from the original distribution, of the highest MOSinterval of being the most represented. Ten models were thencreated using the unbalanced and balanced training sets, andthe five aforementioned ML algorithms; the models predictionperformance was analyzed using the metrics listed in SectionIII-A, and the results are presented in Table I. In this table,the darker colored entries identify, for each performance metricand algorithm, the training set with best performance, balancedor unbalanced. From the results’ analysis the following can bestated:

• For all ML algorithms, the majority of the performancemetrics indicates that the unbalanced training set pro-duces the best results; however, from the Stratified Errormetric, it is visible that there is a decrease in the MSEfrom the balanced to the unbalanced training sets, inthe intervals of ]2, 3] and ]3, 4], and only a marginalincrease in the interval [1, 2] for ET and AB algorithms.Additionally, an increase in MSE is verified from thebalanced to the unbalanced training sets in the ]4,5]stratum which is not significant since it is more importantto accurately predict the MOS in the lower strata. Thedecrease in MSE, seen in the intervals ]2, 3], ]3, 4]for all algorithms and [1, 2] for the GTB, RF andSVR algorithms, proves that when a ML algorithm uses

TABLE I: Balanced (B) and Unbalanced (UB) training setperformance comparison.

a balanced training set, it can produce more accuratepredictions for less represented values than when usingan unbalanced training set;

• All considered ML algorithms could be used, with vary-ing degrees of performance and complexity, to obtainacceptable MOS predictions;

• Regarding the ensemble learning algorithms, both theGTB and RF have similar performances when using thebalanced training set in terms of the Stratified Error, andare able to outperform the ET and AB algorithms in thisparticular metric for the two lowest intervals of MOS.Furthermore, the GTB algorithm outperforms the RF insix of the performance metrics, being only marginallyworse in one of those metrics.

• The SVR produces the best predictions for most of thelower stratified error intervals but performs significantlyworse than the other algorithms in all other metrics.

The correct predictions of the lowest MOS interval is themost important since these values correspond to the mostcritical QoE that a user has when watching a video; such poorexperience, when accurately predicted, can alarm a MNO al-lowing for improvements to be made to its network, ultimatelyincreasing customer satisfaction and decreasing service churnrates. Thus, the balanced training set was used to train thefinal model.

Table I shows that the best performing ML algorithm inmost of the metrics is the ET algorithm. However, in whatconcerns the Stratified Error, and as can be seen in Figure 3,the ET algorithm underperforms the others in the two lowestintervals of MOS values, which is not desirable at all. Assuch, the algorithm that ensues is the GTB, which has thesecond best performance in most metrics and outperforms theET algorithm in the three lowest intervals of MOS for theStratified Error. As such, the algorithm that ensues is the GTB,which has the second-best performance in most performancemetrics and manages to outperform the ET algorithm in thethree lowest intervals of MOS for the stratified error. Togetherwith its low complexity, this makes it a candidate model tobe used, as it balances lower and higher MOS predictions

Fig. 3: Barplot of Stratified Error results (all algorithms).

accuracy. Therefore, the GTB algorithm model was selectedfor the QoE model development, as it is expected to produceaccurate MOS predictions.

C. Proposed QoE Model

After using the GTB algorithm, the ”importance” of eachfeature was obtained, according to [13]; it indicates the featurerelevance for the model learning: the higher the feature im-portance, the greater the model dependence on that feature. Todetermine how many non-relevant features could be removedwhile maintaining the model performance, the features wereranked according to their importance, and several models weretrained with an increasing amount of features, starting withthe most important ones (e.g. the 2nd model was trained withthe two most important features while the 10th model wastrained with the 10 most important features). The metrics usedto analyze each model were the MSE, the 10-Fold CV and theStratified Error and additionally the Adjusted R-Squared wasused. By using the Adjusted R-squared, it can then be assumedthat adding the remaining features is not necessary and thatthese can be discarded, decreasing the model complexity. Theresults of the model performance with an increasing numberof used features is presented in Figure 4.

From this figure, it can be observed that, after the addition of11 features, most error metrics do not decrease significantly.Therefore, the number of features was set at 11, since theaddition of more (and less important) features will not increasethe performance of the model, but increase its complexity andthe possibility of overfitting. The 11 selected features wereseparated in three groups:

• Transmission Rate (80.9% importance) : MaximumAggregate Scheduled Throughput, Aggregate AverageScheduled Throughput, Scheduled Throughput;

• Channel Quality (11.5% importance) : Uplink (UL)Acknowledgments (ACKs), RSRQ, Downlink (DL) Num-Transport Blocks (TBs), DL AvgBlock Error Rate(BLER) and DL StDev CQI;

• Modulation and Coding (5.8% importance) : DL Mod-ulation (MOD) Quadrature Phase Shift Keying (QPSK),

Fig. 4: Model performance with a cumulative number of usedfeatures.

DL MOD64Quadrature Amplitude Modulation (QAM),DL MOD16QAM.

The results obtained with the reduced features model on thetest set are shown in Table II.

TABLE II: QoE model performance using the GTB algorithmand the reduced set of features.

From Table II the following comments can be drawn:• The value for the MSE of the entire test set (i.e., 0.114)

is close to the stratified MSE for the interval ]4, 5] ofthe test set (i.e., 0.103), which is expected since this isthe interval where most of the ground truth MOS valuesbelong to - even after training set balancing, meaning thatthe error for the remaining intervals in stratified error donot significantly influence this error. This shows that thethe Test Set MSE does not properly gauge the predictionaccuracy for the lower MOS values.

• From the Stratified Error metric, it can be noted that theproposed model predicts the lowest intervals of the MOSscale with low error.

• When comparing the 10-Fold CV metric with the MSEfor this test set, it can be asserted that the performanceverified for this particular test set is similar to that ofthe entire dataset using CV. This means that the modelperformance is maintained over several test sets andsurely will be maintained when the model is used onnew data.

IV. USE CASE

After obtaining a new, unseen dataset and with the coordi-nates of these sessions, a geographical representation of theMOS predictions was obtained, from where it was possible toidentify areas with low MOS predictions. These areas werescrutinized by analyzing the radio context in which thesesessions took place. In this dataset the information regarding

which cell the User Equipment (UE) was connected to, forthe duration of the session, was also available. Together withthe network configuration information a detailed analysis wasconducted. This way, it was possible to conjecture aboutwhy the MOS degradation was registered, taking into accountthe 11 considered radio KPIs and the network configurationleading, to the identification of, for example, poor coverage,interference issues or low capacity.

Fig. 5: Example area where a prediction of poor MOS ispresented.

In Figure 5, an example of a low MOS area, is shown.In it, it is possible to see the path of the vehicle in whichthese sessions were conducted and their corresponding MOSpredictions. Several video sessions present a good (4) orexcellent (5) MOS value. However, one session presents aMOS of 3, which could mean that this is an area with poorradio channel conditions. Additionally, the sites to which theUE connected to are Site A and Site B (shown with bluemarkers in Figure 5). From the dataset, it was possible toverify that another two eNodeBs, positioned further away,also connected to the UE throughout the streaming session.It was also verified, that the RSRQ was low for that particularsession, which is probably due to the fact that during thevideo session the received power from neighboring cells leadto the degradation of the radio channel’s conditions whichcaused a decreased signal to interference plus noise ratio. Thisresulted in the UE switching to a more robust modulationscheme, in this case QPSK - as was verified in the DT data.A consequence of using a more robust modulation scheme is adecrease in throughput, which has a direct impact in the QoE,since adaptive video streaming is very sensitive to changes inthroughput. This variation in throughput leads to a degradationin quality of adaptive streaming which is correctly pointed outby the proposed QoE prediction model.

V. CONCLUSIONS

This paper presents a novel QoE prediction model, forvideo streaming based on real 4G data that estimates theMOS given the required layer 1 KPIs. From the developed

work, it was found that the ML regression process, andconsequentially the models’ predictions, do benefit from abalancing of the distribution of the training data. For the pur-poses of substantiating the benefits of training set balancing,there was a need for using different performance metrics,namely the Stratified Error, which measures the MSE forcertain MOS intervals. Additionally, 5 different ML algorithmswere used and their performances compared, being the GTBalgorithm the one that managed to achieve the best results. Themodel estimates QoE with a Pearson correlation of 78.9%, aSpearman correlation of 66.8% and a MSE of 0.114. Anotherconclusion is that the features related with the transmissionrate of the streaming session are the most important featuresin predicting QoE for the GTB algorithm. These featuresassumed a combined feature importance of approximately81%. Overall, the performed predictions made using layer 1metrics provide accurate results.

ACKNOWLEDGMENT

The authors would like to thank FCT for the support bythe project UID/EEA/50008/2013. Moreover, our acknowl-edgement concerning project MESMOQoE (n◦ 023110 -16/SI/2016) supported by Norte Portugal Regional Opera-tional Programme (NORTE 2020), under the PORTUGAL2020 Partnership Agreement, through the European RegionalDevelopment Fund (ERDF).

REFERENCES

[1] Ericsson, “ERICSSON MOBILITY REPORT,” Ericsson, Tech. Rep.,June 2019.

[2] M. Sousa, A. Martins, and P. Vieira, “Self-Diagnosing Low Coverageand High Interference in 3G/4G Radio Access Networks Based onAutomatic RF Measurement Extraction,” in Proceedings of the 13thInternational Joint Conference on e-Business and Telecommunications,ser. ICETE 2016, Portugal, 2016, pp. 31–39.

[3] A. Rufini, A. Neri, F. Flaviano, and M. Baldi, “Evaluation of theimpact of mobility on typical kpis used for the assessment of qosin mobile networks: An analysis based on drive-test measurements,”in 2014 16th International Telecommunications Network Strategy andPlanning Symposium (Networks), Sep. 2014, pp. 1–5.

[4] R. Santos, M. Sousa, P. Vieira, M. P. Queluz, and A. Rodrigues, “AnUnsupervised Learning Approach for Performance and ConfigurationOptimization of 4G Networks,” in 2019 IEEE Wireless Communicationsand Networking Conference (WCNC), April 2019, pp. 1–6.

[5] J. De Vriendt, D. De Vleeschauwer, and D. C. Robinson, “QoEmodel for video delivered over an LTE network using HTTP adaptivestreaming,” Bell Labs Technical Journal, vol. 18, no. 4, pp. 45–62, March2014.

[6] Y. Liu, S. Dey, F. Ulupinar, M. Luby, and Y. Mao, “Deriving andvalidating user experience model for dash video streaming,” IEEETransactions on Broadcasting, vol. 61, no. 4, pp. 651–665, Dec 2015.

[7] V. Pedras, M. Sousa, P. Vieira, M. P. Queluz, and A. Rodrigues, “A no-reference user centric QoE model for voice and web browsing based on3G/4G radio measurements,” in 2018 IEEE Wireless Communicationsand Networking Conference (WCNC), April 2018, pp. 1–6.

[8] P. G. Balis and M. R. Tanhatalab, “Analytic Model for the Predictionof Cell-Edge QoE of Streaming Video Over Best-Effort Mobile RadioBearers,” in 2018 26th Telecommunications Forum (TELFOR), Nov2018, pp. 1–4.

[9] Z. Cheng, L. Ding, W. Huang, F. Yang, and L. Qian, “A unified QoEprediction framework for HEVC encoded video streaming over wire-less networks,” in 2017 IEEE International Symposium on BroadbandMultimedia Systems and Broadcasting (BMSB), June 2017, pp. 1–6.

[10] Y. Liu, S. Dey, F. Ulupinar, M. Luby, and Y. Mao, “Deriving andvalidating user experience model for dash video streaming,” IEEETransactions on Broadcasting, vol. 61, no. 4, pp. 651–665, Dec 2015.

[11] B. Krawczyk, “Learning from imbalanced data: open challenges andfuture directions,” Progress in Artificial Intelligence, vol. 5, pp. 221–232, 2016.

[12] R. Hyndman and A. Koehler, “Another look at measures of forecastaccuracy,” International Journal of Forecasting, vol. 22, pp. 679–688,02 2006.

[13] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel,V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. Van-derPlas, A. Joly, B. Holt, and G. Varoquaux, “API design for machinelearning software: experiences from the scikit-learn project,” in ECMLPKDD Workshop: Languages for Data Mining and Machine Learning,2013, pp. 108–122.

Date post:	20-Jan-2022
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

A No-Reference Video Streaming QoE Estimator based on ...

Documents