Download - An Enhanced Stock Prediction Model using Sentiment ... · sentiment analysis on the data collected from social media and form a sentiment index. Then to improve the accuracy of our

An Enhanced Stock Prediction Model using Sentiment Analysis Based on

Multiple Regression

Saravanan.Ramalingam1* and Sujatha.Putholla2

1Research Scholar, Department of Computer Science and Engineering, Pondicherry University, Pondicherry, India 2Assistant Professor, Department of Computer Science and Engineering, Pondicherry University, Pondicherry, India

ABSTRACT

Big data analytics is the process of extracting data and gaining knowledge from the large sets of

data that helps the data scientists to gain valuable insights. Predictive analytics is one of the key areas of

big data analytics that takes into consideration the historical and current data sets to predict the values of

future data sets. Predictive analytics makes use of statistical methods to generate data predictions as well

as methods for assessing the predictive power of algorithms. Sentiment analysis is also a type of big data

analytics which deals with the process of computing, identifying and categorizing the public opinion,

which is in a text form to find the sentimental interests of public on a particular topic. Stock market is an

area where people exchange their shares and stocks according to their wish with a basic motto of financial

gain. Predicting the stock market is a challenging job as the stock price movement is influenced by a large

number of factors such as global events, political events, general economic conditions, and traders’

expectations across the globe. Hence to predict the stock price movement we make use of predictive

analytics and arrive at predict index value. To predict the influence of public factors we make use of

sentiment analysis on the data collected from social media and form a sentiment index. Then to improve

the accuracy of our prediction we make use of the two indices found and match them accordingly to gain

an overall stock index which predicts the stock price movements.

KEYWORDS: Big Data, Big Data Analytics, Sentiment Analysis, Stock Prediction, Predictive

Analytics.

*Corresponding author

Saravanan.Ramalingam

Research Scholar,

Department of Computer Science and Engineering,

Pondicherry University, Pondicherry, India

Email: [email protected], Mob.No - 9894447191

JASC: Journal of Applied Science and Computations

Volume VI, Issue IV, April/2019

ISSN NO: 1076-5131

Page No:1458

INTRODUCTION

Big Data is a phrase used to mean a massive volume of both structured and unstructured data that

is so large which is difficult to process using traditional database and software techniques. In most

enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current processing

capacity. Big Data has the potential to help companies improve operations and make faster, more

intelligent decisions. This data, when captured, formatted, manipulated, stored, and analyzed can help a

company to gain useful insight to increase revenues, get or retain customers, and improve operations.

Big data has increased the demand of information management specialists in Software AG, Oracle

Corporation, IBM, Microsoft, SAP, EMC, HP and Dell have spent more than $15 billion on software firms

specializing in data management and analytics. In 2010, this industry was worth more than $100 billion

and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.

There are various challenges that are faced in Big Data in day to day life. Some of the challenges

that Big Data faces are discussed in this section.

Understanding and Utilizing Big Data

New, Complex, and Continuously Emerging Technologies

Cloud Based Solutions

Privacy, Security, and Regulatory Considerations

Archiving and Disposal of Big Data

The Need for IT, Data Analyst, and Management Resources

BIG DATA ANALYTICS

Big data analytics is the process of examining large data sets to uncover hidden patterns, unknown

correlations, market trends, customer preferences and other useful business information. The primary goal

of big data analytics is to help companies make more informed business decisions by enabling data

scientists, predictive modelers and other analytics professionals to analyze large volumes of transaction

data, as well as other forms of data that may be untapped by conventional business intelligence programs.

Many analytic techniques, such as regression analysis, simulation, and machine learning, have

been available for many years. Big data analytics allows data scientists and various other users to evaluate

large volumes of transaction data and other data sources that traditional business systems would be unable

to tackle. Traditional systems may fall short because they're unable to analyze as many data sources.



ISSN NO: 1076-5131

Page No:1459

Sophisticated software programs are used for big data analytics, but the unstructured data used in big data

analytics may not be well suited to conventional data warehouses. Big data's high processing requirements

may also make traditional data warehousing a poor fit.

Descriptive Analytics

Descriptive analytics, such as reporting/OLAP, dashboards/scorecards, and data visualization,

have been widely used for some time, and are the core applications of traditional BI.

Predictive Analytics

Predictive analytics suggest what will occur in the future and the methods and algorithms for

predictive analytics such as regression analysis, machine learning, and neural networks.

Prescriptive Analytics

Prescriptive Analytics refers to the process of analyzing the abstraction of an exact data related to

a particular field to enhance the classification result.

Big data analytics helps organizations harness their data and use it to identify new opportunities.

That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier

customers. Some of the major importance of Big Data Analytics is,

Reducing the Cost

Faster and Better Decision Making

Need for New Products and Services

Sentiment analysis is another form of big data analytics which is a computational study of opinions,

sentiments, evaluations, attitudes, appraisal, affects, views, emotions, subjectivity, etc., expressed in text.

The text may include reviews, feedbacks, comments, discussions, news, status, tweets. Sentiment analysis

can be done at different levels - document level, sentence level or aspect/feature level.

Document Level Classification

In this process, sentiment is extracted from the entire review, and a whole opinion is classified

based on the overall sentiment of the opinion holder. The goal is to classify a review as positive, negative,

or neutral. Document level classification works best when the document is written by a single person and

expresses an opinion/sentiment on a single entity.



ISSN NO: 1076-5131

Page No:1460

Sentence Level Classification

This process usually involves two steps:

1. Subjectivity classification of a sentence into one of two classes: objective and subjective.

2. Sentiment classification of subjective sentences into two classes: positive and negative

OBJECTIVE

In the existing system, only two factors such as positive and negative are considered. To overcome

this drawback, a new approach called NRC emotion lexicon has been implemented successfully in the

proposed model. NRC emotion lexicon classifies the given input file into 10 different factors. Another

objective is to perform an enhanced stock prediction. In the existing system only the price movements

were considered that is whether the stock prices move up or go down. Using the multiple regression

technique, the stock prices are predicted.

RELATED WORKS

Stock market is a place where people buy and sell their shares and stocks according to their wish

with a basic motto of financial gain. Investing in stock market seems to be an easy task but that is not the

major case. It also includes a high-risk factor on investing in a particular stock. So, to identify the increase

or decrease in price of a particular stock a technique is utilized called Stock Prediction. Stock prediction

is an area in which interest in predicting the stock prices by analyst increases exponentially as it avoids

the risk or to improve the financial status considerably. The following section lists the various papers

related to stock market prediction.

Thien Hai Nguyen et al. [1], developed a model to predict the stock price movement using the

sentiments of specific topic. A new feature called “topic-sentiment” was incorporated for better stock

market prediction. Historical data were extracted from Yahoo Finance for 18 companies for about a year

period. The sentences were split and then Stanford Core NLP was used for POS tagging and lemmatization

of each word in each sentence was done. For each transaction date, the sentiment value of each topic was

calculated and also the importance of each topic was considered for prediction.

Bollen et al, [2] checked whether public sentiment expressed in daily tweets can predict the stock

market or not. Two tools Opinion Finder and GPOMS were used to measure variation in the public mood



ISSN NO: 1076-5131

Page No:1461

from tweets and the results were correlated with Dow Jones Industrial Average (DJIA). Fagner

Andrade et al, [3] predicted the close price of the stock (PETR4) by utilizing artificial neural network

model. Three stages were included for generating prediction. The datasets collection, cleaning and data

normalization and prediction using MLP feed-forward network model. Both these techniques are

correlated to find the accuracy in stock prediction.

Girija V Attigeri et al, [4] predicted the stock performance by applying the concepts machine

learning and fundamental analysis. Data were gathered and prepared for sentiment analysis. After

analysing, the sentiments are aggregated and visualized in the form of graph and Machine leaning model

to predict new data is developed using Linear Regression. Rishabh Soni et al, [5] proposed a hybrid

approach which combines unsupervised learning to cluster the tweets and perform supervised learning

methods. Feature extraction was implemented after obtaining the data set. Cluster of tweets were formed.

Various decision tree algorithms such as Random Forest are implemented and the performance was

evaluated.

Bhakti G. Deshmukh [6], tried to find out whether twitter sentiments and commodity prices help

in predicting actual stock prices for top 50 companies listed on NIFTY at NSE, India. They used NLP,

Sentiment Analysis and ML techniques for prediction. Tweets are collected and then processed to perform

sentiment analysis and are correlated to predict the stock prices. Kibum Kim, [7] predicted the stock price

on bio industry using opinion mining and mechanical learning. The stock price prediction system

consisting of a data collector, vocabulary analyzer, sentiment dictionary, sentiment analyzer, and stock

price predictor is used. Based on previously stored data, stock price predictor manages predicting stock

prices based on new information using mechanical learning engine.

Jigar Patel, [8] predicted the stock movement using the Trend Deterministic Data Preparation

technique that exploits inherent opinion of each of the technical indicators about stock price movement.

The technical indicators were previously used directly for prediction while this study first extracts trend

related information from each of the technical indicators and then utilizes the same for prediction, resulting

in significant improvement in accuracy. Anthony R. [9], investigated the effect of public sentiments or

mood from a large collection of twitter data for the movement of Stock Market Index. The data was

collected from twitter based on the geo location and the stock price of the local market was predicted to

find if the public sentiments affected the movement of stock index and its degree of influence in that

market using Granger causality test.



ISSN NO: 1076-5131

Page No:1462

Alexander Porshnev [10], used a dictionary-based approach for sentiment analysis which

distinguishes eight basic emotions in the tweets of users. The model made use of SVM between the

collected datasets to find the dependency of stock prices on the sentiments.

From the study of various sentiment analyses, and prediction algorithms related to Stock

Movement Prediction models, it can be found that the limiting number of factors considered for sentiments

extracted from textual data decreases the accuracy in stock prediction. It can be found that using more

number of sentiments to predict the stock results in a more accurate and precise prediction values. Further

combining both historical price prediction techniques and sentimental price prediction techniques, better

prediction accuracy can be achieved.

PROPOSED SYSTEM

To design and implement a novel prediction technique for stock prediction using sentiment

analysis with the help of social media information in real time and historical data of various stocks and

also to predict price rather than stock movement prediction using machine learning techniques. In the

existing system, only two scores were obtained via sentiment analysis namely positive and negative

scores. Also, prediction results were less accurate than the other models. No real time data retrieval for

the existing system was done. These drawbacks were taken into account and an enhanced model was

designed and implemented.

STOCK PREDICTION USING MULTIPLE REGRESSION

In this model, the stock prediction is done with the help of multiple linear regression technique

along with the sentiment analysis done by the help of NRC emotion lexicon. Primarily the tweets are

collected using the Twitter API for particular dates that are to be considered for sentiment analysis. The

collected data are cleansed using techniques such as tokenizing, POS tagging, Stemming, etc. The factors

to be identified here are the two sentiments such as positive and negative as well as the 8 different

emotions: angry, anticipation, disgust, fear, joy, sadness, surprise and trust are extracted with the help of

NRC emotion lexicon. After performing sentiment analysis, the close price of various dates and values of

emotions obtained from sentiment analysis are tabulated. Finally, the performance factors are also

calculated. The architecture of the system is given in the figure.1.



ISSN NO: 1076-5131

Page No:1463

Figure.1 Stock Prediction Flow Diagram

The workflow of the system is listed below

1. The tweets are gathered and saved in a text file.

2. The file is then subjected to text analysis where stop words and punctuations are removed, and

also the file format was converted from UTF-8 format to ASCII format for NRC sentiment

analysis.

3. The Sentiment scores and close prices are fed as input for the two regression operations.

a. First, by including two sentiment factors - positive and negative sentiment score alone

b. Next by including two sentiment factors and eight emotion factors

4. Finally, the performance factors are evaluated after obtaining prediction values.

EXPERIMENTATION

In the proposed model, the real-time datasets from twitter were collected using Twitter API using

RStudio and the historical dataset were collected from Yahoo Finance for IBM stocks. The tweets relating

IBM was fetched for processing. It was fetched for about 25 days and each day 5000 tweets were obtained

in English language. The historical data of IBM which comprises of Open, High, Low, Close and Adjusted

Close prices of each transaction dates were also fetched for the same specified interval and the values are

plotted in the time series method to view the close prices accurately. After performing the sentiment

analysis, multiple regression technique was carried out by making close price as dependent variable and

the emotion values as independent variables



ISSN NO: 1076-5131

Page No:1464

SENTIMENT ANALYSIS EXPERIMENTATION

In this part, first an application was created in Twitter. This application will contain four major

keys such as API key, API secret key, consumer token and consumer token secret which are referred to as

the credentials is shown in figure 2

Figure.2 Credential Details

The tweets related to IBM stocks are retrieved by using the keyword IBM stored in the form of

text file of size about 3.1 MB. This text file of UTF-8 format was converted to ASCII format and then it

is subjected to NRC sentiment analysis

FINANCIAL DATA EXPERIMENTATION

In financial data experimentation, the historical prices are fetched from the Yahoo Finance website.

In RStudio, the URL for fetching the close price values has to be included and then run. The URL contains

the date interval that is from starting date to the ending date for which the financial data has to be fetched.

The value after fetching the IBM close price from Yahoo finance data was shown in figure.3

Figure.3 Sample IBM Close Price Values



ISSN NO: 1076-5131

Page No:1465

PREDICTION EXPERIMENTATION

In prediction experimentation, the input is the close price value and the scores obtained by NRC

emotion lexicon for various emotions and sentiment factors. The input was read from the CSV file in

which both the close price value and all emotion factors and sentiment factors are stored. Once the input

was read, then the prediction has to be done by correlating close price with factors obtained from NRC

emotion lexicon. In this part two types of prediction were carried out.

First, the close price was correlated with the sentiment factors such as positive score and negative score

alone, and then the prediction results are obtained. Next, the close price was correlated with all the ten

factors that are angry, anticipation, disgust, fear, joy, negative, positive, sadness, surprise and trust. Then

the prediction result was obtained for both the confidence and prediction intervals.

Performance Factors

Performance factors are the important metrics to be evaluated that drives the result and conclusion

of any work. The factors used in this work are given below,

Opinion Values

The opinion value is an aggregate of all the opinion words that are discriminated by two score

values such as positive score (Ps) and negative score (Ns).

Opinion Values (Oj) = (Ps-Ns)/(Ps+Ns) …….... (1)

where,

Ps is Positive Scores and

Ns is Negative Scores

MAPE

Mean Absolute Percentage Error (MAPE) is the average absolute percent error, measures the size

of the error in percentage terms. Mean absolute percentage error is given in (2)

𝑀𝐴𝑃𝐸 =1

𝑁∑

𝐹𝑘−𝐴𝑘

𝐴𝑘

𝑛

𝑘=1 …… (2)



ISSN NO: 1076-5131

Page No:1466

where,

A is the actual value and

F is the forecast value

RESULT ANALYSIS

The result obtained from the sentiment analysis, prediction with two factors and ten factors are

discussed in this section.

SENTIMENT ANALYSIS RESULTS

The tweets are given as input in the form of text file. The NRC sentiment lexicon converts UTF-8

format to ASCII format and then the sentiment analysis was done for 5000 tweets on 25 different days.

Sample values are tabulated in Table 1 from the R extracted and processed data. The sample output plot

is shown in figure 4.

Table 1. Tabulation of emotions analyzed from Twitter for one day

Emotions Score

Anger 19

Anticipation 50

Disgust 29

Fear 9

Joy 11

Negative 51

Positive 15069

Sadness 10

Surprise 8

Trust 10056



ISSN NO: 1076-5131

Page No:1467

Figure.4 Sentiment Analysis

PREDICTION RESULTS

The prediction is done for both the sentiment and emotion factors as well. First, the prediction is

done by correlating the close price value with the positive and negative scores obtained from NRC

sentiment analysis and various graphs are plotted and saved. The prediction value for considering positive

and negative alone for both the confidence and prediction interval is shown in figure 5.

Figure.5 Linear Regression Output

Then, the multiple regression technique is done for the sentiment factors positive and negative as

well as the emotion factors such as angry, anticipation, disgust, fear, joy, negative, positive, sadness,

surprise and trust. When the number of independent factors increases, the prediction result decreases. The

multiple regression technique prediction result is shown in figure 6.



ISSN NO: 1076-5131

Page No:1468

Figure.6 Multiple Regression Output

PLOTS OF REGRESSION TECHNIQUES

There are four different plots obtained for the regression techniques. And also a separate plot for

residuals present for the given input. First, Figure 7(a) denotes the residuals for two factors and 7(b)

denotes the residuals for all factors. There are 25 residuals in both the considerations. The residuals are

the part of the dependent variable that the model couldn't explain, and they are our best available estimate

of the error term from the regression model.

Figure 7(a) Residuals for Two Factors Figure 7(b) Residuals for Ten Factors

Then, there is a plot for residuals against the fitted value. The plot of residuals against fitted values

for two factors is shown in the figure 8(a) and the plot of residuals against fitted values for all factors is

shown in the figure 8(b). The regression line denotes the best prediction of the dependent variable and the

independent variable.



ISSN NO: 1076-5131

Page No:1469

Figure 8(a) Residual vs Fitted for Two Factors Figure 8(b) Residuals vs Fitted for Ten Factors

The next plot is the normal Q-Q plot that help us to assess if a set of data plausibly came from some

theoretical distribution such as a Normal or Exponential. The normal Q-Q plot for two factors is shown in

the figure 9(a) and the normal Q-Q plot for all factors are shown in the figure 9(b).

Figure 9(a) Normal Q-Q Plot for Two Factors Figure 9(b) Normal Q-Q Plot for Ten Factors

The Scale–location plot, the fitted values are plotted with respect to the square root of the standardized

residuals. The prediction curve depicts the best matching value of fitted values. The scale-location plot

for 2 factors is shown in figure 10(a) and the scale-location plot for all factors is shown in the figure 10(b).



ISSN NO: 1076-5131

Page No:1470

Figure 10(a) Scale-Location Plot for Two

Factors

Figure 10(b) Scale-Location Plot for Ten

Factors

The next plot is the plot for leverage against residuals. Normally, there are four possibilities that

can be derived from the plot. They are fine, high residuals to low leverage, high residuals to high leverage

and low residuals to high leverage. The residuals vs leverage plot for two factors is shown in figure 11(a)

and the plot for residuals vs leverage for all factors are shown in the figure 11(b).

Figure 11(a) Residuals vs Leverage for Two

Factors

Figure 11(b) Residuals vs Leverage for Ten

Factors

The performance factors such as opinion values and MAPE are evaluated for the proposed model.

The opinion value obtained from the sentiment scores are evaluated and the value is 0.9903486.

The MAPE value obtained from the prediction results are evaluated and the value is 0.1082715

The sentiment analysis was done by considering a greater number of factors than the existing

system. The NRC emotion lexicon plays a major role in classifying sentiments and emotions.



ISSN NO: 1076-5131

Page No:1471

Financial prices are obtained from the Yahoo Finance and are processed to produce the time series

representation for every close prices obtained for the given interval. Obtained data and the sentiment

factors are stored in the separate CSV file which is given as input for multiple regression technique.

Prediction was done for two methods. Multiple regression technique is done by correlating the close price

with positive and negative scores alone as well by correlating close price with all the emotion and

sentiment scores. Performance factors such as opinion value, MAPE are found. Opinion value depends on

the sentiment scores and MAPE is an accuracy factor. Finally, the fit, lower and upper values are obtained

as the prediction results for both the cases.

CONCLUSION

Various Big Data Analytics techniques have been studied for choosing the suitable predictive

analytical technique for this proposed model. For the proposed model, multiple regression analysis

technique seems to be suitable to compare and perform better by including various factors that do affect

the stock prices. In this work, the sentiment analysis technique has been done after obtaining the tweets

from the twitter in real time using Twitter Stream API with the help of the NRC emotion lexicon. Financial

data for IBM was also obtained from Yahoo Finance indicating close prices and it is plotted with the help

of time series package. Obtained sentimental scores and emotion factor scores are correlated with the stock

price historical data for the prediction technique to provide the precise stock price prediction. The

technique used for prediction used in this work was multiple linear regression technique. The accuracy

performance factors are also evaluated for prediction results and opinion value was found with the help of

the sentiment scores obtained from the NRC sentiment analysis. In this work tweets and financial data for

only one company (IBM) was considered. In the future work, the tweets and historical data for more than

six months can be fetched and it can be fed as an input for multiple regression technique. Also tweets and

historical prices for more than one company can be considered for more accurate prediction.

REFERENCES

1. Thien Hai Nguyen, Kiyoaki Shirai, Julien Velcin, “Sentiment analysis on social media for stock

movement prediction”, Expert Systems with Applications, 2015.



ISSN NO: 1076-5131

Page No:1472

2. Bollen, J., Mao, H., & Zeng, X, “Twitter mood predicts the stock market”, Journal of Computer

Science, Springer International Publishing, Vol 2, 2011.

3. Fagner Andrade, Luis Enrique, Christiane Naire, Azevedo Reis, “The Use of Artificial Neural

Networks in the Analysis and Prediction of Stock Prices”, Systems, Man, and Cybernetics (SMC),

IEEE International Conference, 2013.

4. Paul C. Zikopoulos, Chris Eaton, Dirk deRoos, “Understanding Big Data Analytics for Enterprise

Class Hadoop and Streaming Data”, McGraw-Hill companies, 2012.

5. Girija V Attigeri, Manohara Pai M M, Radhika M Pai, Aparna Nayak, “Stock Market Prediction:

A Big Data Approach”, TENCON – IEEE conference, 2015.

6. Chun‑Wei Tsai1, Chin‑Feng Lai, Han‑Chieh Chao and Athanasios V. Vasilakos, “Big data

analytics: A survey”, Journal of Big Data, Springer international publishing, 2015.

7. Rishabh Soni, K. James Mathai, “Improved Twitter Sentiment Prediction through ‘Cluster-then-

Predict Model’ ”, International Journal of Computer Science and Network, Vol 4, Issue 4, 2015.

8. Jigar Patel, Sahil Shah, Priyank Thakkar , K Kotecha, “Predicting stock and stock price index

movement using Trend Deterministic Data Preparation and machine learning techniques”, Expert

Systems with Applications, 2015.

9. Bhakti G. Deshmukh, Premkumar S. Jain, Dr. M. S. Patwardhan, Viraj Kulkarni, “Spin-offs in

Indian Stock Market owing to Twitter Sentiments, Commodity Prices and Analyst

Recommendations”, All India Council for Technical Education, 2016.

10. K.K. Suresh Kumar, N.M. Elango, “Performance Analysis of Stock Price prediction using

Artificial Neural Network”, Global Journal of Computer Science and Technology, Vol 12, Issue

1, 2012.

11. Kibum Kim, Seungmin Yang, Dongyoung Kim, Jeawon Park, Jaehyun Choi, “A Stock Prediction

System Based on News and Twitter”, International Journal of Software Engineering and Its

Applications, Vol 10, Issue 6, 2016.

12. Anthony R Calingo, Ariel m Sison, Batolome T Tangulig, “Prediction Model of the Stock Market

Index Using Twitter Sentiment Analysis”, International journal of Information Technology and

Computer Science, Volume 8, Issue 2, 2016.



ISSN NO: 1076-5131

Page No:1473

13. Alexander Porshev, Ilya Redkin, Alexy Schevchenko, “Improving Prediction of Stock Market

Indices by Analyzing the Psychological States of Twitter Users”, Financial Economics, Volume

22, 2013.

14. Borko Furht, Flavio Villanustre, “Big Data Technologies and Applications”, Comparison between

the Frameworks/Platforms of the Big Data, page number - 31, 32.

15. Walaa Medhata, Ahmed Hassanb, Hoda Korashy, “Sentiment analysis algorithms and applications

– a survey”, Ain Shams Engineering Journal, Volume 5, Issue 4, 2014.

16. Saif M. Mohammad, Peter D. Turney, “Crowdsourcing a Word–Emotion Association Lexicon”,

National Research Council Canada, Computational Intelligence, Volume 59, Issue 1, 2013.

17. Ashish Katrekar, “An Introduction to Sentiment Analysis”, AVP, Big Data Analytics, Page

numbers – 2 and 3.

18. Orlando, Fla, “Gartner Says Big Data Creates Big Jobs: 4.4 Million IT Jobs Globally to Support

Big Data By 2015”, 2012.

19. Leona S. Aiken, Stephen G. West, Steven C. Pitts, “Multiple Linear Regerssion” Part four – Data

Analysis method, 2003.

20. Reddy D. Maheswara, “Trends and Opportunities in Big Data Analytics-An Overview”, Indian

Journal of Science, Volume 23, Issue 80, 2016.



ISSN NO: 1076-5131

Page No:1474