Post on 02-Oct-2020
transcript
A Survey on Contribution of Data Mining in Social Media Prajkta P. Chapke
1
, and Dr. Anjali B. Raut 2
1
Research Scholar, Computer Science and Engineering, H.V.P.M‟s, C.O.E.T.,
Amravati(MS), India prajkta.chapke@rediffmail.com
2
Professor, Computer Science and Engineering, H.V.P.M‟s, C.O.E.T.,
Amravati(MS), India anjali_dahake@rediffmail.com
Abstract. Though Data mining emergence is long before but still its
growing with all new faces in almost all the fields of computer sciences,
Health care, Finance, Social media, and so on. Many researchers have work
in Data mining sharpening the edges with different tools and techniques.
Various algorithms of Data Mining have been used and which improves the
work over the previous one. Nowadays Social media has become one of
important way of expressing ideas, views and emotions regarding a topic,
situation or scenario. This paper includes Data Mining development and
growth in various fields especially in Social Media; how it has become
booming researcher‟s.
Keywords: Data mining, algorithms, social media, data mining technique.
1 Introduction
The computerization of our society has substantially enhanced our capabilities for
both generating and collecting data from diverse sources [Han and Kamber].
Hourly from every sector tremendous amount of data gets generated. This
explosively grown data has generated need for new techniques and automated
tools that can intelligently assist in transforming the vast amounts of data into
useful knowledge. We are surrounded with various type of data generated from
multiple fields like finance, medical, marketing, education, scientific etc. This all
data is in its raw format which may not be useful directly to individual, researcher
or a company. But its summarised, classified, distributed way might help many
users of this information.
Today, the use of social networks is growing ceaselessly and rapidly. More
alarming is the fact that these networks have become a substantial pool for
unstructured data that belong to a host of domains, including business,
governments and health. The increasing reliance on social networks calls for data
Journal of Seybold Report ISSN NO: 1533-9211
VOLUME 15 ISSUE 9 2020 Page: 632
2
mining techniques that is likely to facilitate reforming the unstructured data and
place them within a systematic pattern [5].
P. Ristoski, H. Paulheim [6] has shown how Linked Open Data can be used at
various stages for building content-based recommender systems. The survey
shows that, while there are numerous interesting research works performed, the
full potential of the Semantic Web and Linked Open Data for data mining and
KDD is still to be unlocked.
In order to extract the text information of the webpage, the position of the text
information can be accurately located by using the multiple features of the text
and the rules of the webpage design. According to the above characteristics,
Chongjun Wang, Peng Wei [11] proposed a method for extracting webpage text
information based on multi-feature fusion.
The importance of users‟ sentiments has been realized by the business sector in
the last decade. Since then social media platforms and other websites are used to
extract users‟ opinions about products. Such phenomenon is called sentiment
analysis or opinion mining. Opinion mining is identifying, extracting and
understanding the user‟s attitude or opinion by analysing the text. This process
usually involves natural language processing, statistical analysis and machine
learning techniques for sentiment analysis [18].
2 Techniques in Data Mining
Nowadays we all have access to more data than ever had before. But to make use
of structured and unstructured data to implement improvements, can be extremely
challenging because of the sheer amount of information which in turn can
minimize the benefits of all the data.
Data mining is the process which detects patterns in data for insights relevant to
ones needs. It is essential for both business intelligence and data science. There
are many data mining techniques that can be used to turn raw data into actionable
insights. These involve everything from cutting-edge artificial Intelligence to the
basics of data preparation, which are both key for maximizing the value of data
investments.
1. Tracking patterns
2. Classification
Journal of Seybold Report ISSN NO: 1533-9211
VOLUME 15 ISSUE 9 2020 Page: 633
3
3. Association
4. Outlier detection
5. Clustering
6. Regression
7. Prediction
8. Sequential patterns
9. Decision trees
10. Statistical techniques
11. Visualization
12. Neural networks
13. Machine learning and artificial intelligence
3 Algorithms in Data Mining
As mentioned in earlier sections, every day a voluminous data is generating from
every sector with different type such as audio, video, 3D, social media post,
geospatial, complex type, etc. and this sheer volume is one of the issues for
processing it. Such a mentioned data is diverse, unstructure and fast changing.
Such a type of data is not easy to categorized or organized. To meet such
challenge, a range of automatic methods for extracting information are there,
which are nothing but data mining algorithms.
Fig. 1. Data Mining Algorithms
3.1 C4.5 Algorithm
Journal of Seybold Report ISSN NO: 1533-9211
VOLUME 15 ISSUE 9 2020 Page: 634
4
C4.5 is a data mining algorithms which is used to generate a classifier in the form
of a decision tree from a set of data that has already been classified. Classifier here
refers to a data mining tool that takes data that is needed to classify and tries to
predict the class of new data.
Every data point will have its own attributes. The decision tree created by C4.5
poses a question about the value of an attribute and depending on those values, the
new data gets classified. The training dataset is labelled with lasses making C4.5 a
supervised learning algorithm. Decision trees are always easy to interpret and
explain making C4.5 fast.
3.2 K-mean Algorithm
It is clustering algorithms, where k-means works by creating a k number of groups
from a set of objects based on the similarity between objects. It may not be
guaranteed that group members will be exactly similar, but group members will be
more similar as compared to non-group members. As per standard
implementations, k-means is an unsupervised learning algorithm as it learns the
cluster on its own without any external information.
3.3 Support Vector Machines
In terms of tasks, Support vector machine (SVM) works similar to C4.5 algorithm
except that SVM doesn‟t use any decision trees at all. SVM learns the datasets and
defines a hyper plane to classify data into two classes. A hyperplane is an equation
for a line. SVM exaggerates to project data to higher dimensions. Once projected,
SVM defined the best hyper plane to separate the data into the two classes.
3.4 Apriori Algorithm
Apriori algorithm works by learning association rules which is a data mining
technique that is used for learning correlations between variables in a database.
Once the association rules are learned, it is applied to a database containing a large
number of transactions. Apriori algorithm is used for discovering interesting
patterns and mutual relationships and hence is treated as an unsupervised learning
approach. Thought the algorithm is highly efficient, it consumes a lot of memory,
utilizes a lot of disk space and takes a lot of time.
3.5 Expectation-Maximization Algorithm
Journal of Seybold Report ISSN NO: 1533-9211
VOLUME 15 ISSUE 9 2020 Page: 635
5
Expectation-Maximization (EM) is used as a clustering algorithm, just like the k-
means algorithm for knowledge discovery. EM algorithm work in iterations to
optimize the chances of seeing observed data. Next, it estimates the parameters of
the statistical model with unobserved variables, thereby generating some observed
data. Expectation-Maximization (EM) algorithm is again unsupervised learning
since we are using it without providing any labelled class information.
3.6 PageRank Algorithm
PageRank is a link analysis algorithm that determines the relative importance of
an object linked within a network of objects. Link analysis is a type of network
analysis that explores the associations among objects. It determine the relative
importance of a webpage and rank it higher on search engine. PageRank is treated
as an unsupervised learning approach as it determines the relative importance just
by considering the links and doesn‟t require any other inputs.
3.7 AdaBoost Algorithm
AdaBoost is a boosting algorithm used to construct a classifier, which is a data
mining tool that takes data predicts the class of the data based on inputs. Boosting
algorithm is an ensemble learning algorithm which runs multiple learning
algorithms and combines them. Boosting algorithms take a group of weak learners
and combine them to make a single strong learner. A weak learner classifies data
with less accuracy. After the user specifies the number of rounds, each successive
AdaBoost iteration redefines the weights for each of the best learners. This makes
Adaboost a super elegant way to auto-tune a classifier. Adaboost is flexible,
versatile and elegant as it can incorporate most learning algorithms and can take
on a large variety of data.
3.8 kNN Algorithm
kNN is a lazy learning algorithm used as a classification algorithm. A lazy learner
will not do anything much during the training process except for storing the
training data. Lazy learners start classifying only when new unlabeled data is
given as an input. C4.5, SVN and Adaboost, on the other hand, are eager learners
that start to build the classification model during training itself. Since kNN is
given a labelled training dataset, it is treated as a supervised learning algorithm.
3.9 Naive Bayes Algorithm
Journal of Seybold Report ISSN NO: 1533-9211
VOLUME 15 ISSUE 9 2020 Page: 636
6
Naive Bayes is not a single algorithm though it can be seen working efficiently as
a single algorithm. Naive Bayes is a bunch of classification algorithms put
together. The assumption used by the family of algorithms is that every feature of
the data being classified is independent of all other features that are given in the
class. Naive Bayes is provided with a labelled training dataset to construct the
tables. So it is treated as a supervised learning algorithm.
3.10 CART Algorithm
CART stands for classification and regression trees. It is a decision tree learning
algorithm that gives either regression or classification trees as an output. In
CART, the decision tree nodes will have precisely two branches. Just like C4.5,
CART is also a classifier. The regression or classification tree model is
constructed by using labelled training dataset provided by the user. Hence it is
treated as a supervised learning technique.
4 Data Mining in Social Media
S.
NO
AUTHOR
PAPERS TITLE ALGORITHM /
TOOLS
ADVANTAGES/
LIMITATIONS
1 Zhang et al Multi-modal Sentiment
Classification via Semi-
supervised Learning
Semi supervised
learning algorithm
Greatly advances the state-
of-the art of multi-modal
sentiment classification by
leveraging unlabelled data.
Do not perform other multi-
modal tasks, such as sarcasm
detection and personality
recognition.
2 Li et al Mining Heterogeneous
Influence and Indirect
Trust for
Recommendation
Collaborative
filtering technique.
Social
recommendation
model ReHI
Proposed method performs
better than state-of-the-art
recommendation models, es-
pecially for cold-start users.
A small number of distrust
relationships have great im-
pact on recommendation.
Journal of Seybold Report ISSN NO: 1533-9211
VOLUME 15 ISSUE 9 2020 Page: 637
7
3 Azwa Abdul
Aziz , Andrew
Starkey , Elissa
Nadia Madi
Predicting Supervise
Machine Learning
Performances for
sentiment analysis using
contextual-based
approaches
SML Model Finds the relationship be-
tween words and sources
which can provide a mecha-
nism to predict SML model
performance.
Relationship in CA can be
used to further understand
the structured knowledge of
the data.
4 Monali Bordoloi
,
Saroj Kr. Biswa
s
Keyword extraction
from micro-blogs using
collective weight, Social
Network Analysis and
Mining
A novel
unsupervised
graph-based
keyword extraction
method called
keywords from
collective weights
(KCW)
The most important key-
words extracted by KCW
model are certainly im-
portant for a particular topic
in comparison with other ex-
isting methods.
Semantic methods for key-
word extraction can be ex-
plored along with the pro-
posed model to get better
results.
5 Halgurd S.
Maghdid,
Member, IEEE
Web News Mining Using
New Features: A
Comparative Study
K-Nearest-
Neighbor (k-NN),
decision tree and
deep-learning
recurrent neural
network ( such as
Long Short-Term
Memory „LSTM‟)
Using new feature such as lo-
cation and time information
with the web news mining
techniques.
Combining different classifi-
cation techniques to further
improvements within the low-
er number of web news doc-
uments and it will take less
time.
6 Chongjun
Wang, Peng
Wei
A novel web page text
information extraction
method
WFFTE (Webpage
multi-feature fusion
text extraction)
The method has universality
and high accuracy for the text
information extraction of sin-
gle text and multi-text web
pages.
There are many other features
for webpage texts, and espe-
Journal of Seybold Report ISSN NO: 1533-9211
VOLUME 15 ISSUE 9 2020 Page: 638
8
cially the multi-text body has
its own uniqueness
7 Dr. Ritu
Bhargava,
Abhishek
Kumar, Sweta
Gupta
Collaborative
methodologies for
pattern evaluation for
web personalization
using Semantic Web
Mining
DBpedia and
Linked MDB
algorithms
Improves the efficiency of
classification results with
more accuracy comparatively.
Better future prospects in do-
main of data mining especial-
ly in semantic web mining for
better efficiency of rank pre-
dictor algorithm.
8 Rinkal
Sardhara,
Kamaljit I.
lakhataria
Impact of Different
Domain Inlink,Outlink
and Rechability on
Relevance of Web Page
Using Correlation
Pearson correlation
technique
Helps to reduce mutual rein-
forcement effect on web page
ranking.
Reachability parameter can be
use to select one perticular
link from the multiple links
from the same domain.
9 Bharti Pooja M.,
Prof. Tushar J.
Raval
Improving Web Page
Access Prediction using
Web Usage Mining and
Web Content Mining
Least Frequent
Ones Leave
strategy, First
Come First Leave
strategy, Older than
Timeframe Leave
strategy
Three session discarding
strategies is used to reduce
prediction time and improve
prediction accuracy.
Automation is needed in this
process.
10 Roberto Saia
and Salvatore
Carta
Introducing a Vector
Space Model to Perform
a Proactive Credit
Scoring
Linear Dependence
Based (LDB)
approach
The best stateof-the-art ap-
proaches of credit scoring
such as random forests, even
using only past non-default
cases.
Does not include the default
past cases in the training pro-
cess.
11 M. Alshammari
et al.
Mining Semantic
Knowledge Graphs to
Add Explainability to
A novel approach
for latent factor-
based black box
Generates both accurate pre-
dictions and semantically rich
explanations that justify the
Journal of Seybold Report ISSN NO: 1533-9211
VOLUME 15 ISSUE 9 2020 Page: 639
9
Black Box
Recommender Systems
recommendation
model
predictions.
12 Nehal Mohamed
Ali, Marwa
Mostafa Abd El
Hamid and
Aliaa Youssif
SENTIMENT
ANALYSIS FOR
MOVIES REVIEWS
DATASET USING
DEEP LEARNING
MODELS
Deep learning,
LSTM, CNN
The proposed deep learning
techniques (MLP, CNN,
LSTM, and CNN_LSTM)
have outperformed SVM, Na-
ïve Bayes, RNTN.
13 S. Kausar et al Sentiment Polarity
Categorization
Technique for Online
Product Reviews
Naive Bayes,
Decision Tree,
Random Forest,
Support Vector
Machine, Gradient
Boosting and
Sequence
Review-level categorization
show promising outcomes as
the accuracy.
Automated sentiment analysis
is helpful for analyzing big
textual information, it still has
limitation.
14 Jundong Chen,
Md Shafaeat Ho
ssain,
Huan Zhang
Analyzing the sentiment
correlation
between regular tweets
and retweets
Sentiment analysis
and machine
learning
Provides instructional infor-
mation for modeling infor-
mation propagation in human
society.
To apply a better sentiment
model for well handling nega-
tive sentiment instead of just
simply summing up the posi-
tive and negative sentiments.
15 Tao You,
Yamin Li,
Bingkun Sun,
and Chenglie
Du
Multi-source Data
Stream Online Frequent
Episode Mining
Episode mining Widely used in the field of
smart transportation and sen-
sors.
16 Harish Kumar,
Anuradha,
A.K.Solanki,
Krishna Kant
Singh
Progressive Machine
Learning Approach with
WebAstro for Web
Usage Mining
Clustering
technique with
WEBASTRO tool
Ideal approach for web min-
ing and predicting users be-
haviour for next visit and au-
tomatic website modification.
To explore the use of these
techniques in automated soft-
Journal of Seybold Report ISSN NO: 1533-9211
VOLUME 15 ISSUE 9 2020 Page: 640
10
ware for predicting their next
visit.
17 L. Jain, R.
Katarya and S.
Sachdeva
Recognition of opinion
leaders coalitions in
online social network
using game theory
Game theory-based
Opinion Leader
Detection (GOLD)
algorithm
Strengthen the power of the
coalition and produced the
synergetic outcome.
The innovative computational
intelligence techniques along
with the evolutionary game
theory approach to detect the
promising opinion leader in
social networks.
18 Victoria Kayser
and Erduana
Shala
Scenario development
using web mining for
outlining technology
futures
Web and text
mining
The rapid overview with the
visualizations remarkably re-
duces the reading effort.
Other data sources should be
explored
19 T Mustaqim et
al 2020
Twitter text mining for
sentiment analysis on
government‟s response
to forest fires with vader
lexicon polarity detection
and k-nearest neighbor
algorithm
VADER lexicon
polarity detection
and K-Nearest
Neighbors
The sentiment analysis pro-
cess can almost run automati-
cally.
5 Conclusion
This work provides a much of present updates of social media network analysis. In
this paper Literature works are reviewed and a comparative work of social
networks is put out. Also various Data mining techniques, algorithms and
especially growth and development of data mining in social media have focused.
This work helps to study the relevancy of the techniques and idea of data mining
for social network analysis, and reviews the connected literature concerning web
mining and social networks.
Journal of Seybold Report ISSN NO: 1533-9211
VOLUME 15 ISSUE 9 2020 Page: 641
11
References
1. Sherry Y. Chen and Xiaohui Liu, “The contribution of data mining to information
science”, Journal of Information Science, 30 (6) 2004, pp. 550–558, 2004.
2. Salvador García, Julián Luengo, Francisco Herrera, “Tutorial on practical tips of the
most influential data preprocessing algorithms in data mining”,
http://dx.doi.org/10.1016/j.knosys.2015.12.006 0950-7051/© 2015.
3. Shuja Mirza, Dr. Sonu Mittal, Dr. Majid Zaman, “A Review of Data Mining Literature”.
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No.
11, November 2016.
4. Mohammad Noor Injadat, Fadi Salo and Ali Bou Nassif, “Data Mining Techniques in
Social Media: A Survey”, Neurocomputing,
http://dx.doi.org/10.1016/j.neucom.2016.06.045, 2016.
5. Zhang et al.: “Multi-modal Sentiment Classification via Semi-supervised Learning”,
Citation information: DOI 10.1109/ACCESS.2020.2969205,IEEE Access, 2016.
6. P. Ristoski, H. Paulheim, “Semantic Web in data mining and knowledge discovery: A
comprehensive survey”, Web Semantics: Science, Services and Agents on the World Wide
Web (2016), http://dx.doi. org/10.1016/j.websem.2016.01.001. 2016.
7. Li et al: “Mining Heterogeneous Influence and Indirect Trust for Recommendation”
(December 2019) Citation information: DOI 10.1109/ACCESS.2020.2968102,IEEE
Access. 2017.
8. Azwa Abdul Aziz , Andrew Starkey , Elissa Nadia Madi, “Predicting Supervise Machine
Learning Performances for sentiment analysis using contextual-based approaches”, DOI
10.1109/ACCESS.2019.2958702, IEEE Access. 2017.
9. Monali Bordoloi , Saroj Kr. Biswas, “Keyword extraction from micro-blogs using
collective weight, Social Network Analysis and Mining” (2018) 8:58
https://doi.org/10.1007/s13278-018-0536-8. 2018.
10. Halgurd S. Maghdid, Member, IEEE, “Web News Mining Using New Features: A
Comparative Study”, Citation information: DOI 10.1109/ACCESS.2018.2890088, IEEE
Access, 2018.
11. Chongjun Wang, Peng Wei, “A novel web page text information extraction method”,
2019 IEEE 3rd Information Technology,Networking,Electronic and Automation Control
Conference (ITNEC 2019), 2019.
12. Dr. Ritu Bhargava, Abhishek Kumar, Sweta Gupta, “Collaborative methodologies for
pattern evaluation for web personalization using Semantic Web Mining”, Second
International Conference on Smart Systems and Inventive Technology (ICSSIT 2019) IEEE
Xplore Part Number: CFP19P17-ART; ISBN:978-1-7281-2119-2, 2019.
13. Rinkal Sardhara, Kamaljit I. lakhataria, "Impact of Different Domain Inlink,Outlink and
Rechability on Relevance of Web Page Using Correlation", Proceedings of the International
Conference on Intelligent Computing and Control Systems (ICICCS 2019) IEEE Xplore
Part Number: CFP19K34-ART; ISBN: 978-1-5386-8113-8, 2019.
14. Bharti Pooja M., Prof. Tushar J. Raval, "Improving Web Page Access Prediction using
Web Usage Mining and Web Content Mining ", Proceedings of the Third International
Journal of Seybold Report ISSN NO: 1533-9211
VOLUME 15 ISSUE 9 2020 Page: 642
12
Conference on Electronics Communication and Aerospace Technology (ICECA 2019)
IEEE Conference Record # 45616; IEEE Xplore ISBN: 978-1-7281-0167-5, 2019.
15. Roberto Saia and Salvatore Carta, "Introducing a Vector Space Model to Perform a
Proactive Credit Scoring ", Springer Nature Switzerland AG 2019 A. Fred et al. (Eds.):
IC3K 2016, CCIS 914, pp. 125–148, 2019. https://doi.org/10.1007/978-3-319-99701-8_6,
2019.
16. M. Alshammari et al.: “Mining Semantic Knowledge Graphs to Add Explainability to
Black Box Recommender Systems”, 10.1109/ACCESS.2019.2934633, VOLUME 7, 2019.
17. Nehal Mohamed Ali, Marwa Mostafa Abd El Hamid and Aliaa Youssif, "SENTIMENT
ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING
MODELS", International Journal of Data Mining & Knowledge Management Process
(IJDKP) Vol.9, No.2/3, May 2019.
18. S. Kausar et al.: “Sentiment Polarity Categorization Technique for Online Product
Reviews”, DOI 10.1109/ACCESS.2019.2963020, VOLUME 8, 2020.
19. Jundong Chen, Md Shafaeat Hossain, Huan Zhang, “Analyzing the sentiment
correlation between regular tweets and retweets”, https://doi.org/10.1007/s13278-020-
0624-4, © Springer-Verlag GmbH Austria, part of Springer Nature 2020.
20. Tao You, Yamin Li, Bingkun Sun, and Chenglie Du, "Multi-source Data Stream Online
Frequent Episode Mining", DOI 10.1109/ACCESS.2020.2997337,IEEE Access, 2020.
21. Harish Kumar, Anuradha, A.K.Solanki, Krishna Kant Singh, "Progressive Machine
Learning Approach with WebAstro for Web Usage Mining", International Conference on
Computational Intelligence and Data Science (ICCIDS 2019), 10.1016/j.procs.2020.03.351,
2020.
22. L. Jain, R. Katarya and S. Sachdeva, "Recognition of opinion leaders coalitions in
online social network using game theory ", Knowledge-Based Systems 203 (2020) 106158,
2020.
23. Victoria Kayser and Erduana Shala,"Scenario development using web mining for
outlining technology futures", Technological Forecasting & Social Change,
https://doi.org/10.1016/j.techfore.2020.120086 , 2020.
24. T Mustaqim et al 2020, "Twitter text mining for sentiment analysis on government‟s
response to forest fires with vader lexicon polarity detection and k-nearest neighbor
algorithm", J. Phys.: Conf. Ser. 1567 032024, 2020.
Journal of Seybold Report ISSN NO: 1533-9211
VOLUME 15 ISSUE 9 2020 Page: 643