International Journal of Advanced Research in Engineering ... · PDF fileFuzzy C-means [8] are...

http://www.iaeme.com/IJARET/index.asp 12 [email protected]

International Journal of Advanced Research in Engineering and Technology (IJARET) Volume 9, Issue 1, Jan - Feb 2018, pp. 12–25, Article ID: IJARET_09_01_002

Available online at http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=9&IType=1

ISSN Print: 0976-6480 and ISSN Online: 0976-6499

© IAEME Publication

SOCIAL MEDIA HASHTAG CLUSTERING

USING GENETIC ALGORITHM

Nilesh Gambhava

Institute of Technology, Nirma University, Ahmedabad, India

Dr. Ketan Kotecha

Parul Universiy, Baroda, India

ABSTRACT

Twitter is one of the most influencing microblogging platforms in the

revolutionary era of social media. Tweets, short messages posted by the user to

interact with the social world, are an invaluable source of data which can be used to

predict trends, timeline generation, community detection, etc. Extracting useful

information from tweets is challenging because of two reasons; first, a short length of

a tweet (140 characters) and second, users just focus on the meaning of a tweet,

neither on grammar rules nor on correct spellings. People use the hashtag symbol (#)

before keyword or phrase in the tweet to emphasize the importance of those words in

the tweet during a search. Hashtag clustering is an important technique to extract the

knowledge by categorizing tweets in different clusters. Hashtag clustering is the

challenging task due to three major reasons. First, the number of clusters is not known

in advance, second, domain-related information is not available and third, different

hashtags are being created for the same topic (#deelnet, #deeplearning, #dl, etc.).

Genetic Algorithm is an adaptive heuristic search algorithm that mimics the

evolutionary process of natural selection and survival of the fittest. To the best of our

knowledge, this is the first attempt to cluster hashtags using Genetic Algorithm. We

have experimented our algorithm on a large set of tweets downloaded from popular

Indian media twitter accounts. The results obtained by our model are compared using

crowdsourcing method as there is no other source available to validate the quality of

the results. The results achieved by our model are superior compared to

crowdsourcing results. Also, the users’ validation for the clusters generated proves

the accuracy of the proposed model.

Key words: Social Media, Hashtag Clustering, Genetic Algorithm, Crowd Sourcing.

Cite this Article: Nilesh Gambhava and Dr. Ketan Kotecha, Social Media Hashtag

Clustering Using Genetic Algorithm. International Journal of Advanced Research in

Engineering and Technology, 9(1), 2018, pp 12–25.

http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=9&IType=1

1. INTRODUCTION

Twitter [1] is the most widely used online social networking service where users share their

opinions and communicate with others using short messages known as tweets, limited to 140

characters. Twitter has become one of the largest repositories of news, opinion, and data.

Nilesh Gambhava and Dr. Ketan Kotecha


Tweets posted by different users all over the world represent thoughts and views for a broad

variety of categories. 100 Million+ users tweet 500 Million+ tweets daily on what’s

happening in the world and express their views on thousands of different topics. These tweets

are the valuable source of data to know the trend, for generating a timeline of an event,

finding people of the similar interest group, etc…

Extracting information from tweets is extremely difficult because of its nature and

structure. First, Tweets are limited to 140 characters so many users use the acronym of a word

to shorten the message. Though the message is conveyed well to readers, the acronyms used

in tweets are not meaningful for automatically extracting information. For example, the word

Tomorrow was written as 2m, 2mar, 2mara, 2maro, 2marrow, 2mor, 2moro, 2morow, 2morr,

2morro, 2morrow, 2moz, 2mr, 2mro, 2mrrw, 2mrw, 2mw, tmmrw, tmo, tmoro, tmorrow,

tmoz, tmr, tmro, tmrow, tmrrow, tmrrw, tmrw, tmrww, tmw, tomaro, tomarow, tomarro,

tomarrow, tomm, tommarow, tommarrow, tommoro, tommorow, tommorrow, tommorw,

tommrow, tomo, tomolo, tomoro, tomorow, tomorro, tomorrw, tomoz, tomrw, tomz in

different tweets. Second, users don’t follow any structure of language; they just focus on the

meaning of tweet, neither on grammar rules nor on correct spellings. Third, daily 500

Million+ tweets are posted plus retweets, likes, etc. It requires real-time analytics on a huge

amount of data. Fourth, due to live streaming of thousands of tweets, a search for tweets is not

effective as Google. As on date, Twitter shows a list of tweets containing searched keywords

in order by date of posting of tweet irrespective of importance or relevance of the tweet with

respect to the searched keywords.

Clustering [2-5] is the process of grouping objects in such a way that similar objects

reside in one cluster and dissimilar objects reside in the different clusters. Clustering can be

considered as the most important unsupervised learning problem because of its applicability

to a large set of problems. Clustering process identifies a structure in a collection of unlabeled

data. Clustering algorithms can be classified into two categories, 1) number of clusters known

in advance at initial step and 2) number of clusters are not known in advance. K-means [6-7],

Fuzzy C-means [8] are well-known examples centroid-based clustering algorithms where a

number of clusters, k, is required at first step. These types of algorithms are not suited for

hashtag clustering because we don’t know how many clusters or groups exist for the given set

of hashtags. Even we can’t assume an approximate number of clusters. Hierarchical clustering

algorithms [9-10] belong to the second type of clustering algorithms where a number of

clusters are not known in advance but required at last step. It is also not applicable to hashtag

clustering because we don’t know where to cut dendogram or when to stop. Hashtag

clustering using conventional clustering algorithms is practically not possible.

Kwak et al. [11] have studied the topological characteristics of Twitter and shown the fact

that majority of tweets are news in nature through classifying the trending topics based on

temporal behavior. Java et al. [12] have focused on hierarchical spatio-temporal hashtag

clustering techniques. Using STREAMCUBE, events have been identified based on space and

time hierarchy by Feng et al. [13]. Song et al. [14] have improved text understanding by using

a probabilistic knowledgebase then using a Bayesian inference mechanism to conceptualize

words and short text. Sakaki et al. [15] have investigated earthquakes in Twitter and proposed

an algorithm to monitor tweets and to detect a target event using a classifier of tweets based

on features such as the keywords in a tweet, the number of words, and their context. Stilo and

Velardi [16] have proposed a temporal sense clustering algorithm using Symbolic Aggregate

ApproXimation. Muntean et al. [17] have clustered a large set of hashtags using K-means on

map reduce in order to process data in a distributed manner. Crockett et al. [18] have

reviewed 13 various unsupervised learning algorithms to analyze Twitter data streams and

Social Media Hashtag Clustering Using Genetic Algorithm


identify hidden patterns in tweets where the text is highly unstructured. Tripathy et al. [19]

have proposed to use Wikipedia topic taxonomy to discover the themes from the tweets and

use the themes along with traditional word based similarity metric for clustering. The study of

Adel et al. [20] focuses on clustering tweets based on their textual content similarity using

cellular genetic algorithm cGA. Keshavarz and Abadeh [21] have combined corpora-based

and lexicon-based approaches and lexicons are generated from text. Using these lexicons, a

novel genetic algorithm is proposed to solve optimization problem and find lexicons to

classify text. The authors have classified the tweets into subjective and objective tweets [22].

They extract two meta-level features from tweets, which show their count of objective and

subjective words. The tweets then are classified using these meta-features. Genetic algorithm

is used for creating subjectivity lexicons from training datasets.

Genetic Algorithm (GA) [23-25] is an adaptive heuristic search algorithm that mimics the

evolutionary process of natural selection and survival of the fittest. It is a subset of a much

broader branch of evolutionary computation. It represents an intelligent exploitation of a

random search based on historical information to direct the search into the region of a

solution. It is widely used to find optimal or near to optimal solutions of NP-Hard problems

like optimization, clustering, etc. in a reasonable time which otherwise may take much longer

time to solve the problem, in some cases years also. Many researchers have explored

clustering capability of GA for various kinds of problems [26-29]. GA can be one of the best

methods to solve hashtag clustering problem because of some unique features like,

It does not require any derivative information.

It searches in parallel for global optima in a solution space

It optimizes both continuous and discrete functions

It provides a list of good solutions and not just a single solution

It gives a solution in a finite time which gets better over the time

It is useful when the search space is very large and there are a large number of parameters

involved

The clustering capability of GA is explored in this research work to cluster a set of

hashtags. To the best of our knowledge, this is the first attempt of applying GA for hashtag

clustering without any prior processing of tweets or using domain related information. We

have verified our results using crowdsourcing because there is no other available alternative.

A Hashtag is a word or phrase preceded by a hash sign (#). It is used on social media

websites, especially Twitter, to tag messages on a specific theme or domain. People use the

hashtag symbol (#) before word or phrase in their tweets to emphasize the importance of those

words in the tweet during a search. Twitter also uses hashtags to index the keywords. Because

of a hashtag, it’s possible that your tweet is seen by hundreds or even millions of users who

are not following you but have searched using that particular hashtag. Tweets without hashtag

have a very short life. Users create different hashtags for same topic like #deeplearning, #dl,

#deepnet, #deepneuralnet, etc. Moreover, many users use nearby similar hashtags like

#machinelearning #ai #neuralnets #deeplearning etc… Single hashtag searching may not give

optimal result hence grouping of similar hashtags plays a vital role in the improvement of

twitter hashtag search result. Grouping of a similar hashtag is a typical clustering problem.

Hashtag clustering has various applications like an event timeline generation, community

detection, finding people of a similar interest group, etc…



The rest of the paper is organized as follows. The working principle of GA is described in

section 2. Section 3 presents our proposed algorithm and implementation details of it. Results

are discussed in section 4. Section 5 concludes the paper with an insight into the future work.

2. THE PROPOSED MODEL

2.1. Introduction

Genetic Algorithm (GA) is an adaptive heuristic search algorithm that mimics the

evolutionary process of natural selection and survival of the fittest. GA begins by randomly

generating a set of possible solutions (chromosomes) known as an initial population. The

chromosomes carry the parameter values that create a solution. GA iterates through several

generations by applying genetic operations on population and explores the better solution

after each generation.

The First step in each generation is to calculate fitness value of chromosomes based on

optimization function. Once the fitness value is calculated three basic genetic operators are

applied on chromosomes. The first genetic operator is a selection operator; two parent

chromosomes are selected from the randomly generated chromosomes based on high-quality

fitness value to perform subsequent operations. The second operator is crossover; two child

chromosomes are created using two parent chromosomes by applying crossover operators like

a single point, uniform, and heuristics based. Next operator is a mutation which is used to

avoid stuck in local minima. The mutation changes gene value randomly with very low

probability. Newly produced children chromosomes and some of the elite chromosomes are

passed to the next generation population. The algorithm stops when any one of the stopping

criteria is satisfied or maximum number of generations is reached. Basic algorithm for GA is

as follow;

Figure 1 Algorithm for Genetic algorithm

Implementation of genetic algorithm varies from problem to problem. Fine-tuning of

parameters play a crucial role in achieving an optimal solution to the problem. They have to

be crafted very carefully otherwise GA would never give the optimal result.

To the best of our knowledge, this is the first research attempt, 1) to cluster hashtags

without prior processing or using domain knowledge and 2) using GA. We do not have any

prior reference research work of hashtag clustering using GA hence we experimented

begin t=0 randomly initialize population P(t) while (t <= max_generation or termination criteria is achieved) Calculate fitness of each individual in population Selection_Operation(P(t)) Crossover_Operation(P(t)) Mutation_Operation(P(t)) t=t+1 Copy elite individuals of P(t-1) and children to P(t) end end



different operators with different possible values of parameters in this research work to find

out the best set of operators and parameters.

2.2. Chromosome Representation

The Chromosome is defined as a string of positive integer values. Suppose there are h

different hashtags then the length of the chromosome is h where each gene represents a

hashtag. Value of the gene, also known as an allele, is the cluster number where that particular

hashtag belongs. Suppose we have 5 hashtags and if the chromosome is 12112 then first, third

and fourth hashtags belongs to cluster number 1; second and fifth hashtags belongs to cluster

number 2. Table 1 and table 2 show an example of 10 hashtags where gene index represents

hashtag number and value of gene represents cluster number to which the particular hashtag

belongs.

Table 1 Sample list of hashtags

1 2 3 4 5 6 7 8 9 10

#news #AI #delhi #smog #cricket #dnn #food #taste #India #kashmir

Table 2 Randomly generated chromosome from hashtag list of Table 1

1 2 3 4 5 6 7 8 9 10

4 6 1 3 2 1 2 4 2 5

2.3. Initialization of Population

The Population (P) is a set of n chromosomes where each chromosome represents one valid

solution. The initial population is randomly generated. Let n be the size of the population, Ch

denotes the chromosome and c is the length of a chromosome. For initializing the rth

chromosome, Chr, in the population (r = 1, 2, …, n), an integer is randomly selected from the

range [1, Clustmax] for each gene of the rth

chromosome. We do not know the approximate

number of clusters in advance so we generate chromosomes with Clustmax clusters and divide

number of cluster by 2 after each step until all hashtags belong to single cluster. Fig. 2 shows

the algorithm for the generation of the initial population. Input: List of Hashtags

Output: Randomly Generated Initial Population

initial_chromosome_generation()

{

chromosome_tobe_generated = min_chromosomes //e.g. 5

min_cluster = 1

cluster_count = number_of_hahtags

while(cluster_count < min_cluster)

{

while(i < chromosome_tobe_generated)

randomly generate chromosome with clusters with cluster_count

chromosome_tobe_generated *= 2

cluster_count /= 2

}

}

Retrun P(i)

Figure 2 The proposed method to generate initial random population



The fitness function is based on co-occurrence frequency of hashtags. Two hashtags are

related to a topic if they are present together in minimum t tweets. If two hashtags belong to

the same cluster in a solution and their co-occurrence frequency is high, then assign high

fitness value to the solution and if their co-occurrence frequency is zero or low, then assign

low value or even negative value to the solution. We have tried two different fitness functions

described as in table 3.

Table 3 Fitness Functions

Sr. Fitness Function

1

𝑗 is the co-occurrence weight of hashtag i and hashtag j if 𝑗 > 0 else -3 or -1 if 𝑗

= 0. coefficient is derived from

2 if 𝑗 > 0 else -5 or 0

2.4. Selection Operation

Widely used RWS & SUS are not applicable to our problem because fitness value can be

negative also. We have experimented with two selection operators; Tournament Selection and

Linear Rank Selection (LRS).

2.4.1. Tournament Selection

Tournament selection operator (TSO) is one of the simplest selection operators. Randomly

two or more chromosomes are chosen and the fittest of them is selected for crossover

operation. The lowest fit chromosome never gets a chance to be selected. In our case, team

size is 2.

Figure 3 Tournament Selection

2.4.2. Linear Rank Selection

Sometimes GA converges prematurely to local optima because of few chromosomes with

very high fitness value. LRS tries to overcome the drawback of premature convergence of the

GA. LRS is based on the rank of individuals rather than the fitness value. The chromosomes

are ordered from best fitness value to worst fitness value. The best chromosome is assigned

rank n and worst is assigned rank 1. Based on rank, each individual has the probability of

being selected given by the expression . The selection probability is linearly assigned to

the individuals according to the rank.



2.5. Crossover Operation

We have experimented two widely used crossover operators; single point crossover and

uniform crossover (gene to gene). Single point crossover selects crossover point randomly

and exchanges genes after the crossover point to produce two new chromosomes. Uniform

crossover is applied on the gene to gene bases. A coin is tossed for each gene to decide

whether the first child selects gene from the first parent or the second parent. Figs. 4(a) & 4(b)

show the example of both the types of crossover operation.

Figure 4(a) Single point Crossover

Figure 4(b) Uniform Crossover

2.6. Mutation

Mutation is a small random tweak in the chromosome to start with new search solution. It

changes a value of gene randomly with very low probability. We have used this classic

technique with the novel approach. Our approach first decides whether to apply mutation or

not on the chromosome with reasonably moderate portability and then apply mutation on each

gene with very low probability. Fig. 5 represents mutation operation.

Figure 5 Mutation Operation

3. EXPERIMENTAL RESULTS

3.1. Data Collection

We have used Twitter REST API [30] to download tweets. Using REST API, we can fetch

recent 3200 tweets of a user. We have downloaded tweets from 76 popular Indian media

twitter accounts, as depicted in table 4, starting from 24 September 2017 to backward. This

dataset contains 2,27,000+ tweets, 1,55,000+ hashtags, 48,500+ hashtags pairs and 24,500+

unique hashtags. We have removed hashtags with co-occurrence frequency less than 30 to

make it computably feasible.

3.2. Parameter Selection

Proper selection of GA parameters is a highly crucial task because we cannot achieve optimal

results even if any one of the parameters is improper. We developed an interactive and

generalized experimental model for measuring the effect of specific parameters and operators

for achieving GA’s best performance. Table 5 gives the parameter values which we have

evaluated using our model



Table 4 Tweets downloaded from following Indian Media Twitter Accounts

@aajtak

@abpnewstv

@AmarUjalaNews

@Avinash_Mirror ‏

@BDUTT

@BTVI ‏

@CNBC_Awaaz ‏

@CNBCTV18Live ‏

@CNBCTV18News

@DainikBhaskar ‏

@DDNewsLive

@DeccanChronicle ‏

@DeccanHerald ‏

@dibang ‏

@EconomicTimes ‏

@Eenadu_English

@FeminaIndia ‏

@filmfare

@grihshobha‏

@gujratsamachar

@Haribhoomicom

@htTweets ‏

@IBN7Media

@IndianExpress ‏

@IndiaToday ‏

@indiatvnews ‏

@JagranNews ‏

@KanchanGupta

@Live_Hindustan ‏

@loksabhatv ‏

@madhutrehan ‏

@MayantiLanger_B

@mid_day ‏

@MiniMenon ‏

@mjakbar ‏

@MumbaiMirror ‏

@NavbharatTimes ‏

@ndtv ‏

@NDTVProfit ‏

@NewIndianXpress ‏

@News18Breaking ‏

@News24 ‏

@NewsNationTV ‏

@NewsWorldIN ‏

@NewsX ‏

@Nidhi ‏

@Outlookindia

@prabhatkhabar ‏

@PrabhuChawla ‏

@PrannoyRoyNDTV ‏

@punjabkesari

@rahulkanwal ‏

@RajatSharmaLive ‏

@ravishndtv ‏

@readersdigest ‏

@republic

@SachinKalbag

@sagarikaghose ‏

@sardesairajdeep ‏

@ShereenBhan ‏

@ShomaChaudhury

@sportstarweb

@suchetadalal ‏

@sudhirchaudhary ‏

@SwetaSinghAT ‏

@Telegraph

@thetribunechd ‏

@THexplains ‏

@TimesNow

@timesofindia ‏

@totaltvguide ‏

@vikramchandra ‏

@WIONews

@WTOV9 ‏

@ZeeBusiness ‏

@ZeeNews ‏

Table 5 Parameters and Operators used in the proposed model

Parameter Values

Fitness Function 1) ln (Max Value 50, 25, Penalty -1, -3)

2) Weight (Penalty -5, -10)

Selection Method 1) Tournament Selection

2) Linear Rank Selection

Crossover Method 1) Single Point 2) Gene to Gene

Crossover Probability 1) 100 2) 90 3) 80 4) 70 5) 60

Mutation Probability 1) 10 2) 20 3) 25

Mutation Gene Probability 1) 5 2) 10 3) 20

Number of Generations 1000

Tweets Duration 24 Sep 2017 to backward

3.3. Results and Discussion

The initial population is created from the randomly generated solutions by assigning hashtags

in different clusters. The chromosomes which contain hashtags with higher co-occurrence

frequency are assigned higher fitness value. As generation passes such chromosomes exhibit

the better possibility of moving forward in evolutionary iterations. Fig. 6 shows the evolution

of solution as the generations pass.



Figure 6 The evolution of solution with generation

As the algorithm converges, the clusters are formed inclusive of most relevant hashtags.

Table 6 shows the fittest chromosome achieved after 1000 generations which has fitness value

of 414. Total 56 clusters have been evolved from the chosen data set. Some clusters contain

only one hashtag as they are not related to any other hashtags.

Table 6 Clusters generated using the proposed method

Cluster

Number Hashtags included specific cluster

1.

CNBCAWAAZ | HEADLINES | LIVE | MARKETCOUNTDOWN | MAR

KETKAPANCHNAMA | MORNINGCALL | Q1WITHAWAAZ | STOCK2

020;

2. DERASACHASAUDA | HARYANA | HONEYPREET | RAMRAHIM | R

AMRAHIMSINGH | RAMRAHIMVERDIC;

3. GURUGRAM | PRADYUMAN | PRADYUMANMURDERCASE | RYANI

NTERNATIONALSCHOOL;

4. FUELPRICES | MIDDAYMUMBAI | MIDDAYNEWS | MUMBAI | MUM

BAINEWS | MUMBAIRAINS;

5. AUSTRALIA | CHINA | DOKLAM | INDIA | JAMMU | KASHMIR | PAKI

STAN | UNGA;

6. BHASKARALERT | UPGOVT | UTTARPRADESH | YOGIADITYANAT

H;

7. INSPIRATIONALQUOTES | PKTHOUGHT | THOUGHTOFTHEDAY;

8. BEAUTY | FASHION | HAIRCARE | MAKEUP | SKINCARE;

9. APPLEEVENT | IPHONE8 | IPHONE8PLUS | IPHONEX;

10. FICTION | HINDISHORTSTORY | LITERATURE | STORY;

11. AMITSHAH | BJP | INFOSYS | VISHALSIKKA;

12. MARKETATCLOSE | MARKETS | NIFTY | SENSEX;

13. INCREDIBLEINDIA | TOURISM | TRAVEL;

14. BANGLADESH | MYANMAR | ROHINGYA;

15. BOLLYWOOD | ENTERTAINMENT | PKVIDEO;

16. KAREENAKAPOORKHAN | TAIMURALIKHAN;

17. BULLETTRAIN | PMMODI | SHINZOABE;

18. INDVSL | INDVSSL | SLVIND | TEAMINDIA;



19. DONALDTRUMP | JAPAN | NORTHKOREA;

20. AWAAZADDA | BIGDEBATE | POLL;

21. JANMAN | UPVASFIXING | जनमन;

22. BOLLYWOODPHOTOS | MIDDAYBOLLYWOOD;

23. FREEDOMATMIDNIGHT | MISSIONGST;

24. NOZOMIOKUHARA | PVSINDHU;

25. MUKESHAMBANI | RILAGM2017;

26. KANGANARANAUT | SIMRAN;

27. INDVENG | WWC17;

28. FOOD | INDIANFOOD;

29. LOVE | RELATIONSHIP;

30. JUSTIN | NDTVNEWS;

31. AIADMK | SASIKALA;

32. BIHAR | NITISHKUMAR;

33. CONGRESS | RAHULGANDHI;

34. DAWOODIBRAHIM | IQBALKASKAR;

35. ADIKHAMGUJARAT | SLVSIND;

36. CABINETREJIG | CABINETRESUFFLE;

37. SUPREMECOURT | TRIPLETALAQ;

38. FIIBROKERAGES | HOUSEVIEWS;

39. MODIINVARANASI | NARENDRAMODI;

40. BANKSEBACHAO | TWEETMORCHA;

41. NATIONALVOICE | SACHCHIBAAT;

42. INDVAUS | VAARTA;

43. NEWTON | OSCARS;

44. DIESEL | PETROL;

45. AMARUJALATV;

46. GST;

47. PANCHKULA;

48. MAHIRAKHAN;

49. SANSKRIT;

50. AUVIDEO;

51. l8IND;

52. EARTHQUAKE;

53. MEXICO;

54. RANBIRKAPOOR;

55. BOLLYWOODUPDATE;

56. SRILANKA;

We have experimented the quality of our proposed model on recently downloaded real-

world data set as no benchmark results are available to compare our results. We have used

crowdsourcing model to validate the results and check the applicability of the proposed

model. Total 100 distinct users were allocated the task of performing the clustering from the

given set of hashtags. Table 7 shows fitness value of different solutions based on clusters

assigned by different users. The best fitness value found by GA is 414 whereas best fitness

value found by crowdsourcing is 404. GA has produced the better result than the best result

found by crowdsourcing. The clusters generated using GA are also validated by the users

involved in our experiments. According to the common observations derived from them are,



two of the clusters are such that where hashtags are misplaced (INDVAUS | VAARTA;

ADIKHAMGUJARAT | SLVSIND;). Also, there is a cluster (AMITSHAH | BJP | INFOSYS | VISHALSIKKA;

which needs to be divided further into sub-clusters for more concentrated results. In rest of all

the cases, the clusters are well generated with the most relevant tweets.

Table 7 Fitness value of solutions clustered by crowdsourcing users

User Fitness

value

User Fitness

value

User Fitness

value

User Fitness

value

User 1 306 User 26 390 User 51 286 User 76 282

























Table 8 shows top 10 results achieved by our proposed model. Different solutions

represent different clustering and all these results are comparatively good, hence all these

solutions can be used based on the requirement.

Table 8 Top 10 best results achieved using the proposed model

SM Cp CM FF FF Max FF Pen Mp MGp Fitness Value

TSO 80 G2G Freq Freq ;-5 25 5 414

LRS 80 G2G ln 50 -1 25 5 407

TSO 100 G2G Freq Freq -5 25 5 403

LRS 100 G2G Freq Freq -5 25 5 403

LRS 80 G2G Freq Freq -5 25 5 393

LRS 90 G2G Freq Freq -5 25 5 390

TSO 100 G2G ln 50 -1 20 5 388

LRS 90 G2G Freq Freq -5 25 10 385

TSO 100 G2G Freq Freq -5 25 10 382

TSO 90 G2G Freq Freq -5 25 5 371



Table 9(a) & 9(b) present average fitness value found using different GA operators and

parameters. Both the selection operators (TSO & LRS) can be used for this application as they

produce similar results. Uniform (Gene to Gene) crossover produces substantially good

results compared to single point crossover. Crossover Probability (Cp) should be as high as 90

to 100. Mutation probability (Mp) on chromosome should be moderate around 25 whereas

mutation gene probability (Mgp) should be as low as 5.

Table 9(a) Comparison of different values for Crossover Operation

Crossover

Probability

Avg.

Fitness

100 325

90 318

80 307

70 292

60 260

Crossover

Method

Avg.

Fitness

Uniform

(Gene to

Gene)

316

Single Point 240

Table 9(b) Comparison of different values for Mutation Operation

Mutation

Probability

Avg.

Fitness

25 312

20 299

;15 278

10 264

Mutation Gene

Probability

Avg.

Fitness

5 331

10 302

15 290

20 278

The huge amount of tweets which preserve the concrete observations for every incident

happening in real life worldwide is a treasurable source of information. Practically, every time

performing certain mining operations on such a large data set is not feasible. There is a

growing requirement of automatic and intelligent tools and methods which generate quite a

useful outcome. Such outcomes can be further exploited with respect to many important

dimensions in real life like decision making, optimization etc…

4. CONCLUSIONS

The genetic algorithm has been used widely for clustering for many years. To the best of our

knowledge, this is the first effort to use the genetic algorithm for twitter hashtag clustering.

Also, this is the first attempt to cluster hashtags without preprocessing or using any domain

knowledge. We have experimented our proposed novel model on live dataset downloaded

from Twitter. Benchmark results of this dataset are not available with whom we can compare

results achieved using our model. Hence using crowdsourcing, we have compared the results

of our model with the clustering results achieved by various users. The results achieved using

our model are the most encouraging and better then the best result achieved using

crowdsourcing model. Two most promising highlights of our work are; first, there is no need

of preprocessing hashtags or use of domain knowledge. Second, the number of clusters is not

required at initial step as our model automatically detects it from the given data set. This study

shows that major hurdles related to hashtag clustering can be solved using the genetic

algorithm. Automatic hashtag clustering using GA leads to various developments like event



timeline generation, community finding, trend detection, etc... However, it is equally

important to select proper genetic parameters, which may possibly be crucial task for a good

performance of the algorithm. We have developed an efficient generalized model and derived

a set of parameters for the best results.

ACKNOWLEDGEMENT

We are very much thankful to the Nirma University for providing resources and other

facilities to carry out this research work.

REFERENCES

[1] Twitter. It’s what’s happening. www.twitter.com

[2] Jain, A.K., and Dubes, R.C., 1988. Algorithms for clustering data. Prentice-Hall, Inc.

[3] Jain, A.K., Murty, M.N., and Flynn, P.J., 1999. Data clustering: a review. ACM computing

surveys (CSUR), 31(3), pp.264-323.

[4] Steinbach, M., Karypis, G., and Kumar, V., 2000, August. A comparison of document

clustering techniques. In KDD workshop on text mining, 400(1), pp. 525-526.

[5] Xu, R., and Wunsch, D., 2005. Survey of clustering algorithms. IEEE Transactions on

neural networks, 16(3), pp.645-678.

[6] Hartigan, J.A., and Wong, M.A., 1979. Algorithm AS 136: A k-means clustering

algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1),

pp.100-108.

[7] Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., and Wu, A.Y.,

2002. An efficient k-means clustering algorithm: Analysis and implementation. IEEE

transactions on pattern analysis and machine intelligence, 24(7), pp.881-892.

[8] Gath, I., and Geva, A.B., 1989. Unsupervised optimal fuzzy clustering. IEEE Transactions

on pattern analysis and machine intelligence, 11(7), pp.773-780.

[9] Sneath, P.H., and Sokal, R.R., 1973. Numerical taxonomy. The principles and practice of

numerical classification.

[10] Steinbach, M., Karypis, G., and Kumar, V., 2000. A comparison of document clustering

techniques. In KDD workshop on text mining, 400(1), pp. 525-526.

[11] Kwak, H., Lee, C., Park, H., and Moon, S., 2010, April. What is Twitter, a social network

or a news media? In Proceedings of the 19th international conference on World Wide

Web, pp. 591-600. ACM.

[12] Java, A., Song, X., Finin, T., and Tseng, B., 2007, August. Why we twitter: understanding

microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-

KDD 2007 workshop on Web mining and social network analysis, pp. 56-65. ACM.

[13] Feng, W., Zhang, C., Zhang, W., Han, J., Wang, J., Aggarwal, C., and Huang, J., 2015,

April. STREAMCUBE: hierarchical spatio-temporal hashtag clustering for event

exploration over the twitter stream. In Data Engineering (ICDE), 2015 IEEE 31st

International Conference, pp. 1561-1572. IEEE.

[14] Song, Y., Wang, H., Wang, Z., Li, H., and Chen, W., 2011, July. Short text

conceptualization using a probabilistic knowledgebase. In Proceedings of the Twenty-

Second international joint conference on Artificial Intelligence, 3, pp. 2330-2336. AAAI

Press.



[15] Sakaki, T., Okazaki, M., and Matsuo, Y., 2010, April. Earthquake shakes Twitter users:

real-time event detection by social sensors. In Proceedings of the 19th international

conference on World Wide Web, pp. 851-860. ACM.

[16] Stilo, G., and Velardi, P., 2017. Hashtag sense clustering based on temporal similarity.

Computational Linguistics.

[17] Muntean, C.I., Morar, G.A., and Moldovan, D., 2012. Exploring the meaning behind

twitter hashtags through clustering. In Business Information Systems Workshops, pp. 231-

242. Springer Berlin Heidelberg.

[18] Crockett, K.A., Mclean, D., Latham, A., and Alnajran, N., 2017. Cluster Analysis of

Twitter Data: A Review of Algorithms. In Proceedings of the 9th International Conference

on Agents and Artificial Intelligence, 2, pp. 239-249. Science and Technology

Publications (SCITEPRESS)/Springer Books.

[19] Tripathy, R.M., Sharma, S., Joshi, S., Mehta, S., and Bagchi, A., 2014. Theme based

clustering of tweets. In Proceedings of the 1st IKDD Conference on Data Sciences, pp. 1-

5. ACM.

[20] Adel, A., ElFakharany, E. & Badr, A., 2014. Clustering tweets using cellular genetic

algorithm. Journal of Computer Science, 10, pp. 1269-1280.

10.3844/jcssp.2014.1269.1280.

[21] Keshavarz, H., and Abadeh, M.S., 2017. ALGA: Adaptive lexicon learning using genetic

algorithm for sentiment analysis of microblogs. Knowledge-Based Systems, 122, pp.1-16.

[22] Keshavarz, H., and Abadeh, M.S., 2016, March. SubLex: Generating subjectivity lexicons

using genetic algorithm for subjectivity classification of big social data. In Swarm

Intelligence and Evolutionary Computation (CSIEC), 2016 1st Conference, pp. 136-141.

IEEE.

[23] Holland, J.H., 1992. Adaptation in natural and artificial systems: an introductory analysis

with applications to biology, control, and artificial intelligence. MIT press.

[24] Goldberg, D.E., 1989. Genetic algorithms in search, optimization, and machine learning,

1989. Reading: Addison-Wesley.

[25] Mitchell, M., 1998. An introduction to genetic algorithms. MIT press.

[26] Maulik, U., and Bandyopadhyay, S., 2000. Genetic algorithm-based clustering technique.

Pattern Recognition, 33(9), pp.1455-1465.

[27] Bandyopadhyay, S., and Maulik, U., 2002. Genetic clustering for automatic evolution of

clusters and application to image classification. Pattern Recognition, 35(6), pp.1197-1208.

[28] Hruschka, E.R., Campello, R.J., and Freitas, A.A., 2009. A survey of evolutionary

algorithms for clustering. IEEE Transactions on Systems, Man, and Cybernetics, Part C

(Applications and Reviews), 39(2), pp.133-155.

[29] Rahman, M.A., and Islam, M.Z., 2014. A hybrid clustering technique combining a novel

genetic algorithm with K-Means. Knowledge-Based Systems, 71, pp.345-365.

[30] https://dev.twitter.com/twitterkit/android/access-rest-api

Date post:	07-Mar-2018
Category:	Documents
Upload:	vuongkhue
View:	213 times
Download:	0 times

International Journal of Advanced Research in Engineering ... · PDF fileFuzzy C-means [8] are...

Documents