Sentence clustering using Findex algo

Stack Search

StackOverflow Search Clustering

By

Sonali Sharma, Shubham Goel, Priya Iyer

Project Report Report presented towards the completion of the class project for INFO 256

Applied Natural language Processing

Date: 12/16/2013

Instructor

Marti Hearst

Mentor

Aditi Muralidharan

1. Introduction

There are a number of search engines available today. When users type in a query to search for something over the internet, they are often overwhelmed with the a large amount of results. It often becomes difficult to browse through this list and fetch the most relevant items. We implemented an algorithm that would improve user search using NLP techniques and provide a mechanism to categorise search results into relevant categories, thereby making it easy for the end user to navigate to the category of interest and look for results within that category. We decided to implement Findex algorithm in this project. Findex is a text categorization algorithm that provides an overview of search results as categories where categories are made up of most frequent words and phrases in the resulting document set. The algorithm is based on the assumption that the most frequently used word/phrases in a set of documents capture major topics very well. We used the StackOverflow data to implement the algorithm.

2. Project Goals

2.1 Original intent

We originally intended to work on clustering similar sentences using a monothetic clustering algorithm such as DisCover [1] . Monothetic clustering is a clustering technique wherein each cluster is formed using only one feature and that single feature is present across all the samples, which in our case, are documents. The DisCover algorithm is one such type of a monothetic clustering algorithm. Also, we intended to use the WordSeer [2] project and the associated humanities corpora in order to implement the algorithm on it. However, upon deeper analysis we realized some hurdles to this approach as discussed below.

2.2 Algorithm

The DisCover algorithm aims for full coverage however search results do not necessarily have to fall under clusters. Moreover, understanding and implementing the algorithm appeared to be complex and fell out of scope of a class project. Also, a good measure and method of evaluation of the DisCover algorithm had not been achieved so, we decided to use another algorithm Findex[3] in order to categorize search results. The details of the Findex algorithm are discussed in detail in one of the upcoming sections.

2.3 Data

The developer of WordSeer, Aditi M, had set up two text corpora to work with WordSeer. However, the text seemed more appropriate for testing purposes than for actual use for someone.

2.4 WordSeer

Also, WordSeer had already implemented a version of the Findex algorithm for clustering the search results. There wasn’t much left for us to do and we really wanted to learn more about Findex and its implementation.

Thus, we decided to implement a search result clustering algorithm (Findex) over StackOverflow data using the StackOverflow API. We built a search interface for StackOverflow users to type in their questions and see the results in a neat manner.

2.5 Accomplishments

Data

Data processing was a much harder task than we had originally expected. Using the readymade database that Aditi has uploaded on to the WordSeer platform, we had not expected the data cleaning and loading tasks to be so arduous. The StackOverflow API provided us with data in huge XML and HTML files. A lot of our time was consumed in extracting the data, cleaning it, parsing it and loading it to the database.

Algorithm

We decided to implement Findex as our algorithm of choice for categorizing the search results. Details of the implementation will be discussed in the upcoming section. Our original intent was to only create one layer of categories for the search result topics but we ended up creating a second layer of sub-categories by recursively applying the Findex algorithm over each of the parent categories. From an NLP standpoint, we ended up implementing a lot of concepts from work tokenization, lemmatization, n-gram creation to phrase frequency distribution.

Search User Interface

Initially, our vision was to only create the categories on the search results and display the results on the console. However, we decided to go a step further and create a web interface as well. We realized that visualizing the categories and the sub-categories along with the content of the questions and responses from StackOverflow would be a good way to drive home the point further about search result categorization.

Future Work

We intend to be able to plug the system that we developed to the StackOverflow interface to allow users to quickly browse to the intended questions and responses of their interests through the categorized topics. Currently, the StackOverflow user interface only has tags as a way of classification. However, the issue with that is that they are user generated and sometimes they may not necessarily be relevant to the question.

2.6 Results

The following figures represent the results of our project. Each of the figures contains a different query term and different categories corresponding to the different query terms. Notice how some query terms have no sub-categories while some do. This is because Findex only displays the categories or the sub-categories if there is a substantial number of questions that falls under those. Detailed descriptions of these result pages are provided in the next section.

Fig. 1 Query word: “python”

Fig.2 Query word: “databases in python”

Fig.3 Query word: “Java Interview Questions”

Fig. 4 Part-of-speech tagging for query word: “memory management”

3. Data

We downloaded the Stack Exchange Creative Commons Data Dump, which has all the

public data from websites like Stack Overflow, Server Fault, Stack Apps, etc up to

September 2011. The data files were in XML format with each question and answer

being an entry with the <row> tag. Because the relevant Stack Overflow data was over

4GB in size, we first built our database using the data from english.stackexchange.com

which was about 21MB in size. This allowed us to make progress in parallel on the

database front and the algorithm front. To save time we initially set up everything in a

sqlite database. We processed the XML file using the python built in xml.dom library

and stored the answers and questions in different tables. We cleaned the answer test

by removing all the HTML tags and dropping the content of certain tags like <CODE>

using BeautifulSoup. We did not want to pass the code snippets to our Findex engine.

After cleaning the answer text we merged the answer and the question table and got rid

of all the extra data, like unanswered questions and irrelevant answers. Using this

temporary database we started building our Findex engine and the user interface. As

the next step, we started building our database with the Stack Overflow data. We

switched from sqlite to MySQL. The xml.dom library did not work well for parsing a large

data set as it reads the entire data file into the memory before processing it. We

switched to xml.etree.cElementTree which is a C based library and had to make some

changes to the sql statements to import the stack overflow data into the MySQL

database. This gave us over 1 million questions with cleaned answer text. For our final

demonstration and user interface, we used a subset of ten thousand questions in a

sqlite database, to improve the response time of the system.

4. Algorithms

We implemented a modified version of Findex algorithm. Below are the major steps of

the algorithm used by us.

4.1 Text Mining

This was one of the most important steps. Before doing frequency calculation, the data

had to be transformed into a particular format and stored in sqlite database. The clean

data had to be tokenized, lemmatized, converted to trigrams.

Tokenization- The answer text was tokenized using nltk tokenizer. Tokenization splits

up a string into a list of constituent words.

Stop word removal - In order to get more relevant categories after applying frequency

distribution on tokenized words, it was important to exclude stop words. Without

excluding them, stop words would appear in the most frequent words list and would

result in meaningless categories. We decided to use the stop word list from the

linguistic tools resources of Information Retrieval department of University of Glasgow.

This was an exhaustive list of stopwords and worked well on our dataset. The list of

stopwords can be viewed here.

Lemmatization - After removing stop words, the next step was to lemmatize the words.

Lemmatization was important. Without lemmatization, simple inflections of the words

such as debug and debugging, list and listing, car and cars, would appear as separate

categories. In order to prevent this we used wordnetlemmatizer to lemmatize tokens.

ngrams - The next step was to create unigrams, bigrams and trigrams. In order to

formulate categories we decided to go upto trigrams as phrases up to 3 words made

more relevant categories than phrases consisting of more than 3 words. For every

answer, we created a list of unigrams, bigrams, trigrams and stored them in an ngrams

table that we created. We store the original unigrams, bigrams, trigrams along with

lemmatized version as shown below.

Fig. 5 Table storing ngrams for tokenized answers

4.2 User Query

The previous step was one time activity to upload the dataset into the database. Now

for the specific search, the users are asked to enter a query in the search interface.

Fig. 6 User Interface to input query term

Our program reads the query phrase entered by the user and executes the steps below:

Tokenize query terms - Search term pointers in memory is tokenized to

[“pointers”,”in”,”memory”]

Lemmatize terms - Tokens are then lemmatized [“pointer”,”in”,”memory”]

Stop words - On removing stop words we are left with [“pointer”,”memory”]

Finding relevant questions - For each term in the user query we check the ngrams

table to fetch a list of phrases containing the query term. Upon fetching the phrases

containing query terms, we calculate the frequency of the phrases and arrange them in

descending order.

4.3 Frequency distribution

After fetching all phrases containing the query terms we then calculate frequency

distribution of the phrases and finally filter the results by fetching the top 20 phrases.

The figure below shows the phrases along with their frequency for the query “pointers in

memory”. Each of these phrases is considered as a separate category.

Fig. 7 Most frequent phrases

Upon getting list of top 20 phrases 9by frequency) we also fetch the corresponding

question and answers ids. These are used later to display the search results.

4.4 Hierarchy

In this project we went beyond the Findex algorithm and implemented and introduced

hierarchy in the categories. The unigrams in the resultant phrases were considered as

the top level category. Bigrams containing the unigram was considered as the second

level hierarchy and similarly trigrams containing the bigram was considered as level

three hierarchy. while displaying the search results we showed all all three levels (if

applicable)

Below is the screenshot of the categories returned with parent child relationship.Parent

represents Level 1 category (unigrams), Children represent Level 2 category (bigrams)

and Grandchildren represent Level 3 category (trigrams).

Fig. 8 Creating hierarchy of phrases

4.5 Displaying results

Front end

The front end to display results was built using flask framework. We connected to

sqlite3 database to fetch results and the webpage was built using flask jinja2 template.

the search interface was simple and user could provide any type of search query. As

described in the section above, the search query was parsed first and the results

calculated on the fly and returned back in a hierarchical order. As shown in the figure

below, the results were arranged by the frequency of the phrases. The hierarchy of the

categories also facilitates search for the end user.

Fig. 9 Search results for query “pointers in memory”

The user can now select particular category of interest. On selecting the category the

list of associated questions are displayed on the right plane. The answers can be

viewed upon clicking on the question. This interface makes it easy for the users to

browse through a list of questions corresponding to the selected category. In the

example below, the user is only interested in looking at categories memory stream,

pointers, memory layout and memory leak. There are 10 questions corresponding to

this selection.

Fig. 10 Selecting categories to view relevant search results

5. Further analysis

We did further analysis of categories using Parts of speech tagging of the categories. We found

some interesting results with this exercise which could be used to further classify the categories.

From POS Tagged categories we found patterns like Adjective - Noun, Verb Noun and Noun

Noun.

Adjective Noun, Noun noun combination depicted the types of category. We can see that

transactional, variable, virtual, actual and table are all types of memory.

Types [('transactional', 'JJ'), ('memory', 'NN')] [('variable', 'JJ'), ('memory', 'NN')] [('virtual', 'JJ'), ('memory', 'NN')] [('actual', 'JJ'), ('memory', 'NN')] [('table', 'JJ'), ('memory', 'NN')]

Verb noun combination depicts various actions/ usages of the Noun in the category. The actions

that you can perform on memory consist of writing, freeing, allocation, loading, sharing etc. This

pattern was clearly visible by identifying the verbs in categories.

Actions [('writing', 'VBG'), ('memory', 'NN')] [('freeing', 'VBG'), ('memory', 'NN')] [('allocated', 'VBD'), ('memory', 'NN')] [('related', 'VBD'), ('memory', 'NN')] [('cached', 'VBD'), ('memory', 'NN')] [('loaded', 'VBD'), ('memory', 'NN')] [('shared', 'VBD'), ('memory', 'NN')] [('written', 'VBN'), ('memory', 'NN'), ('tested', 'VBN')] [('written', 'VBN'), ('memory', 'NN')] [('string', 'VBG'), ('memory', 'NN')]

From the above analysis we could clearly see a pattern in the categories returned. Just by looking

at the parts of speech it was easy to categorise the list further.

6. Contributions of Each Team Member

Tasks Shubham Priya Sonali

Initial parsing stack overflow data

100% 0% 0%

Initial Loading of questions and answers into database

90% 5% 5%

Entire dataset Tokenization, stop word removal and loading ngrams

0% 10% 90%

Frequency calculation 20% 20% 60%

Hierarchical categorization

0% 90% 10%

POS tagging 5% 80% 15%

Front end 15% 5% 80%

Documentation 33% 33% 33%

Code Cleanup 33% 33% 33%

7. Code

We wrote the code for Findex algorithm and Front end from scratch along with the code to extend Findex to do hierarchical categorization and parts of speech tagging. You can find our code repository here: https://github.com/sonalisharma/nlp_sentence_clustering

8. Bibliography

[1] Kummamuru, Krishna, et al. "A hierarchical monothetic document clustering algorithm for summarization and browsing search results." Proceedings of the 13th international conference on World Wide Web. ACM, 2004. [2] Muralidharan, Aditi, and Marti Hearst. "Wordseer: Exploring language use in literary text." Fifth Workshop on Human-Computer Interaction and Information Retrieval. 2011. [3] Käki, Mika, and Anne Aula. "Findex: improving search result use through automatic filtering categories." Interacting with Computers 17.2 (2005): 187-206.

Date post:	18-Jul-2016
Category:	Documents
Upload:	sonali
View:	19 times
Download:	0 times

Sentence clustering using Findex algo

Documents