_EECSE6893_001_2014_3_+Yelper-Final+Project+Report

Columbia University E6883 Big Data Analytics Fall 2014 Final Report

Yelp-er: Analyzing Yelp Data

Naman Jain, Naatasha Kenkre, Rhea Goel, Sanket Jain Computer Science Department

Columbia University [email protected], [email protected], [email protected], [email protected]

Abstract—This project analyzes data provided by the consumer-centric, web-based platform called Yelp that connects local businesses with users. In an attempt to highlight how big data analytics can benefit various stakeholders of a business model, we perform three tasks: one, to benefit the users, two, to aid the businesses, and three, to help Yelp improve their own product. First, we generate query-based HeatMap to better visualize the search results. Next, we perform semantic analysis and topic modeling (using LDA) on user reviews to help businesses identify latest trends. Last, we propose a gamification model for Yelp to increase user engagement.

Keywords- Yelp; HeatMap; Semantic Analysis; Topic Modeling; LDA; Gamification

I. INTRODUCTION Retailers and consumer-packaged-goods (CPG) companies have long had access to vast amounts of transaction data: every day, companies capture information about every SKU sold to every customer at every store. In addition, companies regularly use sophisticated market-research techniques to analyze latest market trends and usage patterns. This gained knowledge from humongous amount of data has now become the formula for growth for any consumer centric company. There are several advantages of big data analytics such as targeted marketing, user engagement enhancement, reducing consumer churn, portfolio strategy and product development and what not! Yelp is one such consumer-centric platform that connects people with great local businesses. Founded in 2004, Yelp has come a long way and currently has overall 67 million reviews for local businesses and had about 139 million monthly unique visitors by Q3 2014. Needless to say, it has generated massive amounts of data with tremendous potential - this was our primary motivation behind the project idea. In our project, we try to demonstrate how such huge amounts of data can be harnessed to benefit different facets/stakeholders of a company. Therefore, we perform three tasks: one, to benefit the users, two, to aid the businesses, and three, to help Yelp improve their own product.

First, from the user perspective, we realized that one might need to find out where he/she can find the most bars/night clubs or places with majority Chinese restaurants, or any such cuisine or interest based queries. Currently Yelp only provides a list of search results, which can be sorted by distance from current location, price etc. However, there is no visualization of all the search results. We therefore felt the need for some mechanism to help the user identify hubs for his/her interests in the city. We used HeatMaps for the same, and highlight areas on Google Maps with the most number of search results for a particular query. This makes the application more usable by enhancing the look-and-feel of the user interface. Next, for the businesses, we perform semantic analysis and topic modeling on user reviews, to identify the most commonly talked about things about their business. For instance, if it is a restaurant, we can identify what dishes are Yelpers talking about the most, or what is the overall opinion of people about the place in terms of service, food, staff etc. This can help businesses figure out their strengths and weaknesses. Consequently, they can monetize based on the insights about what attracts the users. Last, but not the least, we propose a gamification model for Yelp itself. Gamification is nothing but the application of typical elements of game playing (e.g., point scoring, competition with others, rules of play) to other areas of activity, typically as an online marketing technique to encourage engagement with a product or service. This user-engagement strategy has been used rigorously in the past few years, and continues to be indisputably effective. Our motivation behind this was our prior experience with a similar platform used commonly in some major cities of India, called Zomato. Zomato took restaurant reviews to a whole new level with users actually battling it out to go to more places, earn more points, and become a ‘better’ foodie. Yelp, on the other hand, incorporates this idea only through the ‘Elite’ community, which is only an exclusive, in-the-know crew. Hence, we extend the concept to all users of Yelp aiming to drive better customer retention and lifetime value. This also helps Yelp increase their customer base, market value, brand name and user loyalty.


The remainder of this report will take you through literature review, data source and description, and algorithms and software used for each of the three tasks.

II. RELATED WORKS Given that Yelp is a popular consumer review platform, most related work has been centered on user review content analysis. The most widely studied is the correlation of user reviews with the reputation or average rating of the restaurants or other businesses on Yelp, like optimal aggregation of consumer ratings by Dai et al, in [5]. Another such interesting study is [1], where Luca et al study the trends of how a restaurant rating is correlated with the restaurant revenue, by combining Yelp user review data along with revenue data obtained from government sources. One important finding of the study was that a one-star increase in Yelp rating leads to a 5-9 percent increase in revenue. Another common area of study is the credibility of reviews on this consumer website. Online reviews have become a valuable resource for decision-making. However, its usefulness brings forth the problem of deceptive opinion spam; when business-owners commit review fraud, either by leaving positive reviews for themselves or negative reviews for their competitors. There has been significant research on the investigation of the extent and patterns of review fraud on Yelp. For instance, in [2] Luca et al highlight the extent of review fraud and suggest that a business’s decision to commit review fraud responds to competition and reputation incentives. Likewise, Mukherjee et. al [3] try to identify what algorithm is used behind the trade secret fake review detection filter by Yelp and find out that behavioral features are much more effective than linguistic features. Furthermore, there has been significant linguistic analysis as well, mostly revolving around sentiment analysis and topic modeling [8], [9]. Another area is trying to find out which reviews are more likely to be relied upon by most users or what factors make a review more useful. For example, Tucker tries to do power/influence analysis of elite or non-elite members of the community using speech code theory to explain and evaluate how computer users communicate by posting reviews on Yelp [4]. The study reveals that overall, opinion leaders on Yelp, a group of regular users who have earned elite status in the community, did carry more authority with review readers than non-elite members. In addition, there has been research not based on user review analysis, like that of [6] which studies why individuals use the website Yelp.com from a uses and gratifications perspective. The results of this survey-based study indicate that individuals overwhelmingly use Yelp.com for information-seeking purposes, followed by

entertainment, convenience, interpersonal utility, and pass time.

III. SYSTEM OVERVIEW

Yelp has introduced, this year, an “Academic Dataset”, a deep dataset for research-minded academics from their wealth of data. It is a dataset that is not only unlike standard datasets but also has some world-relevance in some research project. The Challenge Dataset includes data from Phoenix, Las Vegas, Madison, Waterloo and Edinburgh:

• 42,153 businesses • 320,002 business attributes • 31,617 check-in sets • 252,898 users • 955,999 edge social graph • 403,210 tips • 1,125,458 reviews

The question that arises is that what can we find out using this dataset or what all could we predict from such varied data. There are a myriad of options about what we can do with the given dataset. We could guess a review's rating from its text alone. We could take all of the reviews of a business and predict when it will be the busiest, or when the business is open. We could predict if a business is good for kids, or has Wi-Fi, or parking? We could also predict what makes a review useful, funny, or cool or figure out which business a user is likely to review next. Other tasks include predicting business categories with a fancy clustering algorithm, predicting star ratings using sentiment analysis or building a cool visualization of great local businesses. Clearly, there are innumerable ways to extract meaningful information from this data. But our given time constraint and our knowledge at this point of time we chose the three tasks described above. Following is a brief description of what exactly the dataset provides us: Yelp provides reviews of the 250 closest businesses. The dataset that has been provided to us is a single gzip-compressed file, which is composed of one json-object per line. Here, every object contains a 'type' field, which tells you whether it is a business, a user, or a review. We have three different objects that we have to deal with:

• Business Objects: The business objects contain basic information about local businesses. The 'business_id' field present in the object that is a unique identifier for the business can be used with


the Yelp API to fetch even more information for visualizations. The various other fields are as follows:

o name : gives the full business name o neighborhoods : provides us with a list of

neighborhood names(might be empty!) o full_address : localized address of the

business in question o city & state : the city and state

respectively in which the business resides o latitude &longitude : latitude and

longitude of the location of the business o stars : the star rating for the business o review_count : the number of reviews that

have been written for this business o photo_url : the URL for the picture

associated with the business

• Review Objects: The review objects contain the review text, the star rating, and information on votes that the Yelp users have cast on the review. The two fields present user_id and business_id are of substantial use. We can make use of user_id to associate this review with others by the same user. Similarly, the business_id aids us in associating this particular review with others of the same business. As with the business object, the review object too has its own fields as:

o user_id : this is the identifier of the user writing this review

o stars : star rating which is an integer bet. 1-5

o text: the important part of the review which consists of the review text

o date : date the review was written o votes which consists of useful, cool and

funny : each of contain the number of votes that have been assigned to them by a user.

• User Objects:

The User objects contain aggregate information about a single user across all of Yelp (including businesses and reviews not in this dataset). The fields present in this object can be given as follows:

o type: user o user_id: this is the unique user identifier that

we have come across before o name: the first and last name of the user o review_count: the number of reviews that have

been written by this user o average_stars: is an average i.e. floating point

average of the reviews by this user

o votes (useful , funny and cool): this is a count of useful, funny and cool votes across all the reviews by this user.

IV. ALGORITHM AND SOFTWARE PACKAGE DESCRIPTION Since we performed three tasks, we discuss each task in detail in this section and tal about the algorithm and software used.

I. Query-based HeatMap HeatMaps are nothing but a graphical representation of data where the individual values contained in a matrix are represented as colors. We use google maps are our base, and highlight hubs for specific queries. For this case, higher the density of the search term (ex: Sushi) is in a particular area, sharper is the color gradient of that area. The overall process flow involved the following steps:

• Filter Restaurants: Yelp hosts a number of businesses including doctors, hair stylists, and other services. For this task, we worked only with restaurants and worked with cuisine-based queries.

• Extract location coordinates: Using the dataset given in the JSON format, we extracted the latitude and longitude for each business appearing in a query’s search results.

• Integrate Google Maps API: These coordinates were then passed to the script that generates the HeatMap; which used the Google Maps API.

• HeatMap for query results in chosen city: Finally, we created a web-based UI to exhibit this functionality. The user has the choice to change query and city from the drop down menus, and see relevant areas highlighted on the map.

We used the Google Maps JavaScript API v3 for this purpose. This is a JavaScript API provided by Google, to integrate any HeatMap with Google maps. Using this, we can visualize HeatMap on an interactive map of the world (https://www.google.com/maps). We also provide options to change gradient, map view (Satellite/Terrain), radius, color etc. To display the functionality and make it more interactive, we have also added drop down menus to change search query and desired city. The HeatMap for the given search query is then generated in real time.


Figure 1. HeatMap for search query “Sushi” in Pheonix

A sample screenshot of a HeatMap for “Sushi” in Pheonix is shown in Figure 1.

II. Semantic Analysis and Topic Modeling

The overall process flow for this task can be described as follows:

• Extraction of reviews: we first extracted reviews for all the restaurants (again, we focus only on restaurants, but this is easily extendable to all businesses) and treated each review as a separate document.

• Topic Modeling (LDA): We then perform topic modeling using the most common algorithm called Latent Dirichlet Allocation.

• Top k words: Then we identify relevant topics, and choose the top K words that represent the topic with a good relevance measure.

• Word Cloud: Finally, we visualize our results using Word clouds.

Topic Modeling has been in use for a while now, and is a great tool for language processing and text analysis. In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. So basically, topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. These algorithms help us develop new ways to search, browse and

summarize large archives of texts. A "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings. We have already highlighted the commercial value of such analysis. It can help businesses figure out their strengths and weaknesses. This way they can leverage their strengths, and earn profits by discovering their Unique Selling Point (USP). We used Latent Dirichlet allocation (LDA), which is perhaps the most common topic model currently in use. It is a generalization of PLSI developed by David Blei, Andrew Ng, and Michael I. Jordan in 2002, allowing documents to have a mixture of topics.[10] Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. It is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics.


Figure 2. Word Cloud for identified topic ‘Nightlife/Bar’ Many open-source software packages for topic modeling have been released. We have used MALLET, which is a Java-based package for statistical NLP, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. MALLET topic modeling toolkit contains efficient, sampling-based implementations of LDA, Pachinko Allocation and Hierarchical LDA. It helped us identify topics, along with topic strength – and relevance of each word in all topics. A sample word cloud for a topic ‘nightlife/bar’ is shown in Figure 2.

III. Gamification

We propose a gamification model for Yelp.com by analyzing user’s activity on the website. For instance, review count, or the number of reviews written by the user, is a good measure of his/her involvement on the website as a contributor. There are typically two kinds of traffic that the website faces: one, the reader traffic, and two, the contributors. Yelp’s aim should be to encourage more and ore writers or contributors to be able to increase their market value. Similarly, number of fans, and friends of a user is a good indicator of how good he/she is at networking. There can be a whole lot of advantages of identifying the relatively more social lot as well.

However, just the number of reviews written by the user is not good enough, as we also need a mechanism to check quality of the content provided. Here, the number of votes and compliments can be used, as these numbers are suggestive of how useful is the content provided by these Yelpers. Another important measure of trust is the time since the user has been active on Yelp. It is highly likely that a person has been on Yelp since several years, and has only written a handful of reviews is probably not as active as someone who joined only recently, and has already written more than two reviews every month. A regular reader might find reviews written by the latter kind of people more reliable and up-to-date. Thus, we use all the available information and assign tags to users like ‘Popular’, ‘Social’, ‘Newbie’, ‘Lazybones’, ‘Super Active’, and ‘Dependable’. As discussed before, such a model helps encourage activity and users’ self contributions to the community, thereby aiding Yelp in increasing its customer base, and brand value and perhaps, monopolize the competitive market.

V. EXPERIMENT RESULTS Describe the experiment results of your algorithm. Show how did you evaluate the performance of your algorithm.


The tasks we performed were all unsupervised, and the first time for Yelp (as per our literature review). Hence, we did not have any ground truth to compare our results with. We thereby adopted other qualitative evaluation strategies for each task. To evaluate the usefulness of the Query-based heatmaps, we conducted an online survey asking several questions aiming to compare the current and the proposed user interface for search query results. We also asked the users to rate on a scale of 1 to 5, the usefulness of HeatMaps in this scenario; where 65% users rated it 4 or 5, and an average rating of 3.5 was obtained. Further, about 73% users agreed that this model is visually more appealing then the current list view of search results. In addition, about 95% users said that they have once, or more, wanted to know about the hubs of a particular cuisine in their city. This only proves that this introduction to the Yelp UI is not only desired, but also a successful means to achieve it. Next, for topic modeling, we could easily evaluate that the identified topics were relevant and made sense. We ran the results by out project leader as well to have him on board as well. Last, for the gamification model, we wanted our algorithm to be such that we have a reasonable percentage of people belonging to different categories. For instance, we do not want most people to be tagged ‘Lazybones’, as it becomes demotivating. On the flip side, we also do not want almost everyone to be declared ‘Super active’ as it diminishes the value of the tag. Hence, we decided a percentage of users that we want to belong to each category, and we modified our algorithm based on these decided values, thereby ensuring that the gamification model is neither too lenient, nor too harsh.

VI. CONCLUSION The project tries to highlight the power of big data analytics by analyzing Yelp dataset, and performing three different tasks to benefit all stakeholders of their business model: the Yelpers, the business owners, and Yelp.com itself. We perform visualization of search query results using HeatMap, perform topic modeling n user reviews for businesses to understand the latest trends, and propose a gamification model to encourage user participation.

ACKNOWLEDGMENT The authors would like to thank Professor Ching-Yung Lin, who supported us throughout the course of this project. We are thankful for his aspiring guidance, invaluably constructive criticism and friendly advice during the project work. We are sincerely grateful to Bhavdeep Sethi, leader of the Retail team, for sharing his truthful and illuminating views on a number of issues related to the project. We express our gratitude to him, and all the other members of the teaching staff, who have always been very responsive in providing necessary information, and without whose generous support this project wouldn’t have shaped the way it has.

REFERENCES

[1] Luca, Michael. "Reviews, Reputation, and Revenue: The Case of Yelp.com." Harvard Business School Working Paper, No. 12-016, September 2011.

[2] Luca, Michael and Zervas, Georgios, Fake It Till You Make It: Reputation, Competition, and Yelp Review Fraud (November 26, 2014). Harvard Business School NOM Unit Working Paper No. 14-006.

[3] Mukherjee, A.; Venkataraman, V.; Liu, B.; Glance, N. “What Yelp Fake Review Filter Might Be Doing?”, International AAAI Conference on Weblogs and Social Media, North America, jun. 2013.

[4] Tiana Tucker (2011).“Online Word of Mouth: Characteristics of Yelp.com Reviews”. The Elon Journal of Undergraduate Research in Communications Vol. 2, No.1, 37-42.

[5] Weijia Dai, Ginger Z. Jin, Jungmin Lee, Michael Luca (2012). “Optimal Aggregation of Consumer Ratings: An Application to Yelp.com”. NATIONAL BUREAU OF ECONOMIC RESEARCH, NBER WORKING PAPER SERIES, Paper No. 18567, JEL No. D8,L15,L86.

[6] Amy Hicks, Stephen Comp, Jeannie Horovitz, Madeline Hovarter, Maya Miki andJennifer L. Bevan. “Why people use Yelp.com: An exploration of uses and gratifications” Computers in Human Behaviour Volume 28, Issue 6, November 2012, Pages 2274–2279.

[7] Maria R. EblingI and Ramón Cáceres “Gaming and Augmented Reality Come to Location-Based Services” , IEEE Computer Society, Issue No.01 - January-March (2010 vol.9), pp: 5-6.

[8] Geoffrey Levine and Gerald DeJong. “Automatic Topic Model Adaptation for Sentiment Analysis in Structured Domains” Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, November 8 – 13, 2011, 75-83.

[9] Bin Lu; Ott, M.; Cardie, C.; Tsou, B.K., "Multi-aspect Sentiment Analysis with Topic Models," Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on , vol., no., pp.81,88, 11-11 Dec. 2011.

[10] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I; Lafferty, John (January 2003). "Latent Dirichlet allocation". Journal of Machine Learning Research 3: 993–1022.doi:10.1162/jmlr.2003.3.4-5.993.

Date post:	07-Jan-2017
Category:	Documents
Upload:	sanket-jain
View:	26 times
Download:	0 times

_EECSE6893_001_2014_3_+Yelper-Final+Project+Report

Documents