Engineering/Operations A Tr i pAd v i s o r B l o g
Using NLP to Find “Interesting” Collections of HotelsCraig Schmidt posted October 5, 2015
There are a lot of hotels on TripAdvisor. At the moment, there are 1790 hotels listed for Paris, 1054 hotels for London, and 466 hotels in New YorkCity. We have been working on better ways to explore these hotels, and find an interesting place to stay. In this blog post, I’ll describe an upcomingfeature on TripAdvisor, that uses Natural Language Processing (NLP) to find groups of hotels that have an interesting theme. It is a semi-automated process that can be scaled across many cities. There is some interesting technology behind it.
For New York City, we might show a list of collections like this…
WORK AT TRIPADVISOR
We have collections for “Times Square Views”, “Trendy Soho”, “Catch a Show”, and “Art Deco Classic”, among others. Each of those is a group ofabout 10 hotels that fit the theme. How can we automate finding these interesting groups, even for cities we’re not familiar with?
Topic Modeling is BoringThe most textbook approach might be to build a topic model based on the review text for the hotels. I first tried a Latent Dirichlet Allocation (LDA)model. It assumes that each review is a mixture of a number of latent topics. You can represent each review as a vector with a “bag of words”representation of the review text.
I tried building an LDA, model, but got very boring results. Here are some of the first topics for New York, which a mix of a bunch of general hotelrelated topics.
Single words just don’t have the richness to represent interesting topics. What about ‘breakfast’? Was it good, bad, free, or wonderful? Most hotelreview mix up talking about the room, the location, the staff and more all at once,and these topics show that mixing. But we want to just focus ona single notable aspects of a hotel, not summarize it completely.
Phrases as raw materialSince single words aren’t very expressive, I wanted to find some good phrases in our review text. What is a phrase, exactly? In this case, we meanan n-gram of words that has a specfic semantic meaning. We tried the approach from the paper “Mining Quality Phrases from Massive TextCorpora”, by Liu et. al. at the Univeristy of Illinois and Microsoft (some interesting slides). Because they released an open source implementationtheir SegPhrase algorithm, we were able to try it out.
It is a multiple step process, but the output is that you get a large set of phrases, and a score on how likely it is to be a phrase. We dubbed thisscore the “phrasiness” metric. So what are the strongest phrases in New York?
123456789
0.018 breakfast + 0.015 nice + 0.015 coffee + 0.011 small + 0.009 staff + 0.008 location + 0.008 great + 0.008 clean + 0.032 staff + 0.020 great + 0.016 location + 0.014 stayed + 0.013 friendly + 0.012 helpful + 0.011 new + 0.010 york + 0.015 location + 0.015 great + 0.014 square + 0.013 times + 0.010 walk + 0.010 subway + 0.009 clean + 0.008 around + 0.007 0.010 new + 0.007 distrikt + 0.007 great + 0.006 york + 0.006 breakfast + 0.006 city + 0.006 nyc + 0.005 service + 0.005 0.012 staff + 0.009 yotel + 0.009 desk + 0.008 time + 0.007 front + 0.006 arrived + 0.006 made + 0.006 service + 0.005 0.023 new + 0.019 breakfast + 0.018 york + 0.013 square + 0.011 time + 0.011 empire + 0.010 state + 0.010 building 0.019 great + 0.017 staff + 0.016 bar + 0.012 breakfast + 0.012 location + 0.011 wine + 0.011 new + 0.009 york + 0.009 0.019 great + 0.019 park + 0.018 square + 0.017 location + 0.016 times + 0.016 central + 0.014 staff + 0.012 walking 0.025 air + 0.010 conditioning + 0.009 staff + 0.008 night + 0.007 service + 0.007 stayed + 0.006 smoking + 0.006 new
123456789101112131415161718
club_quarters 0.999620288 hampton_inn 0.999542304 rush_hour 0.999526829 frosted_glass 0.999506234 ritz_carlton 0.999476254 usa_today 0.999473892 jersey_boys 0.999458328 holiday_inn_express 0.999450456 art_deco 0.9994495 gordon_ramsay 0.999448261 battery_park 0.999418922 grand_central_station 0.999410719 naked_cowboy 0.999401511 yankee_stadium 0.999390047 penn_station 0.999386755 columbus_circle 0.999381838 charlie_chaplin 0.999381761 scrambled_eggs 0.999379073
The number shown is the phrasiness score. All of these are clear semantic concepts, not just a sequence of words that frequently co-occur.
This is a list of “phrases” with much lower “phrasiness” scores:
You can imagine that these phrases frequently occur in reviews, but don’t really represent a single concept.
We used a threshold of 0.5 for phrasiness, which reduced the approximately 250 thousand potential phrases for New York to around 35 thousand.While people often talk about air_conditioning in a review, and it is clearly a phrase, that’s not the kind of aspirational topic that people will want tobrowse through. Importantly, there are plenty of phrases that have more interesting emotional content, like art_decoandradio_city_music_hall. So we have the raw material we need to find collections. In fact, we may have too much raw material, as there are tensof thousands of a good phrases just for New York hotels.
word2vec Party TricksWe have a bunch of phrases, and many of them are very similar. These phrases …
192021222324252627282930
jet_lag 0.999370422 affinia_dumont 0.999364144 harry_potter 0.999357816 les_halles 0.999352377 air_conditioning 0.999346666 mamma_mia 0.999345891 hudson_river 0.999345247 pinot_noir 0.999344796 woody_allen 0.999337025 fairy_tale 0.999306646 grand_central 0.999304571 radio_city_music_hall 0.999301883
123456789101112131415
the_staff_were_great 0.116338786 located_on_the_ground_floor 0.116338612 the_rooms_are_very_small 0.116337216 requests_were_honored 0.116334616 an_older_crowd 0.11633381 booked_our_return 0.116329291 quite_rudely 0.116321934 and_pedro 0.116312719 helped_ourselves 0.11630484 after_spending_over 0.116302541 my_husband's_birthday 0.116302346 being_slightly 0.116297622 no_inconvenience 0.116296059 far_greater 0.116295743 longer_period_of_time 0.116291693
1 amazing_view_of_central_park
seem like a great basis for making a hotel collection. But they are mixed in with tens of thousands of other phrases. It would be very helpful tocluster the related phrases into groups. But what features should we use to do the clustering? We need a similarity metric that puts these phrasesclose to each other.
The word2vec algorithm from Mikolov and other Google researchers is perfect for this case. Given a corpus of text, it maps each word into anumerical vector of say 100 dimensions. The mapping is done in a way that put words that are used in a similar way in the corpus are near eachother in the vector space.
The papers on word2vec give lots of party tricks you can do by adding and subtracting the vectors. For example vector(‘Paris’) – vector(‘France’) +vector(‘Italy’) results in a vector that is very close to vector(‘Rome’), and vector(‘king’) – vector(‘man’) + vector(‘woman’) is close to thevector(‘queen’).
I took all 9 billion words of review text at TripAdvisor, and build a word2vec model with around 1 million unique words. I used the C implementationof word2vec which can be easily run from the command line.
Here are some example words, and their nearest neighbors:
2345678910111213
amazing_views_of_central_park beautiful_view_of_central_park beautiful_views_of_central_park central_park_views fantastic_view_of_central_park fantastic_views_of_central_park great_view_of_central_park great_view_over_central_park great_views_of_central_park nice_view_of_central_park spectacular_view_of_central_park wonderful_views_of_central_park
123456789101112131415
great terrific 0.932 fantastic 0.908 fabulous 0.876 wonderful 0.861 brilliant 0.852 fab 0.849 superb 0.818 good 0.787 lovely 0.782 phenomenal 0.773noisy noisey 0.934 loud 0.880 nosiy 0.748
The numbers in the table are the cosine similarity of the top word with the near neighbors. We can see that word2vec is doing a nice job ofmapping related words to be near each other.
For our short phrases, we just added the individual word vectors to create a phrase vector, ignoring stop words. So now we have a numericalvector of features for each phrase, and we can use any clustering technique.
With word2vec, people usually use the cosine distance, where a larger value is better. However, in clustering we usually work with a distance,where smaller is better. If we normalize our word2vec vectors so they have a unit norm, then we can just minimize the Euclidean distance, as thisis equivalent to maximizing the cosine similarity (see the Properties section here for the simple derivation). So going forward, we will use theEuclidean distance of the two normalized word2vec phrase vectors as our distance metric.
Clustering phrases into tight topicsUsually, in a clustering problem we are interested in all of the clusters. Here, we have a slightly different situation. We want to group similarkeywords into very focused groups, like our “view of central park” phrases, above. A small fraction of the clusters, say 100 or so, will be the basis
1617181920212223242526272829303132333435363738394041424344
noicy 0.729 distracting 0.726 busy 0.711 crowded 0.705 annoying 0.696 disruptive 0.693 rowdy 0.685rude unfriendly 0.908 unhelpful 0.898 arrogant 0.882 impolite 0.876 unprofessional 0.872 surly 0.864 dismissive 0.849 abrupt 0.843 ignorant 0.840 hostile 0.826bed beds 0.814 mattress 0.780 sofabed 0.717 couch 0.709 matress 0.708 duvet 0.690 bedroom 0.683 futon 0.653 beds 0.652 room 0.650
for our collections. The rest of the clusters will be focused on more boring topics like air conditioning, and we’ll just ignore those.
What clustering technique will do a nice job of giving us those tight clusters? The default k-means approach for a fairly small k isn’t probably goingto work too well. It will produce broad groups, and concentrate on getting each phrase into a good cluster, even though we don’t really care aboutmost of the phrases.
I chose the agglomerative clustering routine in the python sklearn library to do the clustering. The “complete linkage” option will try the hardest tokeep each cluster pure.
Agglomerative clustering starts with each phrase in its own cluster, so we have N clusters of size 1. It then goes through and tries to combine eachpair of current clusters into a new combined one. With complete linkage, the score of the combined cluster is the maximum of the distancesbetween every pair of phrases in the combined cluster. The two clusters with the smallest maximum score after combination are actuallycombined, leaving us with N-1 clusters. This process is repeated until we have 2 clusters left.
The overall combination process forms a tree of nodes. At internal nodes, a cluster is composed of the phrases that are leaves above the node inthe tree. If we pick a threshold score, we can prune off subtrees where every cluster has a score better than the threshold. We only need to makethe agglomerative tree once, and then we can quickly experiment with different thresholds.
Let’s look at an example for our New York city hotel data. I chose 18 phrases, such that they formed some natural clusters. There are 4 keywordsabout the view, 6 about being comfortable, 2 about the wi-fi, 1 about breakfast, 3 about friendly staff, and 2 about feeling safe. With luck, thephrases should be clustered into these natural clusters at some point. Here is the clustering, with the natural clusters shown as different colors.
The number shown in internal nodes provide the order of the merge, where smaller numbers were merged first, and hence more similar. The leafnodes were given numbers 0 to 17, and so internal node 18 was the first combined cluster with comfortable_beds_and_pillowscomfy_bed_and_pillows. The next node 19 combined friendly_accommodating_staff and friendly_assistance, etc.
In general, the clustering did an excellent job on the test case. A phrase like sooooo_comfy is the least like the other phrases talking about comfort,so it goes in last among the red group.
With the right threshold, we can get natural clusters by themselves. In practice, we can get many useful clusters to use for collections.
We can also go back and find the hotels that frequently mention the phrases in a cluster. These hotels would be the member of the collectionformed by a given cluster.
The human curation post-process
Up to this point we have had a totally automated process, that use phrase detection, word2vec features, and agglomerative clustering to findclusters of phrases. The remaining step is to find the interesting ones. At this point we rely on human curation to pick out the interesting ones, andgive them a snappy name.
The manual process is fairly easy. We provide a list of phrases for a cluster, the hotels that are the most related to the cluster, even someexamples sentences from the review text for those hotels. It is fairly easy to look at that information, and determine if it would a good way toexplore the hotels of a city. It is hard to mathematically define what is interesting, but easy for a human to know when they see it. The human alsocomes up with a clever name, which is also simple given the list of phrases.
Some interesting collectionsThe “Catch a Show” collection has phrases like this:
My personal favorite when I’m in New York, the “Near The High Line” collection has:
12345678910111213141516171819202122232425
at_radio_city_music_hall b'way_shows beacon_theater beacon_theatre broadway_dance_center broadway_play broadway_plays broadway_shows broadway_shows_and_great_restaurants broadway_shows_and_restaurants comedy_shows david_letterman_show easy_walk_to_broadway_shows evening_entertainment great_shows radio_city_hall radio_city_music radio_city_music_hall radio_city_music_hall_and theater_shows theatre_shows walking_distance_to_broadway_shows walking_distance_to_broadway_theaters walking_distance_to_shows walking_distance_to_theatre
12345
chelsea_market_and_high_line chelsea_market_and_the_highline high_line high_line_park highline_park
Arnab Dutta says:November 3, 2015 at 11:22 pm
Nice post!Came across this post as I was searching for a POC for a large micro-blogging site. Just to add, did you consider adding SemanticMarkup/Annotations to the concepts/entities extracted using other information sources?
REPLY
Your email address will not be published. Required fields are marked *
Comment
Name *
Email *
Leave a Reply
The whole process provides insight into a particular city, picking out interesting neighborhoods, features of the hotels, and nearby attractions. Allthe things someone staying in the hotel might be interested in. Different cities will result in different collections.
Once the user has some ideas, they can focus in on a few hotels, and see if they would be a great place to stay. It wouldn’t be possible without theinsights that Natural Language Processing provides.
One response to “Using NLP to Find “Interesting” Collections of Hotels”
678
highline_walk highline_walkway the_high_line_park
POST COMMENT