Cognitive Search Engine Optimization1470637/...Cognitive Search Engine Optimization JOAKIM EDLUND...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2020

Cognitive Search Engine Optimization

JOAKIM EDLUND

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Cognitive Search EngineOptimization

JOAKIM EDLUND

Master in Computer ScienceDate: August 10, 2020Supervisor: Johan GustavssonExaminer: Olov EngwallSchool of Electrical Engineering and Computer ScienceHost company: PacingSwedish title: Kognitiv sökmotoroptimering

Abstract

The use of search engines is a common way to navigate through in-formation today. The field of information retrieval is the field of find-ing documents in large unstructured collections. Within this fieldthere are widely researched baseline solutions to solve this prob-lem. There are also more advanced techniques (often based on ma-chine learning) to improve relevant results further. However, pick-ing the right algorithm or technique when implementing a searchengine is no trivial task and deciding which performs better mightseem hard.

This project takes a commonly used baseline search engine im-plementation (elasticsearch) and measures its relevance score us-ing standard measurements within the field of information retrieval(precision, recall, f-measure). After establishing a baseline config-uration a query expansion algorithm (based on Word2Vec) is imple-mented in parallel with a recommendation algorithm (collaborativefiltering) to compare against each other and the baseline config-uration. Finally a combined model using both the query expansionalgorithm and collaborative filtering is used to see if they can utilizeeach other’s strengths to make an even better setup.

Findings show that both Word2Vec and collaborative filteringimproves relevance over all three measurements (precision, recall,f-measure). These findings could also be confirmed to be signifi-cant through statistical analysis. Collaborative filtering seems tobe performing better than Word2Vec for the topmost results whileWord2Vec improves more the longer the result set is set to be. Thecombined model did show a significant improvement to all measure-ments for result sets of sizes 3 and 5 but larger result sets show lessof an improvement and even worse performance.

Sammanfattning

Användandet av sökmotorer är idag en vanlig metod för att navigeragenom information. Det akademiska området informationssökningstuderar metoder för att hitta dokument inom stora ostruktureradesamlingar av dokument. Det finns flera standardlösningar inom om-rådet som ämnas att lösa problemet. Det finns även ett flertal meravancerade tekniker, ofta baserade på maskininlärning, vars mål äratt öka relevansen hos resultaten ytterligare. Att välja rätt algoritmär dock inte trivialt och att avgöra vilken som ger bäst resultat kantyckas vara svårt.

I det här projektet används en ofta använd sökmotor, elasticse-arch, i dess standarduppsättning och utvärderas mot vanligen an-vända mätvärden inom informationssökning (precision, täckning ochf-värde). Efter att standaruppsättningens resultat har etablerats såimplementeras en frågeutvidgningsalgoritm (query expansion), ba-serad på Word2Vec, och en rekommendationsalgoritm baserad påcollaborative filtering. Alla tre modellerna jämförs senare mot varand-ra efter de tre mätvärdena. Slutligen implementeras även en kom-binerad modell av både Word2Vec och collaborative filtering för attse om det går att nyttja båda modellernas styrkor för en ännu bättremodell.

Resultaten visar att både Word2Vec och collaborative filteringger bättre resultat för alla mätvärden. Resultatförbättringarna kun-de verifieras som signifikant bättre efter en statistisk analys. Colla-borative filtering verkar prestera bäst när man endast tillåter ett få-tal dokument i resultatmängden medan word2vec blir bättre destostörre resultatmängden är. Den kombinerade modellen visade ensignifikant förbättring för resultatmängder i storlekarna 3 och 5.Större resultatmängder visade dock ingen förbättring eller till ochmed en försämring gentemot word2vec och collaborative filtering.

Contents

1 Introduction 11.1 Problem Description . . . . . . . . . . . . . . . . . . . . 11.2 Research Question . . . . . . . . . . . . . . . . . . . . . 21.3 Scope of the Project . . . . . . . . . . . . . . . . . . . . 2

2 Background 42.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Search Engine . . . . . . . . . . . . . . . . . . . . 52.1.2 Types of Search Engines . . . . . . . . . . . . . . 52.1.3 Search Engine Software . . . . . . . . . . . . . . 6

2.2 How Does a Search Engine Work? . . . . . . . . . . . . 72.2.1 Relevance and Result Ranking . . . . . . . . . . . 72.2.2 Term Frequency and Inverse Document Frequency 8

2.3 Machine Learning in Information Retrieval . . . . . . . 92.4 Machine Learning Algorithms . . . . . . . . . . . . . . . 10

2.4.1 Latent Dirichlet Allocation . . . . . . . . . . . . . 102.4.2 Latent Semantic Analysis . . . . . . . . . . . . . . 102.4.3 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Recommender Engines . . . . . . . . . . . . . . . . . . . 112.5.1 Collaborative Filtering . . . . . . . . . . . . . . . 11

2.6 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . 132.6.1 Precision . . . . . . . . . . . . . . . . . . . . . . . 132.6.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . 132.6.3 F-measure . . . . . . . . . . . . . . . . . . . . . . 14

2.7 Improving Search Results . . . . . . . . . . . . . . . . . 142.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 152.9 Summary/Conclusion . . . . . . . . . . . . . . . . . . . . 16

v

vi CONTENTS

3 Methods 183.1 Research Design . . . . . . . . . . . . . . . . . . . . . . 183.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Data Gathering . . . . . . . . . . . . . . . . . . . . 193.2.2 Processing . . . . . . . . . . . . . . . . . . . . . . 223.2.3 Data validation . . . . . . . . . . . . . . . . . . . . 23

3.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.1 Baseline Model . . . . . . . . . . . . . . . . . . . . 243.3.2 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . 263.3.3 Collaborative Filtering . . . . . . . . . . . . . . . 293.3.4 Combining Models . . . . . . . . . . . . . . . . . . 30

3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . 303.4.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . 303.4.2 Tuning . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 Result Gathering . . . . . . . . . . . . . . . . . . . . . . 323.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Results 334.1 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Collaborative Filtering . . . . . . . . . . . . . . . . . . . 364.3 Model Comparison . . . . . . . . . . . . . . . . . . . . . 41

5 Discussion 455.1 Comparing models . . . . . . . . . . . . . . . . . . . . . 465.2 Combined Model . . . . . . . . . . . . . . . . . . . . . . 465.3 Sustainability/Ethics/Societal Impact . . . . . . . . . . . 475.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Conclusions 49

Bibliography 50

Chapter 1

Introduction

As the Internet keeps growing larger and the world is becomingmore digitized and complex the need for digital tools to navigateand access information has grown massively. The need is not newbut has grown at a faster pace in recent years. Search engines oftoday can now serve millions of pages to hundreds of millions ofsearch queries on a daily basis [1].

The need to search through large amounts of documents withquick responses can be considered fulfilled. The need to find rel-evant information however, is a harder need to satisfy. Partly be-cause it is subjective to the user asking for information and alsobecause it requires a semantic understanding of both the query andthe searchable data.

Sometimes a user browsing for information might not know itsown explicit needs. A recommendation engine tries to solve thisproblem for the user. The purpose of a recommender engine is topredict the preference a user would have to an item or documentusing various methods. The intention is to provide documents oritems with the user’s highest predicted preference without the userexplicitly asking for it.

1.1 Problem Description

Today there are companies and organizations providing tools forimplementing search engines for others to use on their collectionsof data. These tools use well-known document indexing techniquesfor fast search and document ranking. It makes it quick and easy

1

2 CHAPTER 1. INTRODUCTION

to set up a search engine for anyone willing to learn the softwaretools.

However, while a baseline search engine installation like thismight fulfill the need of quick and responsive search the questionstill remains whether or not the results are the most relevant. Thisis a problem widely researched and several complex and advancedalgorithms have been studied that aim to improve the relevance ofsearch engine results.

In this project a baseline search engine for food items will bebenchmarked towards common search engine metrics. Then, a ma-chine learning apporach (using Word2Vec) and a recommender en-gine (using Collaborative Filtering), both used to improve rankingwill be implemented on top of the baseline configuration to com-pare benchmarking scores. Finally a combined version of the twointroduced algorithms will also be tested.

This project aims to find whether or not a baseline search enginecan see significant improvements to relevance score by applyingone or more complex machine learning techniques.

1.2 Research Question

What is the measurable impact of implementing a known machinelearning algorithm on a baseline search engine’s relevance score,and how does it compare to collaborative filtering? Can these meth-ods be combined to achieve improved performance?

1.3 Scope of the Project

The aim of this project is to evaluate potential improvements torelevant score by applying machine learning techniques to a base-line search engine. There are several information retrieval soft-wares that implement well-known search techniques available forinstallation and indexing of existing data. In this project one ofthe more popular softwares (elasticsearch) is picked for evaluation.The picked search engine is a crawler-based search engine as it themost commonly used type of search engine at the time of writing.Data for both indexing and evaluation is provided by Pacing SwedenAB which has access to a large food items database and historical

CHAPTER 1. INTRODUCTION 3

item searches with expected results for evaluation. After decidingon a machine learning algorithm for search engine improvement itis later evaluated towards the baseline configuration and collabora-tive filtering. Determining the best machine learning algorithm isnot in the scope of this project but rather evaluating the method-ology to find and measure improvements by introducing machinelearning algorithms to baseline configurations.

The database used to match the incoming claims to relevantfood items is Dabas which is publicly available and contains almost40,000 items (https://www.dabas.com/)[2]. One item document inthe database has 160 distinct properties. This database stores in-formation about food items with various degrees of structure. Somefields such as nutritional information, country of origin and weightare stored in a structured manner. For example, nutritional infor-mation is divided into several properties such as name, value andunit. This makes it easy and clear to interpret. However there islittle to no validation to what the vendors of the different food itemsinput into the database and several fields are open as free text in-put (e.g. the list of ingredients field). The database is updated everynight to make sure the system can present the latest daily informa-tion. This is very important for the client using the search enginesince the data needs to be up to date.

Chapter 2

Background

2.1 Information Retrieval

Te explain what a search engine is it is best to start looking at thebroader subject called information retrieval. The term informationretrieval, as an academic field of study, can be defined as:

Information retrieval (IR) is finding material (usuallydocuments) of an unstructured nature (usually text) thatsatisfies an information need from within large collec-tions (usually stored on computers). [1]

The difference between information retrieval and traditional databaselookup is the so called unstructured nature of the system. In a tra-ditional database system a typical search query would for instancebe looking up a specific OrderID in a table of orders to find theinformation needed. The problem is that the ID of the desired doc-ument must be known beforehand. In an information retreival sys-tem a search query can cover a much broader kind of search query.For instance a query in an IR system could be phrased somethinglike "find movies with aliens" and the system responds with a list ofmovies with aliens.

The term "unstructured data" refers to data which does not haveclear, semantically overt, easy-for-a computer structure. IR systemsalso have the ability to make "semistructured" searches. An exam-ple would be to query the system for movies with titles that contain"star wars" and the description contains "Obiwan".

4

CHAPTER 2. BACKGROUND 5

2.1.1 Search Engine

A search engine, usually a website, is an information retrieval sys-tem that gathers information from different sources on the internetand indexes the information. Then the search engine allows a userto look up desired information from the content gathered via a queryinterface. Most commonly this is done via a text input field and theengine then tries to interpret the input text to match the indexed in-formation. Results are then usually displayed to the user via a list oflinks to the gathered content that the search engine deems relevantto the query. Examples of well known and frequently used searchengines are Google, Bing, Baidu. These search engines index theworld wide web and makes it searchable for anyone to use.

2.1.2 Types of Search Engines

Crawler-Based

A crawler-based search engine use a crawling bot or spider to indexinformation to the underlying database for the search engine. Acrawler-based search engine usually operates in four steps.

1. Crawl for information

2. Index the documents into the search engine database

3. Calculate relevancy for all documents

4. Retrieve results for incoming search queries

This is the most common type of search engine among today’s pop-ular search engines such as Google, Bing, Baidu etc. This is alsothe type of search engine used in this study and more in depth de-scriptions of these steps can be found in the following sections.

Human Powered Directories

Human powered directories are, in contrast to an automated crawler,indexed manually. The site owner of a website submits a short de-scription of the site to the search engine database. The submissionis then reviewed manually by a database administrator and then ei-ther added in the appropriate category or rejected. This means that

6 CHAPTER 2. BACKGROUND

the search engine ranking will only be based on the description andkeywords submitted by the site owner and not take changes to theweb page content into consideration. This type of search engine isno longer used after the wild success of automated engines such asGoogle.

Hybdrid Search Engines

Hybdrid search engines are search engines that utilize a mix ofcrawling and human powered directories. They use a crawler-basedinformation retrieval bot to gather information but use manual meth-ods to improve result ranking. For instance a hybdrid search enginemay display a manually submitted description in the search resultsbut base the ranking by a mixture of the submitted description andthe crawled information from the actual web site itself.

Others

Except for the previously mentioned types of search engines thereare search engines designed for searching specific types of media.For an example Google has a search engine specifically made forsearching images. These types of engines usually utilize other tech-niques since the search medium might require the search engine toanalyze the contextual meaning of the search query for instance.

2.1.3 Search Engine Software

There are a number of different search engine softwares providedby different organizations. The website db-engines [3] ranks searchengine software after popularity and updates the list monthly. Adetailed description of how they calculate this ranking is availableon their website. At the time of writing the top three ranking searchengine software were the following.

Elasticsearch

Elasticsearch is a distributed open-source search engine. It is usedby many large corporations around the world to facilitate and speedup their search engine developments. Elasticsearch provides toolsfor indexing documents and building complex search queries. The


default installation of Elasticsearch will build an inverted index map-ping every word in the indexed documents to all the documents theword appears in and its location in the documents. It uses differenttechniques for different data types by default to optimize search re-sponsivity. It’s a highly customizable search engine and can easilybe tailored to a specific use case [4].

Splunk

Splunk is an enterprise software product that comes with manyprepackaged connectors to make it easy to index and import datafrom various sources. Splunk is mainly operated from a web inter-face to create various reports with advanced visualization tools anddoes most indexing automatically. To extend functionality Splunkprovides a library of apps called Splunkbase which is a collection ofconfigurations, knowledge objects, views, and dashboards that runson the Splunk platform [5].

Solr

Solr is an open source search engine developed by The Apache Soft-ware Foundation. It is a schemaless document based search enginemuch like elasticsearch and their features are very similar as bothprojects are based on the lucene engine [6].

2.2 How Does a Search Engine Work?

2.2.1 Relevance and Result Ranking

A search engine returns lists of results back to the user sorted bytheir relevance to the user’s query. This is called result rank-ing and is one of the fundamental problems that a search enginetries to solve. Classically, relevance is being phrased as a numericalsummary of several parameters that together define the relevanceof a document to the query. The resulting list of matching docu-ments are then sorted by the relevance number ranking the docu-ment with the highest relevance score at the top and the documentwith the lowest relevance at the bottom. How these parameters aregathered and how the relevance number is being calculated is up


to the people that design the search engine and what the usersof the search engine define as relevant. More general search en-gines that go through any web page on the internet might focus onamount of words matching between the web page and the user’squery. Another search engine focused on articles in a web storemight include parameters such as an article’s popularity with othercustomers or how recently the article was published on the storeto the relevance calculation. To solve the problem of result rankingone has to consider the user’s definition of relevance as well as thedifferent techniques behind a search engine to gather and calculatethe final relevance score. It is not uncommon to have to reiterateand reevaluate the search engine multiple times before being ableto solve the problem [7].

2.2.2 Term Frequency and Inverse Document Fre-

quency

The search engine studied in this project is built with elasticsearchwhich by default uses a numerical statistic called term frequency–inversedocument frequency (tf-idf) to rank its search results. Tf-idf is in-tended to reflect the importance of each word to a single documentin a collection.

Term frequency (TF) is defined as the number of times a word oc-curs in a document. This means that documents with a high searchterm frequency will rank higher. However, this means that long doc-uments are more likely to receive higher ranking scores than shortdocuments. To counter that problem one can use inverse documentfrequency (idf), where terms that occur often in a document (e.g.“the”) should not be weighted as much as less frequent terms thatalso matches the document content. The calculation of inverse doc-ument frequency can be altered and tailor made depending on thesearch problem. The general method of calculation derives from thefollowing formula [1]:

idf = log( |{Documents collection}||{Documents the term appears in}|+ 1

)Tf-idf is calculated by multiplying the term frequency and the in-verse document frequency.

tf -idf = tf ∗ idf


In a previous survey from 2016 it was shown that tf-idf was themost frequently used weighting scheme. Tf-idf has a varying per-fomance when it comes to ranking but because of its wide use andrelative ease of use it remains popular. It is usually altered to meetthe needs to specific implementations which leads to the varying re-sults. Other more advanced techniques can be hard to access sincesome authors provide little information about their algorithms mak-ing them hard to replicate [8].

2.3 Machine Learning in Information Re-

trieval

Search engines today have come to utilize many different machinelearning algorithms to improve the experienced relevancy of the re-sulting documents. It can be used for instance in query analysing tocorrect spelling, expand queries with synonyms, disambiguate in-tent, analyse documents by page classification, sentiment analysis,detecting entity relationships and more. Choosing the right algo-rithm might prove difficult because of the number of algorithms andits sometimes lack of extensive documentation in litterature. Thereare still many problems and questions raised that have yet provento be solved [9]. In the field of Natural Language Processing (NLP)there are several algorithms based on machine learning that canbe utilized for improved information retrieval. NLP in the contextof computer science is to program computers to process and under-stand natural language data. In information retrieval it is used to ex-tract semantically understanding from documents or queries to im-prove the classification and retrieval process [10]. A machine learn-ing field in natural language processing is called semantic analysis.Semantic analysis is the process of linguistically parsing sentencesand paragraphs into key concepts, verbs and proper nouns. Usingstatistics-backed technology, these words are then compared to thetaxonomy [10]. Two examples of algorithms within this field arelatent dirichlet allocation and latent semantic analysis.


2.4 Machine Learning Algorithms

There are several machine learning algorithms and approaches tochoose from when it comes to information retrieval and it is an ac-tively developing research area. Unfortunately not all algorithmsare well documented and some are developed for very specific datasets. The algorithms described below are some commonly used al-gorithms found in recent trends in natural language processing thathave good documentation and can be replicated [11].

2.4.1 Latent Dirichlet Allocation

Latent dirichlet allocation (LDA) assumes a document consists ofmany small topics and tries to categorise documents based on theirrelevance to certain topics. For instance LDA can classify a topiccalled bread_related. This topic can then include words such asflour, butter or cheese to have a high probability to belong to thetopic [12].

2.4.2 Latent Semantic Analysis

Latent semantic analysis is a natural language processing techniquewhich analyzes the relationship between a set of documents and theterms they contain. This can be used to predict desired propertiesfrom unstructured text data. It assumes words with similar meaningappears in similar sentences [13].

2.4.3 Word2Vec

Word2vec is a group of models that are used to reconstruct lin-guistic context of words. The models are shallow, two-layer neuralnetworks. The models try to understand the meaning of a word bybuilding a vector space of all words with the distance between themrepresenting the distance in semantic meaning. The models mostoften assumes that words that appear in the same context sharesemantic meaning. Word2Vec can be divided into two different ap-proaches (continuous bag of words and skip-gram model) to predictthe semantic relations between the words. These are particularlycomputing efficient (not requiring much CPU time) which is one of


the reasons why they are appropriate to use for improving searchengines as they are often expected to respond quickly [14].

Continuous bag of words

Continuous bag of words (CBOW) tries to predict a target wordbased on the surrounding words. Consider the sentence "the quickbrown fox jumps over the lazy dog". For example, if the target wordis "fox" and the context window is 2 (number of surrounding words)CBOW will store the context words "brown" and "jumps". This waythe model can try to predict the target word based on the surround-ing words. [15].

Skip-gram

The skip-gram model which is used primarily when having accessto larger data-sets is usually the reverse of what CBOW does. Itdiffers from CBOW by trying to predict the surrounding words givena current word instead of trying to predict the current word basedon the surrounding words. This model becomes more complex tocompute since it tries to predict multiple surrounding words insteadof a single target word. Increasing the window size thus increasesthe computational time and using a large size can lead to trainingthe model taking a very long time. [15].

2.5 Recommender Engines

A recommender system provides an alternative way of finding in-formation compared to traditional search by recommending itemsthe user might not have found by themselves. Since collaborativefiltering has been widely used in recent years for recommendationsystems [9] it is the recommendation system technique described inthis chapter.

2.5.1 Collaborative Filtering

Collaborative filtering is the most widely used technique in recom-mendation systems because of its accuracy and simplistic algorithmdesign. In general collaborative filtering identifies users who have


similar ratings/purchases between the same items. It can then pickitems from similar users and give them as recommendations [16].

K-Nearest Neighbors (KNN)

One way of implementing CF is to use the K-Nearest Neighborslearning method. KNN is a non-parametrized, lazy learning method.This means that the generalization of the data is not done until thequerying of a document. When a query is made to the system theevaluated item’s feature similarity distance is calculated to eachother item in the database. Then the k nearest neighbors are gath-ered and regarded as the most similar item recommendations [17].KNN is useful since it makes no assumptions about the data , it isa relatively simple and versatile algorithm. KNN is however not amemory efficient algorithm since it stores all of the training datain memory and is quite computationally expensive [18]. KNN docu-ment classification explained in pseudo-code:

trainData← load(”knntraindata”)

document← selected_documentk ← selected_k_valuedistances← list()

forall point in trainData dodistances.push(euclidian_distance(point, selected_document))

endsort(distances)

nearestNeighbors← distances[0, k]

documentClass← get_class_of_majority(nearestNeighbors)Algorithm 1: K-Nearest Neighbors

Non-Negative Matrix Factorization (NMF)

NMF is a matrix factorization method that contstrains the matricesto be non-negative. In a topic modeling scenario such as in thisproject it would for instance create a document matrix where eachcolumn is a document and each element a tf-idf word weight. Whenthis matrix is then decomposed into two factors it would create onematrix with each column representing a topic and each row a word.The other matrix having columns representing each document androw each topic. This way it is possible to weight each word and/ordocument towards certain topics and build recommendations based


on this [19].

Singular Value Disposition (SVD)

SVD is used to solve the problem of data sparsity that can arisewhen using CF. SVD is commonly used to reduce the dimensionalityof user-item ratings matrices. This increases the ratings density andthus makes it possible to find hidden relationships from the matrixgiving more ratings options for the recommendation engine to workwith [16].

2.6 Evaluation Metrics

To determine whether or not the search engine perfoms well ac-cording to the end user there must be a way to evaluate the results.In this section common evaluation measures used in informationretrieval are presented.

2.6.1 Precision

Precision is a way to measure the ability of the search engine tofind relevant documents relative to the amount of returned results.Precision is being calculated by taking the number of relevant docu-ments form the search result divided by the number of results [20].

precision =|{Relevant documents}| ∩ |{Retrieved documents}|

|{Retrieved documents}|(2.1)

2.6.2 Recall

Recall works similatly to precision but is measured relative to thetotal amount of relevant documents instead of the search resultamount. Recall is calculated by taking the fraction between thenumber of relevant documents in the search results over the totalamount of relevant documents [20].

recall =|{Relevant documents}| ∩ |{Retrieved documents}|

|{Relevant documents}|(2.2)


2.6.3 F-measure

A way to combine both precision and recall into a single measurablenumber is to use the F-measure.

F-measure = 2 ∗ precision ∗ recallprecision+ recall

(2.3)

When both precision and recall are close to eachother the F-measurevalue is close to the average between them but generally more likethe harmonic mean. The F-measure is also called the F1-measurebecause both precision and recall is evenly weighted. To change theweighting between precison and recall the traditional F-measurecan be modified to the Fβ-score.

Fβ = (1 + β2) ∗ precision ∗ recallβ2 ∗ precision+ recall

(2.4)

When β > 1 recall is weighted higher and when β < 1 precisionis weighted higher. Commonly used β values are 2 and 0.5 andis chosen on how the user priorities recall over precision and viseversa [20].

2.7 Improving Search Results

The most common ways to evaluate search engines are precision,recall and f-measure as described in section 2.6. Characteristics ofgood search engine results are [21]:

• It should be according to user query

• It should be properly organized according to the relevancy ofthe content

• It should be properly defined such that it is understood well bythe user

• It should be robust

• It should provide satisfaction and quality to the user

• It should not be ambiguous

• It should be readable


One common method of improving search results is the tf-idf (de-scribed in section 2.2.2) vector space model approach but there areseveral more approaches to improving results.

One example is for instance using a meta search engine. Thissolution takes a search query and passes it on to several already ex-isting search engines and aggregates their respective results andreranks them. This resolves the problem of having an outdateddatabase assuming at least one of the search engines being up todate for the query [21].

Another example is storing search history and profiling informa-tion about the user. This usually requires the user to log in andanswer forms before using the search engine. It makes the resultsgive a sense of personalization for the end user.

Clustering is also a technique to improve results. It is an unsu-pervised data mining technique that automatically classifies docu-ments without any predefined information on how to do it. It is usedto minimize the space where the search has to look which improvesquery responsiveness and ranking. It also overcomes the problemof vocabulary differences [21].

2.8 Related Work

There have been several earlier attempts to evaluate different ap-proaches and techniques to improve search engine result rankingbefore. An attempt to improve PubMeds result ranking was donein 2018 where the result ranking algorithm was modified usingmachine learning algorithms which showed a great improvementin user click-through rate. The previously used system was a tf-idf based ranking algorithm and was maintained and developed bymanual experiments or analyses. They wanted to see if using ma-chine learning based ranking algorithms could improve the searchengine. They developed a custom algorithm called Best Match in-spired by machine learning based ranking algorithms such as L2R,BM25 and LambdaMART. After deploying the new algorithm theycould measure a 20% increase in user click-through rate [22].

In 2016 Roy et al. implemented a query expansion automationusing Word2Vec leading to improved result ranking. This was doneby analyzing the incoming query to the search engine and then re-formulating it using the trained word2vec corpus they had created.


The expanded queries was found to almost always outperform theirbaseline model significantly [23].

Another interesting related work is the attempt to combine Word2Vecand collaborative filtering to improve a recommendation engine.The results of their experiment suggests that the hybrid algorithmimproves the efficency and accuracy of the recommender systemgreatly for large datasets [24].

2.9 Summary/Conclusion

The algorithms examined in this study are latent dirichlet alloca-tion, latent semantic analysis, word2vec and collaborative filtering.The choice came down to use word2vec and collaborative filteringas they seem to be the more popular choices in recent trends [11][9] but they use quite different solutions. LDA and latent semanticanalysis was chosen not to be evaluated further as they would beused in similar scenarios as word2vec and make the scope of theproject too big.

Word2Vec could potentially be useful for the project to deter-mine whether or not certain properties are related to each other orpossibly synonymous to each other. An example of what it is hopedto resolve is finding out that a certain food label also implies thatthe item is ecological. Hypothetically after training the model withinformation about tender document and articles it will be able towiden the possible matches of certain search words.

For this project collaborative filtering might prove useful if it cansuccessfully find similarities between claims and thus recommenditems that the automated search query cannot find or rank certainitems higher as it finds them relevant to recommend.

In this study Word2Vec and collaborative filtering will be exam-ined to measure how much improvement can be done to a searchengine optimization problem. They are both well known algorithmsused to improve information retrieval system’s relevance but withdifferent approaches. Word2Vec is focused on the word semanticsboth in the search query and the searchable data and collaborativefiltering tries to match users previous actions to find similarities indocuments. Another part of this study is trying to combine boththese strategies to see if it is possible to do and if that could im-prove the search results even more. As has been seen in earlier


attempts both Word2Vec and even attempts to combine the algo-rithms have resulted in improvements. This suggests that thesealgorithms should show significant improvements compared to thebaseline algorithm, however this project will also benchmark the al-gorithms against each other. The intention is to see whether one orthe other algorithms can improve the results more than the otherand if the hybdrid algorithm outperforms the individual algorithms.

In this project precison, recall and a combined f-measure metricwill be used to analyze the results. These methods are commonlyused in the information retrieval field to measure performance andaccuracy of a search engine or recommendation engine. For ex-ample in 1999, Gordon et al. compared eight different search en-gines using precision and recall metrics to benchmark performancecalling the method traditional within the information retrieval field.They also used statistical comparisons to determine the significancein difference between the results as will be used in this project [25].

Chapter 3

Methods

3.1 Research Design

This study applies an experimental design to answer how differ-ent information retrieval algorithms change the retrieved searchresults. In brief, two categorically different information retreival al-gorithms (Word2Vec and Collaborative Filtering) and a custom hy-brid variant were implemented and trained with learning data froma food items database. The models were evaluated by comparingprecison, recall and f-measure to a baseline model based on com-mon tf-idf document indexing. The resulting deviation in the chosenmeasurements were deemed significant or not by applying statis-tical analysis. The constant variables of the experiment were thelearning and test data sets fed to the algorithms. The research de-sign is shown in figure 3.1.

Figure 3.1: The experimental design schema

18

CHAPTER 3. METHODS 19

3.2 Data

Evaluation of the different models required proper data sets in or-der to run simulations and tests. For both simulations and tests,completed tender documents were used to build test and validationdata sets. This was possible because completed tender documentscontained the necessary information about both the requested anddelivered articles. Details about data gathering and pre-processingare given in the following sections.

3.2.1 Data Gathering

The process of gathering data was done in two major steps. Thefirst one being gathering all the necessary test data from previouslycompleted tender documents. The second one being gathering allfood item information stored in the article database. Figure 3.2shows a general process flow of the data gathering process.

Figure 3.2: The data gathering process

Incoming Procurements

A procurement is the process of obtaining goods to an organizationfrom an external vendor. The process for a procurement starts outwith a purchaser creating a procurement order by submitting a ten-der document. This is done by entering a list of loose definitions ofdesired articles and their desired properties in a computer system(e.g chicken leg, frozen, grilled). How this system works in detailis unknown by the principal and the tender strategist. What they

20 CHAPTER 3. METHODS

see is only the resulting exported file that the purchaser sends tothe tender strategist. This file is mostly in the form of a comma sep-arated values file (CSV) containing a list of all the desired articledescriptions and their constraints that go with them. This file struc-ture makes it easy for a computer system to import and interpretthe input data. A few example rows from a tender document can beseen in Table 3.1.

Table 3.1: Tender PositionProductFamily

ProductClass

ProductDescription

Properties Quantity

FrozenFishUnprepared

Salmon

Fillet, Salmo Salar(salmon), Frozen pieces,Portion, Skin and bonefree

1,397

Colonial SpicesLemon/LimePepper

Lemon, Free from MSG,Spice mix, Plastic container

12

ColonialJam/Marmalade/Jelly

Strawberryjam

Berries >= 35.00%,Bucket, No colorants orflavorings

510

Pastry/Desserts

Ice CreamMilk freeice cream

Fat content <= 10%, freefrom milk protein, Vanilla,free from soy, free fromlactose, oat base

181

The first line in Table 3.1 show that the purchaser is asking for1,397 unprepared frozen salmons fulfilling certain properties. Theproblem now for the search engine is to find the most appropriatearticles to present the tender strategist to offer the purchaser. Mostlikely there will be more than one article matching the descriptionprovided so the engine will have to rank the articles found by eachmatching article’s individual relevance score.

Completed Procurements

As mentioned in section 3.2.1 incoming procurements usually comein the form of a CSV file. The same goes for the completed docu-ments. Each row in a file is represented as shown in table 3.2


Table 3.2: Tender Header

Pos Prod Family Prod Class Prod Desc Properties GTININT TEXT TEXT TEXT TEXT INT

Each row in the incoming tender document was identified byits Pos (position) number. To match the completed tender docu-ment with the incoming document file all needed to be done was tomatch the position number between the files for each row. Tenderdocuments were processed by an in-house script and stored in adocument database in JSON-format to easily allow for JSON-basedqueries towards Elasticsearch. The resulting database, consistingof close to 10,000 positions from 14 tender documents with claimsin all product categories, served both to provide the data set fortesting and evaluating the different algorithms.

Articles Database

The article database is accessible over the internet through a REST-API sending responses in JSON-format and consists of a little morethan 35,000 items. Since Elasticsearch is JSON-based the simplestsolution was to query the database for all its information and in-dex it directly to the local Elasticsearch engine. This was also donewith an in house script. Elasticsearch automatically indexed all thedatabase information based on its configuration, setting up tf-idfmodels and interpreting data types. Without any configuration Elas-ticsearch builds a default index with basic optimizations. After thiswas done all left needed to be able to run the tests for the baselinemodel was to define the search query needed to start fetching itemsfrom the database.

Because of the way Dabas is designed to not allow fetching theentire database in one call it took a lot of time to fetch the entiredatabase. To make sure not to have to go through that process mul-tiple times the in-house script was actually designed to first storeall the information in MongoDB, serving as a cache, before send-ing it to Elasticsearch. This way, in case Elasticsearch had to bereindexed with a new configuration for some other reason, it couldbe done directly from MongoDB instead of fetching the information


over the internet. This saved a great amount of time since rebuild-ing Elasticsearch had to be done several times. This also made surethe test data was always the same between index rebuilds.

3.2.2 Processing

To utilize the imported gathered data for testing and machine learn-ing some processing had to be done. The process of creating thetest data was quite straightforward. During the data gathering pro-cess the in-house script used automated tools for identifying eachcolumn’s data type (string, integer, floating type etc.) before im-porting it to the database. After that was done another post datagathering processing had to be done to simplify the testing processlater in the project. An example of this processing would be theproperties column which is a comma separated text string describ-ing each required property of the desired food item. The extra pro-cessing had to go through all of the documents and split the stringon the delimiter to create a more flexible data structure for futurefollow up. An example of a document before and after processingare shown in listing 3.1 and 3.2 below.

Listing 3.1: Tender document data before Processing

1 {2 "procurement" : "boden",3 "pos" : 7,4 "prod_area" : "Colonial/Groceries",5 "prod_division" : "Flour",6 "prod_commodity" : "Sifted rye",7 "properties" : "Free from Nuts, Peanuts, Almonds,

apricot seeds and sesame seeds, Only naturalsugars",

8 "selected_articles" : "17321575952076,7311140730102"

9 }


Listing 3.2: After Processing

1 {2 "procurement" : "boden",3 "pos" : 7,4 "prod_area" : "colonial/groceries",5 "prod_division" : "flour",6 "prod_commodity" : "sifted rye",7 "properties" : [8 "free from nuts",9 "peanuts",

10 "almonds",11 "apricot seeds and sesame seeds",12 "only natural sugars"13 ],14 "selected_articles" : [15 "17321575952076",16 "07311140730102"17 ]18 }

In Listing 3.2 the properties field has been transformed into a list ofproperties where all strings have been lowercased for easier stringcomparison. The selected_articles string in Listing 3.1 have alsobeen altered to a list as seen in Listing 3.2 and the second GTIN(Global Trade Item Number) has had a 0 prepended to make thenumber conform to the GLN standard of having 14 numbers. Thiserror was very common in these CSV files as they were usually madein a spreadsheet software that removed leading zeros from numbersbefore saving the file. The processing had to fix this to be able tofind the selected article by its identifying GTIN or otherwise the testvalidation would not work.

3.2.3 Data validation

The incoming data that was used had to be validated before usedin the final implementation. This was done first by making a smallscript used to import the CSV-files to the MongoDB instance used inthe project. During this process every row and column was checkedto include valid data and processed to its proper data type before


inserted into the database. Any information that did not pass thecriteria had to be ignored since it would not be possible to processit later anyway. An example of a finalized and validated document:

Listing 3.3: Validated Tender Position Document

1 {2 "procurement" : "boden",3 "pos" : 7,4 "prod_area" : "colonial/groceries",5 "prod_division" : "flour",6 "prod_commodity" : "sifted rye",7 "properties" : [8 "free from nuts",9 "peanuts",

10 "almonds",11 "apricot seeds and sesame seeds",12 "only natural sugars"13 ],14 "selected_articles" : [15 "17321575952076",16 "07311140730102"17 ]18 }

3.3 Models

3.3.1 Baseline Model

As mentioned in the previous section most preparations for thebaseline model was already finished by setting up elasticsearch andstoring the procurement data in a database. All needed to be donewas to define the search query. This query was modeled as a JSONdocument schema with the root property being called query. Thequery property then in turn contained all the query configurationneeded for the query to run as intended.


Search Query

The search query is divided into two different groups of restraintsas shown in Listing 3.3. One group of sub queries that must befulfilled to allow an article to end up in the match results. The othergroup contains queries that should match which only boosts scoreif matched.

The sub queries that are required to give a match are the queriesfor product family, product class and product description. Theseproperties are required to match since they define in which cate-gory the article needs to be in. For instance if the procurementclaim asks for a frozen (product area), bird product (product divi-sion), chicken leg (product commodity) then it is expected that allresults match these requirements. Listing non frozen items or nonbird products for instance would never be of relevance.

The optional sub queries are in place to rank the matching arti-cles against each other and are more numerous than the requiredones. One of the simpler of these queries ranks the user’s whole-sale company’s self-branded goods higher. I.e. if the article’s brandname matches the wholesale company’s name it ranks higher. Otheroptional queries are also the weight queries. These are left optionalbecause the search engine has trouble identifying and matching thecorrect weight values for the articles gross and net weight. Sinceweight information about the article is typically part of some freetext description about the item it is not certain the search querywill find it or match it correctly. Thus, the queries are left optionalbut rank articles that actually match higher. The last set of optionalqueries are the ones for finding properties in articles. These aredivided into two groups. One group of properties with no value andone group of properties with an associated value. All of these prop-erties are matched by searching in a text string of concatenatedproperties from the different article documents. This is where thesearch engine struggles the most to find matches as properties canbe written in so many different forms and synonyms. Especially theproperties with an associated value are the hardest to find as thereare so many ways to express this type of information in free text.


Listing 3.4: Search Query

1 {2 "size": 10,3 "query": {4 "bool": {5 "must": [6 {"term": {"Product Code": procurement["prod_area"

]}},7 {"term": {"Product Code": procurement["

prod_division"]}},8 {"term": {"Article Description": procurement["

prod_commodity"]}},9 ],

10 "should": [11 {"range": {"Size":12 "gte": position["min_weight],13 "lte": position["max_weight],14 }},15 {"term": {"manufacturer": wholesale_company_name,

"boost": 2.0}}16 {"term": {"Ingredients": position["props"][0]}},17 ...18 ]19 }20 }21 }

3.3.2 Word2Vec

When the baseline model was up and running, Word2Vec could beused to try and improve it. As explained earlier Word2Vec can beused to try and expand the search queries to include more wordssimilar to the original query. First the model had to be trained be-fore it could be used.

Training Word2Vec

To begin with Word2Vec had to learn the domain that the search en-gine was querying. The training data would consist of both procure-


ment information and article information from the article database.The train data was built by taking each row in all procurementsappending all columns with each other to a single text string. Foreach row the Dabas information for the selected article would alsobe gathered and appended to the end of the string. Finally a listof all the text strings could be passed on to Word2Vec for training.Word2Vec needs to have the train data in this format for it to beable to run its models on it as described in earlier chapters. Theprocess described in pseudo code:

trainData← list()

forall row in procurements doarticleInfo← getArticleInfo(row[”selectedArticle”])

text← ””

forall column in row dotext← text+ column+ ” ”

endforall column in articleInfo do

text← text+ column+ ” ”

endtrainData.append(text)

endword2vec.train(trainData, size = 150, window = 5,minCount =

2, workers = 10)Algorithm 2: Training the Word2Vec Model

For the training different values for the size, window, minCount,workers parameters were used (section 3.2.3). Parameter impacton the results is presented in the results chapter. One model wascreated for each parameter setup for later testing.

Rebuilding the Elastic Index

After Word2Vec was trained it was necessary to rebuild the Elas-ticsearch index to incorporate the model into the search engine.This was done by iterating through each document in the articledatabase copy from MongoDB and complement mainly its ingredi-ents property with what Word2Vec deemed as related words to theexisting ingredients text. Described as pseudo code:


model← load(”word2vec.model”)

forall item in article database dosimilarWords←model.mostSimilar(item[”ingredients”], similarity >=

0.8)

forall word in similarWords doitem[”ingredients”].append(word)

endelastic.index(item)

endAlgorithm 3: Rebuilding the Elastic Index

After this process the documents in the elastic index had muchmore information about the articles giving the search query a widerrange since the documents were longer. The similarity parameterfilters the similarWords list to only return words that Word2Vecfinds 80% similar or more back to the item in this example. Si-miliarity being the calculated cosine similarity between words inthe Word2Vec vector space. This parameter was determined em-pirically. Setting it too low ended up giving the new elastic indexirrelevant new words for the items ingredents description resultingin poor search results. Setting it too high resulted in Word2Vec notrecommending any relevant words at all or very few barely mak-ing any impact on the search results. The same expansion was alsodone for the "Article Description" property for each item. This pro-cess was repeated for each training model with the different train-ing parameter setup creating an elastic index for each model for thefuture testing.

Expanding the Search Query

Word2Vec was also used to expand the search query built in thebaseline model. The structure was kept the same as in Listing 3.3but the search strings were altered. To make the search querieshave a greater chance of finding a wider set of documents the Word2Vecmodel was used to expand the search strings from the incoming ten-der documents. This was done the same way as in Algorithm 2 butfor the search strings "Article Description" and all the "Ingredients"search strings. The other search strings were not altered as theyfunction as a filter rather than a full document search. For thesearch query expansion a similarity filter of 0.9 was used instead


which was decided upon by running tests.

3.3.3 Collaborative Filtering

Collaborative filtering clusters documents after similarity based onuser feedback. This approach is different from Word2Vec and is in-corporated later in the search engine process. Instead of expandingthe search query collaborative filtering was used to expand and im-prove the search result ranking. Tuning collaborative filtering wasdone by testing different algoriothmical approaches. A descriptionof the different algorithms can be found in section 2.5.1. A shortdescription of each algorithm can be seen in Table 3.3

Table 3.3: Collaborative Filtering Algorithms

KNNBaselineAn algorithm taking into account a baselinerating.

KNNWithMeansAn algorithm taking into account the meanratings of each user.

KNNWithZScoreAn algorithm taking into account the z-scorenormalization of each user.

NMFAn algorithm based on Non-negative MatrixFactorization.

SVDAn algorithm based on Singular valuedecomposition.

SVD++The SVD++ algorithm, an extension of SVDtaking into account implicit ratings.

Training Collaborative Filtering

First the collaborative filtering algorithm had to be trained. Thiswas done by extracing information about already completed ten-der documents and building a dataframe. The dataframe consistedof identifying information about the article, descriptive informationabout the query and finally a rating scale variable. In this partic-ular implementation it was the article’s GTIN, search query articlecommodity and how many times the article had been used in previ-ous tender documents with the same article commody requirement.This dataframe was then fed to the collaborative filtering algorithmto calculate their similarities.


Altering the Result Rankings

After elasticsearch had processed the search query and generated alist of search results the collaborative filtering model was used. Thesearch result list was reranked after what the collaborative filteringmodel predicted the rating scale value to be for the search query’srequired article commodity. This way the result ranking was alteredto rely on the item similarity calculated by collaborative filteringinstead of the inverted document frequency as the engine originallyused.

3.3.4 Combining Models

Combining the models was possible since they both targeted differ-ent parts of the information retrieval process. Both the models weretrained just as described in 3.3.2 and in 3.3.3. The search queryand elastic index was expanded and rebuilt using the Word2Vectechniques. Then after the search engine had retrieved its resultsthe documents were reranked as by the collaborative filtering pre-dictions. In this combined version the search engine had a widerknowledge of the lexical meaning of the search queries and thesearchable data set and also an understanding of how popular eacharticle was based on previous user behavior.

3.4 Implementation

The programming language of choice in this project was python forits extensive library of tools and prebuilt algorithms. This made iteasier and faster to run tests and tunings.

3.4.1 Tools

To fully run all the scenarios and tests for this project a number oftools were used. Table 3.4 lists all tools and software packages usedto conduct the entire experiment.


Table 3.4: ToolsName Description Use CaseMongoDB Document database Used to store test and val-

idation dataElasticsearch Document based search

engineUsed for baseline searchengine implementation

Python Programming language Used for implementationSciPy Python package for scien-

tific computingUsed for statistical analy-sis functions

gensim Software framework fortopic modelling

Used to implementWord2Vec models

surprise Python package for rec-ommendation systems

Used to implement Col-laborative Filtering mod-els

pandas Python Data Analysis Li-brary

Used to build dataframes

3.4.2 Tuning

To find the best performing implementation of each algorithm therehad to be some tuning done. The baseline implementation of elas-ticsearch was not altered to make sure the result comparison wouldbe fair. Only the added algorithms were tuned to improve their per-formance.

Word2Vec

Tuning of the Word2Vec algorithm was done mainly by altering thewindow size, min_count and the training algorithm. The windowsize is the maximum allowed distance between the current and thepredicted word in a sentence. The min_count parameter specifies aminimum word frequency that a word needs to have to be allowedin the training model. The two training algorithms tested were theskip-gram model and CBOW as described in section 2.3.1. The finalimplementation used the CBOW model. There are more ways toalter the Word2Vec model for tuning but these parameters showedthe most impact in the result rankings. To be able to evaluate theresult ranking impact for each parameter change one pre trainedmodel file was created for each parameter setup. This way the same


Word2Vec implementation could be rerun with different pre trainedmodels to compare and find the best tuned setup for the algorithm.Tuning result examples for Word2Vec are presented in section 4.1.

Collaborative Filtering

Similar to Word2Vec the Collaborative Filtering algorithm can beimplemented with various tuning parameters. The tool used in thisproject called surprise makes this task very easy. The package hasseveral pre compiled collaborative filtering algorithm alternativesand so the tuning process was merely to try each one and see whichperformed better. All that had to be done was to alter the algo-rithm parameter before retraining the model. Results of this tuningprocess can be found in section 4.2.

3.5 Result Gathering

When gathering results to both determine optimal tuning param-eters and to evaluate the final model setup the dataset for tenderdocuments was partitioned into 14 chunks. One partition was usedfor evaluation (calculating precision, recall and f-measure) and an-other for training. This was done for every partition for every resultgathering run. The final result would then be calculated as the av-erage of all partition runs and the deviation between them.

3.6 Analysis

After gathering precison, recall and f-measure results for each modeland tuning variant it was time to analyze the results. For this theSciPy tool was used to apply statistical analysis methods to the re-sults. When all results had been gathered a statistical analysis us-ing t-test and ANOVA methods were done to determine whether ornot the observed average between the models/tunings differ signif-icantly or not. The null hypothesis being that the compared resultshave identical average values. If the p value is small enough (lessthan 0.05) we can reject the null hypothesis of equal values andassume we have observed a significant difference in results.

Chapter 4

Results

This chapter illustrates the gathered information that later leadto the conclusions drawn in chapter 6. During analysis statisticalmethods were applied on the gathered evaluation metrics for fur-ther insights to draw conclusions. This was all done by using thetools described in section 3.4.1. The intention with the gatheredresults was to build a solid ground of information to be able to eval-uate and find out which method and tuning had the biggest impactto the evaluation metrics.

4.1 Word2Vec

First out was gathering information about tuning the Word2Vec al-gorithm. The following figures (4.1, 4.2 and 4.3) show the meanvalues and standard deviation for each relevance metric for differ-ent choices in window size. More variables were evaluated in theproject (size, window, min_count, workers) but since the processwas exactly the same for all variables only window size is in thereport to serve as an example.

Figure 4.1 shows the precision values for depth 3, 5, 10, and 25for window sizes 3, 5, 10, 15, 20 and 50. The precision can be seento increase with the depth size but window size has no apparentinfluence on precision. The same pattern can be observed for recallas show in figure 4.2 and f-measure as shown in figure 4.3.

Observing the means in the resulting graphs of section 4.1 itlooks like the window size had no apparent impact on the relevancescores. To confirm this an ANOVA test was performed.

33

34 CHAPTER 4. RESULTS

The null hypothesis was that there is no significant difference be-tween the mean relevance score values measured for the differentchosen window size passed to the Word2Vec algorithm.

The alternative hypothesis was that there is a measurable dif-ference in mean relevance score between choosing the differentwindow sizes passed to the Word2Vec algorithm other than randomcauses.

The results from running an ANOVA test on all the observedmeans for each depth value can be observed in table 4.1.

Figure 4.1: Word2Vec Window Size Results

3 5 10 15 20 50

10

20

30

40

Window Size

Pre

cisi

on

depth 3 depth 5 depth 10 depth 25

CHAPTER 4. RESULTS 35


3 5 10 15 20 50

10

20

30

Window Size

Reca

ll



3 5 10 15 20 50

10

20

30

40

Window Size

F-m

easu

re



Table 4.1: ANOVA Word2Vec OptimizationPrecision

Depth 3 5 10 25P-value 0.869 0.998 0.998 0.994

RecallDepth 3 5 10 25P-value 0.872 0.996 0.997 0.988

F-measureDepth 3 5 10 25P-value 0.856 0.997 0.998 0.994

As we can see from table 4.1 all the P-values gathered from thetest runs are too large (larger than 0.05). This means we cannotreject the null hypothesis and thus conclude we cannot say that themeasured differences in mean values are affected by the change inwindow size.

Results suggest that the window size parameter does not impactresults in a meaningful way. That led to using window size 15 forthe optimized model as it seemed to increase f-measure for mediumdepth levels when observing the results in figure 4.3. After follow-ing the same procedure for also the other Word2Vec parameters theoptimized model used values shown in table 4.2.

Table 4.2: Word2Vec Optimized Parameterssize window min_count workers150 15 2 10

4.2 Collaborative Filtering

Secondly was gathering information how different implementationsof the collaborative filtering algorithm influenced the measured rel-evance score.

Figures 4.4, 4.5 and 4.6 show the average precision, recall andf-measure for all tested algorithms for depth level 3, 5, 10 and 25.

The KNN-based algorithms show the most consistent results.The baseline algorithm does not see a significant increase in per-formance over the depth levels for either measurement variable.


KNNWithMeans and KNNWithZScores have very similar results forprecision, recall and f-measure increasingly getting better over theincrease in depth level.

The NMF algorithm show a slight increase in score in precisionand recall (figure 4.4 and 4.5) with an increase in depth level. Look-ing at f-measure (figure 4.6) a very poor result is shown for depthlevel 3 but improves significantly for higher depths.

The SVD-algorithms seem to have a very inconsistent perfor-mance. They both have a very big standard error for the precisionmeasures except for SVD with depth level 10 and 25 (figure 4.4).SVD seems to perform better when it comes to recall (figure 4.5)getting better in relation to the depth level. The SVD++ algorithmshow the opposite behaviour however, getting worse. In the finalfigure (4.6) the results are more consistant. SVD perfoms betterwith the depth level but suddenly drops for depth level 25 however.The SVD++ algorithm sees a very small increase in f-measure to-gether with the depth level but with a high standard error.

Observing the means in the resulting graphs (4.4, 4.5 and 4.6)it looks like the choice of collaborative filtering algorithm had noapparent impact on the relevance scores except for maybe the stan-dard deviation. To confirm this an ANOVA test was performed.

The null hypothesis was that there is no significant difference be-tween the mean relevance score values measured for the differentchosen collaborative filtering algorithm.

The alternative hypothesis was that there is a measurable differ-ence in mean relevance score between choosing the different col-laborative filterings algorithm other than random causes.



Figure 4.4: Collaborative Filtering Precison

KNNBaseline

KNNW

ithMeans

KNNW

ithZScores

NM

FSVD

SVD++

10

15

20

Pre

cisi

on



Figure 4.5: Collaborative Filtering Recall

KNNBaseline

KNNW

ithMeans

KNNW

ithZScores

NM

FSVD

SVD++

10

15

20

Reca

ll



Figure 4.6: Collaborative F-measure

KNNBaseline

KNNW

ithMeans

KNNW

ithZScores

NM

FSVD

SVD++

5

10

15

20

F-m

easu

re


Table 4.3: Collaborative Filtering OptimizationPrecision

Depth 3 5 10 25P-value 0.861 0.926 0.876 0.999


F-measureDepth 3 5 10 25P-value 0.866 0.921 0.869 0.999

As we can see from table 4.3 all the P-values gathered from thetest runs are too large (greater than 0.05). This means we cannotreject the null hypothesis and thus conclude we cannot say that themeasured differences in mean values are affected by the change inalgorithm. Further runs of the optimized version of collaborative fil-tering used KNNWithMeans as it seemed to deliver regular results


when observing graph 4.6.

4.3 Model Comparison

After optimizing the individual models and implementing the com-bined model the following results were gathered for each relevancemetric.

Figures 4.7, 4.8 and 4.9 show all the model’s precision, recalland f-measure side to side to the other models for depth level 3, 5,10 and 25.

Word2Vec seem to have the same trend for all three measur-ments showing an increased performance in relation to an increasein depth level. However, the standard deviation also increases sug-gesting that the model’s performance is not very precise. It consis-tently outperforms the baseline model.

Collaborative filtering shows a more consistent trend not beingaffected by the depth level (expcept for small depth levels in preci-sion, figure 4.7). It performs better than both the baseline modeland word2vec from the start at depth level 3 and 5 but for depthlevel 10 and 25 word2vec shows better numbers.

The combined model performs better than the other models fordepth level 3 and 5 in all three figures. The combined model has thelargest advantage when it comes to precision (as seen in figure 4.7).For depth level 10 however, the model seem to match word2vec onall three measurements but outperform collaborative filtering andfor depth level 25 performing worse than word2vec but better thancollaborative filtering. The combined model does not seem to see asignificant increased performance when increasing the depth level.

The statistical analysis to analyse whether or not the introducedmachine learning algorithm and collaborative filtering had a signif-icant change in relevance score was the one performed on the dataresulting in the graphs in section 4.3. The following results wereobserved:

The null hypothesis was that there is no significant difference be-tween the mean relevance score values measured for the differentmodels algorithm.

The alternative hypothesis was that there is a measurable differ-ence in mean relevance score between choosing the different mod-els other than random causes.



Figure 4.7: Model Comparison Precision

3 5 10 250

10

20

30

Pre

cisi

on

%

Baseline Model Word2Vec Collaborative Filtering Combined models

Figure 4.8: Model Comparison Recall

3 5 10 250

10

20

30

Reca

ll%



Figure 4.9: Model Comparison F-measure

3 5 10 250

10

20

30

F-m

easu

re%


Table 4.4: Comparing ModelsPrecision

Depth 3 5 10 25P-value 9.22e-11 0.0014 3.23e-06 1.25e-05

RecallDepth 3 5 10 25P-value 7.12e-09 0.0017 4.93e-06 6.47e-06

F-measureDepth 3 5 10 25P-value 7.37e-11 0.0012 2.45e-06 1.11e-05

As can be seen in table 4.4 all P-values are small enough to beable to confidently reject the null hypothesis and thus approve thealternative hypothesis that changing model does impact the rele-vance score.

The last statistical analysis was the one performed for comparingthe optimized models to the combined model to see if the combinedversion had a significant impact on result score. The following re-sults were gathered after performing an ANOVA test on the dataresulting in the graphs in section 4.3 except the data for the base-line version.


The null hypothesis was that there is no significant difference be-tween the mean relevance score values measured for the differentmodels algorithm.

The alternative hypothesis was that there is a measurable differ-ence in mean relevance score between choosing the different mod-els other than random causes.


Table 4.5: Comparing Combined VersionPrecision

Depth 3 5 10 25P-value 7.26e-05 0.0195 0.530 0.0168


F-measureDepth 3 5 10 25P-value 6.26e-05 0.0165 0.518 0.0161

The results from table 4.5 are not unanimous and differ betweenthe depth levels. For the depth level 3 and 5 the P-value is verysmall and the null hypothesis can be rejected and we can assumethe change is significant. For depth level 10 the P-value is not smallenough to be able to confidently reject the null hypothesis and wecan cannot assume the change in score to be significant. The depthlevel of 25 has a small enough P-value to reject the null hypothesismeaning the combined algorithm performance significantly worsethan Word2Vec as can be seen in the graphs in section 4.3.

Chapter 5

Discussion

The goal of this project was to determine whether or not imple-menting a known machine learning algorithm on top of a baselinesearch engine implementation significantly increases relevance tothe search results. A further analysis of a suggested combinationof two models was also examined to see if further improvementswas possible. After tuning the algorithms and gathering the neededresults it was possible to determine the significance in relevancechange for the introduced models. Earlier studies hinted introduc-ing machine learning techniques to existing information retrievalsystems improve result ranking. For instance the case where Fior-ini et al. attempted to improve PubMeds result ranking [22]. Roy etal. [23] also successfully improved result ranking using Word2Vecfor query expansion. Comparing the results for Word2Vec and Col-laborative Filtering towards the baseline implementation a signifi-cant improvement can also be observed in this project, confirmingearlier observations.

When looking at the resulting graphs the resulting scores mayseem low. This is most likely due to the evaluation data comingfrom the completed procurements. The tender strategists are verypicky when it comes to choosing a food item for a single claim onlyallowing a maximum of around 3 items per claim. This is makingthe task very hard for the search engine to find relevant documents.This should however not affect the possibility to draw conclusionsfrom these results as all implemented models have been tested andrun under the same conditions.

45

46 CHAPTER 5. DISCUSSION

5.1 Comparing models

Even though both Collaborative Filtering and Word2Vec have beenconfirmed to improve result ranking they seem to behave quite dif-ferently. Word2Vec seems to keep improving as the depth valueincreases when studying the graphs in section 4.3, albeit havinga higher deviation from the average as the depth value increases.This seems true for all three benchmarking values.

Collaborative Filtering seems to follow a different pattern how-ever. For recall and f-measure CF seems to perform really goodfor small depth values but not improving at all the more the depthvalue increases. This suggests that CF can confidently find and rankrelevant documents effectivly for small result sets, but giving the al-gorithm more documents to recommend does not help improvementat all. For the precision measure CF sees a slight improvement go-ing from depth 3 to depth 5 but given the large deviation for depth3 that might be a coincidence. A possible explanation as to why CFfollows this pattern could be because that the collaborative filteringimplementation is designed to only rank documents it finds as inter-esting for the user higher and other documents get no rank boostat all. So if CF only finds 3 to 5 interesting documents those arethe documents that will get a high rank and other documents areassumed uninteresting. This can explain why CF sees good perfor-mance for smaller result sets but no improvement for larger sets.

This suggests that in a setting where the user expects few butaccurate results CF seems to be the better option but in a scenariowhere the user wants to be able to select from longer list of docu-ments Word2Vec might perform better.

5.2 Combined Model

The combined model only offers a significant improvement com-pared to the other algorithms when comparing for small depth val-ues (3 and 5). For larger result lists (10 and 25) the difference can-not be confirmed as significant. Looking at all three measurmentsfor depth level 25 it rather looks like Word2Vec might be performingbetter without combining with collaborative filtering. That not con-firmed however, as this project only analysed from the perspective

CHAPTER 5. DISCUSSION 47

of the combined model.Comparing the combined model to Word2Vec and CF it seems

that it inherited much of its characteristics from the CF-algorithmstated in section 5.1. It definetly performs well for small depth val-ues (even better than CF) but in contrast from CF precison andrecall does seem to improve for larger depth values, albeit very lit-tle. This is probably a result from combining with the Word2Vecalgorithm as it improves over larger depth values with the tradeoffof having a much smaller improvement but narrower devation fromthe average.

This suggests that the combined model is always to prefer overCF and also over Word2Vecfor most likely (at least for medium sizeddepth values) as it offers more consistent results.

5.3 Sustainability/Ethics/Societal Impact

This project aims to find potential improvements of information gath-ering tools using well known methods. Information gathering toolsare used by many in both workplaces and homes. The results inthis report can potentially make it easer for people creating similartools to make an informed decision to which method to implementand how to do it. This would hopefully lead to a better experiencefor the people already using similar tools for either work och homeuse. However, there are potential problems with relying too muchon the performance of search engines. In 2008 Hinman argues thatsearch engines not only provide access to knowledge but even playa big role in the constitution of knowledge itself [26]. Trusting asearch engine completely might lead to the end user not developinga deeper knowledge within the subject. In the case of the search en-gine in this report many of the users are experts with great knowl-edge of the items present in the database. Before they had a sophis-ticated search engine a lot of work had to be done by hand findingthe correct items and their properties. The search engine will mostlikely help them in saving a lot of time finding the correct items andbecause of their expert knowledge they can easily determine if thesearch engine’s results are valid. A new user however, might nothave this previous knowledge and end up trusting the search en-gine more. This could lead to new users not developing the sameprofessional level of knowledge as the more experienced users.

48 CHAPTER 5. DISCUSSION

One aspect not examined in this report is however if the sug-gested algorithms leave some type of bias towards certain types ofsearch results. Hypothetically the implementation might discrimi-nate new information added to the data source to rather prefer torank older documents higher with more gathered user feedback.The semantic analysis might lead to the system preferring certaintopics or words above others which in certain contexts can poten-tially be harmful. It would be interesting to see further analysis tosee whether or not this could be a future problem. There have beenanalyses done on google news recommendation engine to evaluatewhether or not the so called filter bubble hypothesis is valid or not.The paper came to the conclusion that concerns about algorithmicfilter bubbles of online news might be exaggarated. However, theycould see that google news had a general bias towards certain newsoutlets and under-represened others [27].

5.4 Future Work

One downside to this project is that it is limited to one data source.A future improvement of this analysis could be to evaluate this ex-periment on other domains than food items. The project is intendedto be agnostic towards the data source and limit the affecting fac-tors to the implementations of the algorithms only. However, this isnot evaluated or proven to be the case. Therefore, it would be in-teresting to see whether or not the same conclusions can be drawnfor other types of data sources.

Another future work that was outside the scope for this projectcould be measuring execution, training and build times. An argu-ment for using a baseline search engine could be that it is fairlyeasy and quick to set up and deliveres fast and reasonable results.Introducing complexity as suggested in this report improves searchranking but it was never studied if the search engine suffered in ex-ecution speed or setup time. Training machine learning algorithmscan be time costly and more complex queries and ranking couldpotentially take longer to process.

Chapter 6

Conclusions

To conclude the project we look back at the results gathered andthe original research question stated in the beginning. Startingwith the first question, What is the measurable impact of known ma-chine learning algorithms on the search engine’s relevance score,and how does it compare to collaborative filtering?, it was shownthat compared to a standard tf-idf search engine implementationthe impact is measurably better both for the machine learning algo-rithm and collaborative filtering. This has been shown by runningoptimized implementations for Word2Vec and collaborative filteringand analysing their relevance scores by running statistical analysismethods. For the second part of the question, Can these methods becombined to achieve even better performance?, it was shown thatfor smaller depth sizes (3 and 5) a combined model had a signifi-cant better relevance score. However, at depth level 10 there wasno provable difference in performance and for depth level 25 theperformance got worse.

Today it is probably easier than ever to setup a serch enginewith companies like Elastic as mentioned in this report providingfree solutions for anyone to install and try out. Using a standardsetup will likely satisfy the basic performance needs of an informa-tion retrieval system. However, as shown in this study, if you arewilling and able to implement a more sophisticated semantic ma-chine learning algorithm, recommendation algorithm or combiningthe two to complement the standard solution it is possible to signif-icantly improve the performance of the search engine.

49

Bibliography

[1] Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze.“Introduction to information retrieval”. In: Natural LanguageEngineering 16.1 (2010), pp. 100–103.

[2] Delfi Marknadspartner AB. About Dabas. 2019. URL: https://www.dabas.com/publicdb/om.aspx (visited on 07/15/2019).

[3] solid IT. DB-Engines Reference. 2020. URL: https://db-engines.com/en/ranking/search+engine (visited on 02/15/2020).

[4] Elasticsearch NV. Elasticsearch Reference. 2020. URL: https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html (visited on 02/15/2020).

[5] Splunk. Splunk Reference. 2020. URL: https://docs.splunk.com/Documentation/Splunk/8.0.3/Overview/AboutSplunkEnterprise(visited on 02/15/2020).

[6] The Apache Foundation. Solr Reference. 2020. URL: https://lucene.apache.org/solr/features.html#top (visited on04/13/2020).

[7] Xavier Ochoa and Erik Duval. “Relevance ranking metrics forlearning objects”. In: IEEE Transactions on Learning Tech-nologies 1.1 (2008), pp. 34–48.

[8] Joeran Beel et al. “paper recommender systems: a literaturesurvey”. In: International Journal on Digital Libraries 17.4(2016), pp. 305–338.

[9] Ivens Portugal, Paulo Alencar, and Donald Cowan. “The useof machine learning algorithms in recommender systems: Asystematic review”. In: Expert Systems with Applications 97(2018), pp. 205–227.

50

BIBLIOGRAPHY 51

[10] Aurangzeb Khan et al. “A review of machine learning algo-rithms for text-documents classification”. In: Journal of ad-vances in information technology 1.1 (2010), pp. 4–20.

[11] Tom Young et al. “Recent trends in deep learning based natu-ral language processing”. In: ieee Computational intelligenCemagazine 13.3 (2018), pp. 55–75.

[12] David M Blei, Andrew Y Ng, and Michael I Jordan. “Latentdirichlet allocation”. In: Advances in neural information pro-cessing systems. 2002, pp. 601–608.

[13] Thomas K Landauer et al. Handbook of latent semantic anal-ysis. Psychology Press, 2013.

[14] Yoav Goldberg and Omer Levy. “word2vec Explained: deriv-ing Mikolov et al.’s negative-sampling word-embedding method”.In: arXiv preprint arXiv:1402.3722 (2014).

[15] Tomas Mikolov et al. “Efficient estimation of word represen-tations in vector space”. In: arXiv preprint arXiv:1301.3781(2013).

[16] Maryam Khanian Najafabadi et al. “Improving the accuracy ofcollaborative filtering recommendations using clustering andassociation rules mining on implicit data”. In: Computers inHuman Behavior 67 (2017), pp. 113–128.

[17] Deuk Hee Park et al. “A literature review and classificationof recommender systems research”. In: Expert systems withapplications 39.11 (2012), pp. 10059–10072.

[18] Chun-Xia Yin and Qin-Ke Peng. “A careful assessment of rec-ommendation algorithms related to dimension reduction tech-niques”. In: Knowledge-Based Systems 27 (2012), pp. 407–423.

[19] Antonio Hernando, Jesús Bobadilla, and Fernando Ortega. “Anon negative matrix factorization for collaborative filteringrecommender systems based on a Bayesian probabilistic model”.In: Knowledge-Based Systems 97 (2016), pp. 188–202.

[20] Thorsten Joachims and Filip Radlinski. “Search engines thatlearn from implicit feedback”. In: Computer 40.8 (2007), pp. 34–40.

52 BIBLIOGRAPHY

[21] Hina Agrawal and Sunita Yadav. “Search Engine Results Improvement–A Review”. In: 2015 IEEE International Conference on Com-putational Intelligence & Communication Technology. IEEE.2015, pp. 180–185.

[22] Nicolas Fiorini et al. “Best Match: new relevance search forPubMed”. In: PLoS biology 16.8 (2018), e2005343.

[23] Dwaipayan Roy et al. “Using word embeddings for automaticquery expansion”. In: arXiv preprint arXiv:1606.07608 (2016).

[24] Yao Xiao and Quan Shi. “Research and implementation of hy-brid recommendation algorithm based on collaborative filter-ing and word2vec”. In: 2015 8th International Symposium onComputational Intelligence and Design (ISCID). Vol. 2. IEEE.2015, pp. 172–175.

[25] Michael Gordon and Praveen Pathak. “Finding information onthe World Wide Web: the retrieval effectiveness of search en-gines”. In: Information Processing & Management 35.2 (1999),pp. 141–180.

[26] Lawrence M Hinman. “Searching ethics: The role of searchengines in the construction and distribution of knowledge”.In: Web search. Springer, 2008, pp. 67–76.

[27] Mario Haim, Andreas Graefe, and Hans-Bernd Brosius. “Burstof the filter bubble? Effects of personalization on the diversityof Google News”. In: Digital journalism 6.3 (2018), pp. 330–343.

TRITA EECS-EX-2020:586

www.kth.se

Date post:	06-Oct-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times