+ All Categories
Home > Technology > Topics in Advanced Information Retrieval

Topics in Advanced Information Retrieval

Date post: 07-Dec-2014
Category:
Upload: carrie-hall
View: 294 times
Download: 0 times
Share this document with a friend
Description:
In this essay I will discuss the current nature of search engines and potential future improvements to search with particular emphasis on searching in the public domain.
9
1 Topics in Advanced Information Retrieval Carrie Hall, 7179881 University of Manchester Abstract. In this essay I will discuss the current nature of search engines and potential future improvements to search with particular emphasis on searching in the public domain. “I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents’ people have touted for ages will finally materialize- Tim Berners-Lee, 1999 Tim Berners-Lee was not the first person to envisage a world in which information was easily accessible to both humans and machines, it was an idea that was introduced by Vannevar Bush in the 1940’s with his concept of a Memex machine[1]. Then, as is the case now, the challenge lay with how to organise an increasing amount of varied data so that it is easily retrievable. This data is growing at an enormous rate; every two minutes a new bio-medical article is published[2], 48 hours of video is uploaded to YouTube[3], and nearly 100,000 tweets are sent[4]. Combine this with the amount of web searches that are made for data (Google had 4 million searches every two minutes in 2009[5]) and it becomes clear that search is an important part of our use of the web. The aim of all search engines is to return accurate and complete results to a user’s query[6]. A simplified version of how a general-purpose search works is as follows. The search engine crawls the web fetching pages and indexes them based on the content of the page. The keywords from a user query are then matched to these indexes and the user is shown a relevance ranked list of pages that should contain what they are looking for. Precision and recall can be used to calculate the usefulness of results, where precision is the fraction of relevant documents from all that are retrieved and recall is the fraction of retrieved documents that are relevant. There is a trade-off between precision and recall[7] which connotes that the aim of being both accurate and complete is impractical. It can be contended that there can be problems with all aspects of the search, which will be split into the following broad sections in order to address them: 1. User Input what did the user type in? 2. User Meaning what did the user mean by what they typed? 3. Formulating results what did the user want returned? User Input There is a need to handle misspellings and incorrect grammar. Choosing how to effectively interpret the words and phrases used within the search, which includes the need to handle multiple languages, can be even more challenging.
Transcript
Page 1: Topics in Advanced Information Retrieval

1

Topics in Advanced Information Retrieval

Carrie Hall, 7179881 University of Manchester

Abstract. In this essay I will discuss the current nature of search engines

and potential future improvements to search with particular emphasis on

searching in the public domain.

“I have a dream for the Web [in which computers] become capable of analyzing all the data on the

Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which

should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade,

bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent

agents’ people have touted for ages will finally materialize“ - Tim Berners-Lee, 1999

Tim Berners-Lee was not the first person to envisage a world in which information was easily

accessible to both humans and machines, it was an idea that was introduced by Vannevar Bush

in the 1940’s with his concept of a Memex machine[1]. Then, as is the case now, the challenge

lay with how to organise an increasing amount of varied data so that it is easily retrievable. This

data is growing at an enormous rate; every two minutes a new bio-medical article is

published[2], 48 hours of video is uploaded to YouTube[3], and nearly 100,000 tweets are

sent[4]. Combine this with the amount of web searches that are made for data (Google had 4

million searches every two minutes in 2009[5]) and it becomes clear that search is an important

part of our use of the web.

The aim of all search engines is to return accurate and complete results to a user’s query[6]. A

simplified version of how a general-purpose search works is as follows. The search engine

crawls the web fetching pages and indexes them based on the content of the page. The

keywords from a user query are then matched to these indexes and the user is shown a

relevance ranked list of pages that should contain what they are looking for. Precision and recall

can be used to calculate the usefulness of results, where precision is the fraction of relevant

documents from all that are retrieved and recall is the fraction of retrieved documents that are

relevant. There is a trade-off between precision and recall[7] which connotes that the aim of

being both accurate and complete is impractical.

It can be contended that there can be problems with all aspects of the search, which will be split

into the following broad sections in order to address them:

1. User Input – what did the user type in?

2. User Meaning – what did the user mean by what they typed?

3. Formulating results – what did the user want returned?

User Input

There is a need to handle misspellings and incorrect grammar. Choosing how to effectively

interpret the words and phrases used within the search, which includes the need to handle

multiple languages, can be even more challenging.

Page 2: Topics in Advanced Information Retrieval

2

User Meaning

Understanding the intention of the user can be difficult, which may occur if the user does not

provide enough information in their query or if it was ambiguous, for example the word ‘apple’

could refer to Apple Inc or apple the fruit depending on context.

Formulating results

Arguably the biggest potential problem lies with the amount of pages that are retrieved. Even if

relevant pages are found, it can easily be hidden amongst thousands of mildly relevant or

irrelevant results[8].

Indeed it is true that search engines are getting better every day and some work has been

undertaken to make search engines more efficient, reliable and useful.

User input can be improved by making it easier for people to find what they are looking for

regardless of what they type in. Two examples are the 'Suggest’ and ‘Google Instant’ features

introduced towards the end of 2009. Google Instant in particular was interesting because it is

not search-as-you-type as it might appear; rather it is search-before-you-type. This means that

the results for the most likely search, given what the user has typed, are shown[9]. Google claim

this feature is useful because people type slowly but read quickly[10] but it has received

negative responses from more advanced users who claim that it is slow and causes more

frustration than a regular search[11]. These features do improve the user experience of search

but it is not changing the fundamental nature of search insofar as the results are still a set of

documents which the user needs to manually navigate for the information they want.

To aid in user meaning, Google has introduced advanced features to its search, from automatic

calculations, unit conversions and dictionary definitions to more complex searches using

wildcards1[13]. This will mean that simple queries can be resolved without needing to present

the user with a series of documents.

Search engines can decrease the number of irrelevant documents retrieved for a query by

making it more difficult for web sites to improve their ranking by ‘faking’ the content of their

site by using keyword-spamming and hidden text[14, 15]. Another potential way to improve

results is by searching for synonyms to the word the user originally typed, for example

“computer table” will also bring back results for “computer desk”. WordNet is an English lexical

database which is designed for finding lexical concepts from keywords programmatically, so

could help achieve this purpose[16]. Several years ago Google acquired Oingo, a meaning-based

search engine which implements synonym matching which suggests they wanted to integrate

into their search engine[17]. A forum post from 2009 claimed that results for ‘vegetarian

recipes’ were being shown when ‘vegan recipes’ is being searched for[18] but upon recent

exploration this does not seem to occur, which could show that Google’s search algorithm

‘learns’ from the user query data.

1 Wildcard: a character that can be used as a substitute for any of a class of characters in a

search, thereby greatly increasing the flexibility and efficiency of searches12. Linux

Information Project. How to Use Wildcards. 2006; Available from:

http://www.linfo.org/wildcard.html, 12. Ibid. In the context of a Google search this means

that a search for ‘Isaac Newton discovered *’ would return constructive results.

Page 3: Topics in Advanced Information Retrieval

3

In addition to using synonyms, another way that search engines can expand the user query

programmatically is by using stemming. The variation found in words is too great for a simple

term matching algorithm to find[19]. Consider irregular verbs or plurals – if a user typed

‘daffodils’ they would likely want to find pages containing the term ‘daffodil’. Stemming means

that words are broken down into their base form which is then searched and usually retrieves

more results[19]. There are many algorithms which use this approach, Microsoft’s new Bing

search uses a form of n-gram algorithm which breaks down words into a subsequence of items

such as syllables or phonemes[20], and matches documents on that basis.

In other areas simple keyword matching searches are not sufficient, such as the e-science

domain. The e-science community has driven a lot of advances in the Semantic Web as they

collect vast amounts of diverse data and often need quick and correct results when finding these

results. It is in this type of environment that failure is not tolerated as much.

In these complex domains the use of ontologies is prevalent, which has been argued proves that

the Semantic Web is achievable[21]. Ontologies can be defined as “explicit formal specifications

of the terms in the domain and relations among them”[22]. In essence they break down a

complex domain into a hierarchical structure with rich information about the interactions

between objects/resources. They are used in the context of the Semantic Web as they are

machine-readable and are used with an inference engine to obtain information about resources

in the domain[6].

An example of a large ontology in use today is the National Cancer Insitute’s Thesaurus and

Ontology which contains over 90,000 concepts and is still growing[23]. It contains descriptions

of genes, diseases, drugs and chemicals, anatomy, organisms, and proteins as well as the

semantic relationships between them. It uses the Web Ontology Language (OWL) to describe its

resources. OWL is a W3C recommendation which builds on RDF using XML and is used to

express hierarchies and relationships between resources (ontologies)[24].

Noesis is a semantic search engine that uses domain ontologies specifically designed for

environmental scientists. Noesis allows the expert user to filter their search based on attributes

of the object they are searching for. An example of this would be searching for sea grass and

being able to filter by taxonomy, location or water type[25]. Results outside the ontology (e.g.

moisturisers containing sea grass) are removed from the results. Filtering would be an

important feature for users to have as it would mean being able to use the search engine to find

things such as flights, car hire and insurance, rather than needing to go to several sites.

Ontologies can be used in the context of information retrieval in three ways[26]. Firstly they can

be used by domain experts to represent their domain knowledge, secondly by web users to

annotate web resources more efficiently and thirdly by end-users searching the web with

queries based on the ontology. Web resources that are semantically related in the ontology will

be retrieved but can mean that the end-users need to have a basic understanding of the

ontology used.

Another criticism of using ontologies lies with the fact that relying on users to create metadata

is difficult, and there aren’t experts in every subject that are willing to create an Ontology of

Everything, and this would take a significant amount of time[27]. This could be made more

Page 4: Topics in Advanced Information Retrieval

4

possible with the use of automatic annotation which is problematic in itself as it is difficult to

verify.

The Resource Description Framework (RDF) is a language used with OWL and is specifically

designed for describing resources on the web which is why it is often quoted as being one of the

key languages in the future of the Semantic Web[24]. It standardises the way things are

described (such as price) which could lead to search engines being able to more effectively filter

results. It does this by storing data in three parts called triples. A triple contains a subject,

predicate and object which use URI’s2 to identify resources. An example of an RDF triple is

[resource] [property] [value]

The secret agent is Niki Devgood

[subject] [predicate] [object]

An interesting argument to the idea that RDF will be a big player in the semantic web involves

looking at statistics from Google Trends. Evidence from a report in 2006 suggests that that RDF

is not a popular search, in fact more people search Google for ‘Prolog’ and ‘Fortran’[27]. An

updated report confirms the reduction of interest in this technology (see Figure 1). This report

also indicates there is more interest in other technologies, such as AJAX3, by looking at the

amount of books and blog posts on the subject,

There has been recent work into a search engine that queries the web like a database[28]. The

idea is that the end queries are simple to use but the results are complex. It is still in the early

stages and only has 10 entities (namely Person, Company, University, City, Novel, Player, Club,

Song, Film and Award) but it works by telling the system which entity to look for and what the

relations are between them[29]. This example query is taken from the project website and

demonstrates the complex questions that can be answered using this approach; the actual query

is shown in Figure 2 in the appendix:

"Find an Australian actress, an Academy Award winning film, and a Grammy Award winning

song, where the actress stars the film and the song is the theme of that film”

Results returned:

Nicole Kidman, Batman Forever, Kiss from a Rose

Melanie Griffith, Titanic (1997 film), My Heart Will Go On

Mia Farrow, Midnight Cowboy, Without You

This kind of complex result would have taken several queries and a lot of time for a human to do

using a current search engine, but due to the complex way in which the query needs to be

formulated it may never reach mainstream use.

It is important to take a moment to explore the human motivations for using search, and how

people go about doing it. The ‘Principle of least effort’[30] implies that information-seeking

users will use the most convenient search method possible to them, thus it can be argued that

the this search engine would deter a user due to its complex and multipart interface. The

Wolfram Alpha search engine uses a ‘Google-style’ query box which may encourage more users

to use it. Advanced users (such as in the e-science domain) have a lot of experience with

2 URI: Uniform Resource Identifier, used to uniquely identify resources 3 AJAX: Asynchronous Javascript and XML

Page 5: Topics in Advanced Information Retrieval

5

complex query structuring so they might be adverse to using this sort of open ended query

input as it does not map with their mental model of a complex search engine.

It has been suggested that there is an overlap between knowledge representation and model-

driven architecture[31] and that this can be used to as a backbone to power a new semantic

web. The goal of model-driven architecture (MDA) is to separate business applications from

technologies used for implementation. It relies heavily on metadata created by the Meta-Object

Facility (MOF). MDA technologies could be used as a foundation for ontology modelling in the

semantic web as both MDA and Semantic Web languages contain a similar specification (such as

subsumption relations and relationships between classes) [31]. Furthermore UML4 and MOF are

graphical and it may be more straight-forward for experts to create ontologies using these tools

than it would be with a knowledge representation language. The role of the Semantic Web

would then be to reason about these resources and would not be concerned with the complex

task of managing the models. The Object Manangement Group (OMG) have created an Ontology

Definition Metamodel specification which enables modelling of ontologies by using UML

tools[32].

As previously mentioned, simple queries results (like mathematical calculations and unit

conversions) can be pulled from sources and displayed to the user. A more semantic search

engine could try and do this for more complex results, such as “winner of 2011 australian tennis

open”[33], rather than simply identifying potentially useful pages by matching the query

entered. Two examples of search engines attempting this are Sensebot5 and Wolfram Alpha6,

although both of these search engines only cover a relatively small, selected domain. They do

however show that it is possible to have search results without a list of documents.

So far a lot of emphasis has been placed on finding textual answers to user queries, but what

about semantic searching over images or video? In fact this is already appearing – nachofoto

have created a semantic, time-based vertical image search engine[34] although it is only in beta

version and only contains information that is trending on the web.

Another way in which search engines could be more intelligent is in the way they decide which

results to filter when ambiguous searches are made by asking the user a question i.e. ‘Are you

searching for the company or the fruit?’ when it is presented with the term ‘apple’[35]. They

could learn from collecting large numbers of user responses in order to guess at which result

they were looking for. The previously mentioned Wolfram Alpha interprets this search as the

company Apple but provides a very user friendly way to change the interpretation to several

others that are available.

One of the biggest areas of growth in recent years is that of the social web. The social web refers

to web-sites that are driven by user participation such as Wikipedia, YouTube and Flickr[36].

This is a good form of knowledge sharing, especially in newer areas of interest that do not have

a defined structure that could be mapped into an ontology. User participation can be used to

create and update metadata which will aid in the retrieval of items of interest. Therefore as

more people contribute the system becomes more useful, and this added information may be

previously unknown[36].

4 UML: Unified Modelling Language, see http://www.uml.org/ 5 http://www.sensebot.net/ 6 http://www.wolframalpha.com/

Page 6: Topics in Advanced Information Retrieval

6

TipTop7 is a social search engine that provides sentiment analysis of a subject and classifies

them into positive and negative opinions[33, 37]. Twitter in particular is heavily used by this

search engine. Twitter has been shown to influence public collective decision making and even

predict economic changes. In particular a recent study showed that by integrating Twitter

sentiment with stock market prediction the accuracy increases from 73.3% to 86.7%[38]. Using

sentiment analysis could assist search engines when deciding how to rank documents or which

information to present to a user first.

Conclusion

The amount of content on the web is immense and it is growing rapidly. It can be argued that

using ontologies are very effective for domain-specific tasks where the users are expert users

but an ontology that covers the whole world wide web seems unlikely. Ontologies need to be

decided upon, implemented and maintained which are all very large tasks in themselves. It has

been suggested that the only way the Semantic Web can succeed is by using several ontologies

from different communities[21] which, although feasible, will be demanding and time-

consuming.

Motivation for searching is another factor in deciding how a search engine should function –

more advanced users who rely on the search engine to perform their job are likely to influence

how the search engine evolves. It is these users who have pushed changes in areas such as e-

science which could suggest that these areas will continue to evolve and adapt to the amount of

growing data. When developing any sort of software system, often the most difficult thing is

getting the users to use it. In this respect, it is unlikely that any new public search engine will

become widespread enough to overtake Google, Microsoft and Yahoo.

Furthermore, can a computer really mimic human behaviour? Just as the content on the web is

changing rapidly, language and human behaviour also evolves and changes. Perhaps by the time

an ontology or a natural language processor is developed that is complex enough to be used on

the general web, it will be out of date with both the users and the data. Furthermore, if all users

sent well-formed information queries to a search engine then many problems would be solved.

However users tend to type only a few words and expect a result[39]. After all, two searches

with few keywords is likely to take the same amount of time as one long query but without the

advantage of being able to see immediate results.

It is these subtleties in human behaviour that may be near to impossible for a computer system

to interpret, at least in the near future, but enough advances have been made that suggest that

something resembling a Semantic Web is certainly achievable in the long term.

Word count: 2951

7 http://www.feeltiptop.com/

Page 7: Topics in Advanced Information Retrieval

7

Appendix

Figure 1

Figure 2

Page 8: Topics in Advanced Information Retrieval

8

References

1. Bush, V., As We May Think. Atlantic Magazine, 1945.

2. Nenadic, G., Data Integration and Analysis. 2010.

3. SiteImpulse. YouTube Facts & Figures. 2010; Available from: http://www.website-monitoring.com/blog/2010/05/17/youtube-facts-and-figures-history-statistics/.

4. Hachman, M., Twitter Tops 2 Billion Tweets Per Month, in PCMag. 2010.

5. comScore. comScore Reports Global Search Market Growth of 46 Percent in 2009. 2009; Available from: http://www.comscore.com/Press_Events/Press_Releases/2010/1/Global_Search_Market_Grows_46_Percent_in_2009.

6. Movv, S., Noesis: A Semantic Search Engine and Resource Aggregator for Atmospheric Science. American Geophysical Union, 2006.

7. Buckland, M., The relationship between Recall and Precision. Journal of the American Society for Information Science, 1994. Volume 45, Issue 1: p. 12-19.

8. Antoniou, G., A Semantic Web Primer. 2004, Massachusetts: Massachusetts Insitute of Technology.

9. Google, Google Instant, behind the scenes, in Google Blog. 2009.

10. Google. About Google Instant. 2009; Available from: http://www.google.co.uk/instant/.

11. Mello, J., Google Instant: Pros and Cons, in PCWorld. 2009.

12. Linux Information Project. How to Use Wildcards. 2006; Available from: http://www.linfo.org/wildcard.html.

13. Google. Search Features. 2011; Available from: http://www.google.com/help/features.html.

14. Google, Google does not use the keywords meta tag in web ranking, in Google Webmaster Blog. 2009.

15. Google. Webmaster Guidelines. 2011; Available from: http://www.google.com/support/webmasters/bin/answer.py?answer=35769.

16. Miller, G., WordNet: A Lexical Database for English. Communications of the ACM, 1995. Volume 38, Issue 11.

17. Radhakrishnan, A., Oingo Meaning Engine, Semantic Search & Google, in Search Engine Journal. 2007.

18. Google. Vegan vs Vegetarian. 2009; Available from: http://www.google.com/support/forum/p/Web+Search/thread?tid=29ae6b6d77496dd3&hl=en.

19. Hull, D., Stemming Algorithms. Journal of the American Society for Information Science, 1995.

Page 9: Topics in Advanced Information Retrieval

9

20. Gao, J., A Comparative Study of Bing Web N-gram Language Models for Web Search and Natural Language Processing. Proceeding of the 33rd Annual ACM SIGIR Conference, 2010.

21. Staab, S., The Semantic Web Revisted. IEEE Intelligent Systems, 2006Issue May/June.

22. Gruber, T., A translation approach to portable ontology specifications. Knowledge Acquisition, 1993.

23. Jennifer Golbeck, G.F., Frank Hartel, Jim Hendler, Jim Oberthaler, Bijan Parsia, The National Cancer Institute's Thésaurus and Ontology. 2003.

24. Altova. What is the Semantic Web? 2005.

25. Physorg, Semantic science search engine knows that there is a difference. 2009.

26. Corby, O., Querying the Semantic Web with Corese Search Engine. 2004.

27. Zambonini, D., The 7 (f)laws of the Semantic Web, in xml.com. 2006.

28. Physorg, Grant to help researchers build better search engines. 2010.

29. Li, C. Entity-Relationship Queries. 2011; Available from: http://idir.uta.edu/erq/.

30. Zipf, G., Human Behaviour and the Principle of Least Effort. 1949.

31. Frankel, D., The Model Driven Semantic Web. Proceedings of First International Workshop on the Model-Driven Semantic Web, 2004.

32. Group, O.M., Ontology Definition Metamodel. 2009.

33. Hendler, J., Web 3.0: The Dawn of Semantic Search. IEEE Computer Society, 2010. Volume 43, Issue 1: p. 77-80.

34. Zaino, J., Semantic Image Search: Next Up For a Major Search Engine? 2011.

35. Physorg, Smarter than Google?, M. Breedveld, Editor. 2010.

36. Gruber, T., Collective knowledge systems: Where the Social Web meets the Semantic Web. Journal of Web Semantics, 2007. Volume 6.

37. TipTop. TipTop Search FAQs. 2009; Available from: http://beta.tiptopbest.com/faq.html.

38. Bollen, J., Twitter mood predicts the stock market. Journal of Computational Science, 2011. Volume 1.

39. Zaino, J., Exploring Search, in semanticweb.com. 2010.


Recommended