Date post: | 07-Dec-2014 |
Category: |
Technology |
Upload: | carrie-hall |
View: | 294 times |
Download: | 0 times |
1
Topics in Advanced Information Retrieval
Carrie Hall, 7179881 University of Manchester
Abstract. In this essay I will discuss the current nature of search engines
and potential future improvements to search with particular emphasis on
searching in the public domain.
“I have a dream for the Web [in which computers] become capable of analyzing all the data on the
Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which
should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade,
bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent
agents’ people have touted for ages will finally materialize“ - Tim Berners-Lee, 1999
Tim Berners-Lee was not the first person to envisage a world in which information was easily
accessible to both humans and machines, it was an idea that was introduced by Vannevar Bush
in the 1940’s with his concept of a Memex machine[1]. Then, as is the case now, the challenge
lay with how to organise an increasing amount of varied data so that it is easily retrievable. This
data is growing at an enormous rate; every two minutes a new bio-medical article is
published[2], 48 hours of video is uploaded to YouTube[3], and nearly 100,000 tweets are
sent[4]. Combine this with the amount of web searches that are made for data (Google had 4
million searches every two minutes in 2009[5]) and it becomes clear that search is an important
part of our use of the web.
The aim of all search engines is to return accurate and complete results to a user’s query[6]. A
simplified version of how a general-purpose search works is as follows. The search engine
crawls the web fetching pages and indexes them based on the content of the page. The
keywords from a user query are then matched to these indexes and the user is shown a
relevance ranked list of pages that should contain what they are looking for. Precision and recall
can be used to calculate the usefulness of results, where precision is the fraction of relevant
documents from all that are retrieved and recall is the fraction of retrieved documents that are
relevant. There is a trade-off between precision and recall[7] which connotes that the aim of
being both accurate and complete is impractical.
It can be contended that there can be problems with all aspects of the search, which will be split
into the following broad sections in order to address them:
1. User Input – what did the user type in?
2. User Meaning – what did the user mean by what they typed?
3. Formulating results – what did the user want returned?
User Input
There is a need to handle misspellings and incorrect grammar. Choosing how to effectively
interpret the words and phrases used within the search, which includes the need to handle
multiple languages, can be even more challenging.
2
User Meaning
Understanding the intention of the user can be difficult, which may occur if the user does not
provide enough information in their query or if it was ambiguous, for example the word ‘apple’
could refer to Apple Inc or apple the fruit depending on context.
Formulating results
Arguably the biggest potential problem lies with the amount of pages that are retrieved. Even if
relevant pages are found, it can easily be hidden amongst thousands of mildly relevant or
irrelevant results[8].
Indeed it is true that search engines are getting better every day and some work has been
undertaken to make search engines more efficient, reliable and useful.
User input can be improved by making it easier for people to find what they are looking for
regardless of what they type in. Two examples are the 'Suggest’ and ‘Google Instant’ features
introduced towards the end of 2009. Google Instant in particular was interesting because it is
not search-as-you-type as it might appear; rather it is search-before-you-type. This means that
the results for the most likely search, given what the user has typed, are shown[9]. Google claim
this feature is useful because people type slowly but read quickly[10] but it has received
negative responses from more advanced users who claim that it is slow and causes more
frustration than a regular search[11]. These features do improve the user experience of search
but it is not changing the fundamental nature of search insofar as the results are still a set of
documents which the user needs to manually navigate for the information they want.
To aid in user meaning, Google has introduced advanced features to its search, from automatic
calculations, unit conversions and dictionary definitions to more complex searches using
wildcards1[13]. This will mean that simple queries can be resolved without needing to present
the user with a series of documents.
Search engines can decrease the number of irrelevant documents retrieved for a query by
making it more difficult for web sites to improve their ranking by ‘faking’ the content of their
site by using keyword-spamming and hidden text[14, 15]. Another potential way to improve
results is by searching for synonyms to the word the user originally typed, for example
“computer table” will also bring back results for “computer desk”. WordNet is an English lexical
database which is designed for finding lexical concepts from keywords programmatically, so
could help achieve this purpose[16]. Several years ago Google acquired Oingo, a meaning-based
search engine which implements synonym matching which suggests they wanted to integrate
into their search engine[17]. A forum post from 2009 claimed that results for ‘vegetarian
recipes’ were being shown when ‘vegan recipes’ is being searched for[18] but upon recent
exploration this does not seem to occur, which could show that Google’s search algorithm
‘learns’ from the user query data.
1 Wildcard: a character that can be used as a substitute for any of a class of characters in a
search, thereby greatly increasing the flexibility and efficiency of searches12. Linux
Information Project. How to Use Wildcards. 2006; Available from:
http://www.linfo.org/wildcard.html, 12. Ibid. In the context of a Google search this means
that a search for ‘Isaac Newton discovered *’ would return constructive results.
3
In addition to using synonyms, another way that search engines can expand the user query
programmatically is by using stemming. The variation found in words is too great for a simple
term matching algorithm to find[19]. Consider irregular verbs or plurals – if a user typed
‘daffodils’ they would likely want to find pages containing the term ‘daffodil’. Stemming means
that words are broken down into their base form which is then searched and usually retrieves
more results[19]. There are many algorithms which use this approach, Microsoft’s new Bing
search uses a form of n-gram algorithm which breaks down words into a subsequence of items
such as syllables or phonemes[20], and matches documents on that basis.
In other areas simple keyword matching searches are not sufficient, such as the e-science
domain. The e-science community has driven a lot of advances in the Semantic Web as they
collect vast amounts of diverse data and often need quick and correct results when finding these
results. It is in this type of environment that failure is not tolerated as much.
In these complex domains the use of ontologies is prevalent, which has been argued proves that
the Semantic Web is achievable[21]. Ontologies can be defined as “explicit formal specifications
of the terms in the domain and relations among them”[22]. In essence they break down a
complex domain into a hierarchical structure with rich information about the interactions
between objects/resources. They are used in the context of the Semantic Web as they are
machine-readable and are used with an inference engine to obtain information about resources
in the domain[6].
An example of a large ontology in use today is the National Cancer Insitute’s Thesaurus and
Ontology which contains over 90,000 concepts and is still growing[23]. It contains descriptions
of genes, diseases, drugs and chemicals, anatomy, organisms, and proteins as well as the
semantic relationships between them. It uses the Web Ontology Language (OWL) to describe its
resources. OWL is a W3C recommendation which builds on RDF using XML and is used to
express hierarchies and relationships between resources (ontologies)[24].
Noesis is a semantic search engine that uses domain ontologies specifically designed for
environmental scientists. Noesis allows the expert user to filter their search based on attributes
of the object they are searching for. An example of this would be searching for sea grass and
being able to filter by taxonomy, location or water type[25]. Results outside the ontology (e.g.
moisturisers containing sea grass) are removed from the results. Filtering would be an
important feature for users to have as it would mean being able to use the search engine to find
things such as flights, car hire and insurance, rather than needing to go to several sites.
Ontologies can be used in the context of information retrieval in three ways[26]. Firstly they can
be used by domain experts to represent their domain knowledge, secondly by web users to
annotate web resources more efficiently and thirdly by end-users searching the web with
queries based on the ontology. Web resources that are semantically related in the ontology will
be retrieved but can mean that the end-users need to have a basic understanding of the
ontology used.
Another criticism of using ontologies lies with the fact that relying on users to create metadata
is difficult, and there aren’t experts in every subject that are willing to create an Ontology of
Everything, and this would take a significant amount of time[27]. This could be made more
4
possible with the use of automatic annotation which is problematic in itself as it is difficult to
verify.
The Resource Description Framework (RDF) is a language used with OWL and is specifically
designed for describing resources on the web which is why it is often quoted as being one of the
key languages in the future of the Semantic Web[24]. It standardises the way things are
described (such as price) which could lead to search engines being able to more effectively filter
results. It does this by storing data in three parts called triples. A triple contains a subject,
predicate and object which use URI’s2 to identify resources. An example of an RDF triple is
[resource] [property] [value]
The secret agent is Niki Devgood
[subject] [predicate] [object]
An interesting argument to the idea that RDF will be a big player in the semantic web involves
looking at statistics from Google Trends. Evidence from a report in 2006 suggests that that RDF
is not a popular search, in fact more people search Google for ‘Prolog’ and ‘Fortran’[27]. An
updated report confirms the reduction of interest in this technology (see Figure 1). This report
also indicates there is more interest in other technologies, such as AJAX3, by looking at the
amount of books and blog posts on the subject,
There has been recent work into a search engine that queries the web like a database[28]. The
idea is that the end queries are simple to use but the results are complex. It is still in the early
stages and only has 10 entities (namely Person, Company, University, City, Novel, Player, Club,
Song, Film and Award) but it works by telling the system which entity to look for and what the
relations are between them[29]. This example query is taken from the project website and
demonstrates the complex questions that can be answered using this approach; the actual query
is shown in Figure 2 in the appendix:
"Find an Australian actress, an Academy Award winning film, and a Grammy Award winning
song, where the actress stars the film and the song is the theme of that film”
Results returned:
Nicole Kidman, Batman Forever, Kiss from a Rose
Melanie Griffith, Titanic (1997 film), My Heart Will Go On
Mia Farrow, Midnight Cowboy, Without You
This kind of complex result would have taken several queries and a lot of time for a human to do
using a current search engine, but due to the complex way in which the query needs to be
formulated it may never reach mainstream use.
It is important to take a moment to explore the human motivations for using search, and how
people go about doing it. The ‘Principle of least effort’[30] implies that information-seeking
users will use the most convenient search method possible to them, thus it can be argued that
the this search engine would deter a user due to its complex and multipart interface. The
Wolfram Alpha search engine uses a ‘Google-style’ query box which may encourage more users
to use it. Advanced users (such as in the e-science domain) have a lot of experience with
2 URI: Uniform Resource Identifier, used to uniquely identify resources 3 AJAX: Asynchronous Javascript and XML
5
complex query structuring so they might be adverse to using this sort of open ended query
input as it does not map with their mental model of a complex search engine.
It has been suggested that there is an overlap between knowledge representation and model-
driven architecture[31] and that this can be used to as a backbone to power a new semantic
web. The goal of model-driven architecture (MDA) is to separate business applications from
technologies used for implementation. It relies heavily on metadata created by the Meta-Object
Facility (MOF). MDA technologies could be used as a foundation for ontology modelling in the
semantic web as both MDA and Semantic Web languages contain a similar specification (such as
subsumption relations and relationships between classes) [31]. Furthermore UML4 and MOF are
graphical and it may be more straight-forward for experts to create ontologies using these tools
than it would be with a knowledge representation language. The role of the Semantic Web
would then be to reason about these resources and would not be concerned with the complex
task of managing the models. The Object Manangement Group (OMG) have created an Ontology
Definition Metamodel specification which enables modelling of ontologies by using UML
tools[32].
As previously mentioned, simple queries results (like mathematical calculations and unit
conversions) can be pulled from sources and displayed to the user. A more semantic search
engine could try and do this for more complex results, such as “winner of 2011 australian tennis
open”[33], rather than simply identifying potentially useful pages by matching the query
entered. Two examples of search engines attempting this are Sensebot5 and Wolfram Alpha6,
although both of these search engines only cover a relatively small, selected domain. They do
however show that it is possible to have search results without a list of documents.
So far a lot of emphasis has been placed on finding textual answers to user queries, but what
about semantic searching over images or video? In fact this is already appearing – nachofoto
have created a semantic, time-based vertical image search engine[34] although it is only in beta
version and only contains information that is trending on the web.
Another way in which search engines could be more intelligent is in the way they decide which
results to filter when ambiguous searches are made by asking the user a question i.e. ‘Are you
searching for the company or the fruit?’ when it is presented with the term ‘apple’[35]. They
could learn from collecting large numbers of user responses in order to guess at which result
they were looking for. The previously mentioned Wolfram Alpha interprets this search as the
company Apple but provides a very user friendly way to change the interpretation to several
others that are available.
One of the biggest areas of growth in recent years is that of the social web. The social web refers
to web-sites that are driven by user participation such as Wikipedia, YouTube and Flickr[36].
This is a good form of knowledge sharing, especially in newer areas of interest that do not have
a defined structure that could be mapped into an ontology. User participation can be used to
create and update metadata which will aid in the retrieval of items of interest. Therefore as
more people contribute the system becomes more useful, and this added information may be
previously unknown[36].
4 UML: Unified Modelling Language, see http://www.uml.org/ 5 http://www.sensebot.net/ 6 http://www.wolframalpha.com/
6
TipTop7 is a social search engine that provides sentiment analysis of a subject and classifies
them into positive and negative opinions[33, 37]. Twitter in particular is heavily used by this
search engine. Twitter has been shown to influence public collective decision making and even
predict economic changes. In particular a recent study showed that by integrating Twitter
sentiment with stock market prediction the accuracy increases from 73.3% to 86.7%[38]. Using
sentiment analysis could assist search engines when deciding how to rank documents or which
information to present to a user first.
Conclusion
The amount of content on the web is immense and it is growing rapidly. It can be argued that
using ontologies are very effective for domain-specific tasks where the users are expert users
but an ontology that covers the whole world wide web seems unlikely. Ontologies need to be
decided upon, implemented and maintained which are all very large tasks in themselves. It has
been suggested that the only way the Semantic Web can succeed is by using several ontologies
from different communities[21] which, although feasible, will be demanding and time-
consuming.
Motivation for searching is another factor in deciding how a search engine should function –
more advanced users who rely on the search engine to perform their job are likely to influence
how the search engine evolves. It is these users who have pushed changes in areas such as e-
science which could suggest that these areas will continue to evolve and adapt to the amount of
growing data. When developing any sort of software system, often the most difficult thing is
getting the users to use it. In this respect, it is unlikely that any new public search engine will
become widespread enough to overtake Google, Microsoft and Yahoo.
Furthermore, can a computer really mimic human behaviour? Just as the content on the web is
changing rapidly, language and human behaviour also evolves and changes. Perhaps by the time
an ontology or a natural language processor is developed that is complex enough to be used on
the general web, it will be out of date with both the users and the data. Furthermore, if all users
sent well-formed information queries to a search engine then many problems would be solved.
However users tend to type only a few words and expect a result[39]. After all, two searches
with few keywords is likely to take the same amount of time as one long query but without the
advantage of being able to see immediate results.
It is these subtleties in human behaviour that may be near to impossible for a computer system
to interpret, at least in the near future, but enough advances have been made that suggest that
something resembling a Semantic Web is certainly achievable in the long term.
Word count: 2951
7 http://www.feeltiptop.com/
7
Appendix
Figure 1
Figure 2
8
References
1. Bush, V., As We May Think. Atlantic Magazine, 1945.
2. Nenadic, G., Data Integration and Analysis. 2010.
3. SiteImpulse. YouTube Facts & Figures. 2010; Available from: http://www.website-monitoring.com/blog/2010/05/17/youtube-facts-and-figures-history-statistics/.
4. Hachman, M., Twitter Tops 2 Billion Tweets Per Month, in PCMag. 2010.
5. comScore. comScore Reports Global Search Market Growth of 46 Percent in 2009. 2009; Available from: http://www.comscore.com/Press_Events/Press_Releases/2010/1/Global_Search_Market_Grows_46_Percent_in_2009.
6. Movv, S., Noesis: A Semantic Search Engine and Resource Aggregator for Atmospheric Science. American Geophysical Union, 2006.
7. Buckland, M., The relationship between Recall and Precision. Journal of the American Society for Information Science, 1994. Volume 45, Issue 1: p. 12-19.
8. Antoniou, G., A Semantic Web Primer. 2004, Massachusetts: Massachusetts Insitute of Technology.
9. Google, Google Instant, behind the scenes, in Google Blog. 2009.
10. Google. About Google Instant. 2009; Available from: http://www.google.co.uk/instant/.
11. Mello, J., Google Instant: Pros and Cons, in PCWorld. 2009.
12. Linux Information Project. How to Use Wildcards. 2006; Available from: http://www.linfo.org/wildcard.html.
13. Google. Search Features. 2011; Available from: http://www.google.com/help/features.html.
14. Google, Google does not use the keywords meta tag in web ranking, in Google Webmaster Blog. 2009.
15. Google. Webmaster Guidelines. 2011; Available from: http://www.google.com/support/webmasters/bin/answer.py?answer=35769.
16. Miller, G., WordNet: A Lexical Database for English. Communications of the ACM, 1995. Volume 38, Issue 11.
17. Radhakrishnan, A., Oingo Meaning Engine, Semantic Search & Google, in Search Engine Journal. 2007.
18. Google. Vegan vs Vegetarian. 2009; Available from: http://www.google.com/support/forum/p/Web+Search/thread?tid=29ae6b6d77496dd3&hl=en.
19. Hull, D., Stemming Algorithms. Journal of the American Society for Information Science, 1995.
9
20. Gao, J., A Comparative Study of Bing Web N-gram Language Models for Web Search and Natural Language Processing. Proceeding of the 33rd Annual ACM SIGIR Conference, 2010.
21. Staab, S., The Semantic Web Revisted. IEEE Intelligent Systems, 2006Issue May/June.
22. Gruber, T., A translation approach to portable ontology specifications. Knowledge Acquisition, 1993.
23. Jennifer Golbeck, G.F., Frank Hartel, Jim Hendler, Jim Oberthaler, Bijan Parsia, The National Cancer Institute's Thésaurus and Ontology. 2003.
24. Altova. What is the Semantic Web? 2005.
25. Physorg, Semantic science search engine knows that there is a difference. 2009.
26. Corby, O., Querying the Semantic Web with Corese Search Engine. 2004.
27. Zambonini, D., The 7 (f)laws of the Semantic Web, in xml.com. 2006.
28. Physorg, Grant to help researchers build better search engines. 2010.
29. Li, C. Entity-Relationship Queries. 2011; Available from: http://idir.uta.edu/erq/.
30. Zipf, G., Human Behaviour and the Principle of Least Effort. 1949.
31. Frankel, D., The Model Driven Semantic Web. Proceedings of First International Workshop on the Model-Driven Semantic Web, 2004.
32. Group, O.M., Ontology Definition Metamodel. 2009.
33. Hendler, J., Web 3.0: The Dawn of Semantic Search. IEEE Computer Society, 2010. Volume 43, Issue 1: p. 77-80.
34. Zaino, J., Semantic Image Search: Next Up For a Major Search Engine? 2011.
35. Physorg, Smarter than Google?, M. Breedveld, Editor. 2010.
36. Gruber, T., Collective knowledge systems: Where the Social Web meets the Semantic Web. Journal of Web Semantics, 2007. Volume 6.
37. TipTop. TipTop Search FAQs. 2009; Available from: http://beta.tiptopbest.com/faq.html.
38. Bollen, J., Twitter mood predicts the stock market. Journal of Computational Science, 2011. Volume 1.
39. Zaino, J., Exploring Search, in semanticweb.com. 2010.