+ All Categories
Home > Documents > How doctors search: A study of query behaviour and the impact on search results

How doctors search: A study of query behaviour and the impact on search results

Date post: 27-Nov-2016
Category:
Upload: lois
View: 213 times
Download: 0 times
Share this document with a friend
20
How doctors search: A study of query behaviour and the impact on search results Marianne Lykke a,, Susan Price b , Lois Delcambre c a Aalborg University Nyhavnsgade 14, Room 3-16, 9000 Aalborg, Denmark b Microsoft Corporation, One Microsoft Way, Redmond, WA 98052, USA c Portland State University, Department of Computer Science, P.O. Box 751, Portland, OR 97207-0751, United States article info Article history: Received 22 March 2011 Received in revised form 21 December 2011 Accepted 14 February 2012 Available online 10 March 2012 Keywords: Query behaviour Work-place retrieval Family doctors Retrieval performance abstract Professional, workplace searching is different from general searching, because it is typically limited to specific facets and targeted to a single answer. We have developed the semantic component (SC) model, which is a search feature that allows searchers to structure and specify the search to context-specific aspects of the main topic of the documents. We have tested the model in an interactive searching study with family doctors with the purpose to explore doctors’ querying behaviour, how they applied the means for specifying a search, and how these features contributed to the search outcome. In general, the doctors were capable of exploiting system features and search tactics during the searching. Most search- ers produced well-structured queries that contained appropriate search facets. When searches failed it was not due to query structure or query length. Failures were mostly caused by the well-known vocabulary problem. The problem was exacerbated by using certain filters as Boolean filters. The best working queries were structured into 2–3 main facets out of 3–5 possible search facets, and expressed with terms reflecting the focal view of the search task. The findings at the same time support and extend previous results about query structure and exhaustivity showing the importance of selecting central search facets and express them from the perspective of search task. The SC model was applied in the highest performing queries except one. The findings suggest that the model might be a helpful feature to structure queries into central, appropriate facets, and in returning highly relevant documents. Ó 2012 Elsevier Ltd. All rights reserved. 1. Introduction Information retrieval is closely related to the information environment, setting and domain of interest, set of people, pro- fessions, work roles, problems, and associated tasks (Taylor, 1991) in which the information need arises. The work task moulds the querying activity by influencing the: choice of search terms, use of system features and relevance assessments (Vakkari, 2003). Studies of workplace search demonstrate that workplace searching is different from general searching (Fagin et al., 2003). Professional workplace queries are typically targeting a single ‘‘right answer’’, and tend to have a small set of useful documents. The correct answer is context-specific and often limited based on non-topical aspects such as document genres and information characteristics, communication channels and sources of documents, document status, and products related to the document (Freund, Toms, & Waterhouse, 2005). 0306-4573/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2012.02.006 Corresponding author. Tel.: +45 21251854. E-mail addresses: [email protected] (M. Lykke), [email protected] (S. Price), [email protected] (L. Delcambre). Information Processing and Management 48 (2012) 1151–1170 Contents lists available at SciVerse ScienceDirect Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman
Transcript
Page 1: How doctors search: A study of query behaviour and the impact on search results

Information Processing and Management 48 (2012) 1151–1170

Contents lists available at SciVerse ScienceDirect

Information Processing and Management

journal homepage: www.elsevier .com/ locate/ infoproman

How doctors search: A study of query behaviour and the impacton search results

Marianne Lykke a,⇑, Susan Price b, Lois Delcambre c

a Aalborg University Nyhavnsgade 14, Room 3-16, 9000 Aalborg, Denmarkb Microsoft Corporation, One Microsoft Way, Redmond, WA 98052, USAc Portland State University, Department of Computer Science, P.O. Box 751, Portland, OR 97207-0751, United States

a r t i c l e i n f o a b s t r a c t

Article history:Received 22 March 2011Received in revised form 21 December 2011Accepted 14 February 2012Available online 10 March 2012

Keywords:Query behaviourWork-place retrievalFamily doctorsRetrieval performance

0306-4573/$ - see front matter � 2012 Elsevier Ltddoi:10.1016/j.ipm.2012.02.006

⇑ Corresponding author. Tel.: +45 21251854.E-mail addresses: [email protected] (M. Lykke

Professional, workplace searching is different from general searching, because it is typicallylimited to specific facets and targeted to a single answer. We have developed the semanticcomponent (SC) model, which is a search feature that allows searchers to structure andspecify the search to context-specific aspects of the main topic of the documents. We havetested the model in an interactive searching study with family doctors with the purpose toexplore doctors’ querying behaviour, how they applied the means for specifying a search,and how these features contributed to the search outcome. In general, the doctors werecapable of exploiting system features and search tactics during the searching. Most search-ers produced well-structured queries that contained appropriate search facets. Whensearches failed it was not due to query structure or query length. Failures were mostlycaused by the well-known vocabulary problem. The problem was exacerbated by usingcertain filters as Boolean filters. The best working queries were structured into 2–3 mainfacets out of 3–5 possible search facets, and expressed with terms reflecting the focal viewof the search task. The findings at the same time support and extend previous results aboutquery structure and exhaustivity showing the importance of selecting central search facetsand express them from the perspective of search task. The SC model was applied in thehighest performing queries except one. The findings suggest that the model might be ahelpful feature to structure queries into central, appropriate facets, and in returning highlyrelevant documents.

� 2012 Elsevier Ltd. All rights reserved.

1. Introduction

Information retrieval is closely related to the information environment, setting and domain of interest, set of people, pro-fessions, work roles, problems, and associated tasks (Taylor, 1991) in which the information need arises. The work taskmoulds the querying activity by influencing the: choice of search terms, use of system features and relevance assessments(Vakkari, 2003). Studies of workplace search demonstrate that workplace searching is different from general searching(Fagin et al., 2003). Professional workplace queries are typically targeting a single ‘‘right answer’’, and tend to have a smallset of useful documents. The correct answer is context-specific and often limited based on non-topical aspects such asdocument genres and information characteristics, communication channels and sources of documents, document status,and products related to the document (Freund, Toms, & Waterhouse, 2005).

. All rights reserved.

), [email protected] (S. Price), [email protected] (L. Delcambre).

Page 2: How doctors search: A study of query behaviour and the impact on search results

1152 M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170

We have developed the semantic component (SC) model which is an indexing and search feature that allows searchers tofocus the search on domain-specific aspects of the main topic of the documents (Price, Delcambre, & Nielsen, 2006). Themodel seeks to meet the needs of domain-specific, workplace information searching and builds on findings that domain ex-perts use knowledge about document genres and document structures when searching, reading and analysing documents(Bishop, 1999; Dillon, 1991; Freund & Toms, 2005; Hearst & Plaunt, 1993). We have evaluated the model in a realistic, inter-active searching study with family practice doctors using a web-based national health portal, Sundhed.dk, for work-relatedinformation search tasks during a primary care visit (Price, Nielsen, Delcambre, Vedsted, & Steinhauer, 2009). The study wasan experimental comparison study between a baseline system, providing features from the portal’s existing search system,and an experimental system with the same features plus the ability to further specify the search using the semantic com-ponent model. The study showed an improvement in document ranking of 35.5% as measured by Mean Average Precision(MAP) and 25.6% by normalized Discounted Cumulative Gain (nDCG) when semantic components were used to structurequeries and to rank documents. The difference between the two systems was statistically significant for both MAP(p < 0.02) and nDCG (p < 0.01).

In the present paper we extend our findings and study the doctors’ queries in detail to explore how they applied the avail-able means for specifying the search, and how these search features contributed to the search outcome. Our goal is to gaininsight about the doctors’ querying activities and study how they transformed the search tasks into search facets and howthey formed and reformulated queries. Altogether, we want to extend our understanding of doctors’ work-related querybehaviour in order to guide the design of supplemental organisation structures and search features for interaction betweenthe searchers and the system, including the SC model. We seek to understand how they use the search features, and how thefeatures and the querying activities perform.

We specifically address the following research questions:

1. How did the doctors formulate queries to solve the search tasks?a. How many queries did they formulate per search task (session)?b. How many search terms did they use to formulate queries?c. How many search keys did they use?d. Which vocabulary and semantic categories did they use to express the search facets?e. Which search features did they use to formulate search facets?f. How much reformulation did they do?g. Which reformulations did they do?

2. How did the doctors structure the search tasks into search facets?a. Which search facets did they include?b. Which search facets did they omit?

3. How did the searchers obtain a high retrieval performance?a. Which formulations and features resulted in high, middle, low performance queries?b. Which formulations and features resulted in failures?

To answer these questions, we analysed search log data and relevance data from the interactive searching study. We stud-ied variables including queries, reformulations, search facets and search terms, search filters, SCs, search vocabulary, andsearch performance.

The remainder of the paper is organised as follows. Section 2 presents the background information for our study includingprevious studies of family doctors’ seeking and searching behaviour as well as studies about querying and retrieval perfor-mance. Section 3 provides an overview of the SC model that we developed and compares it to existing search features. Sec-tion 4 describes the research design used to examine doctors’ querying behaviour and the research questions. The studydesign of the retrieval test is described in Section 4.1, the data collection methods in Section 4.2, and the methods and mod-els used to analyse the query behaviour in Section 4.3. The section includes a description of the national health portal, theexisting search features, and the group of family doctors who carried out the information searching. Section 5 presents anddiscusses the results and their implications. Finally, Section 6 concludes the paper, and presents propositions for furtherresearch.

2. Related studies

We present related work in the following two subsections. The review is divided into two sections: one reviewingresearch about family practitioners’ searching behaviour and another section about retrieval performance and querying.

2.1. Family practitioners’ searching behaviour

Doctors have significant information needs when attending patients. Based on a review of original research papers, Ketc-hell, Lailani, Kauff, Barak, and Timberlake (2005) estimates that one clinical question arises per patient visit. A broad range ofquestions arise. In an observational study of 386 US practitioners, doctors asked a total of 1101 questions. A categorization of

Page 3: How doctors search: A study of query behaviour and the impact on search results

M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170 1153

the questions provided a taxonomy of 69 question categories. The three most common categories, comprising 24% of allquestions, concerned questions about drug dosing, cause of symptom X, and management of disease or finding X (Elyet al., 1999). The study showed that family doctors spent an average of less than 2 min pursuing an answer. A later US pri-mary care study reported that family physicians spend an average of 12 min per search, using on average mainly twosources, paper and colleagues (Gorman & Helfand, 1995). Timeliness is one of the main difficulties primary care physiciansreport when looking for electronic information. The typical primary care visit is stipulated to last 10–20 min, and family doc-tors need precise, ‘‘bottom line’’ answers, digested into easily accessible summaries (Coumou & Meijman, 2006).

A substantial percentage of questions can be answered by consulting electronic resources, which is an option that doctorsincreasingly use (Alper, Stevermer, White, & Ewigman, 2001; Cullen, 2002; Magrabi, Coiera, Westbrook, Gosling, & Vickland,2005; Pedersen, 2005). Alper et al. (2001) studied the ability of electronic medical databases to provide adequate answers toclinical questions of family doctors. Twenty questions were used to evaluate a total of 14 databases by the criteria: percent-age of questions answered and time to adequately answer the questions. The study suggests that individual databases cananswer a considerable proportion of family practitioners’ clinical questions, whereas a combination of available databasescan answer 75% or more. The time required to obtain answers is longer than the 2-min average time spent by doctors inthe study of Ely et al. (1999), and range from 2 to 6 min per question.

Cullen (2002) surveyed a random sample of 294 family doctors in New Zealand to determine their use of the Internet andtheir access to MEDLINE. 48.6% reported that they use the Internet to look for clinical information. Information was primarilysought on rare diseases, updates on common diseases, diagnosis, and information for patients. MEDLINE was the most fre-quently accessed source. They self-reported search skills as basic; 94% of the respondents search with keywords in MEDLINE,45.8% apply the feature ‘‘Limits’’ to filter on various types of metadata, 39.0% combine keywords with the Boolean ‘‘and’’,38.1% use the feature ‘‘Related articles’’ to get similar, topically related papers, and 11.0% use controlled search terms inthe form of Medical Subject Headings. Many respondents recognized the need for higher-level skills to retrieve the most rel-evant and reliable information. Cullen (2002) concludes that lack of information retrieval skills is perhaps the most signif-icant barrier to use of Internet-based information sources.

Bennett, Casebeer, Kristofco, and Collins (2005) faxed questionnaire to a random sample of family practice doctors to ex-plore purpose and barriers to Internet use in family practice with the aim to compare information-seeking behaviour of fam-ily practice doctors to physicians in other specialties. The 17 respondents found the Internet to be useful and important as aninformation source. Their searches differed from their colleagues in that they had a focus on direct patient care questions.Almost half of family physicians use hand-held computers, most often for drug reference.

Andrews, Pearce, Ireson, and Love (2005) investigated the information seeking behaviour of primary care practitioners ina questionnaire study with response rate of 51% (59 of 116). 78% of the respondents stated that they sought information tosupport patient care several times a week, 68% sought this information while the patient waited. The main barriers to infor-mation seeking were lack of time (76%), cost (33%), information-seeking skills (25%), and format of information sources(22%). The use of online resources was moderate, and practitioners were more likely to use print and interpersonal sources.

Magrabi et al. (2005) studied 227 family doctors’ interaction with an evidence system by use of transaction logs and asurvey analysis. 80% of participants rated their computer skills as good or better. The majority of the 1680 searches wereconducted from consulting rooms between 9 am and 7 pm. The most frequent searches related to diagnosis (40%) and treat-ment (35%). The searchers used, on average, 1.5 terms per search query, with a maximum of four terms per query. Disease-specific terms were used in a large number of searches (81%), with fewer searches with ‘‘symptoms’’ terms (26%), ‘‘drug’’terms (18%), and ‘‘other’’ terms (28%). Ninety-two percent (92%) of the searches were undertaken using the system’s profiles.A profile is an advanced search feature that selects information resources automatically for the user, and adds specialist key-words to the general question posed by the user.

In a Danish questionnaire survey, 70% of 278 family practitioners used the Internet daily (Pedersen, 2005). The practitio-ners searched for drug information, referral guidelines, waiting list information, and disease information.

During completion of a total of 20 realistic information needs using a mix of textbooks, journal papers, and computerapplications, Ely, Osheroff, Evbell, Chambliss, Vinston, et al. (2002) identified six major complications when attempting toanswer practitioner’s questions with evidence: (1) the excessive time required to find information, (2) difficulty in findingcorrect search terms and modifying the original question, which is often vague and open to interpretation, (3) difficultyselecting an optimal strategy to search for information, (4) failure of a seemingly appropriate resource to cover the topic,(5) uncertainty about how to know when all relevant evidence has been found so that the search can stop, and (6) inadequatesynthesis of multiple bits of evidence into a clinically useful statement. Findings 2 and 3 suggest that searchers encounterdifficulties in formulating precise search queries, but the paper does not report any details about search problems. The studyalso was quite broad in that it included non-computerized information sources.

In a recent study Davies (2011) investigated physicians’ use of information in the United States (US) (80 responses), Canada(80 responses), and the United Kingdom (UK) (636 responses). The percentage of responses from physicians working in gen-eral practice was 31.3% 42.5% in US, 32.5% in Canadian responses, and in UK. Generally, physicians in the US and Canada weremore likely to use electronic resources to locate information compared to physicians in the UK. Significant numbers of UK andCanadian physicians responded that they never used electronic resources for diagnosis, 20.0% and 13.8% respectively, com-pared to 1.3% of US physicians. Physicians in the US were twice as likely to report using electronic resources to find treatmentoptions for common diseases and three times as likely to report searching for information on rare diseases and syndromes.The only information use that was similar for all three countries was continuing professional development (CPD). Physicians

Page 4: How doctors search: A study of query behaviour and the impact on search results

1154 M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170

were asked their level of awareness and understanding of specific terms used in journal papers such as relative risk, system-atic review, confidence interval, and publication bias. The responses were similar with more than 70% reporting understand-ing or being able to explain evidence-based medicine (EBM) concepts. Limits on the search query such as English language,human subjects only, date of publication, was used by the US physicians 57.5%, Canadian 55.0%, and UK physicians 42.5%of the time. Boolean AND was used by US physicians 82.5%, Canadians 77.5%, and UK 67.5% of the time. Also subjects headingswas used mostly by US physicians, 46.3% of the time compared to 43.9 (CA) and 32.5% (UK). ‘‘Related articles’’ or ‘‘Similar ti-tles’’ were used by US physicians 78.8%, Canadian physicians 72.5%, and UK physicians 62.5% of the time. Time was the mainbarrier to accessing electronic information.

It is a limitation that the findings of Cullen (2002), Pedersen (2005), Bennett et al. (2005), Andrews et al. (2005), and Da-vies (2011) are based on self-reports, and Alper et al. (2001) and Ely et al. (2002) on questions answered by the investigators.However, concurrently the studies suggest that doctors look for information about few, common topics about drug dosing,cause of symptoms, and disease management. They used few search terms, and only half or less of the doctors applied fea-tures such as metadata filters and Boolean operators. Only the use of Boolean AND had increased from 2002 to 2011. Thesearchers stated that they encountered difficulties in formulating precise, context-specific queries. Ely et al. (2002) reportedspecifically that they had uncertainty about their choice and modification of search terms, and difficulty modifying searchquestions to important, context-specific facets such as patient, intervention, comparison, and outcome.

The findings about doctors’ searching behaviour are in line with general findings that end users do not use very sophis-ticated searches (Markey, 2007a; Markey 2007b). Searchers often enter very short queries and they seldom apply advancedsearch features. Transaction log analysis of thousands of queries posed by Internet users showed that one in three querieshad only one term (Jansen, Spink, & Saracevic, 2000). The average query length was 2.21 terms. In a study of 87 universitystudents’ queries submitted to Internet search engines almost 60% of the queries had only one to two terms and on average,the searchers used 2.76 terms (Lucas & Tobi, 2002). The average number of query terms was even smaller, 1.45, in a study byStenmark (2008) of Intranet searchers.

2.2. Retrieval performance

We have not found any studies that examine retrieval performance of doctors’ querying activities. Sormunen (2002a) stud-ied query exhaustivity and extent for 18 search tasks using a Finnish newspaper text database as the test collection. The studyshowed that increasing the exhaustivity of optimized best-match queries led to a significant increase in precision, especiallyfor tasks that require high precision. Vakkari, Jones, and MacFarlane (2004) explored how the expression of search facets andrelevance feedback by users were related to search success in automatic and interactive query expansion. In an experimentwith 26 students using the Okapi system to search for 4 TREC topics, the results showed that search success was related bothto how exhaustively search facets were covered by query terms and to the relevance feedback. A longitudinal study of 22undergraduates showed that the students’ ability to cover all aspects of the conceptual topic of interest in their query termsis the major feature affecting search success. Also their ability to extract new query terms from retrieved items improvessearch results (Pennanen & Vakkari, 2003). In laboratory experiments Kekäläinen and Järvelin (1998, 2000) compared retrie-val performance of strongly and weakly structured queries in combination with different query expansion types. Their resultsshowed that the effects of query expansion depend on the structure of the query. Query expansion interacts with query struc-ture; significant query expansion seems to require a highly-structured well-structured query, but in cases with little or noquery expansion, the structure of the query does not matter. In most cases the best performance was obtained by the largestquery expansion that included all semantic relationships. Wolfram (2008) compared end-user searching characteristics infour different web-based information retrieval environments (a university bibliographic databank, a university OPAC, theExcite search engine, and the HealthLink specialized search service). Transaction logs were analysed to identify searchingbehaviour such as query length, session length, time spent, number of page views, and number of query terms. The four envi-ronments revealed notable differences at the term, query, and session levels. On average the terms per query were 1.81 for thebibliographic databank, 3.66 for OPAC, 2.62 for the search engine, and 1.78 for the specialized search service. The searchersmade, on average, 1.87 queries per session in the bibliographic databank, 2.25 in OPAC, 1.80 in the search engine, and 2.34in the specialized service. They viewed, on average, 1.05 pages per query in bibliographic databanks, 1.13 in OPAC, and1.75 in the search engine. No information was reported for the specialized search service. Mu and Lu (2010) examined theinfluence of query complexity and different query expansion strategies on the effectiveness of genomic information retrieval.Query complexity was defined as the average number of terms in the query. The query expansion strategies were based on theUnified Medical Language System (UMLS) Metathesaurus. In a laboratory study in the 2006 TREC Genomic Track test collec-tion, 24 queries were analysed. In general, query expansion did not improve the retrieval effectiveness measured by MeanAverage Precision. Queries with fewer terms outperformed queries with a larger number of terms. The finding was statisti-cally significant.

The findings concerning specialized search services confirm previous findings that searching in domain-specific environ-ments are more targeted and context-specific than searching in a more general setting. Compared to other environments,domain-specific queries are shorter and more focused. Sessions contain a larger number of queries, but are of very shortduration, suggesting quick assessment of the returned results. Altogether the findings about query structure, expansionand complexity imply that search success depends on the searchers’ ability to articulate the facets of topics in query terms,

Page 5: How doctors search: A study of query behaviour and the impact on search results

M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170 1155

structure the facets in the query, and cover the facets exhaustively. Vakkari et al. (2004) suggest that a point of departure forimproving best-match systems would be providing searchers in the initial search with an option to see the facets of the topic.

3. The semantic component model

We developed the SC model in order to facilitate formulation of structured queries that cover the search topic preciselyusing domain-specific aspects, as is frequently required in workplace retrieval. The SC model divides a given collection into aset of document classes, where each class has an associated set of SCs. Documents may be classified by the type of topic thatis the main focus or by the main purpose of the document. The appropriate classification depends on the nature of the doc-ument collection. In a health-related collection topic type is a natural axis for classification. Doctors search for informationabout diseases, drugs and clinical treatment, and in the case of the national health portal, sundhed.dk we have also founddocuments that corresponds to these topics, e.g. documents about diseases (one document class), documents about medi-cations (a second document class), and documents about treatment methods, e.g. chemotherapy (a third document class)(Price & Delcambre, 2006). Documents in a document class tend to contain information about a finite set of aspects or facetsof the topic that are important in the domain and reflect a general understanding of the type of information needs expected.For example, documents about diseases often contain information about symptoms and treatment whereas documents aboutmedications often contain information about dosage and side effects. We call these aspects semantic components of the doc-ument class, and define a semantic component instance as one or more text segments about a particular aspect or facet of themain topic of the document. The segments of text that comprise a semantic component instance can vary in length and mayor may not be contiguous. Each semantic component instance consists of segments of text that may or may not correspond tostructural elements in the document.

SCs are similar, but not necessarily identical, to the classical facets used within library and information science to struc-ture a subject area. Vickery (1960) defines facets and facet analysis as the sorting of terms in a given field of knowledge intohomogeneous, mutually exclusive facets, each derived from the parent universe by a single characteristic of division. A sub-ject field has innumerable characteristics, and may be divided into a variety of facets. Aitchison, Gilchrist, and Bawden (2002)list the following ‘‘fundamental’’ facet categories that may be applicable in many subject fields: Entities, Attributes, Actions,Space and Time. Some SCs correspond quite naturally to facets. For example segments that contain information about treat-ment in a document about diseases may be considered as an action facet. However, SCs can also contain information thatmight not be considered a topic facet, such as scheduling instructions for a patient in documents about surgical operationsthat is a form facet. SCs can also combine two or more concepts that are different aspects of a topic, such as combining epi-demiology and natural history into general information about a disease. SCs are used to specify what kind of information isgiven about a document’s general topic, and may specify sub-topics, type of information or target group, depending on thedomain-specific collection. Since SCs are intended to facilitate retrieval, and not intended to describe the domain; knowingthe contents of a particular document collection, or the common information needs among users of the collection, may leadto selecting SCs with varying degrees of specificity to represent document content.

The SC model can assist the searcher in three ways: (1) a searcher can search for documents about a topic that also con-tain a particular SC, e.g. information about treatment; (2) the searcher can search for terms within SC text segments, e.g. theterm ‘‘chemotherapy’’ in the SC treatment, and (3) the searcher can see a list of the SCs present (with their respective lengths)in each document in a hit list. The list of SCs with their lengths provides a short synopsis of the document and this can help asearcher decide whether a particular document is likely to be useful. Consider, for example a situation where a doctor who isseeing a long-standing patient with severe asthma who is newly pregnant. The doctor needs to decide whether to continueher patient’s current asthma medications or to change to a different regime to protect the fetus. He does not need documentsthat describe asthma, or documents that recommend diagnostic tests for asthma. He knows his patient has asthma. He needsa document that describes the safety, or risks, of specific asthma medications in pregnant women. Best-match informationretrieval systems typically return documents based on scores related to matching the words in user queries to documentrepresentations consisting of words extracted from document text (natural language indexing) or terms, usually assignedfrom a controlled vocabulary (controlled indexing). Using the SC model, the doctor can request documents about asthma thatcontain the text term ‘‘pregnancy’’ within the SC ‘treatment’ (with a document type of diseases). Table 1 shows the documentclasses and semantic components used in the study.

We call the set of document classes and associated sets of semantic components that are identified for a particular docu-ment collection a semantic component schema. The semantic component schema for sundhed.dk consists of six documentclasses and associated semantic components. We developed the schema using random sampling and purposeful browsingto analyse the sundhed.dk document collection. We selected a sample of 72 documents by using 10 common, non-do-main-specific search terms, such as the Danish words for and, or, in, and it, to generate large results sets that were not biasedto particular topics. We used a pseudo-random number generating program to select documents from the results set to ana-lyse. We outlined the information in each document using brief descriptions of information content in natural language. Weidentified the primary focus of each document and classified the documents according to the semantic type, such as clinicalproblem or drug, of its primary focus. We then considered the kinds of information we found in the documents assigned toeach class. We constructed a list of semantic components to represent the commonly occurring information types (often as-pects of the main topic) by clustering similar natural language descriptions of the information types into groups and assigning

Page 6: How doctors search: A study of query behaviour and the impact on search results

Table 1Sundhed.dk semantic component model.

Document class Semantic components Document class Semantic components

Clinical problem General information Clinical unit Function and specialtyDiagnosis Practical informationReferral ReferralTreatment Staff and organization

Clinical method General information Drugs General informationPractical information Practical informationReferral Target groupAftercare EffectRisks Side effects, interactions and contraindications)Expected results

Services General information Notice General informationPractical information Practical informationReferral Qualification

1156 M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170

meaningful labels to each group. The initial schema underwent several cycles of iterative improvement as we became morefamiliar with the document collection and with how physicians use the collection. We consulted physician users of the col-lection and sundhed.dk employees who were involved in the indexing and editorial processes.

4. Research design

In this section we present the methods and the data of our study: in Section 4.1 we present the design of the searchingstudy studied in this paper; Section 4.2 presents the data collection methods and Section 4.3 the methods for data analysisand data evaluation. Design of the searching study is described in detail in (Price et al. (2009).

4.1. Overall study design

In the searching study we used a convenience sample (as distinguished from a random sample) of 30 family practice phy-sicians from the Aarhus region. We made purposive sampling, because we did not want to study inexperienced searchers’ability to exploit search features. We wanted to investigate doctors who already used the Internet and the sundhed portalin their daily work practice, and study how they use of the available search features to solve work-related information searchtasks during a primary care visit. We made the study in a controlled setting in order to be able to compare querying behav-iour and search performance across the test searchers. The doctors searched for documents relevant to four search tasks de-signed to mimic daily work-related tasks in family practice. The test sessions took place at the regional centre for familypractice, and the physicians were paid an amount equivalent to what they could have earned in their practice during the2-h study. Table 2 summarizes the self-reported medical and searching experience of the 30 participants. The test searchersvaried with regard to professional medical experience, but were similar with regard to experience with the sundhed.dk por-tal and Internet searching.

The experimental session consisted of four search sessions. Each subject used System 1 (the baseline system) for two con-secutive tasks and System 2 (the experimental system with SC searching features) for two consecutive tasks. (See the nextsection for a more detailed description of System 1 and System 2.) We randomized the order of tasks and system use in orderto avoid any effect that the order might have on the result because the test subjects might gain new IR and subject knowl-edge during the search. Another intention was to neutralize any effect caused by increasing familiarity with the experiment.

We developed the search tasks in collaboration with two family doctors using the methodology of Borlund (2003). Eachtask represented a typical information search task that might be encountered in the context of a patient visit. The doctorsformulated the search tasks. They were instructed to formulate the tasks in everyday terminology reflecting colloquial dia-logue and conversation in family practice. Our tasks were very much in line with prior work on clinical questions (Ely et al.,1999; Magrabi et al., 2005; Pedersen, 2005). Task A, B and D represent tasks concerning disease management and referral tothe public health care system, and task C poses a question about drug dosage. Table 3 provides a condensed summary of thesearch tasks with an indication of search facets.

4.1.1. Experimental search system and test collectionThe operational sundhed.dk web portal is the national Danish health portal intended for both healthcare professionals

and citizens. The portal was established in 2005 and contained nearly 25,000 documents at the time of the study. The portaluses Ultraseek, a commercial search engine developed by Verity Inc., and later acquired by Autonomy Corporation (Auton-omy, 2011). We were granted a temporary license for the Ultraseek 5.6 software by Ensight (now Metier), the Danish dis-tributor for Verity/Autonomy products. Sundhed.dk gave us copies of its configuration files so that we could mimic theoperational system.

Page 7: How doctors search: A study of query behaviour and the impact on search results

Table 2Searcher characteristics.

Searcher characteristic Value Std. Dev.

Experience using Internet search engines 7.18 years 2.82 yearsExperience using sundhed.dk to find information about health care or the healthcare system (not patient data) 2.43 years 1.43 yearsSelf described level of searching experience on a scale from 1 (not at all experienced) to 5 (very experienced) 2.40 years 0.93Experience as a medical professional 21.44 years 7.59 years

Table 3Information search tasks.

Searchtask

Topic description Facets

A Ex-smoker, cough, fatigue, shortness of breath. Two tentative diagnoses: (1) Chronic obstructivepulmonary disease (COPD) and (2) Lung cancer. First step is examination of COPD due to waitingtime for lung cancer examination

Topical facets:– Diagnosis: Chronic obstructivepulmonary disease (COPD)– Admin activity: Referral– Health care activity: Evaluation

Find documents that help you to decide referral procedure to follow to evaluate for COPD Non-topical facets:– Information type: Referral guidelines– Region: Århus

B Pregnant woman (week 23), light bleeding, no pain, no other abnormal conditions. Topical facets:– Condition: Pregnancy– Diagnosis: Spontaneous abort– Phase: 2. Trimester– Symptom: Bleeding– Admin activity: Referral

Find documents that help you to decide if and how the patient should be referred to acuteexamination at hospital

Non-topical facets:– Information type: Referral guidelines;Treatment description– Region: Århus

C Woman with two un-intentional miscarriages, ready to get pregnant again. Topical facets:– Condition: Pregnancy after two un-intentional miscarriages– Health care activity: Drugprescription– Mechanism: Folate

Find documents that help you to decide if the patient should take folate, and at what dose Non-topical facets:– Information type: Treatmentdescription

D Man attacked with knife. Now nervous and afraid to leave his apartment. Topical facets:– Diagnosis: Anxiety– Treatment: Psychological treatment– Health service: Public insurance– Admin activity: Referral

Find documents that help you to decide if the patient can be referred for free psychological help(covered by public insurance)

Non-topical facets:– Information type: Referral guidelines

M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170 1157

We created an experimental search system based on the existing sundhed.dk portal that consisted of documents, a searchengine, and two different search interfaces that we labelled as System 1 and System 2. We copied all 24,712 documentsowned by sundhed.dk as of July 2006 (including keyword and metadata fields) to the search system. These pages containedinformation about health and healthcare and also about the Danish healthcare system. For example, the portal includedreferral guidelines that specify how a family practitioner should refer a patient to the hospital, what tests must be done first,and which records should be sent to the hospital. The portal also included information about which services are available andwhether they are subsidized by the government. Some information is written for healthcare providers and some for patientsand their families, but all the documents are available to everyone on the public web portal.

The Ultraseek software system provides three main functionalities in the sundhed.dk portal: (1) it indexes all the docu-ments, (2) it generates a search interface, implemented as a web page, and (3) it performs requested searches and generates aweb page with a ranked list of links to documents that comprise the search result. Both the indexing and the search interfaceare customizable by setting parameters through an administrative interface and by editing the code that generates the userinterface. The internal algorithms for searching the indexes and for ranking results are proprietary and cannot be viewed ormodified. Documentation on the Ultraseek website describes the scoring algorithm, in very general terms, as taking into ac-count term frequency, term location within the document, rarity of individual terms, occurrence of multiple query terms,and document quality ‘‘based on numerous factors’’ (Ultraseek, 2008).

Page 8: How doctors search: A study of query behaviour and the impact on search results

Fig. 1. Search interfaces (mostly in Danish) for System 1 (left) and System 2 (right).

1158 M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170

The System 1 interface provides a simple search box plus two search filters that are controlled by dropdown menus, oneto filter documents by the region of Denmark to which the documents apply (labelled Regionalt indhold in the interface) andone to filter documents by an existing document classification (labelled Informationstype in the interface) used by sun-dhed.dk. The default behaviour for both filters was to include all documents (i.e., apply no filter). Queries typed into thesearch box use the Ultraseek query syntax, which includes wildcard expansion when an asterisk is included in a search term.The left panel of Fig. 1 shows a cropped screenshot of the System 1 interface.

System 2 has the same features as System 1 plus the ability to further specify the search using semantic components. (SeeSection 3 for an explanation of the SC schema that was used.). To search using System 2, the searcher typed one or moresearch terms into the search box labelled Search and optionally chose an item from the dropdown menus for the two filters,as for System 1. In addition, the searcher could (optionally) enter one or more search terms into one or more of the text boxesfor the semantic components. The right panel of Fig. 1 shows a cropped screenshot of the System 2 interface. The text boxeswere grouped by document class (the name of the class is in bold font) and labelled with the semantic component. The circlehighlights a search term (the Danish equivalent of pregnan�) in a text box associated with the semantic component for treat-ment. System 1 was produced using the standard Ultraseek system, configured to mimic the operational sundhed.dk system.System 2 was based on System 1 but with customization of the web interface code. The right panel of Fig. 1 shows a screen-shot of the System 2 interface. The text boxes were grouped by document class (the name of the class is in bold font) andlabelled with the SC. To search using SCs, the searcher typed one or more search terms into the search box labelled Search,e.g. astma, and specified that he wanted a document about a disease (equivalent of Klinisk problem in Fig. 1) that contain theterm pregnan� (equivalent of gravi�) in the full text in the treatment SC (equivalent of Behandling). The experimental systemsused a combination of full-text and controlled indexing and allowed the user to express the search facets by free naturallanguage terms and controlled keywords from the two international health-related classification systems and a locallydeveloped Citizens Thesaurus:

� International Classification Primary Care (ICPC), developed to classify patient data and clinical activity in the domains offamily practice and primary care (International Classification of primary care, 2011).� The International Classification of Diseases (ICD-10), developed to classify documents for hospital professionals (Interna-

tional Classification of Diseases (ICD), 2011).� Citizens Thesaurus, a home-grown thesaurus developed to meet topics of interest and vocabulary of citizens.

Not all documents in the portal are indexed by controlled keywords and classification codes, but the controlled systemsare well-known and widely used in indexing and searching in Danish health care. Family doctors are especially familiar withthe ICPC codes and use them for indexing patient records and for searching the sundhed.dk portal.

The authority dropdown lists for the two search filters ‘information type’ and ‘region’ constitute a second type of con-trolled, search vocabulary, and the SC model represents a third type. The six document classes in the semantic componentschema represent document types. When a searcher limits a search to one of the document classes, he/she searches for oneof the schema’s six document types. When the searcher limits the search to one or more semantic components, the searcherlimits the search to facets represented by the semantic component model (see Table 1 for overview of semantic componentschema used in this study).

Seven experienced indexers, who had received training with respect to semantic components and to our semanticcomponent indexing software, indexed 371 documents using the SC model. The documents were identified by executing

Page 9: How doctors search: A study of query behaviour and the impact on search results

M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170 1159

a variety of searches applicable to each of the four search tasks. The goal was to identify documents most likely to be re-trieved at a high rank by the users, plus all documents relevant to the tasks.

We stored the semantic component data in metadata fields that we added to each indexed document. Data included theindexer-assigned document class, a list of the semantic components present in the document, the size of each semantic com-ponent instance (the number of characters in the instance), and the text in each semantic component instance. After config-uring Ultraseek to index our newly-defined metadata fields, when present, we indexed all 24,712 documents with Ultraseek.

4.1.2. Retrieval and results displayWe configured both System 1 and System 2 to return 100 hits, ordered by similarity score. Ultraseek provided the ability

to differentially weight words that appear in different parts of the document. The same weights that were used in the oper-ational sundhed.dk system were used for both systems. System 1 returned documents using the Ultraseek similarity algo-rithm based on full text indexing of the title, body, keywords, and designated metadata fields. If a value was selected foreither of the two filters, Informationstype or Regionalt indhold, documents matching the query term were returned only ifthe document also contained the appropriate value in the metadata field for the selected filter(s). System 1 did not searchmetadata fields that represented semantic components.

System 2 sent the query in the main (simple) search box plus the values for the two filters, if any, to the Ultraseek searchengine exactly as for System 1. The documents returned were precisely those that would have been returned by the samesearch submitted to System 1. Only documents returned by the simple query were eligible for inclusion in the results set.However, unlike System 1, System 2 intercepted the result list and similarity scores, and sent a second query with onlythe terms (including asterisks used as wildcards) that were entered into the semantic component fields as a fielded searchof the selected semantic component metadata fields. When only an asterisk appeared in a semantic component search box, adocument was returned only if it was returned by the initial search for documents containing terms in the simple search boxand it contained an instance of the semantic component indicated by the asterisks. The similarity scores for the secondsearch were determined solely by the similarity of the semantic component part of the query to the corresponding semanticcomponent instances in the retrieved documents. Documents without an instance of the requested semantic componentwere not returned from the second search and were assigned a similarity score of zero for the second search. Documentranking was determined by a final similarity score that was a linear combination of the similarity scores from the twosearches. The results of the two searches were equally weighted, thus the resulting score was the average of the scores fromthe two searches.

The results displays for both System 1 and System 2 mimicked the operational system. Both systems displayed the title, asnippet of text showing the query term in context, the document ID, the Region (if any) for which the document was written,the document type (Informationskategori) used by the operational system, and a summary written by the document author.In addition, System 2 also displayed: (1) the document class selected by the indexer from our list of six document classes(Documenttyper) and (2) a list of semantic components appearing in the document plus an integer to indicate the size, innumber of characters, of the semantic component instance. Indication of the size of the semantic component instancewas shown to give the searcher an idea of the amount of information given about the selected semantic component, e.g.Treatment.

4.2. Data collection methods

We collected data about searching behaviour through search logs, questionnaires, verbal protocols, interviews, and rel-evance assessments.

4.2.1. Search logsThe systems’ logging facility provided detailed data on search interactions: query identity, query terms, semantic com-

ponents �, semantic components term, search filters, and hits.

4.2.2. QuestionnairesThree types of questionnaires were distributed. The pre-session questionnaire established searcher skills and experience,

experience with the sundhed.dk portal, educational background and work experience. The post-system questionnaire gath-ered information on the ease of use and user satisfaction with the two versions of the search interface. The post-system ques-tionnaire was distributed in two parts, first when a test searcher had completed the first two search tasks using one of thetest interfaces, and second when the searcher had completed the last two search tasks using the second test interface. Thefinal post-questionnaire collected data on searcher preferences and views on future use of the search features underinvestigation.

4.2.3. Verbal protocolsSearchers were instructed to ‘think aloud’ as they interacted with the system in order to get some insight into their per-

ceptions, problems, strategies and understanding of the task in hand. During the sessions, probing questions were asked bythe experimenter in order to gain more insight into the searchers’ perceptions of the individual topics and search tasks. Theverbal utterances were audio-recorded digitally. Statements that appeared important were written down manually during

Page 10: How doctors search: A study of query behaviour and the impact on search results

1160 M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170

the search sessions. The written statements were later thematically analysed. On a few occasions the digital recording wasconsulted for clarification of the written statement. Interviews: A final post-session semi-structured interview provided fur-ther information on the system’s interactive search features as well as the overall experimental setting. The interviews wereaudio-recorded digitally, and transcribed by condensing the informants’ articulated meanings to shorter formulations. A the-matic analysis was carried out to provide an overview of the themes and meanings.

4.2.4. Relevance assessmentsTwo types of assessments were collected. We recorded relevance assessments from the test subjects by having them type

into a Word document in order to calculate search system performance from a user perspective. We asked a family practi-tioner with expertise in research and clinical care to create a gold standard by which we judged the relevance from a systemperspective. Sormunen’s (2002b) four-point scale that classifies documents as irrelevant, marginally relevant, fairly relevant,and highly relevant was used for expert, system-oriented relevance assessments as well as for user relevance assessments.

To study and report the querying activities we primarily used data from the transaction log files. Data from question-naires, verbal protocols, and interviews were used to gain a better understanding of the querying activities emerging fromthe log data. We used relevance data from the expert practitioners to identify queries with high retrieval performance.

4.3. Data analysis of querying behaviour

We analysed the log data and querying behaviour. In order to explore query formulation behaviour we studied how manyqueries were issued to the system, how the searchers divided the search tasks into topical and non-topical search facets, howmany query terms they used, which semantic category of terms they used to express the search facets, how they applied thesearch filters and the semantic component features (document classes and components), and how many times they modifiedand reformulated the queries. We used the ideal set of search facets presented in Table 3 to gain insight into the search facetsused by the searchers. We studied the impact of querying activities by analysing high performance queries and query fail-ures, using multiple regression analyses.

We analysed the queries individually and at the session level. Following Jansen’s (2006) definitions, a query is defined as alist of zero or more terms submitted to a search engine, and a session is all of the queries issued by the searcher to fulfil aninformation search task. A query term is a string of characters, with terms separated by a delimiter such as a space or otherseparator. A search key refers to any expression denoting an individual search facet (Vakkari, 2000). A search key may con-tain one, two or n terms. We distinguished between topical search facets that are topical aspects of the search topic, and non-topical facets that enclose or contextualise the topic, e.g. that the topic is presented in a certain form, limited to a certain timeor geography, or targeted to a certain audience or work task. All four tasks were be divided into a combination of topical andnon-topical facets; task A into three topical facets and two non-topical facets, task B into five topical and two non-topicalfacets, task C into three topical and one non-topical facet, and task D into four topical and one non-topical facet (see Table3).

In order to get an understanding of the user’s search vocabulary, we compared the search keys to the terms used to ex-press search facets in the search task description. We divided the keys into: (a) keys identical to task description terms and(b) other terms. We do not know whether the searchers selected the search keys purposefully from the task description, butthe categorization provided a starting point to investigate whether the searchers used synonym variations to express searchfacets; if they expressed search facets by use of a hierarchically broader or narrower term, or whether they chose to expresssearch facets from another perspective by a related search facet. Hence, we classified the keys that were not identical to taskdescription terms into five semantic categories, adopted from Wang (1997). A synonym (ST) is a key that is interchangeablewith the corresponding task description term. A broader term (BT) refers to a key which is broader in the hierarchy. A nar-rower term (NT) refers to a key narrower in hierarchy. A related term (RT) is a key which is associated and represents a re-lated perspective. A classification code (CC) is a key that express the search facet by a code. We also coded terms as taskdescription terms (TD), or terms other than tasks description terms (OT). Terms that are not topical keys, we also coded(Non-t).

The facet analysis was carried out by two experts in facet analysis and information retrieval. First, the experts conductedthe analysis separately; next, the results were compared; and finally, the differences were resolved during a group discus-sion. Two members of the research team coded the queries and search keys. Differences were discussed and solved jointly.

We measured search performance of each individual query using the normalized Discounted Cumulative Gain (nDCG)metric based on expert relevance judgments (Järvelin & Kekäläinen, 2002). Because we were simulating a search settingwhere information needs were very specific, we assigned values of 0, 1, 10, and 100 to documents with relevance ratingsof 0, 1, 2, and 3 respectively. We used a factor of 10 to separate these values because marginally relevant and partially rel-evant documents are generally not very useful in the setting we simulated. Similarly, because the search tasks simulated asetting where time available for searching is limited, we used a discounting parameter of base 2 to simulate a ‘‘busy’’ user.Based on the performance scores, we categorized the queries into four groups: high performance queries with nDCG rangingfrom 1.00 to 0.66, middle performance queries with nDCG ranging from 0.65 to 0.33, low performance queries with nDCGranging from 0.32 to 0.01, and failures corresponding to zero hits. We did this to analyze whether it is possible to identifycertain characteristics for respectively, high, middle, low performance and failure queries regarding search facets, searchterms and search features. We did multiple-linear regression analyses for each search task in order to correlate performance

Page 11: How doctors search: A study of query behaviour and the impact on search results

Table 4Variables of the study.

Variable Definition and measurement

Queries per session Average number of queries per sessionTerms per query Average number of terms per queryTerms per session Average number of terms per sessionSearch keys per query Average number of search keys per querySearch keys per session Average number of search keys per sessionSessions with reformulations Percentage of sessions with reformulationsTask description search keys (TD) Percentage of search facets expressed by task description termSynonym search keys (ST) Percentage of search facets expressed by synonym term to task description termClassification code search keys (CC) Percentage of search facets expressed by classification codeRelated search keys (RT) Percentage of search facets expressed by related terms to task description termBroader search keys (BT) Percentage of search facets expressed by broader terms to task description termNarrower search keys (NT) Percentage of search facets expressed by narrower terms to task description termNon-topical search keys (NTT) Percentage of search facets expressed by non-topical terms‘Region’ filter per query Average number of ‘region’ filters per query‘Information type’ filter per query Average number of ‘information type’ filters per querySemantic component 1 ‘term’ (SC1term) per query Average number of first semantic component term per querySemantic component 1 ‘�’ (SC1�) per query Average number of first semantic component � per querySemantic component 2 ‘term’ (SC2term) per query Average number of second semantic component term per sessionSemantic component 2 ‘�’ (SC2�) per query Average number of second semantic component � per sessionPerformance score for query Discounted Cumulative Gain (DCG) for query

M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170 1161

and search features. The purpose was to provide a quantitative picture and explanation of the features’ individual impor-tance for the search result. We did not do a test of statistical significance, because data was not randomly sampled. The vari-ables of the study are presented in Table 4.

5. Results and discussion

In this section we present and discuss the results. In Section 5.1, 5.2 and 5.3 we discuss data about the doctors’ queryingbehaviour: how they structured the search tasks into search facets and query terms, how they formulated and modified que-ries, and how they obtained high retrieval performance. In Section 5.4 we describe how the querying activities contributed tothe search results. Section 5.5 relates the findings to previous research about physicians’ searching behaviour. Table 5 pro-vides an overview of the results of the study variables over the four search tasks and the two test systems. When we laterdiscuss the results in detail, we additionally present data individually for the tasks and systems, because the tasks resulted inindividual combinations of querying activities, and System 1 and 2 provided different options.

5.1. Query formulation and query terms

Each of the thirty test searchers completed four information searching tasks, resulting in a total of 120 search sessions.They used 343 queries to fulfil the tasks, 151 queries in System 1 and 191 queries in System 2. On average, the number of

Table 5Results for variables for 30 test searchers and 4 search tasks A–D.

Variables System 1 Sessions N = 60Queries N = 151

System 2 Sessions N = 60Queries N = 191

Queries per session (average) 2.5 3.2Terms per query (average) 2.0 1.5Terms per session (average) 5.1 4.9Search keys per query (average) 1.8 1.5Search keys per session (average) 4.6 4.7Sessions with reformulations (percentage) 53.3 65.0Task description search keys (TD) (percentage) 69.8 62.6Synonym search keys (ST) (percentage) 14.4 9.0Classification code search keys (CC) (percentage) 1.1 6.0Related search keys (RT) (percentage) 12.2 22.4Broader search keys (BT) (percentage) n/a n/aNarrower search keys (NT) (percentage) n/a n/aNon-topical search keys (NTT) (percentage) 2.2 n/a‘Region’ filters per query (percentage) 45.0 34.0‘Information type’ filters (percentage) 49.3 22.9Semantic component 1 ‘term’ (percentage) n/a 19.9Semantic component 1 ‘�’ (percentage) n/a 48.2Semantic component 2 ‘term’ (percentage) n/a 3.7Semantic component 2 ‘�’ (percentage) n/a 4.2

Page 12: How doctors search: A study of query behaviour and the impact on search results

1162 M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170

queries per session was 2.9, with 2.5 in System 1 and 3.2 in System 2. On average, the searchers used the most queries tosolve task B (4.1), and the least to solve task A (1.6). For a summary of the number of queries used, see Table 6.

The searchers reformulated the queries in 53.3% of the sessions in System 1 with up to 7 reformulations, and in 65% of thesessions in System 2 with up to 10 reformulations; for details, see Table 7. The reformulations consisted of combinations ofstrategies: removal/addition/change of the information type filter, removal/addition of the region filter, removal/addition/change of SCs, substitution of SC⁄/SC term, substitution of facets, and variation of synonyms. There was no observable con-sistent pattern among searchers regarding their choice of reformulations tactics. Similarly, at the individual level the search-ers did not seem to use a systematic reformulation strategy, e.g. initially trying out synonym variations, next related terms,then changing the filters. They reformulated the queries and applied search tactics without any observable order.

The searchers used a total of 599 terms to express the search facets in query terms, on average 5.1 terms per session inSystem 1 and 4.9 terms per session in System 2. The average number of query terms varied over the four search tasks, from2.0 terms for task A to 9.0 terms for task B. For details see Table 8. The searchers used on average 2.0 query terms per query inSystem 1 and 1.5 in System 2; for details, see Table 9. They used 0.5 fewer terms for task A, and 0.5 more terms for task B. Ona few occasions the searchers used advanced Ultraseek syntax in query formulation, e.g. wildcard �, Boolean +, or quotedstrings (e.g., ‘‘bleeding under pregnancy’’) to express a compound term.

The querying behaviour was slightly different across systems as well as across tasks. Searchers consistently entered morequeries into System 2 than into System 1; for task C the difference was highest. Furthermore, they entered more queries fortask B and fewer queries for task A. The number of reformulations was similar for the two systems, except for task C wherethe number of reformulations was higher for System 2 than for System 1. The number of query terms per session varied overthe tasks with more terms for task B and less for task A. They used more query terms per session in System 1 than System 2,except for task C.

The higher level of query activity for System 2 compared to System 1 is not surprising given that System 2 had more fea-tures to try out, and given that the participants knew that we were evaluating new features based on our test design.

Table 6Queries per session (average per session).

Queries (average per session)

A B C D All

System 1 25 (1.7) 59 (3.9) 28 (1.9) 40 (2.7) 152 (2.5)System 2 24 (1.6) 65 (4.3) 52 (3.5) 50 (3.3) 191 (3.2)All 49 (1.6) 124 (4.1) 80 (2.7) 90 (3.0) 343 (2.9)

Table 7Number of sessions with query reformulation.

Number of sessions with query reformulation (percentage)

A B C D All

S1 S2 S1 S2 S1 S2 S1 S2 S1 S2

Reformulation 5 5 13 12 5 11 9 11 32 (53.3) 39 (65.0)No reformulation 10 10 2 3 10 4 6 4 28 (46.7) 21 (35.0)Total 15 15 15 15 15 15 15 15 60 60

Table 8Query terms per session (average per session).

Query terms (average per session)

A B C D All

System 1 35 (2.3) 144 (9.6) 50 (3.3) 78 (5.2) 307 (5.1)System 2 26 (1.7) 127 (8.5) 70 (4.7) 69 (4.6) 292 (4.9)All 61 (2.0) 271 (9.0) 120 (4.0) 147 (4.9) 599 (5.0)

Table 9Query terms per query (average per query).

Query terms (average per query)

A B C D All

System 1 35 (1.4) 144 (2.4) 50 (1.8) 78 (2.0) 307 (2.0)System 2 26 (1.1) 127 (2.0) 70 (1.3) 69 (1.4) 292 (1.5)All 61 (1.2) 271 (2.2) 120 (1.5) 147 (1.6) 599 (1.7)

Page 13: How doctors search: A study of query behaviour and the impact on search results

M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170 1163

The higher level of querying activity for task B has a straightforward explanation. Task B turned out to be controversialbecause the searchers used different relevance criteria compared to the reference standard. Only one document was judgedhighly relevant in the reference standard. Five of the users who examined this document also judged it highly relevant, threejudged it fairly relevant, but 6 judged it irrelevant and one judged it only marginally relevant. As a consequence, the search-ers had difficulties finding relevant documents, and thus continued searching, even though they retrieved documents duringthe search that were judged as highly relevant from the point of view of the expert who created the reference standard.

The higher level of querying activity for task C in System 2 compared to System 1 occurred because the searchers useddifferent combinations of semantic components to specify the search. Task C concerns the dosage of folate for a pregnantwoman after an unintended miscarriage (search tasks are described in Table 3), and in this task, it was possible to look atthe search problem from two main topical perspectives, pregnancy or drug prescription. Each of these main perspectives isrepresented in the semantic component model by a document class: Clinical problem (pregnancy) and Drug (folate as a med-ication). Within each of these two document classes it was furthermore pertinent to use several of the available semanticcomponents, e.g. the components ‘‘general information’’ and ‘‘treatment’’ for the Clinical problem class and the components‘‘target group’’ and ‘‘practical information’’ for Drug class. Thus, there were many alternative search combinations using thesemantic component model, and the findings reflect that the searchers exploited the range of combinations.

5.2. Search filters and semantic components

The searchers used the region filter in 45% of the queries in System 1 and 34% in System 2; for details, see Table 10. Theregion filter was mostly used for task A and B, where regional procedures were an important search facet (for overview oftasks, see Table 3).

The ‘information type’ filter was used in 49.3% of the queries in System 1 and in 29.3% in System 2; for details, see Table11. For all tasks the searchers did frequent reformulations of information type, specifically substituting the controlled,authority list terms ‘‘referral guideline’’ and ‘‘treatment description’’ that are semantically very close. They also addedand removed filters.

The SC�, where the searcher limited the search to documents that contain the selected SC, was used more frequently thanthe SC term, where the searcher searched for terms within SC text segments. They used SC1� in 48.2% of the queries, and SC1term in 19.9% of the queries, in both cases using the semantic component as the first specification (SC1). As the second spec-ification SC2, they used SC2� in 4.2% of the queries, and SC2 term in 3.7% of the queries; see Tables 12 and 13 for details.

The searchers used semantic components that have different purposes. They used semantic components to search forreferral information using the ‘referral’ semantic component, as opposed to using the ‘information type’ filter ‘‘referral guide-line’’. In these cases they most frequently used the option of specifying the search was to include a specific semantic com-ponent without specifying a specific term, e.g. ‘referral’/� as opposed to ‘referral’/psychologist. Similarly, they used the‘treatment’ semantic component rather than using the ‘information type’ filter option of ‘‘treatment description’’. In this casethey primarily used the simple ‘treatment’/� semantic component.

Table 11Information type filter per query (percentage).

Information type filter per query (percentage)

A B C D All

System 1 16 (64.0) 31 (52.5) 13 (46.4) 15 (37.5) 75 (49.3)System 2 10 (41.7) 12 (18.5) 12 (23.0) 22 (44.0) 56 (29.3)

Table 10Region filter per query (percentage).

Region filter per query (percentage)

A B C D All

System 1 15 (60.0) 35 (59.0) 7 (25.0) 11 (28.0) 68 (45.0)System 2 12 (50.0) 32 (49.0) 10 (19.0) 11 (22.0) 65 (34.0)

Table 12SC1 per query (percentage).

SC1 per query (percentage)

A B C D All

SC1 term 2 (8.3) 12 (18.5) 18 (34.6) 6 (12.0) 38 (19.9)SC1� 19 (79.2) 31 (47.7) 23 (44.2) 19 (38.0) 92 (48.2)

Page 14: How doctors search: A study of query behaviour and the impact on search results

Table 13SC2 per query (percentage).

SC2 per query (percentage)

A B C D All

SC2 term 3 (12.5) 2 (3.1) 2 (3.8) n/a 7 (3.7)SC2� 3 (12.5) 3 (4.6) 2 (3.8) n/a 8 (4.2)

1164 M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170

They furthermore used semantic components to combine and specify topical facets of the search task; e.g. they searchedfor the term ‘‘bleeding’’ in the ‘treatment’ semantic component instead of searching for it as a query term, or they searchedfor the term ‘‘folate’’ in the ‘treatment’ semantic component to specify that they were searching for ‘‘folate’’ as a kind ofmedication. Thus, when the searchers wanted to limit the search to topical facets as opposed to non-topical facets suchas ‘information type’ and ‘region’, they primarily used the option of specifying a specific term, e.g. they searched for the queryterms ‘folate’ ‘pregnancy’ and the semantic component ‘treatment’/spontaneous abortion.

In sum, the searchers used the information type filter and region filter more in System 1 than System 2. In System 2 theyused the semantic component model to specify the information type instead of using the information type filter. They mostlyused SC�, limiting the search to documents that contain the selected SC instead of using SC term that limited the search addi-tionally to specific terms within SC text segments. On the other hand, they used SCterm as opposed to SC� to specify thesemantic component when they wanted to combine topical aspects.

5.3. Query structure, vocabulary and semantic categories

The searchers used 275 search keys in System 1 to express the search facets; on average they used1.8 keys per query, and4.5 queries per session. Searchers used 281 search keys in System 2, and, on average, they used 1.5 keys per query and 4.7queries per session. The majority of search keys were identical to task description terms: 69.8% in System 1 and 62.6% inSystem 2; for details, see Table 14.

5.3.1. Query structureIn general, the searchers were consistent in their choice of search facets. For task A, they searched the diagnosis facet by

searching for chronic obstructive pulmonary disease (COPD), for task B they searched the condition facet by searching forPregnancy and the symptom facet by searching for Bleeding, for task C they searched the mechanism facet by searchingfor folate and/or the condition facet by searching for Pregnancy, and for task D they searched the admin activity facet bysearching for Referral and the treatment facet by searching for Psychology. These facets represent the main facets of the tasks.In a few cases the searchers chose to express the main facet by other facets, e.g., in search task B they used the diagnosis facetby searching for ‘‘spontaneous abortion’’ instead of the symptom facet by searching for ‘‘bleeding’’, thereby interpreting andtranslating the search problem from facts (bleeding) to diagnosis (spontaneous abortion). Most searchers started the search-ing task using the main facets.

Other facets were used for query modification and reformulation. For task A the searchers reformulated by adding thenon-topical facet information type, for task B they reformulated using the facet diagnosis (spontaneous abortion), the facetphase (2. Trimester), or other symptom facets (heart sound and movement). For task D, they reformulated using the diagnosisfacet (by searching for anxiety). They primarily reformulated using the non-topical facet information type for task C.

In fewer cases they reformulated by using synonym keys (ST), consisting of true synonyms, abbreviations, classificationcodes, and common/scientific terms. All classification codes were taken from the International Classification Primary Care(ICPC) developed to classify patient data and clinical activity in the domains of family practice and primary care. Neitherbroader keys (BT) nor narrower keys (NT) were used; the searchers chose search keys at the same level of specificity asthe task description.

Table 14Search key vocabulary (percentage).

Search key vocabulary (percentage)

A B C D All

S1 S2 S1 S2 S1 S2 S1 S2 S1 S2

TD 20 22 87 70 38 37 47 47 192 176(69.0) (92.0) (74.0) (58.0) (76.0) (52.0) (61.0) (64.0) (69.8) (62.6)

OT 9 2 33 49 11 32 30 22 83 105(31.0) (8.0) (26.0) (42.0) (24.0) (48.0) (39.0) (36.0) (30.0) (39.0)

All 29 24 120 119 49 69 77 69 275 281(100.0) (100.0) (100.0) (100.0) (100.0) (100.0) (100.0) (100.0) (100.0) (100.0)

TD refers to task description terms, and OT to terms other than tasks description terms.

Page 15: How doctors search: A study of query behaviour and the impact on search results

M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170 1165

Non-topical facets were rarely expressed using query terms. Only four queries in System 1 used query terms to expressthe non-topical facet information type. In System 2 the information type facet was often expressed by semantic components.

Sometimes the reformulations increased search performance; other times the first query in a session turned out to be thebest performing query. Table 15 provides examples of search sessions in respectively System 1 and System 2 for search taskC. In System 1 the simplest query consisting of the search task’s main facets provided the best performance. Adding twoinformation type filters did not improve performance. In System 2 the simple query with the two main facets similarlyformed the basis of the search. However, the level of specification and limitations were higher. The searcher tried out differ-ent SCterm. The best performance was obtained by the fourth query that used an SC to specify the information type.

5.3.2. Query vocabularyIn general, the searchers started the search by using task description keys (TD), and reformulated using related search

keys (RT) and synonym search keys (ST). However, in a few cases they used RT keys in the first query, e.g. ‘‘psychologicaltreatment’’ instead of TD key ‘‘referral to psychologist’’, or ST terms, e.g. ‘‘financial support’’ instead of TD keys ‘‘public insur-ance’’. The distribution of task description search keys varied over tasks and systems, with the largest variation for searchtask A and C (see Table 16). For task A the searchers used more classification code search keys (CC) and ST keys in System1 compared to System 2. For search task B the searchers tried out more RT keys in System 1 compared to System 2, and moreCC keys in System 2 compared to System 1 where they did not use any CC keys at all. Most keys represented topical facets,97.8% in System 1 and 100% in System 2.

5.4. Retrieval performance

We categorised queries into high, middle, and low performance queries and failure queries based on nDCG as described inSection 4.3, in order to study the impact of features and querying activities on search performance (see Table 17). For eachsearch task we investigated which features and querying activities provided good search results, and which caused prob-lems. In general, System 2 achieved higher mean performance for each task, and for all tasks combined. System 2 returnedrelevant documents at rank 1 more often than System 1. Use of semantic components was not required in the study designwhen using System 2, but in all high performance queries except one, semantic components were applied in searching. Thisdemonstrates that the searchers were able to use semantic components, and that semantic components in general had a highimpact on search results. In the following we will analyse in detail what characterises high and middle performance queries

Table 15Search session for Search task C, respectively in System 1 and System 2.

ID SN Region Info type Query terms SC1/class/SC SC1⁄/term nDCG

System 1: Search task C97 8 Pregnant folate – 0,6398 8 Drug description Folate pregnant – 0,1999 8 Treatment description Folate pregnant – 0,19

System 2: Search task C286 25 Folate pregnancy Clinical problem/treatment Spontaneous abortion 0,40287 25 Pregnancy Drug/general information Folate 0,08288 25 Pregnancy folate supplement Drug/general information Folate 0,35289 25 Pregnancy folate Drug/general information Folate 0,63290 25 Fyn Pregnancy folate Drug/general information Folate 0,63

Legend: ID refers to query ID and SN to searcher number.

Table 16Semantic categories of search keys (percentage).

Number of terms in semantic categories (percentage)

A B C D All

S1 S2 S1 S2 S1 S2 S1 S2 S1 S2

TD 20(69.0) 22(91.6) 87(72.5) 70(58.8) 38(77.6) 37(53.6) 47(61.0) 47(68.1) 192(69.8) 176(62.6)ST 4(13.8) 1(4.2) 19(15.8) 11(9.2) 1(2.0) 1(1.4) 16(20.8) 12(17.4) 40(14.5) 25(9.0)CC 3(10.3) 1(4.2) n/a 12(10.0) n/a 4(6.0) n/a n/a 3(1.1) 17(6.0)B T n/a n/a n/a n/a n/a n/a n/a n/a n/a n/aNT n/a n/a n/a n/a n/a n/a n/a n/a n/a n/aRT n/a n/a 14(11.7) 26(22.0) 10(20.4) 27(39.1) 10(13.0) 10(14.5) 34(12.4) 63(22.4)Non-t 2(6.9) n/a n/a n/a n/a n/a 4(5.2) n/a 6(2.2) n/aTotal 29(100.0) 24(100.0) 120(100.0) 119(100.0) 49(100.0) 69(100.0) 77(100.0) 69(100.0) 275(100.0) 281(100.0)

Legend: TD (Task description key); ST (Synonym key); CC (Classification Code); BT (Broader key); NT (Narrower key); RT (Related key); Non-t (Non-topicalkey).

Page 16: How doctors search: A study of query behaviour and the impact on search results

Table 17The number of high, middle, low, and failure queries for Search tasks A–D, measured by nDCG (percentage).

Number of high, middle, low and failure queries (percentage)

Search task A Search task B

System 1 System 2 System 1 System 2

High performance 1 (4.0) 8 (33.0) High performance 10 (17.0) 9 (14.0)Middle performance 14 (56.0) 10 (42.0) Middle performance 13 (22.0) 14 (21.0)Low performance 4 (16.0) 2 (8.0) Low performance 11 (19.0) 11 (17.0)Failure 6 (24.0) 4 (17.0) Failure 25 (42.0) 31 (48.0)Total 25 (100.0) 24 (100.0) Total 59 (100.0) 65 (100.0)

Search task C Search task D

System 1 System 2 System 1 System 2

High performance 1 (4.0) – High performance – 6 (12.0)Middle performance 9 (32.0) 19 (36.0) Middle performance 10 (25.0) 4 (8.0)Low performance 10 (36.0) 18 (35.0) Low performance 12 (30.0) 7 (14.0)Failure 8 (28.0) 15 (29.0) Failure 18 (45.0) 33 (66.0)Total 28 (100.0) 52 (100.0) Total 40 (100.0) 50 (100.0)

1166 M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170

compared to low performance and failure queries. We looked at the number of search facets (query structure) and the searchvocabulary.

5.4.1. Query structureHigh performance queries covered 2–3 of the topical facets of the search tasks, and except for task B the best queries did

not include any search filters. For task A the high performance queries included two topical facets: diagnosis, expressed bythe search key ‘‘chronic obstructive pulmonary disease’’ or its abbreviation ‘‘COPD’’, and the health care facet evaluation, ex-pressed in System 1 with a search key and in System 2 with a ‘‘diagnosis’’ semantic component.

High performance queries for task B included three topical facets: condition, expressed by ‘‘pregnancy’’ (or alternativelythe related facet diagnosis, expressed by ‘‘spontaneous abortion’’), the symptom facet expressed by ‘‘bleeding’’, and the phasefacet expressed by ‘‘2. Trimester’’. The queries were specified by the non-topical ‘information type’ filter ‘‘treatment descrip-tion’’ and the ‘region’ filter ‘‘Aarhus’’ in System 1, and the non-topical ‘region’ filter ‘‘Aarhus’’ and the semantic component‘referral’/� in System 2.

High performance queries for task C included two topical facets: condition expressed by ‘‘pregnancy’’, and mechanism ex-pressed by ‘‘folate’’ in System 1. In System 2 the high performance queries for task C included the use of the semantic com-ponent ‘general information’/term ‘‘folate’’ or the semantic component ‘treatment’/term ‘‘pregnancy’’.

High performance queries for task D consisted of two topical facets: admin activity expressed by ‘‘referral’’, and treatmentfacet expressed by ‘‘psychology’’. In System 2 one of the two topical facets were expressed by use of the ‘referral’ semanticcomponent SC�.

Regarding query structure the main difference between high and middle performance queries is the use/non-use of thenon-topical facets ‘information type’ filter and ‘region’ filters. For the tasks A, C and D the best queries did not include anyfilters whereas for task B the inclusion of the ‘region filter’ turned out to be important for high search performance.

The ‘information type’ filter caused the most problems because indexers as well as searchers applied the filter incorrectlyand differently. The incorrect and diverse use seemed to be caused by vague definition and ambiguous understanding of thepreferred, authority terms chosen to express the information types. For instance, the information type authority terms‘‘treatment description’’ and ‘‘referral guidelines’’ represent concepts that are semantically very close. One may considerreferral guidelines as a natural part of the more comprehensive treatment description. In practice information about referralis often included in treatment description documents that deal with all aspects about treatment including referral. Conse-quently, only few documents in the portal have been indexed by the authority term ‘‘referral guidelines’’. In addition manydocuments in the portal have not been assigned any authority term at all for the ‘information type’ filter as the informationtype metadata field is not mandatory. Together this explains the poor performance of queries that included the informationtype filter. Many of the searchers specified appropriately the search using the information type authority term ‘‘referralguidelines’’. However, since the relevant documents had been indexed by the term ‘‘treatment description’’ instead of themore precise (and correct) term ‘‘referral guidelines’’, or because the relevant documents had not been indexed at all bythe information type filter, the documents were omitted from the search result because the filters were combined withthe rest of the search specification using a Boolean AND.

Also for task B the Boolean operation of filters caused problems. The searchers who appropriately limited the search todocuments derived from the Aarhus region got zero hits because no relevant documents about the search topic had beenassigned the authority term ‘‘Aarhus’’.

Page 17: How doctors search: A study of query behaviour and the impact on search results

M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170 1167

Boolean operation of SC�, where the searcher limited the search to documents that contain a selected SC, caused anotherportion of the low performance queries in System 2 for task B. In this case searchers who ‘‘wrongly’’ selected the ‘‘treatment’’and ‘‘diagnosis’’ semantic component instead of the ‘‘referral’’ semantic component were similarly disappointed.

However, in general it worked better to limit the search using semantic components rather than filters. This correspondsto and may be explained by findings from a comparative indexing study carried out by Lykke, Price, and Delcambre (2010).The indexing study compared semantic component indexing with term indexing, and found that SC indexing appeared to beless sensitive and easier to apply than classical term indexing. The findings suggested that it is easier to index consistentlyand accurately with semantic components compared to term indexing, probably because the semantic component model,due to its simpler nature, is not dependent to such a degree as term indexing on the suitability of the controlled indexingvocabulary. Thus, one may argue that the semantic component model is more independent of the searchers’ and indexers’understanding and usage of vocabulary compared to searching and indexing based on terms, and as such reduces the vocab-ulary problem in information retrieval.

5.4.2. Query vocabularyThe vocabulary used to express the search facets in high performance queries consisted primarily of TD keys. In general,

TD keys worked best except for task B where RT keys worked well. CC keys caused problems in relation to task A, and ST keysin relation to task D. This finding indicates the importance of expressing the search topics precisely from the viewpoint andby the everyday medical vocabulary that is used in the consultancy and in the web site articles.

Overall, the results indicated the importance of selecting the central search facets and the perspective of the search tasks.For instance, for task B only three out of five possible topical facets were present in the best queries, with two facets beinginterchangeable. The search task looked for information on how to refer bleeding women’’: ‘‘Find documents that help you todecide if and how the patient should be referred to acute examination at hospital’’. Searchers who chose to express thesearch with a slightly different perspective, such as the diagnosis or treatment perspective of bleeding, pregnant womenhad much weaker results even though the choice of perspective might not be considered a mistake or error, just slightlydifferent.

To further explore and confirm the results from the query performance analysis, we conducted multiple-linear regressionanalyses, one for each search task, in order to correlate search performance and selected query formulation activities. Thepurpose was to provide a quantitative picture and explanation of the activities’ individual importance for the search result.Only activities with sufficient data were analysed. We analysed the following querying activities: (1) use of ‘informationtype’ filter, (2) use of ‘region’ filter, (3) use of SC model, (4) number of search keys (between 1 and 4 keys), and (5) choiceof search key vocabulary (TD or OT search keys). Table 18 shows the variables that obtained a sufficient goodness-of-fitas measured by Adjusted R2.

For tasks A, C and D the regression analyses support results from the performance analysis regarding search filters, SC�

and number of search keys. But not all findings from the qualitative analysis are supported by the regression analyses.The analyses support the following findings:

� ‘Information type’ filters. Queries not including the ‘information type’ filter obtained the highest performance scores. Thisapplied to task C and D in System 1, and task C in System 2.� SC1�. Queries not including SC1� obtained the highest performance scores. This applied to task C in System 2.� SC2�. Queries not including SC2� obtained the highest performance scores. This applied to task A in System 2.� ‘Region’ filters. Queries not including the ‘region’ filter obtained the highest performance scores. This applied to task C in

system 1 and tasks A and C in System 2.� Number of search keys. Queries only using 1 search key obtained lower performance scores. This applied to task C.

Four issues concerning query performance emerged from the analysis. Accurate choice of search facets, number of searchkeys, choice of search filters’ authority terms, and choice of search vocabulary all seem to affect performance. Over all tasks,the main reason for low search performance was the erroneous choice of authority terms for the ‘information type’ filter;sometimes the wrong choice was made by searchers, sometimes by indexers. The choice and number of search facets seemed

Table 18Querying activities’ impact on search performance.

System 1 System 2

C D A C

Querying activities R2 = 0581 R2 = 0493 R2 = 0638 R2 = 0661Info. type (–) X X XSC2* (–) XSC1* (–) XRegion (–) X X X1 key (–) X

* The symbols +/� indicate whether the activities was used in the searching (+) or not used in the searching (–).

Page 18: How doctors search: A study of query behaviour and the impact on search results

1168 M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170

also to influence search performance. High performance queries included the same combination of 2–3 out of 3–5 relevantsearch facets, as opposed to low performance queries that included other combinations of more or other facets, often rep-resenting the search topic from other perspectives. High performance queries expressed the search topic by TD terms thatexpressed the search topic by common, everyday medical terminology, as opposed to low performance queries that tried outsynonym variations (ST) and ICPC classification codes (CCs) describing the topic from another discourse or angle. Altogether,the results point out the well-know vocabulary problems in information retrieval caused by variability in selection of andnaming of search concepts (Furnas, Landauer, Gomez, & Dumais, 1987; Iivonen, 1995).

5.5. Relation to previous research

We cannot compare directly to previous findings concerning doctors’ querying behaviour due to differences in researchmethodology, search tasks, and retrieval systems. However, the results are consistent with Cullen (2002) and Magrabi et al.(2005) stating that doctors use few search terms and only use search filters in half of the searches. To some degree, the find-ings also correspond to results by Ely et al. (2002), showing that doctors encounter difficulties in selecting an optimal searchstrategy. Approximately 30% of the queries were considered low performance queries or failures, and none of the test search-ers seemed to apply a systematic query formulation strategy.

The searchers’ querying behaviour were in line with previous findings of work-place information searching by Faginet al.(2003), Freund and Toms (2005), and Wolfram (2008) that workplace searchers go for quick, targeted answers witha small set of useful documents. The test searchers used little time for searching, on average 2.9 queries per session, andstopped searching soon after finding one or few relevant documents (Price et al., 2009).

The results both support and extend previous findings about query structure and performance. The best working querieswere structured into the same 2–3 topical facets (out of 3–5 facets). This corresponds to Wolfram’s (2008) findings that asmaller number of terms outperforms queries with a larger number of terms. However, it is in contrast to previous findingsby Kekäläinen and Järvelin (1998), and Vakkari et al. (2004) that suggested that search success for precision-orientedsearches depends on exhaustive coverage of search facets. The present findings indicate that a few, seemingly core searchfacets might be sufficient for retrieval success. Selecting the right set of key search facets seems to have more impact onsearch success compared to a high degree of exhaustivity. Vakkari et al. (2004) suggest a point of departure for improvingbest-match systems would be providing users in the initial search with an option to see explicate the facets of the topic. Thepresent findings suggest that the feature should also be used to support users in identifying the central, core search facets.

The semantic component model might be such a feature. The model produced the highest performance scores over alltasks. Sometimes the searchers used the model to structure and express a search facet, sometimes to specify a query term.For instance, when they searched for ‘‘folate’’ in the ‘treatment’ semantic component, they specified ‘‘folate’’ as a kind ofmedication, and they structured the search to use the search facet ‘‘diagnosis’’. Instead of expressing search facets by queryterms, they replaced search keys by SC� in 48.2% of the queries and by SC terms in 19.9% of the queries. Similarly, the System2 searchers applied fewer search filters and, as an alternative, used the SC model to limit the search to ‘‘referral guideline’’and ‘‘treatment description’’. Sometimes this limitation based on semantic components caused search failures (as with fil-ters), but, in general, semantic components worked better for limiting the queries, apparently because it was easier to applyconsistently by indexers as well as searchers.

Altogether, the findings indicate that the SC model encouraged the searchers to structure the queries and try out moresearch facets. Furthermore, it seems that the model was less sensitive to vocabulary misunderstandings in indexing andsearching, possibly because the SC model provided appropriate semantics for the documents and the search tasks.

5.6. Limitations

We recognize that the study had limitations. The experiment was limited to a single user group searching over a singlecollection of documents. The number of search tasks was small, especially compared to the number of topics typical of exper-iments. But unlike laboratory-style evaluations, we had 30 domain experts as end-users who formulated queries and inter-acted with the system, resulting in 120 search sessions. The investigated searchers were in an experimental situation so theirbehaviours may not be reflective of what they would do in real-world situations. However, the data represent the doctors’understanding and capability of exploiting search features and tactics, irrespective of their behaviour in non-experimentalsituations. Another limitation was our limited ability to control the searching algorithm that was in the Ultraseek searchengine.

6. Conclusion

Workplace searching is different from general searching because it is typically limited to specific facets and targeted to asingle answer. Search success depends on searchers’ ability to articulate the facets of topics in query terms. We have devel-oped the semantic component model, which is a search feature that allows searchers to structure and limit the search tocontext-specific aspects of the main topic of the documents. We have tested the model in an interactive searching study withfamily practice physicians. The purpose of the study was twofold. We sought to explore how doctors’ applied the means forspecifying a search and how these features contributed to the search outcome.

Page 19: How doctors search: A study of query behaviour and the impact on search results

M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170 1169

To summarize the findings, there were on average 2.9 queries per session, 5.0 query terms per session, 1.7 query termsper query. There were 4.6 keys per session, and 1.6 keys per query. On average the ‘region’ filter was used in 45% of the que-ries in System 1 and 34% in System 2. The ‘information type’ filter was used in 49% of the queries in System 1 and 29% inSystem 2. In system 2 SC� was used as the first semantic component in 48% of the queries and as the second componentin 8% of the queries. The SC term was used as the first semantic component in 20% of the queries and as the second compo-nent in 4% of the queries. The searchers were consistent in their choice of facets. For task A they chose one main facet, and forthe tasks B, C and D two facets. The searchers primarily reformulated the queries using search keys representing other searchfacets (RT keys). In fewer cases, they reformulated using ST keys. They mostly used TD keys to express search facets, 70% ofthe search keys were similar to task description terms in System 1 and 63% in System 2. They did not apply search tactics andfeatures in any particular order, but switched randomly between the tactics and features. It seems that they discovered theindexing failures and vocabulary problems in relation to the ‘information type’ filter, because throughout the searching mostsearchers adjusted and omitted the ‘information type’ filter in later queries. In general, the searchers were willing as well asable to reformulate queries and try out different tactics and features during searching. They actively changed the searchvocabulary, filters and SCs. They applied both SC� and SC term, and in few cases SC1 and SC2.

In general, the doctors were capable of exploiting system features and search tactics. They operated the features andchanged the search vocabulary, filters and SCs throughout the searches. The doctors were precision-oriented, and their que-rying behaviour reflected work-place information searching being busy searchers with a single answer in mind. Most search-ers produced well-structured queries that contained appropriate search facets and search vocabulary terms. The majority ofretrieval failures were due to the discrepancy between the searchers’ and the indexers’ application of the ‘information type’filters. The problem was exacerbated by using the filter as a Boolean filter. This also applied to the use of SC�. Using filters andthe SC� to boost document ranking, instead of as a filter, might improve performance.

The best performing queries were structured into the same 2–3 central facets out of 3–5 five possible facets, and ex-pressed by terms that articulated the search topic from the viewpoint and vocabulary used in the search task description.The findings support and extend previous results about query structure and exhaustivity that indicate the importance ofdetermining the crucial search facets and right perspective and terminology. The findings suggest that the semantic compo-nent model might be a helpful feature to structure and express queries. The findings further suggest that the semantics of theSC model may be easier to apply for indexers as well as searchers, and as such the model may be less sensitive to the well-known vocabulary problem in information retrieval. The model produced the highest performance scores over all tasks andsemantic components were applied in all high performance queries except one.

The study provides preliminary evidence of the potential usefulness of semantic components for searching in domain-spe-cific, work-related tasks. We would like to test the semantic components model in other domains and validate its usefulnessby implementing the model in an operational system. We would also like to test whether less-expert users can benefit fromsemantic components.

References

Aitchison, J., Gilchrist, A., & Bawden, D. (2002). Thesaurus construction and use. London: Fitzroy Derborn.Alper, B. S., Stevermer, J. J., White, D. S., & Ewigman, B. G. (2001). Journal of Family Practice, 50(11), 960–965.Andrews, J. E., Pearce, K. A., Ireson, C., & Love, M. M. (2005). Information-seeking behaviours of practitioners in a primary care practice-based research

network (PBRN). Journal of the Medical Library Association, 93(2), 206–212.Autonomy, A HP company (2011). <http://www.autonomy.com/> Retrieved 19.12.11.Bennett, N. L., Casebeer, L. L., Kristofco, R., & Collins, B. C. (2005). Family physicians information seeking behaviours: A survey comparison with other

specialities. BMC Medical Informatics and Decision Making, 5 (9). <http://www.biomedcentral.com/1472-6947/5/9> Retrieved 19.12.11.Bishop, A. (1999). Document structure and digital libraries: How researchers mobilize information in journal articles. Information Processing & Management,

35, 225–279.Borlund, P. (2003). The IIR evaluation model: A framework for evaluation of interactive information retrieval systems. Information Research, 8(3), 2003.Coumou, H. C., & Meijman, F. J. (2006). How do primary care physicians seek answers to clinical questions? Journal of Medical Library Association, 94(1),

55–60.Cullen, R. J. (2002). In search of evidence. family practitioner’s use of the Internet for clinical information. Journal of Medical Library Association, 90(4),

370–379.Davies, K. S. (2011). Physicians and their use of information: A surey comparison between the United States, Canada, and the United Kingdom. Journal of the

Medical Library Association, 99(1), 88–90.Dillon, A. (1991). Reader’s models of text structures: The case of academic articles. International Journal of Man-Machine Studies, 35, 913–925.Ely, J. W., Osheroff, M. H. E., Bergus, G. R., Levt, B. T., Chambliss, M. L., & Evans, E. R. (1999). Analysis of questions asked by family doctors regarding patient

care. British Medical Journal, 319(August), 358–361.Ely, J. W., Osheroff, J. A., Evbell, M. H., Chambliss, M. L., Vinston, D. C., Stevermer, J. J., et al (2002). Obstacles to answering doctors’ questions about patient

care with evidence. British Medical Journal, 324, 1–7.Fagin, R., Kumar, R., McCurley, K. S., Novak, J., Sivakumar, D., Tomlin, J. A., et al. (2003). Searching the workplace web. In: Proceedings of the 12th international

world wide web conference (WWW ’03) (pp. 366–375), Budapest, Hungary, May 20–24, 2003.Freund, L., Toms, E., & Waterhouse, J. (2005). Modeling the information behaviour of software engineers using a work-task framework. In: Grove, A. (Ed.),

ASIS&T ’05 Proceedings of the 68th annual meeting, Charlotte, NC, October 28-ember 2, 2005.Furnas, G. W., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1987). The vocabulary problem in human-system communication. Communications og the ACM,

30(11), 964–971.Gorman, P. N., & Helfand, M. (1995). Information seeking in primary care. Medical Decision Making, 15, 113–119.Hearst, M., & Plaunt, C. (1993). Subtopic structuring for full length document access. In: Proceedings of the ACM SIGIR conference on research and development

in information retrieval (pp. 59–68). Pittsburgh, PA.Iivonen, M. (1995). Consistency in the selection of search concepts and search facets. Information Processing & Management, 31(2), 173–190.International Classification of Diseases (ICD) (2011). <http://www.who.int/classifications/icd/en/> Retrieved 19.12.11.

Page 20: How doctors search: A study of query behaviour and the impact on search results

1170 M. Lykke et al. / Information Processing and Management 48 (2012) 1151–1170

International Classification of Primary Care, Second edition (ICPC-2) (2011). <http://www.who.int/classifications/icd/adaptations/icpc2/en/index.html>Retrieved 19.12.11.

Jansen, B. J. (2006). Search log analysis: What it is, what’s been done, how to do it. Library & Information Science Research, 28, 407–432.Jansen, B. J., Spink, A., & Saracevic, T. (2000). Real life, real users, and real needs: A study and analysis of user queries on the web. Information Processing and

Management, 36, 207–227.Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446.Kekäiläinen, J. K., & Järvelin, K. (2000). The co-effects of query structure and expansion on retrieval performance in probabililistic text retrieval. Information

Retrieval, 1, 329–344.Kekäläinen, J., & Järvelin, K. (1998). The impact of query structure and query extension on retrieval performance. In Proceedings of the SIGIR’98 (pp. 30–137).

New York (NY): ACM.Ketchell, D. S., Lailani, S. A., Kauff, D., Barak, G., & Timberlake, D. (2005). Prime answers: A practical interface for answering primary care questions. Journal of

American Medical Information Association, 12, 537–545.Lucas, W., & Tobi, H. (2002). Form and function. The impact on query formation om web search results. Journal of the American Society for Information Science

and Technology,, 53(2), 95–108.Lykke, M., Price, S., & Delcambre, L. M. (2010). Using semantic components to represent and search domain-specific documents: An evaluation of indexing

accuracy and consistency. Advances in Knowledge Organization, 12, 276–282.Magrabi, F., Coiera, E. W., Westbrook, J. I., Gosling, A. S., & Vickland, V. (2005). General practitioners’ use of online evidence during consultations.

International Journal of Medical Informatics, 74, 1–12.Markey, K. (2007a). Twenty-five years of end-user searching, Part 1: Research findings. Journal of the American Society for Information Science and Technology,

58(8), 1071–1081.Markey, K. (2007b). Twenty-five years of end-user searching, Part 2: Future research directions. Journal of the American Society for Information Science and

Technology, 58(8), 1123–1130.Mu, X., & Lu, K. (2010). Towards effective genomic information retrieval: The impact of query complexity and expansion strategies. Journal of Information

Science, 36(2), 194–208.Pedersen, S. T. (2005). Undersøgelse af IT-anvendelsen bland alment praktiserende læger. Århus: Sundhedsstaben.Pennanen, M., & Vakkari, P. (2003). Students conceptual structure, search process, and outcome while preparing a research proposal. Journal of the American

Society for Information Science and Technology, 54(8).Price, S., Delcambre, L., & Nielsen, M. L. (2006). Using semantic components to express questions against document collections. In: Proceedings international

workshop on health information and knowledge management (HIKM 2006). Arlington (VA).Price, S. L., Nielsen, M. L., Delcambre, L. M. L., Vedsted, P., & Steinhauer, J. (2009). Using semantic components to search for domain-specific documents: An

evaluation from the system perspective and the user perspective. Information Systems, 34(8), 778–806.Sormunen, E. (2002a) A retrospective evaluation method for exact-match and best-match queries applying an interactive query performance analyser. In:

Crestani, F., et al. (Eds.), Advances in information retrieval: Proceedings of the 24th European colloquium on IR research (pp. 334–352). Springer, Berlin andHeidelberg.

Sormunen, E. (2002b). Liberal relevance criteria of TREC: counting on negligible documents? In: Proceedings of ACM SIGIR ’02, Tampere, Finland, August 11–15, 2002.

Stenmark, D. (2008). Identifying clusters of user behaviour in intranet search engine log files. Journal of the American Society of Information Science andTechnology, 59(14), 2232–2243.

Taylor, R. S. (1991). Information use environment. In B. Dervin & M. Voigt (Eds.). Progress in communication sciences (vol. 10). Norwood (NJ): Ablex.Vakkari, P. (2000). Cognition and changes of search terms and tactics during task performance. A longitude study. In: Proceedings of the RIAO 2000 conference

(pp. 894–907). Paris: C.I.D.Vakkari, P. (2003). Task-based information searching. Annual Review of Information Science and Technology, 37, 413–466.Vakkari, P., Jones, S., & MacFarlane, A. (2004). Query exhaustivity, relevance feedback and search success in automatic and interactive query expansion.

Journal of Documentation, 60(2), 109–127.Vickery, B. C. (1960). Faceted classification: A guide to construction and use of specialized schemes. London: Aslib. 1960.Wang, P. (1997). User’s information needs at different stages of a research project: A cognitive view. In P. Vakkari, R. Savolainen, & B. Dervin (Eds.),

Information seeking in context (pp. 07–318). London: Graham Taylor.Wolfram, D. (2008). Search characteristics in different types of Web-based IR environments: Are they the same? Information Processing & Management, 44,

1279–1292.


Recommended