Post on 10-Jan-2017
transcript
Providing Consultancy &
Research in Health Economics
Providing Consultancy &
Research in Health Economics
Do simple text mining tools
have anything to offer
Embase users?
Sunrise seminar
May 2016
Julie Glanville
Providing Consultancy &
Research in Health Economics
Disclosures
I acknowledge funding from Elsevier which has covered my
attendance at the MOSAIC conference
I work for YHEC, a consultancy company that does contract
research for range of public and private sector organisations
I offer training courses in advanced searching and text mining
Providing Consultancy &
Research in Health Economics
Agenda
What are simple text mining tools?
How might TM tools help us with searching?
How can we use TM tools with Embase?
Learning more
Providing Consultancy &
Research in Health Economics
What is text mining?
“Text mining is the process of discovering and extracting
knowledge from unstructured data. This comprises three
main activities:
– Information retrieval (IR) to gather relevant texts.
– Information extraction (IE) to identify and extract entities, facts
and relationships between them.
– Data mining to find associations among the pieces of information
extracted from many different texts.
…[TM] can help make the implicit information in your
documents more explicit…”
Source: Nat Centre for TM. http://www.nactem.ac.uk/faq.php?faq=1
Providing Consultancy &
Research in Health Economics
TM is not a single thing
TM software comes in many forms and can do many
different things
Simple things – word frequency analysis
counting the numbers of times words appear in the text
More complex things – word co-occurrence
looking at patterns of words occurring together to identify concepts and
relationships between words
Semantic analysis – analysing text according to the meaning of
words not just their presence or absence
“89% of the group achieved smoking cessation”
“Five different smoking cessation interventions were explored”
Providing Consultancy &
Research in Health Economics
What is behind TM software?
TM software works according to algorithms
Packages use algorithms to achieve results
Algorithms make use of features in the texts such as
frequency of terms, co-occurrence of terms, presence/absence of terms
Different packages use different algorithms
TM software may make use of dictionaries and stop
word lists
It is likely that these will differ across software packages
TM software may also make use of lists, vocabularies
and term relationships (ontologies/taxonomies)
E.g. lists of diseases, geographic areas, proteins
Providing Consultancy &
Research in Health Economics
Providing Consultancy &
Research in Health Economics
Providing Consultancy &
Research in Health Economics
What can TM do?
TM can be used to extract information from texts and identify
patterns in that information
This might lead to identifying themes in the text of which we were
unaware
Particularly useful for helping to see information more clearly that
might otherwise be “concealed” within large volumes of text.
It is “objective” in its analysis of text
Many TM packages can cope with information in very different
formats and explore them in a single corpus (body of text)
Database records, papers, web pages, tweets
Providing Consultancy &
Research in Health Economics
How easy is it to access TM
software? Many different TM software packages are available free of charge on the
internet or to download
Some do single tasks
e.g. simple word frequency analysis such as PubMed PubReMiner
Others offer bundles of tools to give different ways to explore a set of
text
e.g. Voyant
There is also free software for more sophisticated tasks within TM, such as
machine learning
e.g. GATE
To get the best out of this software will require some investment of time
to fully learn the options within the software and their implications
Providing Consultancy &
Research in Health Economics
What are simple text mining
tools?
The selected tools we will look at today, achieve the following:
Analyse the frequency of terms appearing in database records
Title
Abstract
Subject headings
Other fields
Analyse phrases within records
Analyse the collocation of terms within records
Show us the content/themes within a set of records
There are many more tools…
Providing Consultancy &
Research in Health Economics
How can we use TM tools with
Embase records?
Many TM packages are built as interfaces to PubMed
PubMed PubReminer
GoPubMed
Quetzal
MeSH on Demand
These are helpful for analysing records from PubMed and building
MEDLINE strategies
What services can we use to help us with exploring Embase
records?
To develop strategies
To explore the content of a set of records
Providing Consultancy &
Research in Health Economics
Identifying search terms
Our topic for today is
How can we treat biofilms that have
formed in infected wounds
I have found a set of 902 Embase records
using a first very basic scoping search
How can frequency analysis tools help us
see the words in the records
EndNote
Voyant
Example Embase
(OvidSP) search
1. Biofilm$1.ti,ab.
2. Wound$1.ti,ab.
3. 1 and 2
Providing Consultancy &
Research in Health Economics
EndNote offers simple frequency analysis:
You can see which terms might be useful for strategy
development
All records are indexed as they are loaded into EndNote:
The Keywords field is indexed automatically to create a Term
List
Term Lists can be used to create an index
Additional Term Lists can be defined and populated
e.g. title, title/abstract
Ideal for analysing Embase records
Frequency of terms in the title, title/abstract, EMTREE
EndNote, 1
Providing Consultancy &
Research in Health Economics
Before loading records into EndNote decide how to treat the
information coming in to Keywords field:
You can break up subject index terms by changing the term
delimiters:
E.g. Pseudomonas infection/dt [Drug Therapy] can be parsed
differently:
Separate words - Pseudomonas Infection dt drug therapy
Two phrases - Pseudomonas Infection dt [drug therapy]
If you want to do frequency analysis of other fields or combinations of
fields you can do this once the records are loaded
EndNote, 2
Providing Consultancy &
Research in Health Economics
Keywords field
To set the term delimiters you can use the following sequence
(EndNote X7.5) in a new empty EndNote library:
Tools, Define term lists, Keywords
(Change) Delimiters – select the ‘/’ symbol to cut the
subheading from the EMTREE heading
Update list
OK
Load your Embase records
EndNote, 3
Providing Consultancy &
Research in Health Economics
Create the frequency analysis of the EMTREE terms :
Tools, Subject bibliography
Keywords, OK
Select all, OK
Choose display format by selecting Layout
Terms, Subject terms only
Change number of lines between entries e.g. remove suffix
^p^p
Change display order to frequency by selecting ‘By term
count – decending’
Select OK
To print listing, select Print
To save the listing select Save
EndNote, 4
Providing Consultancy &
Research in Health Economics
Endnote, 5
To see a title and abstract frequency analysis, first define a Term LIst
Select Tools, Define Term Lists
Select create List
Give the list a helpful name e.g. Titleab
Check the custom delimiters and make sure that a space is added so
that words will be processed individually
Select Update list
Then select the title field and link it to the Titleab term list
Then select the abstract field and link it to the Titleab term list
The term list is now ready and any subject bibliography involving those
fields will be able to use single terms
Providing Consultancy &
Research in Health Economics
Endnote, 6
To save or print out the title and abstract frequency analysis
Tools, Subject bibliography
Select Title as well as Abstract (using the control key), OK
Choose Select all, OK
Choose display format by selecting Layout
Terms, Subject terms only
Change number of lines between entries e.g. remove suffix ^p^p
Change display order to frequency by selecting ‘By term count –
decending’
Select OK
To print listing, select Print
To save the listing select Save
Providing Consultancy &
Research in Health Economics
EndNote pros and cons
Pros
Easy and quick to create frequency counts of the fields in Embase
records, particularly of EMTREE
This facility is built into a package you might be using to manage your
search results
Cons
Not very visual
Cannot do phrase analysis of title and abstract
Cannot do more sophisticated analyses such as word collocation
Cannot implement stopwords
Providing Consultancy &
Research in Health Economics
Format of Embase records
Often need the Embase records in plain text format
Best to select only the fields you want to analyse so that
The files process more quickly
The output is not cluttered with unwanted words e.g. from the address
fields
To get ‘clean data’
Download just the selected fields from Embase (e.g. in OvidSP select
text file output and then selected fields e.g. ti, ab, sh)
Or, download records to EndNote and export only fields of interest from
EndNote into a file
Sometimes you may want to have one file of title/abstract fields as
well as a separate file of the EMTREE only
Providing Consultancy &
Research in Health Economics
Voyant Tools
http://voyant-tools.org/
Can upload a text file or
cut and paste the
contents of a text file into
Voyant
It provides various
views on the text
Example (title/abstract)
904 biofilms and wounds
records
Providing Consultancy &
Research in Health Economics
Voyant pros and cons
Pros
Offers a simple terms display and a word cloud
Much more visual presentation
We can explore set of records in a non-linear way
Can save the data visualisations for the future
Can manipulate the stopword list e.g. to remove words such as ‘title’ or
‘abstract’
Cons
May be a little slow to respond?
Providing Consultancy &
Research in Health Economics
Voyant Tools
Running the phrase analysis over the biofilms and
wounds records helps us to identify phrases
Can choose a seed word e.g. ‘wound’ and inspect
phrases that contain it
Providing Consultancy &
Research in Health Economics
TERMINE
http://www.nactem.ac.uk/software/termine/
2 MB file size
Can paste in records, document text e.g. protocol, or
parts of records
Analysis of EMTREE headings from a batch of 498
Embase records
Blue items are phrases – any frequency
Red items are phrases with higher frequency – threshold set at 4
Providing Consultancy &
Research in Health Economics
TERMINE
Table representation of phrases by C
score weight
NOTE: additional pre-processing
(tidying) could be undertaken e.g.
taking out the * for the focused
headings
Providing Consultancy &
Research in Health Economics
Text Analyzer
http://www.online-utility.org/text/analyzer.jsp
Paste in text e.g. set of records
Click ‘Process text’
Groups results that appear as most frequent
phrases
Choose phrase lengths
402 Embase records: title abstract
Providing Consultancy &
Research in Health Economics
Providing Consultancy &
Research in Health Economics
Providing Consultancy &
Research in Health Economics
Identifying options for
proximity operator use
Text analyzer
– The display with longer phrases can help with
deciding on proximity operators
Voyant
– The keyword in context might help with deciding on
proximity operators
– The phrase option can be set to phrases of specific
lengths
Providing Consultancy &
Research in Health Economics
Voyant Collocates Tool
The table view shows:
– Term: this is the keyword (or keywords) being
searched
– Collocate: these are the words found in proximity of
each keyword
– Count (context): this is the frequency of the collocate
occurring in proximity to the keyword
Providing Consultancy &
Research in Health Economics
Providing Consultancy &
Research in Health Economics
Voyant
collocates
Providing Consultancy &
Research in Health Economics
Voyant settings
Keywords are in blue/green and collocates (words in proximity to the
keywords) are showing in orange/red.
Default: most frequent collocates are shown for the 10 most
frequent keywords in the corpus
Probably best for title/abstract terms?
Might want to add some terms to stopwords e.g. Title
If you change the context slider at the bottom more terms are
included (with lower frequency)
Hovering over collocates shows their frequency in proximity (not
their total frequency)
Exploring the terms can highlight words for consideration for
proximity searches
Providing Consultancy &
Research in Health Economics
Identifying options for /freq
command use in OvidSP
OvidSP offers the option to implement frequency
selection in searches
– biofilm.ab. /freq=2
TM exploration using Voyant (using the
collocator option) might suggest combinations of
words on which to focus
– These could be tested with the frequency operator to
see if they do improve the precision of the search
Providing Consultancy &
Research in Health Economics
Identifying concepts?
TM software can help us to see themes in a batch of search results
VOSviewer
http://www.vosviewer.com/
Can be run online (needs Java) but best to download software
It creates maps of themes within documents
– Network visualisation
– Density visualisation
Possible to zoom into areas of interest
Scroll over the map and zoom in
Providing Consultancy &
Research in Health Economics
VosViewer
Carry out an Embase search and download results as RIS
Open VosViewer (http://www.vosviewer.com/) and
LaunchVosViewer
Select Create, map based on text data
Select RIS option and load the RIS file
Select Next and choose the specific fields and the term score base
field
Select Next and choose binary (presence/absence) or full
(frequency) counting
Choose the minimum number of occurrences of a term (e.g. 5)
Choose Relevance score for each
Providing Consultancy &
Research in Health Economics
VOSviewer: network
visualisation
Network visualisation
The font size and the size of the circle depend on the weight of
an item
Weight is determined by total strength of all links to the item
The colour of the circle of an item is determined by the cluster to
which the item belongs
Providing Consultancy &
Research in Health Economics
Providing Consultancy &
Research in Health Economics
Providing Consultancy &
Research in Health Economics
Providing Consultancy &
Research in Health Economics
Providing Consultancy &
Research in Health Economics
VOSviewer: Density
visualisation
Item density visualisation
Each point has a colour that depends on density of items at that
point (between red and blue)
Larger the number of items in neighbourhood of a point and
higher the weights of neighbouring items, the close the colour of
the point is to red
Smaller the number of items around a point and lower the
weights of the neighbouring items, the closer the colour of the
point is to blue.
Providing Consultancy &
Research in Health Economics
Providing Consultancy &
Research in Health Economics
Providing Consultancy &
Research in Health Economics
Identify where research is
published
Endnote
– Creating a term list of the address field – important to
use the comma as a delimiter
– Analysing words in title and abstract
VOSviewer
– Create a file with only address fields and see what
the pictures show
Providing Consultancy &
Research in Health Economics
Explore trends within research
Voyant
– If a set of records are loaded in date order the
frequency of terms over time can be shown
Providing Consultancy &
Research in Health Economics
IQWIG approach, 1
Hausner, E, Waffenschmidt S, Kaiser T, and Simon M. Routine Development of Objectively Derived Search Strategies. Sys Rev 2012;1:19 DOI: 10.1186/2046-4053-1-19.
Identify a test set and split randomly
Load into Endnote
Text mining package “tm” (Text Mining Infrastructure in R) in R:
o http://tm.r-forge.r-project.org/
On the basis of information derived from the titles and abstracts of the downloaded references, terms are ranked by frequency.
Terms present in at least 20% of the references in the development set are selected for further examination.
Most frequent terms with a low sensitivity of 2% or less are used in strategy
Providing Consultancy &
Research in Health Economics
IQWIG approach, 2
[They use Pubreminer for selected MeSH]
Terms are then divided up into Condition, Intervention and Study design.
Then iterative trial and error approach to find strategy that works
Then strategy is tested against validation set
They also use antconc software to identify phrases or adjacent terms
http://www.laurenceanthony.net/software/antconc/
Has to be downloaded
Providing Consultancy &
Research in Health Economics
AHRQ report
Paynter RA, Bañez LL, Berliner E, Erinoff E, Lege-Matsuura J, Potter S, Uhl
S. EPC Methods: An Exploration of the Use of Text-Mining Software in
Systematic Reviews. Research White Paper. (Prepared by the Scientific
Resource Center and the Vanderbilt and ECRI Evidence-based Practice
Centers under Contract Nos. 290-2012-00004-C [SRC], 290- 2012-00009-I
[Vanderbilt], and 290-2012-00011-I [ECRI].) AHRQ Publication 16-EHC023-
EF. Rockville, MD: Agency for Healthcare Research and Quality; April
2016.
https://effectivehealthcare.ahrq.gov/ehc/products/625/2214/text-mining-
report-160419.pdf
Overview of TM tools used in searching
Evaluation of TM tools recommended in literature and by interviewees
Also summarises research in using TM for other parts of systematic review
process
Providing Consultancy &
Research in Health Economics
Flinders University Library
http://flinders.libguides.com/text_mining
Text mining resource
Lists of various tools
Providing Consultancy &
Research in Health Economics
What are the challenges of
using TM software?
TM has many different options – need to identify what aspect of TM
will be used at what stage of the search process and how
Lack of standardisation – no agreed single approach
Different systems use different algorithms – how might these impact
on the resulting searches we develop?
TM can help with volume processing but we still need to make
decisions based on the results, and these may still be subjective
unless we can define benchmarks or decision rules a priori
Some software is complex to learn
The process of using TM can be challenging to document
Providing Consultancy &
Research in Health Economics
Summary
There are powerful free tools which can help us with term and
phrase identification to assist with developing Embase searches
Data visualisation such as VOSviewer might help with developing
strategies for complex topics by showing the concepts to de-
emphasise and topics on which to focus
Possibly most useful for more complex searches?
There are many different tools to explore
Many of these can be downloaded to a PC and downloaded
versions may offer more flexibility and reliability than the web-based
services
Providing Consultancy &
Research in Health Economics
Providing Consultancy &
Research in Health Economics
http://tinyurl.com/yhec-facebook
http://twitter.com/YHEC1
http://www.minerva-network.com/
Thank youjulie.glanville@york.ac.uk
Telephone: +44 1904 324832
Website: www.yhec.co.uk