Date post: | 22-Jan-2017 |
Category: |
Science |
Upload: | lars-juhl-jensen |
View: | 90 times |
Download: | 1 times |
Text-mining practical
Lars Juhl Jensen
unix primer
the command line
some useful commands
cat
less
head -10
tail -10
grep ‘needle’
cut -f 2
sort
sort -nr
uniq -c
redirecting output
write to file
command > filename
using pipes
command1 | command2
putting it all together
cut -f 4 infile | sort | uniq -c |sort -nr | head -100 > outfile
the task
disease gene finding
named entity recognition
human genes
gene prioritization
what I have done
information retrieval
two diseases
prostate cancer
schizophrenia
two sets of documents
82,373 abstracts
89,904 abstracts
one file with each set
one line per abstract
dictionary
tab-delimited file
human genes
21,929 entities
synonyms
from many databases
orthographic variation
prefixes and suffixes
automatically generated
2,920,042 names
tagcorpus program
flexible matching
upper- and lower-case
spaces and hyphens
tab-delimited output
what you will do
named entity recognition
find unfortunate names
create “black list”
information extraction
co-mentioning
within abstracts
rank genes for each disease
find shared gene
wrap up
Protein kinase B
PKB
Akt
AKT1
same protein
synonyms matter
“black list” is crucial
text mining is useful
not black magic
Thanks for your attention