Text-mining practical

Text-mining practical

Lars Juhl Jensen

unix primer

the command line

some useful commands

cat

less

head -10

tail -10

grep ‘needle’

cut -f 2

sort

sort -nr

uniq -c

redirecting output

write to file

command > filename

using pipes

command1 | command2

putting it all together

cut -f 4 infile | sort | uniq -c |sort -nr | head -100 > outfile

the task

disease gene finding

named entity recognition

human genes

gene prioritization

what I have done

information retrieval

two diseases

prostate cancer

schizophrenia

two sets of documents

82,373 abstracts

89,904 abstracts

one file with each set

one line per abstract

dictionary

tab-delimited file

human genes

21,929 entities

synonyms

from many databases

orthographic variation

prefixes and suffixes

automatically generated

2,920,042 names

tagcorpus program

flexible matching

upper- and lower-case

spaces and hyphens

tab-delimited output

what you will do

named entity recognition

find unfortunate names

create “black list”

information extraction

co-mentioning

within abstracts

rank genes for each disease

find shared gene

wrap up

Protein kinase B

PKB

Akt

AKT1

same protein

synonyms matter

“black list” is crucial

text mining is useful

not black magic

Thanks for your attention

Date post:	22-Jan-2017
Category:	Science
Upload:	lars-juhl-jensen
View:	90 times
Download:	1 times

Text-mining practical

Science