Weekly Report Start learning GPU

Post on 14-Jan-2016

14 views 0 download

Tags:

description

Weekly Report Start learning GPU. Ph.D. Student: Leo Lee Supervisor: Dr. Xiaowen Chu Date: Sep. 11, 2009. Outline. Protein identification and pFind GPU and data mining Research Plan. Protein identification and pFind. Background Identify flow Challenges - PowerPoint PPT Presentation

transcript

Weekly ReportStart learning GPU

Ph.D. Student: Leo LeeSupervisor: Dr. Xiaowen ChuDate: Sep. 11, 2009

Outline

Protein identification and pFind

GPU and data mining

Research Plan

Protein identification and pFind

Background

Identify flow

Challenges

Could GPU be used?

Protein identification and pFind

Background

Identify flow

Challenges

Could GPU be used?

The Human Genome Project: China 1%

Same gene , different protein

Human Plasma ProteomeProject, USA

Human Disease Glycomics/Proteome Initiative (HGPI), Japan

Human Proteome Program: China in charge of liver

Characters of Proteome

Protein identification and pFind

Background

Identify flow

Challenges

Could GPU be used?

Mass Spectrometry Based Protein Identification

Mixed Proteins

>ipi|IPI00243451|IPI00243451.6 MDQHQHLNKTAESASSEKKKTRRCNGFKMFLAALSFSYIAKALGGIIMKISITQIERRFD…

TAESASSEKMFLAALSFSYIAK…

Digest

Mixed peptides

LC-MS/MS

Data

analyze

Protein sequence Peptide sequence

Merge

19-21-08 FT 893 MS2 9 avg #1 RT: 0.63 AV: 1 NL: 1.04E4T: FTMS + p NSI Full ms2 893.60@30.00 [ 500.00-1600.00]

600 700 800 900 1000 1100 1200 1300 1400m/z

0

10

20

30

40

50

60

70

80

90

100

Relat

ive A

bund

ance

928.6396

929.9735

720.3784823.9249

916.4733769.9116 955.7405

1008.5148

1097.6791676.8584

1229.5820 1358.6410900.2117663.0114588.3018 1115.5698 1412.59381348.38761239.3015

Tandem MS

Web search engine

Protein identification SE

20040060080010001200

Go pFind

Sequence database

…KFDTGIPDGFAGFFGHYAQGGITFRH

EWTRJQIDF…

query

scoreTAESA

MFLAALS

…FSYIAK200400600800100012

00

20040060080010001200

……

Upper bound of mass : 699.70

lower bound of mass 699.90

6 9 9 .7 8 T L K H L K6 9 9 .7 8 W D R D L6 9 9 .8 2 E L D G E R...

查询结果

200 40060080010001200

400.15 EVDG400.15 AAEE400.15 PSTD

…698.48 SVKKKK699.78 TLKHLK699.78 WDRDL

……

>IQPSKANMETEPDQ…>DEAVPPPALQLQFN……..

Protein sequence database

Protein identification SE

digestion

20040060080010001200

20040060080010001200

20040060080010001200

……

>IQPSKANMETEPDQ…

>DEAVPPPALQLQFN…

>RQRAILKVMNTIGGE……

MS

Protein identification SEProtein

database

>IQPSKANMETEPDQ…

>DEAVPPPALQLQFN…

>RQRAILKVMNTIGGE……

MS Protein database

Digest

400 EVDG

400 AAEE

400 PSTD

698 SVKKKK

699 TLKHLK

699 WDRDL

……

Peptide

Matching

Protein identification SE

Protein identification and pFind

Background

Identify flow

Challenges

Could GPU be used?

>IQPSKANMETEPDQ…>DEAVPPPALQLQFN…>RQRAILKVMNTIGGE…

MS Protein database

Digest

EVDGAAEEPSTD

SVKKKKTLKHLKWDRDL

……

Peptide

Matching

Challenges of PISE

Generation Speed keep increasing

Protein increaseexponentially

PTM leads to huge peptides

E.g. Phosphorylation

Amino S, T and Y (HPO3,80Da)

- May be happen- 25 kinds of possibilities

PO3 PO3 PO3 PO3PO3

EMSVPSCQYILSATNR

Identification of PTM

400 EVDG

400 AAEE

400 PSTD

631 EMSVPS

699 TLKHLK

699 WDRDL

……

Peptide

>IQPSKANMETEPDQ…

>DEAVPPPALQLQFN…

>RQRAILKVMNTIGGE……

Protein

Protein identification and pFind

Background

Identify flow

Challenges

Could GPU be used? http://bioinformatics.oxfordjournals.org/cgi/

content/full/25/15/1937

Protein identification on GPU

Each thread-each MS

Each thread-each score

Each thread-each “query” V1 Match V2

Seems valuable to think further!

Outline

Protein identification and pFind

GPU and data mining

Research Plan

Google 2009.09.11

CPU 133,000,000 Genome GPU 45,600

GPU 13,800,000 Proteomic GPU 7,830

GPGPU 621,000 Protein GPUProtein GPU 85,300

CUDA 6,040,000 Protein identification GPU

3,450

Data mining on GPU

77,700

GPU and data mining

Characters of GPU GPU VS CPU

CUDA

Data mining on GPU

Quadro FX 5600

NV35 NV40

G70G70-512

G71

Tesla C870

NV30

3.0 GHzCore 2 Quad3.0 GHz

Core 2 Duo3.0 GHz Pentium 4

GeForce8800 GTX

0

100

200

300

400

500

600

Jan 2003 Jul 2003 Jan 2004 Jul 2004 Jan 2005 Jul 2005 Jan 2006 Jul 2006 Jan 2007 Jul 2007

GF

LO

PS

1 Based on slide 7 of S. Green, “GPU Physics,” SIGGRAPH 2007 GPGPU Course. http://www.gpgpu.org/s2007/slides/15-GPGPU-physics.pdf

GPU VS CPU

Design philosophies are different.

The GPU is specialized for compute-intensive, massively data parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing rather than data

caching and flow control

The fast-growing video game industry exerts strong economic pressure for constant innovation

DRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU GPU

What is the GPU Good at?

The GPU is good at data-parallel processing The same computation executed on many data

elements in parallel – low control flow overhead with high SP floating point arithmetic intensity

Many calculations per memory access Currently also need high floating point to integer

ratio High floating-point arithmetic intensity and many data

elements mean that memory access latency can be hidden with calculations instead of big data caches – Still need to avoid bandwidth saturation!

CUDA - No more shader functions. CUDA integrated CPU+GPU application C program

Serial or modestly parallel C code executes on CPU Highly parallel SPMD kernel C code executes on GPU

CPU Serial CodeGrid 0

. . .

. . .

GPU Parallel Kernel

KernelA<<< nBlk, nTid >>>(args);

Grid 1CPU Serial Code

GPU Parallel Kernel

KernelB<<< nBlk, nTid >>>(args);

CUDA

Basic

Memory

Threads

Application performance

Data mining on GPU

K-means

K-nn

Apriori

SVM

K-means on GPU

A team at University of Virginia, led by Professor Skadron

HKUST && MSRA GPUMiner

LABS-hp

Experiments -GPUMiner

Experiments-HPL

Data mining on GPU

The time of speed-up highly depends on the implementation Data transfer Memory CPU-GPU cooperation

Outline

Protein identification and pFind

GPU and data mining

Research Plan

Research Plan

Keep reading related papers GPU, data mining

Development Read our k-means program Try to speed it up Try protein identification on GPU

Time schedule

Courses Thu. 6.30-9.30pm, data mining

TA Tue. 11.30-12.20am, Network security; Fri. 9.30-11.30am, Network security;

Thank you for your listening