Computational Biology Laboratory Chuan Yi Tang CS Department, NTHU [email protected].

Computational BiologyLaboratory

Chuan Yi Tang

CS Department, NTHU

[email protected]

Our Aims

• Develop new tools

• Create aided valued database

• Apply to systems biology

• Apply to Biomedicine

• A transcription factor is a protein that regulates the activation of transcription in the eukaryotic nucleus. Transcription factors localise to regions of promoter and enhancer sequence elements either through direct binding to DNA or through binding other DNA-bound proteins.

Coregulated genes

Gene 1

Gene 2

Gene 3

Transcription factor

atgaccgggatactgattaatacaaggttgggtataatggagtacgataa

attgagatcaatgtacggcgggtgctctcccgattggaagacaacgtggg

gcaatcgggatcacaacgtagaattggatgtcaaaataatggagtggcac

gtcaatcgaaaaaacggtggtgagcgcaaagtaaagggattggaccgctt

S1

S2

S3

S4

1 2 3 4 5 6 7 8

a 0 0 0 0 0 0 0 0

t 5 0 0 0 3 0 0 0

c 4 9 9 0 6 9 9 4

g 0 0 0 9 0 0 0 5

SP 1

IUPAC code

Sp1 binding site

YCCGYCCS

• Degeneracy often tends to occur at specific positions of transcription elements. e.g. Sp1 binding site YCCGYCCS

• When no auxiliary data are used (orthologous sequences), the accuracy of most tools for motif discovery is strongly influenced by the motif degeneracy and the lengths of input sequence.

ARRTTYYRSA high motif degeneracy , weak motif

AAGTTYYRCA low motif degeneracy , strong motif

•A degenerate (l, d)-motif is defined as a pattern of length l over the IUPAC code with no more than d degenerate positions. (A degenerate position is a position occupied by a character other than A, G, C or T)

e.g. ARATTYT degenerate (7,2)-motif

• Degenerate motif discovery problem. Given a set of sequences S = {S1, S2, …, Sm | Si belongs to {A, G, C, T}* for all i} and three nonnegative integers k, l and d, find all degenerate (l, d)-motifs, each of which has occurrences in at least k sequences in S.

METHODS



gcaatcgggatcacaacgtagaattggatgtcaaaataatggagtggcac

gtcaatcgaaaaaacggtggtgagcgcaaagtaaagggattggaccgctt

S1

S2

S3

S4

e.g.

l=3, d=1 k=4

Wij = ATA

All possible set of degenerate positions : {P1, p2,p3}_TA, A_A, AT_

For each possible set X = {p1, …, pd} of degenerate positions, all Wpq with V(Wij, Wpq) X are collected.

_TA

ATA (S1)CTA (S2)ATA (S3)CTA (S3)TTA (S4)

A_A

ATA (S1)ATA (S2)ATA (S3)ACA (S4)AAA (S4)ACA (S5)

AT_

ATC (S2)ATT (S3)ATA (S3)ATA(S3)AAA(S3)

K=4 K=5

K=2

Background letter probabilities are PA = 0.22, PT = 0.22 PC = 0.28, and PG = 0.28. A negative (p, q)-entry means that the letter p at position q is weakly conserved in G(Wij | X).

Lpq = log[(observed probability of p at position q in G(Wij | X)) / Pp]

Pseudo occurrence elimination

Motif scoring methods

s1 = ( Lij / pj ) / l,

This fact is used to measure the conservation and the significance of each reported motif.

(1.51+1.51+1.51+1.51+(0.31+0.31)/2+1.51+(0.31+0.82)/2)

The measure used for comparison is the performance coefficient |K P| / |K P|.

(Pevzner P. A. and Sze, S. H. (2000) Combinatorial approaches to finding subtle signals in DNA sequences. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB 2000), 269-278.)

K is the set of positions of the known motif occurrences in the input sequences.

P is the set of predicted positions.

The best performance coefficients among the top ten motifs found by these tools are compared.

Evaluation of performance on synthetic data



gcaatcgggatcacaacgtagaattggatgtcaaccaaagtggagtggcac

Red words the set of positions of the known motif occurrences ( K )

the set of predicted positions ( P )

|K P| = 21 |K P| = 35 |K P| / |K P|= 21/35 = 0.6

S1

S2

S3

0

0.2

0.4

0.6

0.8

1

1.2

10%~15

%

15%~20

%

20%~25

%

25%~30

%

30%~35

%

35%~40

%

40%~45

%

45%~50

%

Degree of motif degeneracy

Perf

orm

ance

Coe

ffic

ient

MotifSeeker

Projection

Consensus

Gibbs sampling

MEME

Evaluation of performance on synthetic data

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

10%~15

%

15%~20

%

20%~25

%

25%~30

%

30%~35

%

35%~40

%

40%~45

%

45%~50

%

50%~55

%

55%~60

%

60%~65

%

65%~70

%

70%~75

%> 75

%

Degree of motif degeneracy

Performance Coefficient

Average Specificity

Average Sensitivity

MotifSeeker

Specificity : |K P| / |P| false positive

Sensitivity : |K P| / |K| false negative

The best performance coefficient among the top ten motifs selected.

Evaluation of performance on tissue-specific regulatory elements

• Four liver-specific factors : HNF-1, HNF-3, HNF-4 and C/EBP.

• Each regulon consists of at least five genes. The average length of the analyzed promoter sequences is about 2.5 k bp.

Reference

• Identification of Degenerate Motifs Using Position Restricted Selection and Hybrid Ranking Combination,

by C. H. Peng etc., to appear in NAR

彭千華鍾允昇 etc.

臺灣土雞在育種上所面臨的問題•雜交品系•以肉用雞的篩選方式選拔•飼養期長

結果 :平均產蛋數低飼養成本高市場競爭力弱

雞群育種

育種計劃

篩選

基因型表現型

利用血清蛋白質當作篩選標誌• 問題 :

1.多少種標誌 ?2.何種標誌 ?3.是否可用濃度的臨界值來作篩選 ?

4.單一階段或多重階段篩選 ?

研究蛋白質體當做雞群的篩選

• Lezczynski et al.(1985) Relationship of plasma estradiol and progesterone levels in domestic chicken hens. Poult. Sci. 64, 545.

• Kuo et al.(2005) Proteomic analysis of hypothalamic proteins of high and low egg production strains of chicken. Theriogenology 64, 1490

• Huang et al. (2006) Analysis of chicken serum proteome and differential protein expression during development in single-comb White Leghorn hens. Proteomics 6, 2217

禽類產蛋之生物路徑分析科學農業 (2004), 10 月

•探討產蛋過程中 , 血清蛋白質的角色•血清蛋白質彼此間的關連•串聯哺乳動物及禽類與生殖相關的分子 , 製作產蛋相關生物路徑圖

Adipocyte

Intestine

Hypothalamus

GnRHGHRH

GHIH

PRFs

PIFs

GHPRL LH FSH

Pituitary gland

Leptin

Ovary

Follicle

Oocyte

EstrogenProgesterone Inhibin Activin Follistatin BMP

WBC

IL-8

MCP-1

TNF-

MMPs

PA

PGs

Elastase

LTB4

PAF

Histamine

ROS

GH-R ER

IGF-I

Liver

ApoAI, ApoB, ApoVLDL II

VTGI,VTG II, VTG III

Riboflavin BP, Retinol-BP

ZP1, 2M, Transthyretin

25-hydroxyvitamin D3-BP

Kidney

1.25-dihydroxyvitamin D3

Calbindin D28

OviductER

Ovalbumin,Ovotransferrin,Lysozyme

Ovomucoid, Ovoinhibitor, Ovostatin, Cystatin

Ovocleidin-17, Osteopontin,calbindin D28

Photostimulation, Glucose

Relaxin

LH-R FSH-R

Oviposition-related biopathway

IGF-I

LCAT

Egg

Egg yolk

Egg white

Egg shell

GnRH-R

Serum protein marker

• Ovotransferrin

• Vitellogenin

• Apolipoprotein A-I

• X protein

(an IGF-I like protein)

• Apo VLDL-IIExp I Exp II

Stage selection

• Exp I : Association of serum protein levels with egg number at two stage.

24 wk (initial egg production)

35 wk (peak egg production)• Exp II: Selection strategy at three stage

14 wk (premature stage)

24 wk

35 wk

Exp (I)Fig. 1. Egg production rate of TRFCC (n=157).

(A) Total egg number of all hens, (B) hens in four groups

0

20

40

60

80

100

120

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

week of age

Egg

pro

duct

ion

rate

(%

)

0

10

20

30

40

50

60

70

80

90

100

25 30 35 40 45

Week of age

Egg

pro

duct

ion

rate

(%

)

Group IGroup IIGroup IIIGroup IV

(A)

(B)

Fig. 3. Association of relative protein levels

with total egg number.

0

1

2

3

4

5

6

30 50 70 90 110 130

Total egg number

Rel

ativ

e le

vel

s o

f v

itel

log

enin 24w (r=0.23, p<0.01)

35w (r=0.53, p<0.01)

0

1

2

3

4

5

6

30 50 70 90 110 130

Total egg number

Rel

ativ

e le

vel

s o

f ap

o A

-I

24w (r=0.14)35w (r=-0.52. p<0.01)

(A) Vitellogenin (B) Apo A-I

0

0.5

1

1.5

2

30 50 70 90 110 130

Total egg number

Rel

ativ

e le

vels

of

ovot

rans

ferr

in

24w (r=0.05)35w (r=-0.02)

0

0.5

1

1.5

2

2.5

30 50 70 90 110 130

Total egg number

Rel

ativ

e le

vels

of X

pro

tein

24w (r=0.103)

35w (r=0.331, p<0.01)

(C) Ovotransferrin (D) X protein

Table 6Correlation between serum protein levels and total egg number in two batches of TRFCC.

14w 24w 35w 14w 24w 35wApo A-I 0.02 0.07 (-0.55)** 0.16 0.2 (-0.57)**Apo VLDL-II 0.02 0.15 0.06 -0.15 0.2 0.38**

X protein 0.04 0.05 0.24* -0.16 0.13 0.46**

Vitellogenin ― 0.03 0.19 ― 0.24* 0.65**

* p < 0.05** p < 0.01

Batch A Batch B

Exp II. 篩選策略• 臺灣紅羽土雞• A 批 (n=77)

2003 年七月孵化• B 批 (n=78)

2003 年九月孵化

Fig. 1. Egg production rate of batch A (n=77) and

batch B (n=78) of TRFCC.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Wk of age

Egg

pro

duct

ion

rate

Batch A

Batch B

Code-selection

serum protein level serum protein level

Score

rank

score

rank

Batch ABatch B

Transformation

Regional codes

code

Code-selectionStep 1: selection 20% of low egg number of birds in batch B of

TRFCC

Step 2: Transform codes in batch A of birds

Table 2Serum protein levels in batch A of TRFCC (35 wk).

score rank score rank score rank score rank1 2.00 63 63.82 1.62 0.92 61 62.61 0.65 1.78 24 39.83 1.99 5.62 63 64.66 2.582 1.05 25 25.32 0.93 0.44 26 26.68 0.32 * ― ― ― 2.71 18 18.47 1.233 1.08 29 29.38 0.95 0.40 21 21.55 0.29 1.24 11 18.26 0.62 2.5 16 16.42 1.134 1.33 50 50.65 1.13 0.49 31 31.82 0.35 1.70 19 31.53 1.79 4.97 53 54.39 2.285 1.18 40 40.52 1.02 0.60 38 39.00 0.42 1.92 34 56.43 2.35 4.83 52 53.37 2.226 1.41 54 54.70 1.19 0.68 45 46.18 0.48 * ― ― ― 4.24 41 42.08 1.947 2.60 68 68.88 2.06 0.02 4 4.11 0.03 * ― ― ― 1.12 7 7.18 0.498 0.88 13 13.17 0.80 1.53 73 74.92 1.06 1.84 28 46.47 2.15 6.75 76 78.00 3.119 2.90 71 71.92 2.28 0.22 12 12.32 0.17 * ― ― ― 1.2 11 11.29 0.5310 1.19 44 44.57 1.03 0.64 42 43.11 0.45 2.04 39 64.72 2.66 4.97 54 55.42 2.28: : : : : : : : : : : : : : : :78 4.15 75 75.97 3.19 0.02 5 5.13 0.03 0.66 7 11.62 -0.86 0.96 5 5.13 0.42*Serum samples were not available.

vitellogeninbird no.

apo A-I apo VLDL-II X-protein

tBRtBR

tBRiiBS

iiBSiiBS

iiBStBR

)(m

nRR

it AB

結論• 雞血清蛋白質濃度的變化除了和產蛋有關外 , 也受到環境的影響

• 將兩批雞的蛋白質濃度轉成分數 (score) 及序號(Rank), 利用密碼轉換方式可以巧妙的找出兩批雞的蛋白質濃度的規則性 , 進而可作預測

• 血清蛋白質在 14週和產蛋無關 ,卻可以在此階段找出低產雞的規則性

• 利用密碼篩選法可於 14wk 淘汰 19.5% 雞隻，其中包含 78.8% 之 50% 低產雞

致謝參與土雞計劃之合作及研究人員

• 動科所林志鴻博士李文權博士莊景凱博士

土雞計劃相關工作人員林寶雪小姐陳欣欣小姐陳惠卿小姐陳玉惠小姐陳宛宜小姐林冬梅小姐

• 中興大學李淵百教授黃三元副教授陳志峰副教授

• 清華大學分醫所劉銀樟教授

• 清華大學資工所唐傳義教授林沿妊小姐

• 高首企業股份有限公司黃次洋執董黃士人場長

刀鋒式伺服器在尖端科學計算領域的研發 (廣達產學 )

子計畫二 : 建置叢集計算技術於理論物理及生物資訊的環境

國家實驗研究院 : 莊哲男院長國家高速網路與計算中心 : 張西亞博士國家理論科學研究中心 : 張圖南主任清華大學資訊工程學系 : 唐傳義教授

Performance Comparison between IB and GE on Quanta Blade Server

• Each blade server contains 10 blades

• Intel EM64T Xeon (Nacona)

– 3.2 GHz with 1 MB of L2 cache

• Each blade contains 4 GB of DDR2 400

• Scientific Linux release 4.2 x86_64 with kernel version 2.6.14.5

• IBG2 2.0.1 driver for IB

Quanta Blade Server

生物資訊相關應用的研發 (1)

• 方法的研發– 平行演化樹的建立– 平行三條序列的比對– 平行多條有限制序列的比對– 蛋白質二級結構的預測– 基因體序列地標的建置與其 SNP 、 EST 序列的比對

生物資訊相關應用的研發 (2)

• 相關服務網站與資料庫的建立– 平行演化樹建立的網站– 蛋白質二級結構的預測的網站– 平行序列比對的網站– 蛋白質二級結構預測的資料庫– 蛋白質序列與結構資訊的資料庫– 基因體序列地標建置的網站

方法的研發 (3)

• 基因體序列地標的建置與其 SNP 、 EST 序列的比對– 建置中– Human EST: 600萬筆以上 , SNP: 900萬筆以上– (2002) The human EST alignments compared 1.75×109 bases in

3.73×106 ESTs against 2.88×109 bases of human DNA and took 220 CPU hours on a Linux farm of 800 MHz Pentium IIIs.

第二年的研究計畫 (2006/11~2007/7)

• 方法的研發方面– 完成基因體序列地標的建置與其 SNP、 EST 序列的比對– 測試相關方法的效能並提供系統方面的建議

• 相關服務網站與資料庫的建立 (service oriented architectures)– 完成平行序列比對的網站– 完成基因體序列地標建置的網站– 蛋白質序列與結構資訊的資料庫

實驗室未來導向• 昇華現有應用與醫療資訊接軌

– 國科會後卓越計畫「下一世代資訊通訊網路尖端技術與應用」第六分項

– 經濟部技術處學界科專計畫「智慧型感測系統、網路及應用技術研發」

– 國科會前瞻計畫「醫療格網－以格網為基礎的 E-health 系統」

– 國科會與廣達產學合作計畫– 整合醫療格網、智慧型感測系統、生物資訊學並從事於人類疾病相關的研究

核醫影像銀行的病史探勘及其在癌症診斷上的應用

唐傳義閻紫宸 ( 長庚核醫科主任 )

王速貞 (FDA USA)

背景• 長庚核醫科現已設置影像銀行，收集非常多的癌

症病人核醫資料，包括 CT 、 MRI 、 PET 、 SPECT 等影像及其醫生對每一張影像所做的診斷

• 該影像資料代表病人不同分期的腫瘤狀況• 除銀行中的病人核醫影像及診斷資料外，可以透

過調閱每一個病人在資料庫中的病歷資料，以了解其完整的病史 ( 包括個人資料、治療方式、病理解剖、生化數據 )

• 我們希望透過資訊探勘技術對病史獲取有價值的資訊

那些是有價值的資訊• 某一種治療方法對病人不同療效的分類• 不同的療效呈現與病人個別差異的相關性• 建立依病人個人化資訊的診斷支援系統

鼻咽癌（ Nasopharyngenl Carcinoma ， NPC ） • 被認為是南方中國人特有的癌症。• 根據統計，男性每十萬人每年罹患鼻咽癌的人數在台灣是 7.7 人、美國 0.63人、日本 0.27 人。

• 鼻咽癌發生之原因經研究結果約有三項，即遺傳因子、 EB病毒感染、環境因素 ( 小時候多食用醃漬食物或鹹魚、工作環境之空氣污染及久而多的吸菸者 )

• 常見症狀可略分為六項：頸部腫塊、單側之聽力障礙或閉塞感、痰中帶血絲或鼻涕中含血、一邊之鼻塞或鼻涕增多、頭痛和臉麻或遠看東西糢糊。

• 發病者的人類白細胞抗原（ HLA ）亦顯示與某些特別的 HLA 有聯繫。

• 鼻咽癌的治療以放射治療為主，晚期 (第三、四期 ) 或復發之病人可能需要併用化學及手術治療。治療後，少數病人可能復發，所以定期追蹤檢查是必要的。

• 台灣鼻咽癌之早期診斷率以及治療成績，如今在世界上是相當突出而有名的。

Genome-wide Interpretation:Informatics of Immune Responses

-The Concept of Immunometer

林口長庚紀念醫院內科部感染醫學科黃景泰醫師Ching-Tai Huang, M.D., Ph.D.Infectious Diseases, Medicine

Chang Gung Memorial Hospital

自體抗原腫瘤傳染性微生物

環境抗原

Immune Tolerance & Immune Activation-Balance between Physiology & Pathology

Tolerance Activation

移植器官

Transgenic Mouse Model

-Adoptive Transfer System

RecipientsHA expressing

Transgenic Mice

C3-HALow

DonorsHA specific TCRTransgenic Mice

a) CD4+: 6.5 (I-Ed HA110-120 )b) CD8+: clone 4 (Kd HA518-526)

Pooledsplenocytes &

lymph node cells

C3-HAHighNon-Tg

Immune Tolerance & Immune Activation-in CD4+ T Cells

Naive

Anergic/Regulatory

Activated/Memory

Immune Tolerance & Immune Activation-Dynamic genomic approach(With Affymetrix Gene Chips)

Day 2 Day 3 Day 4Naive

Anergic/Regulatory

Activated/Memory

RNA

RNA RNARNA

RNA RNA RNA

Our Aims

• Finding the Immunometer: Clustering the genes with similar expression pattern and significant difference

• Finding the motifs of genes that have similar expression pattern for reconstructing the regulatory relationship of genes.

• Systems biology: By combining the annotated genes by literature, gene expression data and even signal transduction pathway, to find the interface of signal transduction and regulatory network.

NF-κB pathway model

NFκB

IKK

NF-κB

NFKB ICAM1

ICAM-1

0

1

2

3

4

5

1 2 3 4 5 6

time

gene

exp

ress

var

iatio

n

1數列

ICAM-1 _2

0

5

10

15

20

1 2 3 4 5 6

time

gene

exp

ress

var

iatio

n

1數列

Date post:	21-Dec-2015
Category:	Documents
View:	220 times
Download:	3 times

Computational Biology Laboratory Chuan Yi Tang CS Department, NTHU [email protected].

Documents