Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 220 times |
Download: | 3 times |
Our Aims
• Develop new tools
• Create aided valued database
• Apply to systems biology
• Apply to Biomedicine
• A transcription factor is a protein that regulates the activation of transcription in the eukaryotic nucleus. Transcription factors localise to regions of promoter and enhancer sequence elements either through direct binding to DNA or through binding other DNA-bound proteins.
Coregulated genes
Gene 1
Gene 2
Gene 3
Transcription factor
atgaccgggatactgattaatacaaggttgggtataatggagtacgataa
attgagatcaatgtacggcgggtgctctcccgattggaagacaacgtggg
gcaatcgggatcacaacgtagaattggatgtcaaaataatggagtggcac
gtcaatcgaaaaaacggtggtgagcgcaaagtaaagggattggaccgctt
S1
S2
S3
S4
• Degeneracy often tends to occur at specific positions of transcription elements. e.g. Sp1 binding site YCCGYCCS
• When no auxiliary data are used (orthologous sequences), the accuracy of most tools for motif discovery is strongly influenced by the motif degeneracy and the lengths of input sequence.
ARRTTYYRSA high motif degeneracy , weak motif
AAGTTYYRCA low motif degeneracy , strong motif
•A degenerate (l, d)-motif is defined as a pattern of length l over the IUPAC code with no more than d degenerate positions. (A degenerate position is a position occupied by a character other than A, G, C or T)
e.g. ARATTYT degenerate (7,2)-motif
• Degenerate motif discovery problem. Given a set of sequences S = {S1, S2, …, Sm | Si belongs to {A, G, C, T}* for all i} and three nonnegative integers k, l and d, find all degenerate (l, d)-motifs, each of which has occurrences in at least k sequences in S.
METHODS
atgaccgggatactgattaatacaaggttgggtataatggagtacgataa
attgagatcaatgtacggcgggtgctctcccgattggaagacaacgtggg
gcaatcgggatcacaacgtagaattggatgtcaaaataatggagtggcac
gtcaatcgaaaaaacggtggtgagcgcaaagtaaagggattggaccgctt
S1
S2
S3
S4
e.g.
l=3, d=1 k=4
Wij = ATA
All possible set of degenerate positions : {P1, p2,p3}_TA, A_A, AT_
For each possible set X = {p1, …, pd} of degenerate positions, all Wpq with V(Wij, Wpq) X are collected.
_TA
ATA (S1)CTA (S2)ATA (S3)CTA (S3)TTA (S4)
A_A
ATA (S1)ATA (S2)ATA (S3)ACA (S4)AAA (S4)ACA (S5)
AT_
ATC (S2)ATT (S3)ATA (S3)ATA(S3)AAA(S3)
K=4 K=5
K=2
Background letter probabilities are PA = 0.22, PT = 0.22 PC = 0.28, and PG = 0.28. A negative (p, q)-entry means that the letter p at position q is weakly conserved in G(Wij | X).
Lpq = log[(observed probability of p at position q in G(Wij | X)) / Pp]
Pseudo occurrence elimination
Motif scoring methods
s1 = ( Lij / pj ) / l,
This fact is used to measure the conservation and the significance of each reported motif.
(1.51+1.51+1.51+1.51+(0.31+0.31)/2+1.51+(0.31+0.82)/2)
The measure used for comparison is the performance coefficient |K P| / |K P|.
(Pevzner P. A. and Sze, S. H. (2000) Combinatorial approaches to finding subtle signals in DNA sequences. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB 2000), 269-278.)
K is the set of positions of the known motif occurrences in the input sequences.
P is the set of predicted positions.
The best performance coefficients among the top ten motifs found by these tools are compared.
Evaluation of performance on synthetic data
atgaccgggatactgattaatacaaggttgggtataatggagtacgataa
attgagatcaatgtacggcgggtgctctcccgattggaagacaacgtggg
gcaatcgggatcacaacgtagaattggatgtcaaccaaagtggagtggcac
Red words the set of positions of the known motif occurrences ( K )
the set of predicted positions ( P )
|K P| = 21 |K P| = 35 |K P| / |K P|= 21/35 = 0.6
S1
S2
S3
0
0.2
0.4
0.6
0.8
1
1.2
10%~15
%
15%~20
%
20%~25
%
25%~30
%
30%~35
%
35%~40
%
40%~45
%
45%~50
%
Degree of motif degeneracy
Perf
orm
ance
Coe
ffic
ient
MotifSeeker
Projection
Consensus
Gibbs sampling
MEME
Evaluation of performance on synthetic data
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
10%~15
%
15%~20
%
20%~25
%
25%~30
%
30%~35
%
35%~40
%
40%~45
%
45%~50
%
50%~55
%
55%~60
%
60%~65
%
65%~70
%
70%~75
%> 75
%
Degree of motif degeneracy
Performance Coefficient
Average Specificity
Average Sensitivity
MotifSeeker
Specificity : |K P| / |P| false positive
Sensitivity : |K P| / |K| false negative
Evaluation of performance on tissue-specific regulatory elements
• Four liver-specific factors : HNF-1, HNF-3, HNF-4 and C/EBP.
• Each regulon consists of at least five genes. The average length of the analyzed promoter sequences is about 2.5 k bp.
Reference
• Identification of Degenerate Motifs Using Position Restricted Selection and Hybrid Ranking Combination,
by C. H. Peng etc., to appear in NAR
彭千華鍾允昇 etc.
研究蛋白質體當做雞群的篩選
• Lezczynski et al.(1985) Relationship of plasma estradiol and progesterone levels in domestic chicken hens. Poult. Sci. 64, 545.
• Kuo et al.(2005) Proteomic analysis of hypothalamic proteins of high and low egg production strains of chicken. Theriogenology 64, 1490
• Huang et al. (2006) Analysis of chicken serum proteome and differential protein expression during development in single-comb White Leghorn hens. Proteomics 6, 2217
Adipocyte
Intestine
Hypothalamus
GnRHGHRH
GHIH
PRFs
PIFs
GHPRL LH FSH
Pituitary gland
Leptin
Ovary
Follicle
Oocyte
EstrogenProgesterone Inhibin Activin Follistatin BMP
WBC
IL-8
MCP-1
TNF-
MMPs
PA
PGs
Elastase
LTB4
PAF
Histamine
ROS
GH-R ER
IGF-I
Liver
ApoAI, ApoB, ApoVLDL II
VTGI,VTG II, VTG III
Riboflavin BP, Retinol-BP
ZP1, 2M, Transthyretin
25-hydroxyvitamin D3-BP
Kidney
1.25-dihydroxyvitamin D3
Calbindin D28
OviductER
Ovalbumin,Ovotransferrin,Lysozyme
Ovomucoid, Ovoinhibitor, Ovostatin, Cystatin
Ovocleidin-17, Osteopontin,calbindin D28
Photostimulation, Glucose
Relaxin
LH-R FSH-R
Oviposition-related biopathway
IGF-I
LCAT
Egg
Egg yolk
Egg white
Egg shell
GnRH-R
Serum protein marker
• Ovotransferrin
• Vitellogenin
• Apolipoprotein A-I
• X protein
(an IGF-I like protein)
• Apo VLDL-IIExp I Exp II
Stage selection
• Exp I : Association of serum protein levels with egg number at two stage.
24 wk (initial egg production)
35 wk (peak egg production)• Exp II: Selection strategy at three stage
14 wk (premature stage)
24 wk
35 wk
Exp (I)Fig. 1. Egg production rate of TRFCC (n=157).
(A) Total egg number of all hens, (B) hens in four groups
0
20
40
60
80
100
120
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
week of age
Egg
pro
duct
ion
rate
(%
)
0
10
20
30
40
50
60
70
80
90
100
25 30 35 40 45
Week of age
Egg
pro
duct
ion
rate
(%
)
Group IGroup IIGroup IIIGroup IV
(A)
(B)
Fig. 3. Association of relative protein levels
with total egg number.
0
1
2
3
4
5
6
30 50 70 90 110 130
Total egg number
Rel
ativ
e le
vel
s o
f v
itel
log
enin 24w (r=0.23, p<0.01)
35w (r=0.53, p<0.01)
0
1
2
3
4
5
6
30 50 70 90 110 130
Total egg number
Rel
ativ
e le
vel
s o
f ap
o A
-I
24w (r=0.14)35w (r=-0.52. p<0.01)
(A) Vitellogenin (B) Apo A-I
0
0.5
1
1.5
2
30 50 70 90 110 130
Total egg number
Rel
ativ
e le
vels
of
ovot
rans
ferr
in
24w (r=0.05)35w (r=-0.02)
0
0.5
1
1.5
2
2.5
30 50 70 90 110 130
Total egg number
Rel
ativ
e le
vels
of X
pro
tein
24w (r=0.103)
35w (r=0.331, p<0.01)
(C) Ovotransferrin (D) X protein
Table 6Correlation between serum protein levels and total egg number in two batches of TRFCC.
14w 24w 35w 14w 24w 35wApo A-I 0.02 0.07 (-0.55)** 0.16 0.2 (-0.57)**Apo VLDL-II 0.02 0.15 0.06 -0.15 0.2 0.38**
X protein 0.04 0.05 0.24* -0.16 0.13 0.46**
Vitellogenin ― 0.03 0.19 ― 0.24* 0.65**
* p < 0.05** p < 0.01
Batch A Batch B
Fig. 1. Egg production rate of batch A (n=77) and
batch B (n=78) of TRFCC.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
Wk of age
Egg
pro
duct
ion
rate
Batch A
Batch B
Code-selection
serum protein level serum protein level
Score
rank
score
rank
Batch ABatch B
Transformation
Regional codes
code
Step 2: Transform codes in batch A of birds
Table 2Serum protein levels in batch A of TRFCC (35 wk).
score rank score rank score rank score rank1 2.00 63 63.82 1.62 0.92 61 62.61 0.65 1.78 24 39.83 1.99 5.62 63 64.66 2.582 1.05 25 25.32 0.93 0.44 26 26.68 0.32 * ― ― ― 2.71 18 18.47 1.233 1.08 29 29.38 0.95 0.40 21 21.55 0.29 1.24 11 18.26 0.62 2.5 16 16.42 1.134 1.33 50 50.65 1.13 0.49 31 31.82 0.35 1.70 19 31.53 1.79 4.97 53 54.39 2.285 1.18 40 40.52 1.02 0.60 38 39.00 0.42 1.92 34 56.43 2.35 4.83 52 53.37 2.226 1.41 54 54.70 1.19 0.68 45 46.18 0.48 * ― ― ― 4.24 41 42.08 1.947 2.60 68 68.88 2.06 0.02 4 4.11 0.03 * ― ― ― 1.12 7 7.18 0.498 0.88 13 13.17 0.80 1.53 73 74.92 1.06 1.84 28 46.47 2.15 6.75 76 78.00 3.119 2.90 71 71.92 2.28 0.22 12 12.32 0.17 * ― ― ― 1.2 11 11.29 0.5310 1.19 44 44.57 1.03 0.64 42 43.11 0.45 2.04 39 64.72 2.66 4.97 54 55.42 2.28: : : : : : : : : : : : : : : :78 4.15 75 75.97 3.19 0.02 5 5.13 0.03 0.66 7 11.62 -0.86 0.96 5 5.13 0.42*Serum samples were not available.
vitellogeninbird no.
apo A-I apo VLDL-II X-protein
tBRtBR
tBRiiBS
iiBSiiBS
iiBStBR
)(m
nRR
it AB
結論• 雞血清蛋白質濃度的變化除了和產蛋有關外 , 也受到環境的影響
• 將兩批雞的蛋白質濃度轉成分數 (score) 及序號(Rank), 利用密碼轉換方式可以巧妙的找出兩批雞的蛋白質濃度的規則性 , 進而可作預測
• 血清蛋白質在 14週和產蛋無關 ,卻可以在此階段找出低產雞的規則性
• 利用密碼篩選法可於 14wk 淘汰 19.5% 雞隻,其中包含 78.8% 之 50% 低產雞
致謝 參與土雞計劃之合作及研究人員
• 動科所林志鴻博士李文權博士莊景凱博士
土雞計劃相關工作人員林寶雪小姐 陳欣欣小姐陳惠卿小姐 陳玉惠小姐陳宛宜小姐 林冬梅小姐
• 中興大學李淵百教授黃三元副教授陳志峰副教授
• 清華大學分醫所 劉銀樟教授
• 清華大學資工所 唐傳義教授 林沿妊小姐
• 高首企業股份有限公司 黃次洋執董 黃士人場長
刀鋒式伺服器在尖端科學計算領域的研發 (廣達產學 )
子計畫二 : 建置叢集計算技術於理論物理及生物資訊的環境
國家實驗研究院 : 莊哲男院長國家高速網路與計算中心 : 張西亞博士 國家理論科學研究中心 : 張圖南主任清華大學資訊工程學系 : 唐傳義教授
Performance Comparison between IB and GE on Quanta Blade Server
• Each blade server contains 10 blades
• Intel EM64T Xeon (Nacona)
– 3.2 GHz with 1 MB of L2 cache
• Each blade contains 4 GB of DDR2 400
• Scientific Linux release 4.2 x86_64 with kernel version 2.6.14.5
• IBG2 2.0.1 driver for IB
Quanta Blade Server
生物資訊相關應用的研發 (1)
• 方法的研發– 平行演化樹的建立– 平行三條序列的比對– 平行多條有限制序列的比對– 蛋白質二級結構的預測– 基因體序列地標的建置與其 SNP 、 EST 序列的比對
生物資訊相關應用的研發 (2)
• 相關服務網站與資料庫的建立– 平行演化樹建立的網站– 蛋白質二級結構的預測的網站– 平行序列比對的網站– 蛋白質二級結構預測的資料庫– 蛋白質序列與結構資訊的資料庫– 基因體序列地標建置的網站
方法的研發 (3)
• 基因體序列地標的建置與其 SNP 、 EST 序列的比對– 建置中– Human EST: 600萬筆以上 , SNP: 900萬筆以上– (2002) The human EST alignments compared 1.75×109 bases in
3.73×106 ESTs against 2.88×109 bases of human DNA and took 220 CPU hours on a Linux farm of 800 MHz Pentium IIIs.
第二年的研究計畫 (2006/11~2007/7)
• 方法的研發方面– 完成基因體序列地標的建置與其 SNP、 EST 序列的比對– 測試相關方法的效能並提供系統方面的建議
• 相關服務網站與資料庫的建立 (service oriented architectures)– 完成平行序列比對的網站– 完成基因體序列地標建置的網站– 蛋白質序列與結構資訊的資料庫
實驗室未來導向• 昇華現有應用與醫療資訊接軌
– 國科會後卓越計畫「下一世代資訊通訊網路尖端技術與應用」第六分項
– 經濟部技術處學界科專計畫「智慧型感測系統、網路及應用技術研發」
– 國科會前瞻計畫「醫療格網-以格網為基礎的 E-health 系統」
– 國科會與廣達產學合作計畫– 整合醫療格網、智慧型感測系統、生物資訊學並從事於人類疾病相關的研究
背景• 長庚核醫科現已設置影像銀行,收集非常多的癌
症病人核醫資料,包括 CT 、 MRI 、 PET 、 SPECT 等影像及其醫生對每一張影像所做的診斷
• 該影像資料代表病人不同分期的腫瘤狀況• 除銀行中的病人核醫影像及診斷資料外,可以透
過調閱每一個病人在資料庫中的病歷資料,以了解其完整的病史 ( 包括個人資料、治療方式、病理解剖、生化數據 )
• 我們希望透過資訊探勘技術對病史獲取有價值的資訊
鼻咽癌( Nasopharyngenl Carcinoma , NPC ) • 被認為是南方中國人特有的癌症。• 根據統計,男性每十萬人每年罹患鼻咽癌的人數在台灣是 7.7 人、美國 0.63人、日本 0.27 人。
• 鼻咽癌發生之原因經研究結果約有三項,即遺傳因子、 EB病毒感染、環境因素 ( 小時候多食用醃漬食物或鹹魚、工作環境之空氣污染及久而多的吸菸者 )
• 常見症狀可略分為六項:頸部腫塊、單側之聽力障礙或閉塞感、痰中帶血絲或鼻涕中含血、一邊之鼻塞或鼻涕增多、頭痛和臉麻或遠看東西糢糊。
• 發 病 者 的 人 類 白 細 胞 抗 原 ( HLA ) 亦 顯 示 與 某 些 特 別 的 HLA 有 聯 繫 。
• 鼻 咽 癌 的 治 療 以 放 射 治 療 為 主 ,晚期 (第三、四期 ) 或復發之病人可能需要併用化學及手術治療。治療後,少數病人可能復發,所以定期追蹤檢查是必要的。
• 台灣鼻咽癌之早期診斷率以及治療成績,如今在世界上是相當突出而有名的。
Genome-wide Interpretation:Informatics of Immune Responses
-The Concept of Immunometer
林口長庚紀念醫院 內科部 感染醫學科 黃景泰醫師Ching-Tai Huang, M.D., Ph.D.Infectious Diseases, Medicine
Chang Gung Memorial Hospital
自體抗原 腫瘤傳染性微生物
環境抗原
Immune Tolerance & Immune Activation-Balance between Physiology & Pathology
Tolerance Activation
移植器官
Transgenic Mouse Model
-Adoptive Transfer System
RecipientsHA expressing
Transgenic Mice
C3-HALow
DonorsHA specific TCRTransgenic Mice
a) CD4+: 6.5 (I-Ed HA110-120 )b) CD8+: clone 4 (Kd HA518-526)
Pooledsplenocytes &
lymph node cells
C3-HAHighNon-Tg
Immune Tolerance & Immune Activation-Dynamic genomic approach(With Affymetrix Gene Chips)
Day 2 Day 3 Day 4Naive
Anergic/Regulatory
Activated/Memory
RNA
RNA RNARNA
RNA RNA RNA
Our Aims
• Finding the Immunometer: Clustering the genes with similar expression pattern and significant difference
• Finding the motifs of genes that have similar expression pattern for reconstructing the regulatory relationship of genes.
• Systems biology: By combining the annotated genes by literature, gene expression data and even signal transduction pathway, to find the interface of signal transduction and regulatory network.
NFKB ICAM1
ICAM-1
0
1
2
3
4
5
1 2 3 4 5 6
time
gene
exp
ress
var
iatio
n
1數列
ICAM-1 _2
0
5
10
15
20
1 2 3 4 5 6
time
gene
exp
ress
var
iatio
n
1數列