COMPUTER DATA ANALYSIS
OF GENOME SEQUENCING
BY TECHNOLOGY ChIP-seq
AND Hi-C
adviser–Yuri Orlov, ICG SB RAS
author– Kulakova Ekaterina, bachelor
Topicality
Automated systems allow decoding DNA and genomic sequences up to whole genomes. The
complete sequencing of genomes leads to avalanche growth on the sequence information
(megabytes and gigabytes of data).
Development of methods based on chromatin immunoprecipitation (ChIP-seq, ChIA-PET) gives
a qualitatively new data.
There are new tasks of computer genomics (analysis of spatial, non-linear structures of
chromosomes)
*ChIP- Seq = Chromatin ImmunoPrecipitation sequencing
ChIA-PET = Chromatin Interaction Analysis by Paired-End-Tag sequencing
The aim of this work - the study of chromosomal contacts in the cell nucleus with the help ofcomputer programs statistical data of genes and chromosomal domains, experimental dataanalysis ChIP-seq and Hi-C.
Integration of modern genome-wide ChIP-seq data and Hi-C, which became available only inthe last two or three year
Using the parameter precision location on chromosome with which to analyze the data
Establishing a list of genes located on chromosome boundaries of topological domains.
Aim and Scientific novelty
Methods Hi-C and ChIA-PET*
*ChIP- Seq = Chromatin ImmunoPrecipitation sequencing
ChIA-PET = Chromatin Interaction Analysis by Paired-End-Tag sequencing
Hi-C = Hi (high dimension chromosome) Conformation
Comprehensive Mapping of Long-Range
Interactions Reveals Folding Principles of the
Human Genome. Science, 2009
Arrangement of chromosomes in
the cell nucleus (reconstruction
according to Hi-C)
Scheme of local chromosomal
domains ("tangle" contacts)
Separate loops
«tangle»
(Dixon et al., 2012)
Topological arrangement of the
domains of chromosomes and its
mapping in the genome
Scheme of arrangement of
genes on chromosome
Genomic data: genes, peaks ChIP-seq,
contact areas ChIA-PETgenes
genes
Plot of
chromosomal
contacts ChIA-PET
Chromosomal domain
Peaks of ChIP-seq
profiles
File formats and their presentation
>track name=ER_E2 description=ER_E2 chr1 557112 558114
chr1 559459 560286
chr1 998864 999397
chr1 999399 999604
chr1 1004343 1005146
chr1 1070346 1071080
chr1 1305474 1306502
chr1 1358287 1358744
chr1 1776987 1777750
chr1 1820476 1821168
chr1 1922754 1923628
chr1 2131962 2132747
chr1 2325805 2326447
chr1 2368996 2369977
chr1 3119829 3120541
chr1 3244610 3245121
…
Bed-file example
The size of one file with the
genomic profile - from 100 MB to
2-3 Gb
RefSeq annotation taken from UCSC Genome
Browser
http://genome.ucsc.edu/cgi-bin/hgTables
Data about domains in mouse cells -
obtained in the laboratory O.L.Serov (ICG
SB RAS) (Fib_domains, Sp_domains).
Calculation of the position of genes and
domain boundaries
А1 – left coordinate of the gene B1 - right coordinate of the gene.
А2 – left coordinate of the domain, В2 – right coordinate of the domain.
Е – accuracy, user-defined.
if (|А1 – А2| <= Е) & (В1 < А2 + (В2 – А2)/2) true, we assume that the gene
lies close to the left boundary of the domain. Similar conditions for the right
border.
Е
А2 В2А1 В1
доменген
Example of location of chromosomal
domains and genes for mouse
chromosome 10 The linear arrangement of genes in the domain
Table location types of genes in chromosomal
domains
Other – other genes
Inside – genes that lie within the domains
onBorder – genes lying on the domain
boundaries.
Analysis of the location set of genes on
the domains in different cell types
User specifies a list of genes. Possible to analyze all the genes in the genome
(20,000 genes)
Types of cells - embryonic stem cells (fibroblasts - Fib) and sperm (Sp)
mouse. Experiment Hi-C, ICG SB RAS
Sp (densely packed
structure)
92,5 % genes within domains
1,4% on border
6,1% other
Fib (Open chromatin)
72,6 % genes within domains
3,2% on border
24% other
Experimental data.
Gene Ontology categories
For analysis were taken genes lying on the
domain boundaries.
The result was sorted by the number of
genes with common biological processes
category
Used online resource
http://david.abcc.ncifcrf.gov/
Analysis of the co-expression of genes, lying on the borders of the spatial domain
For analysis were taken genes located on the domain boundaries.
Used online resource STRING http://string-db.org/
The main result - graphs of gene networks of varying degrees of
connectivity for the two types of cells
Fib
698 – the total number of genes on
the domain boundaries
88 – genes involved in the
connection
160 pairs of connection
12% genes from total
Sp
314 – the total number of
genes on the domain
boundaries
13 – genes involved in the
connection
10 pairs of connection
4% genes from total
Conclusion
Implemented a Java program
Application of the program to the experimental data (ICG SB RAS
and databases on chromosome contacts)
The analysis of the location set of genes in chromosomal domains (control computer simulation)
Next Steps
Define domains including pluripotency genes in the mouse genome (Dixon
et al., 2012).
Make developed project is compatible with other programs designed toICG SB RAS for microarray data developed in languages Java, C / C + +.
Integrate the program with data on gene expression database BioGPS
microchips in human genome.
Thank you for your attention!
Publications(Thesises) Safronova N.S., Kulakova E.V., Orlov Yu.L. (2013) Applications of text complexity measures to
genome sequences analysis. // Proceedings of GIW-2013, National University of Singapore, 16-18 Dec 2013. P.42.
Медведева И.В., Вишневский О.В., Кулакова Е.В., Спицына А.М., Афонников Д.А., КочетовА.В., Орлов Ю.Л. (2014) Геномная организация и контекстные характеристики генов сповышенной экспрессией в клетках мозга // Геномная организация и контекстныехарактеристики генов с повышенной экспрессией в клетках мозга // XVI Всероссийскаянаучно-техническая конференция «Нейроинформатика-2014»: Сборник научных трудов.М.: НИЯУ МИФИ. Ч. 2., С. 32-42.
Kulakova E.V., Bryzgalov L.O., Orlov Y.L., Li G., Ruan Y. Computer analysis of chromosomecontacts revealed by sequencing // Конференция BGRS\SB-2014 (Bioinformatics of GenomeRegulation and Structure\System Biology).
Kulakova E.V., Podkolodnaya O.A.,Serov O.L., Orlov Y.L. Computer data analysis of genomesequencing by technology ChIP-seq and Hi-C.// Конференция BGRS\SB-2014 (Bioinformaticsof Genome Regulation and Structure\System Biology).P – 90.
Кулакова Е.В. Компьютерный анализ данных геномного секвенирования по технологииChIP-seq и Hi-C. // Конференция МНСК-2014 (Международная научная студенческаяконференция). C. 207
Spitsina A., Kulakova E.V., Safronova N., Orlova N.G. Statistical analysisof gene expression data by rank correlation coefficients.// Конференция BGRS\SB-2014(Bioinformatics of Genome Regulation and Structure\System Biology). P-91.