+ All Categories
Home > Documents > BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account:...

BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account:...

Date post: 11-Jan-2016
Category:
Upload: bennett-glenn
View: 212 times
Download: 0 times
Share this document with a friend
Popular Tags:
29
BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666 Pogány, Hungary
Transcript
Page 1: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

BioInformatics ConsultationPractice 3

Gábor Pauler, Ph.D.

Tax.reg.no: 63673852-3-22Bank account: 50400113-11065546

Location: 1st Széchenyi str., 7666 Pogány, HungaryTel: +36-309-015-488

E-mail: [email protected]

Page 2: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Content of the PracticeFragment processing:

Restriction site database: WebCutterPrimer cleaning: SMS2 DNA PatternVector cleaning: NCBI VecScreenFragment assembly: CAP3

Auxiliary sequence operations: SMS2GUIConversion operationsSequence analysisSeqence mappingRandom sequences

Uploading sequences: EBIRegistrationUpload auxiliary dataUpload sequence

Data Import/Export/Conversion operations: Excel, AccessText file formatsConverting text file formatsHTML-tables and wide textText to Excel From Excel to Text, HTML, Picture Metafile, Bitmap, Access tables

Home Assignment 3: Fragment clean and matchReferences

Page 3: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Fragment processing: Restriction maps: WebCutter: Input The first task in cloning where bioinformatics

is heavily involved is in pre-processing: Selecting restriction enzymes Forecasting restriction sites in case the

cloned sequence is known Performing these tasks we need a restriction

mapping tool based on database of restricti-on enzymes

We will use WebCutter for this purpose: (http://rna.lundberg.gu.se/cutter2/index.html)

At the Start Screen: Sequence title: Title of analysis DNA sequence box:Copy the exami-

ned nucleotide sequence in FASTA for-mat through clipboard(max.50000chars)

Type: Type of analysis Linear: Linear DNA Circular: Circular (eg. in plasmids) Silent mutagenesis: sites in non-

coding parts Display options: results can be

displayed both in graphic or tabular format ordered by nucleotide position/enzyme name

Enzymes: can be filtered by Least and Most number of cutting Lenght of recognition site in bases

(as lenght influences accuracy) By enzyme name list (multiple

selection with Ctrl+Click) Press Analyze sequence to run

ClickClick

CtrlCtrl

ClickClickClickClick

ClickClick

ClickClick

ClickClick

Click

Click

ClickClick

Page 4: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Fragment processing: Restriction maps: WebCutter: Output Character-based restriction map by base positions:

This is great for manual processing and prediction of lenght of possible fragments

However it is hard to process automatically at more numerous fragment lenght computation

Tabular list of restriction sites:

It contains enzime names, number of sites, list of coordinates of sites and recognized sequence wit GCG masked nucleotide codes at uncertain matches

It can be copied into Excel for more detailed fragment lenght forecasts (see later)

Page 5: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Fragment processing: Primer cleaning: SMS2 DNA Pattern In post-processing the sequenced

fragments, the first task is to eliminate sequence of primer, as it can confuse further analysis

As primers are at the very beginning of fragment sequences, usually they are already eliminated in chromatogram analysis, as recognintion of initial sequence is most of the time uncertain

But, in case it is not already eliminated, we can use SMS2 DNA Pattern (http://www.bioinformatics.org/sms2/dna_pattern.html ) to do it:

At the Start Screen: Raw sequence:Copy one or more

nucleotide sequences in FASTA format (max.50000chars)

Search pattern: Sequence of the primer. We can give alternative bases for one position in brackets: [AT] We assume here that sequence of primer(s) used is known!

Submit button: Run At Output Screen:

It gives coordinates of matching sequences

At both strands of the DNA!Results for 180 residue sequence "sample sequence one" starting "ttaaggaccc">match number 1 to "ctt[ca]" start=68 end=71 on the reverse strandctta>match number 2 to "ctt[ca]" start=2 end=5 on the reverse strandctta

ClickClick

Page 6: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Fragment processing: Vector cleaning: NCBI VecScreen: Input Comparing to primers, it is more

cumbersome task to clean up sequence of vectors from fragments:

Vectors sequences are longer They usually can take place both

beginning/end of fragments Vectors are usually used for multiple

purposes containing highly-featured sites

So vector-contamination can totally confuse up any further analysis if it is left in the fragment-sequence!

We will use NCBI’s VecScreen (http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html):

At the Start Screen: Sequence box:Copy the analyzed

nucleotide sequence in FASTA format (max.50000chars)

Run Vecscreen button: it will match sequence against vectors stored in NCBI’s UniVec database (ftp://ftp.ncbi.nih.gov/pub/UniVec/)

At Output Settings Screen: Graphic output Sequence retrieval: display cleaned

sequence View report button: go to output

ClickClick

ClickClick

ClickClick ClickClick

Page 7: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Fragment processing: Vector cleaning: NCBI VecScreen: Output At Output Screen:

At the top, we can see a graphic map overview of matching vector parts

Different intensity of match-es are coded with color regions

Down, there is a text list of matching vec-tor sequences with:

Data of vector

Matching statistics: ratio of identityes and gaps

Detailed character maps of matching

Page 8: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Fragment assembly: Basic definition, CAP3: Input- There is a limitation in PCR that regular DNA

polymerazes work only on. 500-1000 base pair lenght parts, and also most sequencing techniques have serious lenght limitations

- So, longer sequences can be assembled only from cloned fragments, which usually have 50-100 base pairs overlap at their end

- However, restriction sites do not distribute evenly in the genome, and it may disturb overlapped assembly. Thats why we use restriction maps designing the cloning.

- Whenever clone fragments are sequenced and cleaned from primer and vector sequences, we need a software, which Assembles(Összeszerel) the fragments: it finds ca. 100 matching base pairs between beginning/end sequence of one fragment and end/beginning sequence of reverse complement of another fragment.

- After assembly of fragments, we will have the Contig(Kontig): the longest possible compromised sequence assembled

We use CAP3 software (http://pbil.univ-lyon1.fr/cap3.php) for fragment assembly:

- At Start screen:- Sequence box: copy here fragment

DNA sequences in FASTA format after each other

- Submit button: Run- At Otput screen: we get a menu of outputs:

- Contigs: sequence of the longest resulting contig(s) (ideally there should be one) in FASTA format:

>Contig1TCCTTTAAATCCCTTACATGATCTGAGTTCAGACCGGCGTGAGCCAGGTCGGTTTCTATCCTTATTTTTTGTTTATATTTTAGTACGAAAGGACCAAGTATTTTAAATAATTTATTTTAT

ClickClick

ClickClick

Page 9: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Fragment assembly: CAP3: Output- Assembly details:

- It gives the sequence pairs matched at contig assemply:- + denotes the original fragment sequence, - - denotes the reverse complemented another fragment sequence

- Below them it gives the consensus sequence:Number of segment pairs = 2; number of pairwise comparisons = 1'+' means given segment; '-' means reverse complementOverlaps Containments No. of Constraints Supporting Overlap******************* Contig 1 ********************2006-ISO-TD1-2006-ISO-16S+DETAILED DISPLAY OF CONTIGS******************* Contig 1 ******************** . : . : . : . : . : . :2006-ISO-TD1- TCCTTTAAATCCCTTACATGATCTGAGTTCAGACCGGCGTGAGCCAGGTCGGTTTCTATC2006-ISO-16S+ GACCGGCGTGAGCCAGGTCGGTTTCTC-C ____________________________________________________________consensus TCCTTTAAATCCCTTACATGATCTGAGTTCAGACCGGCGTGAGCCAGGTCGGTTTCTATC

Page 10: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Content of the PracticeFragment processing:

Restriction site database: WebCutterPrimer cleaning: SMS2 DNA PatternVector cleaning: NCBI VecScreenFragment assembly: CAP3

Auxiliary sequence operations: SMS2GUIConversion operationsSequence analysisSeqence mappingRandom sequences

Uploading sequences: EBIRegistrationUpload auxiliary dataUpload sequence

Data Import/Export/Conversion operations: Excel, AccessText file formatsConverting text file formatsHTML-tables and wide textText to Excel From Excel to Text, HTML, Picture Metafile, Bitmap, Access tables

Home Assignment 3: Fragment clean and matchReferences

Page 11: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Auxiliary sequence operations: SMS2: GUI- Before uploading and further analysis of

assembled contig sequences we may need certain transformations and format conversions called sequence manipulation.There is an easy-to use, comprehensive toolkit for this called:

Sequence Manipulation Site (SMS2) (http://www.bioinformatics.org/sms2/index.html):

Graphic User Interface (GUI): all SMS2 tool share pretty similar user interface:

- Left menu: we can choose the requred operation from the hierarchic ordered list

- At Start screens:- Top: We can see the explanation of

operation- Sequence box: We can copy here

input sequence in FASTA format (or more sequences consecutively, if the current operation requres)

- There is always a suitable example nucleotide/protein input sequence in the box, making it easier to try out tools!

- Below: are parameter settings of current operation

- Submit button: Run- At Output Screens:

- Outputs are partially graphic, partially in text format, or HTML tables depending on the operation

Site: Positions: AatI agg|cct none AatII gacgt|c 160 Acc16I tgc|gca none AccII cg|cg 44

ClickClick

ClickClick

ClickClick

Page 12: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Auxiliary sequence operations: SMS2: Format conversion operations: Split/Combine FASTA: Cutting a longer continous FASTA sequence into

standard lenght row or concatenate more sorter FASTA into one. EMBL/GenBank-FASTA: From EMBL/GenBank record to FASTA sequence EMBL/GenBank Feature Extractor: From EMBL/GenBank record it extracts

exons and assembles them to cDNA, based on the records feature table EMBL/GenBank Trans Extractor: From EMBL/GenBank record it extracts

possible translated proteins in FASTA format (considering alternative splicing) Filter DNA/Protein: From FASTA formatted DNA/Protein sequence it cleans

illegal characters (except N, which denotes uncertain sequencing in DNA) OneToThree/ThreeToOne: It converts FASTA formatted protein sequences

between 1-char and 3-char coding format, where * and *** respectively denote uncertain sequencing or translation

Window Extractor DNA/Protein: It extracts a window from FASTA formatted DNA/Protein sequence giving the window center position coordinate and width

Range Extractor DNA/Protein: It extracts multiple ranges from FASTA formatted DNA/Protein sequence, given by comma separated coordinates or coordinate ranges, eg: 1,2,3..15,END

- It can concatenate them into one FASTA or split into equal lenght FASTA files Reverse Complement: It computes reverse(5’-3’3’-5’)/ or complement(AT,

CG)/ or reverse complement from FASTA formatted DNA sequence Complements of mask characters denoting uncertain sequencing are treated

by GCG code table! Split Codons: A coding DNA sequence given in FASTA format is understood as

undisturbed sequence of triplet/codons, and it is split to 3 sequences by in-codon position(1,2,3), eg.: from sequence: ATGATG 3 sequences: AAA,TTT,GGG

It is used solely in codon position statistical analysis

Page 13: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Auxiliary sequence operations: SMS2: Sequence analysis operations: Restriction Digest: Simulation of a restriction of a longer DNA sequence given in

FASTA format with a restriction enzyme selected from SMS database (or its binding site sequence given manually):

It computes a list of possible fragment sequences and writes them in one text file in consecutive FASTA records for further processing

Restriction Summary: The same as above, except that it gives not the fragments itself, but a statistic summary table about their properties

PCR Primer Stats: It forecasts for designed primer sequences given in FASTA format:- Melting temperature (important for PCR temperature programming)- Complementarity or partial complementarity (considerably complementer primers

connect to each other instead of cloned DNA strand, reducing PCR efficiency)- In case of linear or circular DNA

PCR Products: It simulates PCR of a DNA sequence given in FASTA format:- Using the selected or manually inputed open/close primers- Prepares a list of expected PCR product sequences in one text file in consecutive

FASTA records ORF Finder: In a DNA sequence given in FASTA format, it searches Open Reading

Frames (ORF): sequence parts bordered by stop codons on 2 DNA strands × 3 reading frames of codon starting positions (1,2,3) = in 6 reading frames. It is used finding possible coding parts of a DNA

Gives list of ORFs in 1 file as consecutive FASTA recs, Gives a summary table about their lenght and position

CpG Island: In a DNA sequence given in FASTA format, it searches CG-dimer rich „islands”: they are usually take place at the 5’-end of genes in vertebrates(gerincesek)

Gives a summary table about CG-island’s lenght and position Translate/Reverse translate: Translating FASTA DNA to FASTA 1-char coded

Protein, or translate back protein to most likely cDNA sequence based on the selected specie’s Codon Usage Table: probability alterante codons of aminoacids in species

Page 14: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Auxiliary sequence operations: SMS2: Other operations Sequence mapping operations:

Primer map: In a DNA sequence given in FASTA format, it prepares a graphic map of binding sites of given list of primers

- Also gives a summary table of coordinates of sites and primer name Restriction map: The same as above, just for restriction enzymes Translation map: In a DNA sequence given in FASTA format, it translates all

6 possible reading frames into FAST 1-car coded Protein sequences- Valid codon table can be selected (the default is Genomial (not

Mithocondrial), and Standard for vertebrates)- It assumes that DNA contains only coding parts, no introns should be there

Random sequence generation operations: Random DNA/cDNA/Protein: Random DNA/cDNA/Protein sequences for:

Simulation or try out other software, or Make unprepared students really cry

at sequence analysis computer lab exam! Wohahaha, Yeah!

Mutate/Shuffle DNA/Protein: In a DNA sequence given in FASTA format, it crea-tes flip/ insert/ shuffle mutations

Random DNA/Protein regions: In a DNA/Protein sequence given in FASTA format, it randomizes regions given by comma separated coordinates or coordinate ranges, eg: 1,2,3..15,END

Page 15: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Content of the PracticeFragment processing:

Restriction site database: WebCutterPrimer cleaning: SMS2 DNA PatternVector cleaning: NCBI VecScreenFragment assembly: CAP3

Auxiliary sequence operations: SMS2GUIConversion operationsSequence analysisSeqence mappingRandom sequences

Uploading sequences: EBIRegistrationUpload auxiliary dataUpload sequence

Data Import/Export/Conversion operations: Excel, AccessText file formatsConverting text file formatsHTML-tables and wide textText to Excel From Excel to Text, HTML, Picture Metafile, Bitmap, Access tables

Home Assignment 3: Fragment clean and matchReferences

Page 16: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Uploading sequences: EBI: Registration, Upload auxiliary data 1 After cleaning and assembling fragments, now

we have a nice sequence we would like to share with other researchers

For this purpose, we will use EBI’s interface (http://www.ebi.ac.uk/embl/Submission/index.html)

At Registration&Login Screen:- Register: Register to EBI database first:- Giving your Personal data and press Save- Then you will receive a validation e-mail to

your given address, where you should click a link to validate your registration

- After that you can login giving your e-mail and password pressing Log in button

At Function Select Screen: we have to select- Submit sequences option button- At Here link, we get a utility to check out

whether there is any vector contamination left in the sequence: it uses EBI’s BLASTN nucleotid alignment tool, to check contami-nation in a FASTA formatted DNA sequen-ce based on EMVEC vector database

At Sequence Type Select Screen: we can give the type of uploaded sequence, eg.:

- WGS (Unannotated): whole genom with shotgun cloning

- EST: Expressed sequence tags- We can way faster upload dat if it is prefor-

matted, then select EMBL, MIENS, etc. At Valid From Date Screen: we can give

whether to show it immediately or delayed. Delayed submit is important when you want to prove later, that you submitted first, but don’t want other researchers to access it until your paper is not published

ClickClick

Clic

kC

lick

Click

Click

ClickClick

ClickClick

Click Click

Click

Click

Click Click

ClickClick

Page 17: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Uploading sequences: EBI: Upload auxiliary data and sequence At Publication Reference Screen:

- Citation type: published/unpublished journal article, etc.

- Title, Year, Jornal name- Authors Initials, Surname- In case of multiple publications you can return to

this screen and add more ones At Auxiliary Info Selection Screen: we can select

what kind of environmental info will be attached to sequence:

- Organism, Organelle- Strain, Isolate- Contig name

At Auxiliary Info Selection Screen: we can give the previously selected auxiliary data

At Validation Screen: it checks internal logical dependencies of data. Pressing Validate button it searches Organism, Organelle at EBI databases If everything is OK, then:

At Sequence Upload Screen: you can upload sequence in FASTA format

ClickClickClickClickClickClick

ClickClick

Page 18: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Content of the PracticeFragment processing:

Restriction site database: WebCutterPrimer cleaning: SMS2 DNA PatternVector cleaning: NCBI VecScreenFragment assembly: CAP3

Auxiliary sequence operations: SMS2GUIConversion operationsSequence analysisSeqence mappingRandom sequences

Uploading sequences: EBIRegistrationUpload auxiliary dataUpload sequence

Data Import/Export/Conversion operations: Excel, AccessText file formatsConverting text file formatsHTML-tables and wide textText to Excel From Excel to Text, HTML, Picture Metafile, Bitmap, Access tables

Home Assignment 3: Fragment clean and matchReferences

Page 19: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Data Import/Export/Conversion: Introduction, Text file formats- Most of bioinformatic software receives input and gives back output

in text files (as FASTA, EMBL, Genbank are all text files)- The problem is that they output sizeable table-like results (eg.

restriction site lists) also in text file or in HTML-table, what we would like to effectively transfer to Spreadsheets(Táblázatkezelő) (Excel) or Databases(Adatbáziskezelő) (Access) for advanced analysis.

- Learning some simple tricks and techniques, one can avoid days of manual work eating time from research, solving things in 5 minutes!

Text file formats: to describe tables in text files, software use alternative methods:

Fixed column width tables: this is most popular, but it is worst:- All columns of a table have their fixed charcter-width- Data content cannot be longer than column with. If it is

shorter excess space is filled with Space(ASCII32) chars - Looking it in a Word processzor(Szövegszerkesztő)

columns look nice and aligned (assuming that text is in fixed width Courier New font type)

- Sometimes it does not contain column name texts, or only in abbreviated form, as it may not fit in the same number of characters as the data content

Column delimiter symbol-based tables: less frequent, but better:- Columns are separated by a given delimiter symbol

(Elhatárolójel) _ , : ;- So looking the file in a word processor, we can see bounch

of them- But columns do not look nice and aligned, as their data

content can be pretty variable length- So, the first line can contain whatever lenghty column

names

AA BB CC6.45 5.5 7.3515.6 17.8 3.2

AA,BB,CC6.45,5.5,7.3515.6,17.8,3.2

Page 20: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Data Import/Export/Conversion: Text file formats 2- There are different subspecies of delimiter symbol: Comma Separated Values, CSV:

- Popular among USB-connected instruments- Hovewer in German and Hungarian we use

comma as decimal separator instead of dot, so it can confused up with column separators.Also,text data content can contain comma

- Therefore text data can be put between Text Markers (””) Space(ASCII32)delimited format:

- This is also very popular format- One serious issue that it is very easy to mix up with fixed

column lenght format, which prevents auto-processing:- If columns are not aligned at all rows with spaces, it cannot be

processed as fixed format- While Space-delimited format understands two consecutive

spaces as Null(Üres)-valued field, messing up columns: eg.: before 7.35 there are 2 spaces. This will be the bad result:

- Such a messed up text file can be corrected in Word by selecting the text with, Shift+Cursor and launch Edit|Find/Replace(Szerkesztés|Keresés/Csere) menu to replace two consecutive spaces (__) with one (_), using Replace all(Összes cseréje) button. Repeating this sometimes, space duplications will be eliminated

Colon and semicolon separated formats:better than space delimited, but this characters can appear in stored text also. This can be solved with text mar-kers also

Tab(ASCII9)delimited format:as Tab specially denotes column break- It cannot be mixed up with other characters- But simple users can get confused,as Tab is invisible,except

when pressing button ( ) in word

”AA”,”This,not delimiter!”,”CC”6.45,5.5,7.3515.6,17.8,3.2

AA BB CC6.45 5.5 7.3515.6 17.8 3.2AA BB CC6.45 5.5 7.3515.6 17.8 3.2AA BB CC6.45 5.5 7.3515.617.8 3.2

”AA”;”This;not delimiter!”;”CC”6.45;5.5;7.3515.6;17.8;3.2

AA BB CC6.45 5.5 7.3515.6 17.8 3.2

╥╥ ╥╥╥╥ ╥╥╥╥ ╥╥

AA,BB,CC6.45,5.5,7.3515.6,17.8,3.2

Click Click

Page 21: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Data Import/Export/Conversion: Converting text file formats- Our frequent task is to export table-like text outputs into Excel,

Access or PowerPoint (eg. Codon usage frequencies): Word text to HTML table:

- Select the thext with Shift+Pull- Table|Convert|Text to Table (Táblázat|Konvertálás|Szöveg

táblázattá) menu:- It tries to autodetect, whether the text is in fixed column

width or in delimited format- If it misjudges(eg.on mixing the 2 formats) we can correct it- It gives the number of rows/columns to be created

- Properties of HTML table in Word:- Its rows/columns/cells are fully formattable: sizeable, colorable,

and frameable, also Font/Style/Size/Color of text can be set- Its cell can contain pictures also, while Excel table cell

cannot: picture can be there in background or on overlay- Width of columns can be set to Manual, Uniform, Fit to

content, Fit to Window width- One stupid thing in HTML is that default cell margins are huge

eating up lot of desktop space, reduce them to 0: - Select all the table with Shift+Pull- Table|Table Properties (Táblázat|Táblázat_tulajdonsá-

gai) menu:- |Cells(Cella) tab|Settings(Beállítások) button:

- |Uncheck Same as whole table (Teljes táblázat-tal egyezően)

- |Set Cell Margins(Margók) to 0cm- Another stupid thing of HTML that default column height is not

0, adding redundant space between rows. Set it to 0:- |Rows(Sorok) tab

- |Define row height (Magasság megadása):

- |At least(legalább)| 0 cm

AA BB CC6.45 5.5 7.3515.6 17.8 3.2

AA BB CC 6.45 5.5 7.35 15.6 17.8 3.2

ShiftShift

Clic

kC

lick

PullPull

ClickClick

ClickClick

ClickClick Click

Click

Click

Click

Page 22: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Data Import/Export/Conversion: Converting HTML and wide text HTML table from Word to Excel/PowerPoint/HTML webpage:

- Can be simply copied with Edit|Copy (Szerkesztés|Másolás) Ctrl+C Edit|Paste (Szerkesztés|Beillesztés)Ctrl+V through clipboard keeping all the formattings

HTML table from Word to Text:Select all the table with Shift+Pull:- Table|Convert|Table to Text (Táblázat|Konvertálás|Táblázat

szöveggé) menu: writes out to delimited text file format|Give delimiter character:Tab

Text from Wordb to Picture Metafile:- Output of numerous bioinformatic softwares are text files which use so

wide lines consisting lot of characters (eg. restriction or alignment maps of sequence wit characters) that they cannot fit into the page body of a Word document or a PowerPoint slide and lines messed up.How we can solve it:

- We can reduce font size but it reduces visibility:- Or we can shift from fixed lenght font Courier New to more

compact font (eg. Arial narrow), but alinment of rows will be dest-royed because it is non-fixed lenght font

- Therefore copy text to clipboard, and instead pasting normally with Edit|Paste(Szerkesztés|Beillesztés) Ctrl+V paste it with Edit|Past special(Szerkesztés|Irányított beillesztés) menu:

- |Select Enchanced Metafile(Kép) format- Text will be pasted into Word or PowerPoint as easy-to

resize picture, - Additionally,using their drawing tool(View|Tools|Drawing (Nézet|

Eszköztárak|Rajzoló) menu), picture still can be edited as a set of graphic objects: we can rewrite characters or put additional graphic

- But it cannot be edited as word processor text anymore

AA BB CC

6.45 5.5 7.35 15.6 17.8 3.2

AA BB CC6.45 5.5 7.3515.6 17.8 3.21 cagctggggggaggtggcgaggaagatgacgtggtagttgtcgcggcagctgccaggaga1 10 20 30 40 50 1 gtcgacccccctccaccgctccttctactgcaccatcaacagcgccgtcgacggtcctct

1 cagctggggggaggtggcgaggaagatgacgtggtagttgtcgcggcagctgccaggaga1 10 20 30 40 50 1 gtcgacccccctccaccgctccttctactgcaccatcaacagcgccgtcgacggtcctct

1 cagctggggggaggtggcgaggaagatgacgtggtagttgtcgcggcagctgccaggaga1 10 20 30 40 50 1 gtcgacccccctccaccgctccttctactgcaccatcaacagcgccgtcgacggtcctct

ShiftShift

PullPull

ClickClick

ClickClick

ClickClick

ClickClick

Page 23: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Content of the PracticeFragment processing:

Restriction site database: WebCutterPrimer cleaning: SMS2 DNA PatternVector cleaning: NCBI VecScreenFragment assembly: CAP3

Auxiliary sequence operations: SMS2GUIConversion operationsSequence analysisSeqence mappingRandom sequences

Uploading sequences: EBIRegistrationUpload auxiliary dataUpload sequence

Data Import/Export/Conversion operations: Excel, AccessText file formatsConverting text file formatsHTML-tables and wide textText to Excel From Excel to Text, HTML, Picture Metafile, Bitmap, Access tables

Home Assignment 3: Fragment clean and matchReferences

Page 24: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Data Import/Export/Conversion: Text to Excel Text file table to Excel table:

- Select and copy table in a text file to clipboard then paste it into cell (A1) of an empty Excel worksheet with Edit|Past special (Szerkesztés|Irányított beillesztés) menü|Selecting Plain text (Nem formázott szöveg) format:

- This will look pretty nasty at first: Excel copies it into separate rows, but columns will be melted together as text in one cell

- Select this single column (A1:A3) with Shift+Pull, and make sure that columns to the right of it are empty

- Then use Data|Text to Columns (Adatok|Szövegből oszlo-pok) menu to start text breaking wizard:

- First it asks whether text data is in fixed/delimited format:- If delimited, give delimiter symbol (eg. Comma),

and the text marker, and set whether consecutive delimiters are melted or create empty field:

- If fixed, it gives a breaking screen where you can define column delimiter arrows with Click/Pull

- Then it shows columns created, and we can decide their data type manually or leave it detected automatically:

- First problem is with that Excel by default recognizes text as dates if they conform the international settings of Windows at Start button|Control panel|Inter-national settings|Date- and numeric format (Start gomb|vezérlőpult|Területi beállítások|Dátum- és számformátum). Different dates are left as text!

- You can recognize incorrect detection by alignment: text is at left in cell, recognized dates/numbers at right

- This can be solved setting Date (Dátum) format con-form with data content (YMD(ÉHN), MDY(HNÉ), etc.)

- With Special(Irányított) button we can define Deci-mal separator(Tizedesjel) and Thousand separator (Ezres elválasztó) if it is not detected correctly

- Pressing Finish(Bezár) button of the wizard, the table will be placed in consecutive columns with correct data formatting:

Notrecog-nized!

ShiftShift

PullPull

ClickClick

ClickClick

ClickClickClickClick

ClickClick

Click

Click

PullPull

Click

Click

Clic

kC

lick

ClickClick

ClickClick

Page 25: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Data Import/Export/Conversion:From Excel to Text/ HTML/ Presentation Excel table to text format table:

We can copy selected excel table/diagram or both together to clipboard with Edit|Copy(Szerkesztés|Másolás) Ctrl+C

- Paste to Word or PowerPoint with Edit|Pastspecial(Szerkesz-tés|Irányított beillesztés) menu|In Plain text(Nem formázott szöveg) format. It puts Tab(ASCII9) characters among columns as delimiters

- If we would like another delimiter, past the table as HTML and convert it to text as described earlier choosing delimiter char

- Alternatively, you can concatenate content of columns into continous text in a separate column using cell formulas: =A1& ”,”&B1&”,”&C1 where:&-text concat, „”-constant, A1-cell ref.

Excel table to Presentation: HTML table/Picture Metafile/Bitmap:- Never ever paste it with Edit|Paste(Szerkesztés|Beillesztés)

Ctrl+V into Word or PowerPoint!!! Because this embeds the WHOLE Excel file invisibly into teh document/presentation as many times as you pasted any part table:

- Embedded Excel can still make computations with cell formula, but most of the time we do not need that

- However it will result in a huge document/presentation file, which will frequently freeze Word and PowerPoint

- Correctly,you should paste it with Edit|Past special(Szerkesz-tés|Irányított beillesztés) where you have following options:

- HTML format: Preserves color/font formatting well and table is fully editable (cell formula replaced with numbers)Row/colum sizes/margins messed up, lot of work to fix!

- Picture Metafile:Excellent preservation of all formattingExcellent resizeable Cannot be edited as tableCan be edited as drawing with Word/PPT drawing toolAt simple table/graphic it consumes less resource than:

- Bitmap:It is pasted exactly as you can see on screenBad resizeability, quality deteriorates rapidly Very limi-ted editability with PaintBrush In case of highly cokplex diagrams bitmap consumes less resources than metafile

AA BB CC6.45 5.5 7.3515.6 17.8 3.2

AA BB CC

6.45 5.5 7.35

15.6 17.8 3.2 AA BB CC

6.45 5.5 7.3515.6 17.8 3.2

ShiftShift

PullPull

╥╥ ╥╥╥╥ ╥╥╥╥ ╥╥

Ctrl+CCtrl+C

ClickClick

ClickClickClickClick

Ctrl+VCtrl+V

Page 26: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Data Import/Export/Conversion: Excel diagram to Picture Metafile in PPTRe-formatting charts at presentation: There

are some features of charts we cannot set in Excel, but it is possible to do in meta-file:Eg. at complex 3D area charts,it would be great to create semi-transparent func-tion surfaces partially covering eachother, but it cannot be done in Excel. How to do: Copy 3D area charttrough clipboard as metafileConvert metafile into PPT drawing with View|Toolbars|Drawing|Drawing me-nu|Ungroup(Nézet|Eszköztárak|Raj-zoló |Rajzoló menü|Csoportbontás), repeat it as long as it can be doneDelete unnecessary chart background, axis, axis text, etc. elementsSelect all remaining elements, format them Doubleclicking on selection, set their color, border, and transparencyGroup elements together againBut a difficult drawing containing 1000s of elements can eat up lot of resources and freeze presentationTherefore, cut metafile to clipboard with Edit|Cut (Szerkesztés|Kivágás)Paste as GIF picture with Edit|Paste special|GIF (Szerkesztés|Irányított beillesztés|GIF). It keeps transparency, and reduces resource consumption, but it can be edited only as image anymore

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

0.00

0.30

0.60

0.90

0.000.100.200.300.400.500.600.700.800.901.00

ClickClick

ClickClick ClickClick

ClickClick

ClickClick

ClickClick

ClickClick

Szerkesztés

SzerkesztésKivágás

Clic

kC

lick

Click

Click

ClickClick

Pull

Pull

ClickClick

ClickClick

ClickClickClickClick

ClickClickClickClick

Page 27: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Data Import/Export/Conversion: From Excel to Access Databse Table- As an Excel worksheet can process max. 65535 rows,

it is worth to put sizeable data tables into database be-fore Excel freezes.In Access, steps are the following:

- With File|New|Empty database|{Path/Name.mdb} |Save (Fájl|Új|Üres adatbázis|{Elérési út/Név.mdb} |Mentés) menu we create a new empty *.mdb data-base file with the given name on given path:

- With File|Get external data|Import|Excel+Name.xls |Import (Fájl|Külső adatok átvétele|Importál|Excel fájlok + Név.xls|Importálás) menu, import wizard is launced(only if Access is set up in full setup version!):

- First, we select from which worksheet we will import the table: this should have regular row/column structure, with column name at the first line and identical type of data within one column, otherwise Access cannot import:

- Next, we can see the table to import, and it asks wheteher there are column names in the first line

- Next, it asks whether to put data in new database table or an already existing (it should have compatible column structure to receive data)

- Next, we can overview types of columns- Next, it ask to assign primary key to table: No- At Finish, it ask the name of new table: Munka1

- After the wizard finished, new table can be opened with DoubleClick on Tables|Munka1 icon:

- Access can handle ca. 10M rows in a table and computes much more faster than Excel

- However its programming is much more difficult, can be done in Structured Query Language (SQL)

Munka1

ClickClick

ClickClick

ClickClick

ClickClick

ClickClick Click

Click

Page 28: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Home Assignment 3: Fragment clean and matchClean up the following fragments given in FASTA fromat from primer and vector sequences and try to match them using suitable software! (5pts)

Fragment1: Fragment1.txtFragment2: Fragment2.txtFragment3: Fragment3.txt

Solution: 3-1HomeAssignSolution.doc

Page 29: BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666.

Cloning, fragment processing:Restriction site database: WebCutter: http://rna.lundberg.gu.se/cutter2/index.html Primer cleaning: SMS2 DNA Pattern: http://www.bioinformatics.org/sms2/index.htmlVector cleaning: NCBI VecScreen: http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html Fragment assembly: CAP3: http://pbil.univ-lyon1.fr/cap3.php

Auxiliary sequence operations: SMS2: http://www.bioinformatics.org/sms2/index.html Uploading sequences: EBI: http://www.ebi.ac.uk/embl/Submission/index.html Data Import/Export/Conversion operations in Excel/Access:

http://www.andrewsexceltips.com/ http://www.andypope.info/ http://www.dicks-blog.com/

References


Recommended