+ All Categories
Home > Documents > BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on...

BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on...

Date post: 26-Dec-2015
Category:
Upload: brenda-ball
View: 223 times
Download: 0 times
Share this document with a friend
Popular Tags:
87
BioJava Core API
Transcript
Page 1: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

BioJava Core API

Page 2: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Java for Bioinformatics?

Cross platform means develop on one platform deploy on any.

Widely accepted industry standard. Lots of support libraries for modern

technologies (XML, WebServices, JDBC).

Scales well from small to industrial strength enterprise sized programs.

Page 3: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Java for Bioinformatics?

Object Oriented. Rapid development due to

Very strict types Simple clear syntax Exception handling and recovery Cross platform Extensive class library Code reuse

Page 4: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

What is BioJava?

A collection of Java objects that represent and manipulate biological data

Not a program, rather a programming library

Open source (LGPL) open for all development, even commercial. Not ‘sticky’ or ‘viral’.

Page 5: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

What is BioJava?

Collection of objects to assist bioinformatics research

Started at EBI/Sanger in 1998 by Matthew Pocock and Thomas Down

25+ developers have contributed (5 core)

Page 6: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

What is BioJava?

BioJava has acquired 1100+ classes, 130,000+ lines of code.

Uses CVS version control, JUnit testing and ANT builds.

It now has a fairly stable API. 76 packages!

Page 7: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Where is BioJava

Home Page www.biojava.org

BioJava in Anger http://www.biojava.org/docs/bj_in_anger/

Mailing Lists [email protected] [email protected]

Nightly Builds http://www.derkholm.net/autobuild/

Page 8: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Obtaining BioJava

Download http://www.biojava.org/download/ Get binaries, source and docs

biojava-live (requires cvs) cvs -d

:pserver:[email protected]:/home/repository/biojava login Password is ‘cvs’ cvs -d

:pserver:[email protected]:/home/repository/biojava checkout biojava-live

cvs update -Pd

Page 9: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Compiling biojava-live

Requires the ANT build tool http://jakarta.apache.org/ant/

The ANT tool will use build.xml to Arrange source code Compile source Make jar file Make Java docs Build demos Build and Run tests Change to biojava-live; type ant

Unit testing requires JUnit http://junit.sourceforge.net/

Page 10: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Setting up BioJava

Put the following JAR files on your class path:

biojava.jar bytecode-0.92.jar commons-cli.jar commons-collections-2.1.jar commons-dbcp-1.1.jar commons-pool-1.1.jar

Page 11: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Object Orient Patterns and BioJava Design

Page 12: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

BioJava Design

Uses some reasonably “advanced” concepts Design by Interface Protected or Private constructors Factory classes and Methods Flyweight/ Singleton objects

Page 13: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Interfaces Hide Implementation

In BioJava there are several implementations of the Distribution interface.

Any can be legally returned by a method that returns a Distribution (the returning method may even return different ones depending on the situation).

Any can be legally used as an argument to a method that requires a Distribution.

All are guaranteed to contain a minimal set of common methods.

Page 14: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Flyweight and Singleton Objects

A Singleton is a class with only one instance and only one access point.

A Singleton will need a Private constructor and may be static (e.g. AlphabetManager).

A Flyweight object uses sharing to support large numbers of fine-grained object efficiently.

For example in BioJava there is only ever one instance of the DNA Symbol “A”. A sequence of A’s is really just a list of pointers to that one object.

Page 15: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Factory and Static methods

Sometimes it is useful to prevent a user from directly constructing an object via a constructor. If the construction is complex. If the choice of the optimal implementation is

best left to the API developer. If important resources are best protected from

end users e.g. Singletons/ Flyweights. Rather than instantiating the object via its

constructor a static method or Factory object is used

Page 16: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Examples

Static method: FiniteAlphabet dna = DNATools.getDNA();

Static field: DistributionFactory df = DistributionFactory.DEFAULT;

Factory method: Distribution d = df.createDistribution(dna);

Page 17: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Two Levels of BioJava

Macro type programming Tools classes (SeqIOTools,

DistributionTools etc). Static methods for common tasks.

Full programming Lots of customizations and ‘plug and

play’ possible. More exposure to the sharp edges of the

API. Less documentation.

Page 18: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Alphabets, Symbols and Sequences

Page 19: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Symbols

In BioJava the DNA residue “A” is an object.

In Bioperl “A” would be a String. The “A” object is part of the sequence

not the sequence. “A” from DNA is not equal to “A” from

RNA or “A” from Protein.

Page 20: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Why not Strings?

DNA A != RNA A != Protein A For Strings “A”.equals(“A”); DNA Alphabet also contains

K,Y,W,S,R,M,B,D,G,V,N

Page 21: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Why not Strings?

Object Y contains C and T, The String “Y” doesn’t contain anything

Translation HashMaps with Strings are flawed. Biojava GGN translates to GLY String GGN maps to null

A fully redundant String to String HashMap translation table requires 4096 keys!

Page 22: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Symbols are Canonical

DNATools.a() == DNATools.a(); There is only one instance of ‘a’

DNATools.a().equals(DNATools.a()); ProteinTools.a() != DNATools.a(); Even on Remote JVM’s!

During serialization Alphabet indexing is transient and ‘reconnected’ via readResolve() methods.

Page 23: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Alphabets

A set of Symbols Alphabets can be infinite

DoubleAlphabet, IntegerAlphabet Some Alphabets have a Finite number

of Symbols DNA, RNA etc

Alphabet and FiniteAlphabet interfaces

Page 24: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

org.biojava.bio.Alphabet

boolean contains(Symbol s) Returns whether or not this Alphabet contains the symbol.

List getAlphabets() Return an ordered List of the alphabets which make up a compound alphabet.

Symbol getAmbiguity(java.util.Set syms) Get a symbol that represents the set of symbols in syms.

Symbol getGapSymbol() Get the 'gap' ambiguity symbol that is most appropriate for this alphabet

String getName() Get the name of the alphabet.

Symbol getSymbol(java.util.List rl) Get a symbol from the Alphabet which corresponds to the specified ordered list of symbols.

SymbolTokenization getTokenization(java.lang.String name) Get a SymbolTokenization by name. 

void validate(Symbol s) Throws a precanned IllegalSymbolException if the symbol is not contained within this Alphabet.

Page 25: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

org.biojava.bio.FiniteAlphabet

In addition to the previous methods

void addSymbol(Symbol s) Adds a symbol to this Alphabet

Iterator iterator() Retrieve an Iterator over the Symbols in this Alphabet. 

void removeSymbol(Symbol s) Remove a symbol from this alphabet.

int size() The number of symbols in the alphabet.

Page 26: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

The Default Alphabets

DNA (a,c,g,t) RNA (a,c,g,u) PROTEIN (all amino acids including ‘Sel’) PROTEIN-TERM (all PROTEIN plus “*”) STRUCTURE (PDB structure symbols) Alphabet of all integers (Infinite Alphabet)

Can generate SubIntegerAlphabets Alphabet of all doubles (Infinite Alphabet)

Page 27: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Getting the common Alphabets

import org.biojava.bio.symbol.*; import java.util.*; import org.biojava.bio.seq.*;  public class AlphabetExample {   public static void main(String[] args) {     Alphabet dna, rna, prot;      //get the DNA alphabet by name     dna = AlphabetManager.alphabetForName("DNA");      //get the RNA alphabet by name     rna = AlphabetManager.alphabetForName("RNA");      //get the Protein alphabet by name     prot = AlphabetManager.alphabetForName("PROTEIN");     //get the protein alphabet that includes the * termination Symbol     prot = AlphabetManager.alphabetForName("PROTEIN-TERM");      //get those same Alphabets from the Tools classes     dna = DNATools.getDNA();     rna = RNATools.getRNA();     prot = ProteinTools.getAlphabet();     //or the one with the * symbol     prot = ProteinTools.getTAlphabet();    } }

Page 28: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

SymbolLists are made of Symbols

org.biojava.bio.symbol.SymbolList A sequence of Symbols from the same

Alphabet. Uses biological coordinates from 1 to

length cf String from 0 to length-1

Page 29: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Doesn’t this waste memory?

A SymbolList is not really a List of Symbol Objects.

Rather a List of Object references. Still a bit heavier than a char[] but not

serious.

A CG

T

AACGTGGGTTCCAACT

Page 30: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

The Bigger Picture

A CG

T

AACGTGGGTTCCAACT

AlphabetManager

“DNA”

“Protein”

Page 31: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

The SymbolList interface

void edit(Edit edit)           Apply an edit to the SymbolList as specified by the edit object. 

Alphabet getAlphabet()           The alphabet that this SymbolList is over. 

Iterator iterator()           An Iterator over all Symbols in this SymbolList. 

int length()           The number of symbols in this SymbolList. 

String seqString()           Stringify this symbol list. 

SymbolList subList(int start, int end)           Return a new SymbolList for the symbols start to end inclusive. 

String subStr(int start, int end)           Return a region of this symbol list as a String. 

Symbol symbolAt(int index)           Return the symbol at index, counting from 1.

List toList()           Returns a List of symbols.

Page 32: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

String to SymbolList

import org.biojava.bio.seq.*import org.biojava.bio.symbol.*;

 public class StringToSymbolList { public static void main(String[] args) {

     try {  //create a DNA SymbolList from a String  SymbolList dna = DNATools.createDNA("atcggtcggctta");  //create a RNA SymbolList from a String  SymbolList rna = RNATools.createRNA("auugccuacauaggc");   //create a Protein SymbolList from a String  SymbolList aa = ProteinTools.createProtein("AGFAVENDSA");}catch (IllegalSymbolException ex) {  //this will happen if you use a character in one of your strings that is //not an accepted IUB Character for that Symbol.  ex.printStackTrace();}   

}}

Page 33: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

SymbolList to String

import org.biojava.bio.symbol.*;

public class SymbolListToString {  

public static void main(String[] args) {SymbolList sl = null;

   //code here to instantiate sl    

//convert sl into a String String s = sl.seqString(); }}

Page 34: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

The Sequence Interface

A Sequence is a SymbolList with more information.

In addition to Annotatable and SymbolList:String getName()

The name of this sequence. 

String getURN() A Uniform Resource Identifier (URI) which identifies the sequence represented by this object.

Also implements FeatureHolder which allows addition of Feature Objects.

Page 35: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Quickly generate a Sequence

import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*;  public class StringToSequence {   public static void main(String[] args) {      try {       //create a DNA sequence with the name dna_1       Sequence dna = DNATools.createDNASequence("atgctg", "dna_1");        //create an RNA sequence with the name rna_1       Sequence rna = RNATools.createRNASequence("augcug", "rna_1");        //create a Protein sequence with the name prot_1       Sequence prot = ProteinTools.createProteinSequence("AFHS", "prot_1");     }     catch (IllegalSymbolException ex) {       //an exception is thrown if you use a non IUB symbol       ex.printStackTrace();     }   } }

Page 36: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

More Complex Symbols and Alphabets

Page 37: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Ambiguity Symbols

Ambiguous or Fuzzy data is a fact of life, especially with sequencing.

DNA traces can contain symbols such as n, r, w, v, h, k, y, n etc.

In BioJava DNA symbols a, c, g, t are AtomicSymbols.

Ambiguous symbols like y are BasisSymbols.

Page 38: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

BasisSymbols

A BasisSymbol may be represented as a list of one or more Symbols.

BasisSymbol extends Symbol. Ambiguity Symbols are always

BasisSymbols getSymbols() The list of symbols that

this symbol is composed from.

Page 39: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

AtomicSymbols

AtomicSymbols are not ambiguous. They cannot be further divided into

Symbols that are valid members of the parent Alphabet.

In the case of compound Alphabets they can be divided into valid Symbols from component Alphabets.

Page 40: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

AtomicSymbols

The AtomicSymbol interface extends BasisSymbol but adds no new methods only behaviour contracts.

AtomicSymbol instances guarantee that getMatches() returns an Alphabet containing just that Symbol and each element of the List returned by getSymbols() is also atomic.

Page 41: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Atomic and Basis

A T

AATW

W

AlphabetManager“DNA”

AtomicSymbols

BasisSymbol

Page 42: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Translating Ambiguity

BioJava handles translation of ambiguity very smoothly.

DNA ‘n’ = [a,c,g,t] Transcribes to RNA ‘n’ [a,c,g,u] ggn translates to Gly agn translates to [Ser, Arg] Most protein ambiguities have no

‘token’ and are printed as ‘X’

Page 43: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

CrossProduct Alphabets

A CrossProductAlphabet is a combination of two or more Alphabets.

Any type of CrossProductAlphabet is possible

Dimers (DNA x DNA) Codon (DNA x DNA x DNA) Conditional ((DNA x DNA) x DNA) Mixed ((DNA x DNA x DNA) x PROTEIN)

Page 44: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Finite and Compound Alphas

A CG

T

[AAC][GTG]GGTTCCAACT

DNA AtomicSymbols

ACA GTG(DNA x DNA x DNA) AtomicSymbols

GNG (DNA x DNA x DNA) BasisSymbol

Page 45: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

What are they good for?

Codon Symbols (DNA x DNA x DNA). Many analysis Classes such as Count

and Distribution use Symbol as an argument. A hexamer can be an AtomicSymbol.

Phred is DNA x Integer 1st and Higher order Markov Models

use CrossProductAlphabets.

Page 46: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

How do I make a CrossProductAlphabet?

import java.util.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*;  public class CrossProduct {   public static void main(String[] args) {      //make a CrossProductAlphabet from a List     List l = Collections.nCopies(3, DNATools.getDNA());     Alphabet codon = AlphabetManager.getCrossProductAlphabet(l);      //get the same Alphabet by name     Alphabet codon2 =         AlphabetManager.generateCrossProductAlphaFromName(

"(DNA x DNA x DNA)“ );

      //show that the two Alphabets are canonical     System.out.println(codon == codon2);   } }

Page 47: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Making Triplet Views on a SymbolList

import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*;  public class CodonView {   public static void main(String[] args) {     try {       //make a DNA SymbolList       SymbolList dna = DNATools.createDNA("atgcccgcgtaa");       System.out.println("Length of dna " + dna.length());        //get a Codon View (window size of three)       SymbolList codons  = SymbolListViews.windowedSymbolList(dna, 3);       System.out.println("Length of codons " + codons.length());        //get a Triplet View       SymbolList triplets = SymbolListViews.orderNSymbolList(dna, 3);       System.out.println("Length of triplets "+ triplets.length());     }     catch (Exception ex) {       ex.printStackTrace();     }   } }

Page 48: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Getting a Symbol for a Codon

import java.util.*;  import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*;  public class MakeATG {   public static void main(String[] args) {     //make a CrossProductAlphabet from a List     List l = Collections.nCopies(3, DNATools.getDNA());     Alphabet codon = AlphabetManager.getCrossProductAlphabet(l);      //get the codon made of atg     List syms = new ArrayList(3);     syms.add(DNATools.a());     syms.add(DNATools.t());     syms.add(DNATools.g());      Symbol atg = null;     try {       atg = codon.getSymbol(syms);     }     catch (IllegalSymbolException ex) {       //used Symbol from Alphabet that is not a component of codon       ex.printStackTrace();     }     System.out.println("Name of atg: "+ atg.getName());   } } 

Page 49: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Breaking a Codon into its Parts

import java.util.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*;  public class BreakingComponents {   public static void main(String[] args) {     //make the 'codon' alphabet     List l = Collections.nCopies(3, DNATools.getDNA());     Alphabet alpha = AlphabetManager.getCrossProductAlphabet(l);      //get the first symbol in the alphabet     Iterator iter = ((FiniteAlphabet)alpha).iterator();     AtomicSymbol codon = (AtomicSymbol)iter.next();     System.out.print(codon.getName()+" is made of: ");      //break it into a list its components     List symbols = codon.getSymbols();     for(int i = 0; i < symbols.size(); i++){       if(i != 0)         System.out.print(", ");       Symbol sym = (Symbol)symbols.get(i);       System.out.print(sym.getName());     }   } }

Page 50: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Basic Sequence Operations

Page 51: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Getting a section of a SymbolList

symbolAt(int i) Returns a Symbol

subList(int min, int max) Returns a SymbolList

subString(int min, int max) Returns the subsection tokenized to a

String

Page 52: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Transcription

In BioJava DNA sequences and RNA sequences are from different Alphabets. To convert between them:

//make a DNA SymbolListSymbolList dna = DNATools.createDNA("atgccgaatcgtaa");

 //convert it to RNASymbolList rna = DNATools.toRNA(dna);

 //just to prove it workedSystem.out.println(rna.seqString()); //augccgaaucguaa

//biological transcription (ie copy and reverse strand)rna = DNATools.transcribeToRNA(dna); //5’ atgccgaatcgtaa 3’System.out.println(rna.seqString()); //5’ uuacgauucggcau 3’

Page 53: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Reverse Complement

import org.biojava.bio.symbol.*; import org.biojava.bio.seq.*;  public class ReverseCompiment {   public static void main(String[] args) throws Exception{     SymbolList forward = DNATools.createDNA("atcgctagcgatcg");      //two step     SymbolList reverse = SymbolListViews.reverse(forward);     SymbolList revc1 = DNATools.complement(reverse);      //one step     SymbolList revc2 = DNATools.reverseComplement(forward);      //test for equivalence     System.out.println(revc1.equals(revc2));   } }

Page 54: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Translation

RNATools contains the “Universal” RNA to Protein TranslationTable.

Standard procedure is transcribe DNA to RNA and then translate.

Page 55: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Translation Example

import org.biojava.bio.symbol.*; import org.biojava.bio.seq.*;  public class Translate {    public static void main(String[] args) {     try {       //create a DNA SymbolList       SymbolList symL = DNATools.createDNA("atggccattgaatga");        //transcribe to RNA       symL = RNATools.toRNA(symL);        //translate to protein       symL = RNATools.translate(symL);        //prove that it worked       System.out.println(symL.seqString());     }     catch (Exception ex) {      ex.printStackTrace()       }

   } }

Page 56: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Sequence I/O

Page 57: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Don’t ever write another Parser

If you can avoid it! BioJava supports

Genbank, GenPept, RefSeq, EMBL, SwissProt, PDB, Fasta, ABI, LocusLink, Unigene (requires Java 1.4)

GAME, AGAVE Blast, Fasta, HMMER (models and results), BlastXML,

MEME, Phred OBDA, BioIndex, BioSQL, DAS, GFF, XFF Ensembl (with biojava-ensembl package)

StAX/ Tag value RMI and Serialization

Page 58: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Simple I/O

Most of BioJava’s simpler I/O operations are conveniently wrapped up behind static methods from the SeqIOTools class.

SeqIOTools can read and write: Fasta (protein or DNA) EMBL GenBank (flat file and XML) SwissProt GenPept MSF (protein or DNA) Fasta Alignments

Page 59: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

SeqIOTools Reader Methods

SequenceIterator i = SeqIOTools.readGenbank(br); SequenceIterator i = SeqIOTools.readGenpept(br); SequenceIterator i = SeqIOTools.readSwissprot(br); SequenceIterator i = SeqIOTools.readEmbl(br); etc… SequenceIterator i = (SequenceIterator)

SeqIOTools.fileToBiojava("fasta", "dna“, br);

Alignment a = (Alignment) SeqIOTools.fileToBiojava(“MSF", “rna“, br);

Page 60: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Features, Locations, Annotations

Page 61: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Features and Annotations

Sequence data often comes with added information about the various properties of the sequence (Genbank, SwissProt etc).

BioJava divides this information into global properties (Annotations) and Localized properties (Features).

Page 62: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Annotatable

Annotatable is an “mix-in” interface that indicates the implementing object contains a Annotation object.

It defines one method. Annotation getAnnotation();

Page 63: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Annotations

org.biojava.bio.Annotation Annotations are used for Global properties. Species, Accession Number, xrefs, date,

publication. Key – value maps. Key and Value are objects but almost always are

Strings. Annotation.EMPTY_ANNOTATION

static convenience class good place holder, avoids null pointer exceptions immutable

Page 64: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Annotation API

Map asMap() Return a map that contains the same key/values as this Annotation. 

boolean containsProperty(java.lang.Object key) Returns whether there the property is defined. 

Object getProperty(java.lang.Object key) Retrieve the value of a property by key. 

Set keys() Get a set of key objects. 

void removeProperty(java.lang.Object key) Delete a property 

void setProperty(java.lang.Object key, java.lang.Object value) Set the value of a property.

Page 65: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

FeatureHolder

FeatureHolder is another “mix-in” interface which allows the implementing object to hold Features.

Sequence implements FeatureHolder. Features are created by

FeatureHolders. FeatureHolders can be filtered.

Page 66: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

FeatureHolder methods

boolean containsFeature(Feature f) Check if the feature is present in this holder.

int countFeatures() Count how many features are contained.

Feature createFeature(Feature.Template ft)  Create a new Feature, and add it to this FeatureHolder.

Iterator features()  Iterate over the features in no well defined order.

FeatureHolder filter(FeatureFilter filter)  Query this set of features using a supplied FeatureFilter. 

FeatureHolder filter(FeatureFilter fc, boolean recurse)  Return a new FeatureHolder that contains all of the children of this one that passed the filter fc.

FeatureFilter getSchema() Return a schema-filter for this FeatureHolder.

void removeFeature(Feature f)  Remove a feature from this FeatureHolder.

Page 67: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Features are Annotatable

Features implement Annotatable Can hold an annotation Global annotations of a Feature

/note: /db_xref: etc

Page 68: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Features may be nested

Features implement FeatureHolder! Therefore Features may hold nested

Features c.f. The AWT Menu is a MenuItem e.g. A gene has exons and introns Filtering can be recursive A Feature cannot hold itself (directly or

indirectly)

Page 69: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Location API

Locations are objects that specify a minimum and maximum bound on a region of sequence.

Contains some useful methods, particularly getMin() and getMax().

Many methods have been deprecated and are now delegated to LocationTools.

LocationTools is the best place to get new instances of a Location.

PointLocation, RangeLocation, CircularLocation, CompoundLocation.

Page 70: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

LocationTools

static boolean areEqual(Location locA, Location locB)    Return whether two locations are equal.

static boolean contains(Location locA, Location locB)     Return true iff all indices in locB are also contained by locA.

static Location flip(Location loc, int len)     Flips a location relative to a length.

static Location intersection(Location locA, Location locB)      Return the intersection of two locations.

static CircularLocation makeCircularLocation(int min, int max, int seqLength)      A simple method to generate a RangeLocation wrapped in a CircularLocation

static Location makeLocation(int min, int max)      Return a contiguous Location from min to max.

static boolean overlaps(Location locA, Location locB)      Determines whether the locations overlap or not.

static Location subtract(Location x, Location y)    Subtract one location from another.

static Location union(java.util.Collection locs)      The n-way union of a Collection of locations.static 

Location union(Location locA, Location locB)      Return the union of two locations.

Page 71: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Location Example

import org.biojava.bio.symbol.*; import org.biojava.bio.seq.*;  public class SpecifyRange {   public static void main(String[] args) {     try {       //make a RangeLocation specifying the residues 3-8       Location loc = LocationTools.makeLocation(3,8);       //print the location       System.out.println("Location: "+loc.toString());        //make a SymbolList       SymbolList sl = RNATools.createRNA("gcagcuaggcggaaggagc");       System.out.println("SymbolList: "+sl.seqString());        //get the SymbolList specified by the Location       SymbolList sym = loc.symbols(sl);       System.out.println("Symbols specified by Location: "+sym.seqString());     }     catch (IllegalSymbolException ex) {       //illegal symbol used to make sl       ex.printStackTrace();     }   } }

Page 72: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Filtering Features

FeatureHolders have a filter method that accepts a FeatureFilter as an argument.

Features that are accepted by the FeatureFilter are returned as a new FeatureHolder.

Filtering may be done recursively so that nested Features are subjected to the same FeatureFilter .

Page 73: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

FeatureFilters

FeatureFilter is an interface that specifies one method. boolean accept(Feature f)

There are 26 implementations of FeatureFilter in BioJava available as inner classes of the FeatureFilter interface.

Most commonly used are ByType, BySource, StrandFilter, OverlapsLocation, ContainedByLocation.

Also boolean logic filters: And, Or, Not

Page 74: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Analysis and Distributions

Page 75: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Distributions and Counts

The Distribution and Count interfaces are from the org.biojava.bio.dist package.

Counts are maps from AtomicSymbols to counts.

Distributions are maps from Symbols to frequencies.

Page 76: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Distributions

Distributions are central to analysis Map Symbols to Frequencies Can be trained or weights can be set Used heavily in dp (dynamic programming)

package. HMM transitions and emmissions

Many implementations, frequently used are: SimpleDistribution OrderNDistribution UniformDistribution

Page 77: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Distribution API

 Alphabet getAlphabet() The alphabet from which this spectrum emits symbols. Distribution getNullModel() Retrieve the null model Distribution that this Distribution recognizes. double getWeight(Symbol s) Return the probability that Symbol s is emited by this spectrum. void registerWithTrainer(DistributionTrainerContext dtc) Register this distribution with a training context. Symbol sampleSymbol() Sample a symbol from this state's probability distribution. void setNullModel(Distribution nullDist) Set the null model Distribution that this Distribution recognizes. void setWeight(Symbol s, double w) Set the probability or odds that Symbol s is emited by this state.

Page 78: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

DistributionFactory

Generally a Distribution is created using a DistributionFactory.

The DistributionFactory interface contains a static inner class called DEFAULT that implements DistributionFactory

DistributionFactory df = DistributionFactory.DEFAULT; Distribution d = df.createDistribution(dna.getAlphabet());

Page 79: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Distribution Training

Distributions can be trained on observed sequences using a DistributionTrainerContext.

One or more Distributions can be registered with the DTC. //register the Distributions with the trainer

dtc.registerDistribution(dnaDist);

Page 80: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

DistributionTrainerContext

A DistributionTrainer is assigned to each registered Distribution by the DTC.

If unusual training behaivour is required you can register your own DistributionTrainer at the same time.

The dtc can also add pseudocounts if needed.

Ambiguities are automagically handled. Counts are split according to the null model.

Page 81: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Training Example

      //make a DNA SymbolList       SymbolList dna = DNATools.createDNA("atcgctagcgtyagcntatsggca");        //get a DistributionTrainerContext       DistributionTrainerContext dtc = new SimpleDistributionTrainerContext();        //make the Distribution       Distribution dnaDist =           DistributionFactory.DEFAULT.createDistribution(dna.getAlphabet());               //register the Distribution with the trainer       dtc.registerDistribution(dnaDist);               for(int j = 1; j <= dna.length(); j++){         dtc.addCount(dnaDist, dna.symbolAt(j), 1.0);       }               //train the Distribution       dtc.train();         

Page 82: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

setWeight() Example

FiniteAlphabet a = DNATools.getDNA();Distribution d =

DistributionFactory.DEFAULT.createDistribution(a);//set the weight of each symbold.setWeight(DNATools.a(),0.3);d.setWeight(DNATools.c(),0.2);d.setWeight(DNATools.g(),0.2); d.setWeight(DNATools.t(),0.3);

Page 83: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

DistributionTools

DistributionTools holds static methods for creating and manipulating Distributions.

Tasks include: Equal emission spectra? Shannon Entropy, information, KL Distance. Generate biased sequences. Make a Distribution[] from an Alignment (each Distribution

represents one position in an Alignment. Average two or more Distributions. Randomize a Distribution. Make a Distribution from a Count.

Page 84: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

Serialization of Distributions

Distributions are Serializable Write to and Read from Binary RMI

XMLDistributionWriter Write any Distribution to a stream in XML format.

XMLDistributionReader SAXParser Read any Distribution from a XML stream

Page 85: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

XML Output

<?xml version="1.0" ?>

<Distribution type="Distribution">

  <alphabet name="DNA" />

  <weight sym="adenine" prob="0.32178516910737204" />

  <weight sym="cytosine" prob="0.04596199299395364" />

  <weight sym="guanine" prob="0.1405504188012911" />

  <weight sym="thymine" prob="0.4917024190973832" />

</Distribution>

Page 86: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

What Else??

Dynamic Programming (HMMs) Bibliography Alignments Blast and Fasta parsing

Page 87: BioJava Core API. Java for Bioinformatics? Cross platform means develop on one platform deploy on any. Widely accepted industry standard. Lots of support.

What Else??

BioSQL support GUI components Chromatograms Molecular Biology (pI, mass, restriction

enzymes) Molecular Structure


Recommended