Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | brenda-ball |
View: | 223 times |
Download: | 0 times |
BioJava Core API
Java for Bioinformatics?
Cross platform means develop on one platform deploy on any.
Widely accepted industry standard. Lots of support libraries for modern
technologies (XML, WebServices, JDBC).
Scales well from small to industrial strength enterprise sized programs.
Java for Bioinformatics?
Object Oriented. Rapid development due to
Very strict types Simple clear syntax Exception handling and recovery Cross platform Extensive class library Code reuse
What is BioJava?
A collection of Java objects that represent and manipulate biological data
Not a program, rather a programming library
Open source (LGPL) open for all development, even commercial. Not ‘sticky’ or ‘viral’.
What is BioJava?
Collection of objects to assist bioinformatics research
Started at EBI/Sanger in 1998 by Matthew Pocock and Thomas Down
25+ developers have contributed (5 core)
What is BioJava?
BioJava has acquired 1100+ classes, 130,000+ lines of code.
Uses CVS version control, JUnit testing and ANT builds.
It now has a fairly stable API. 76 packages!
Where is BioJava
Home Page www.biojava.org
BioJava in Anger http://www.biojava.org/docs/bj_in_anger/
Mailing Lists [email protected] [email protected]
Nightly Builds http://www.derkholm.net/autobuild/
Obtaining BioJava
Download http://www.biojava.org/download/ Get binaries, source and docs
biojava-live (requires cvs) cvs -d
:pserver:[email protected]:/home/repository/biojava login Password is ‘cvs’ cvs -d
:pserver:[email protected]:/home/repository/biojava checkout biojava-live
cvs update -Pd
Compiling biojava-live
Requires the ANT build tool http://jakarta.apache.org/ant/
The ANT tool will use build.xml to Arrange source code Compile source Make jar file Make Java docs Build demos Build and Run tests Change to biojava-live; type ant
Unit testing requires JUnit http://junit.sourceforge.net/
Setting up BioJava
Put the following JAR files on your class path:
biojava.jar bytecode-0.92.jar commons-cli.jar commons-collections-2.1.jar commons-dbcp-1.1.jar commons-pool-1.1.jar
Object Orient Patterns and BioJava Design
BioJava Design
Uses some reasonably “advanced” concepts Design by Interface Protected or Private constructors Factory classes and Methods Flyweight/ Singleton objects
Interfaces Hide Implementation
In BioJava there are several implementations of the Distribution interface.
Any can be legally returned by a method that returns a Distribution (the returning method may even return different ones depending on the situation).
Any can be legally used as an argument to a method that requires a Distribution.
All are guaranteed to contain a minimal set of common methods.
Flyweight and Singleton Objects
A Singleton is a class with only one instance and only one access point.
A Singleton will need a Private constructor and may be static (e.g. AlphabetManager).
A Flyweight object uses sharing to support large numbers of fine-grained object efficiently.
For example in BioJava there is only ever one instance of the DNA Symbol “A”. A sequence of A’s is really just a list of pointers to that one object.
Factory and Static methods
Sometimes it is useful to prevent a user from directly constructing an object via a constructor. If the construction is complex. If the choice of the optimal implementation is
best left to the API developer. If important resources are best protected from
end users e.g. Singletons/ Flyweights. Rather than instantiating the object via its
constructor a static method or Factory object is used
Examples
Static method: FiniteAlphabet dna = DNATools.getDNA();
Static field: DistributionFactory df = DistributionFactory.DEFAULT;
Factory method: Distribution d = df.createDistribution(dna);
Two Levels of BioJava
Macro type programming Tools classes (SeqIOTools,
DistributionTools etc). Static methods for common tasks.
Full programming Lots of customizations and ‘plug and
play’ possible. More exposure to the sharp edges of the
API. Less documentation.
Alphabets, Symbols and Sequences
Symbols
In BioJava the DNA residue “A” is an object.
In Bioperl “A” would be a String. The “A” object is part of the sequence
not the sequence. “A” from DNA is not equal to “A” from
RNA or “A” from Protein.
Why not Strings?
DNA A != RNA A != Protein A For Strings “A”.equals(“A”); DNA Alphabet also contains
K,Y,W,S,R,M,B,D,G,V,N
Why not Strings?
Object Y contains C and T, The String “Y” doesn’t contain anything
Translation HashMaps with Strings are flawed. Biojava GGN translates to GLY String GGN maps to null
A fully redundant String to String HashMap translation table requires 4096 keys!
Symbols are Canonical
DNATools.a() == DNATools.a(); There is only one instance of ‘a’
DNATools.a().equals(DNATools.a()); ProteinTools.a() != DNATools.a(); Even on Remote JVM’s!
During serialization Alphabet indexing is transient and ‘reconnected’ via readResolve() methods.
Alphabets
A set of Symbols Alphabets can be infinite
DoubleAlphabet, IntegerAlphabet Some Alphabets have a Finite number
of Symbols DNA, RNA etc
Alphabet and FiniteAlphabet interfaces
org.biojava.bio.Alphabet
boolean contains(Symbol s) Returns whether or not this Alphabet contains the symbol.
List getAlphabets() Return an ordered List of the alphabets which make up a compound alphabet.
Symbol getAmbiguity(java.util.Set syms) Get a symbol that represents the set of symbols in syms.
Symbol getGapSymbol() Get the 'gap' ambiguity symbol that is most appropriate for this alphabet
String getName() Get the name of the alphabet.
Symbol getSymbol(java.util.List rl) Get a symbol from the Alphabet which corresponds to the specified ordered list of symbols.
SymbolTokenization getTokenization(java.lang.String name) Get a SymbolTokenization by name.
void validate(Symbol s) Throws a precanned IllegalSymbolException if the symbol is not contained within this Alphabet.
org.biojava.bio.FiniteAlphabet
In addition to the previous methods
void addSymbol(Symbol s) Adds a symbol to this Alphabet
Iterator iterator() Retrieve an Iterator over the Symbols in this Alphabet.
void removeSymbol(Symbol s) Remove a symbol from this alphabet.
int size() The number of symbols in the alphabet.
The Default Alphabets
DNA (a,c,g,t) RNA (a,c,g,u) PROTEIN (all amino acids including ‘Sel’) PROTEIN-TERM (all PROTEIN plus “*”) STRUCTURE (PDB structure symbols) Alphabet of all integers (Infinite Alphabet)
Can generate SubIntegerAlphabets Alphabet of all doubles (Infinite Alphabet)
Getting the common Alphabets
import org.biojava.bio.symbol.*; import java.util.*; import org.biojava.bio.seq.*; public class AlphabetExample { public static void main(String[] args) { Alphabet dna, rna, prot; //get the DNA alphabet by name dna = AlphabetManager.alphabetForName("DNA"); //get the RNA alphabet by name rna = AlphabetManager.alphabetForName("RNA"); //get the Protein alphabet by name prot = AlphabetManager.alphabetForName("PROTEIN"); //get the protein alphabet that includes the * termination Symbol prot = AlphabetManager.alphabetForName("PROTEIN-TERM"); //get those same Alphabets from the Tools classes dna = DNATools.getDNA(); rna = RNATools.getRNA(); prot = ProteinTools.getAlphabet(); //or the one with the * symbol prot = ProteinTools.getTAlphabet(); } }
SymbolLists are made of Symbols
org.biojava.bio.symbol.SymbolList A sequence of Symbols from the same
Alphabet. Uses biological coordinates from 1 to
length cf String from 0 to length-1
Doesn’t this waste memory?
A SymbolList is not really a List of Symbol Objects.
Rather a List of Object references. Still a bit heavier than a char[] but not
serious.
A CG
T
AACGTGGGTTCCAACT
The Bigger Picture
A CG
T
AACGTGGGTTCCAACT
AlphabetManager
“DNA”
“Protein”
The SymbolList interface
void edit(Edit edit) Apply an edit to the SymbolList as specified by the edit object.
Alphabet getAlphabet() The alphabet that this SymbolList is over.
Iterator iterator() An Iterator over all Symbols in this SymbolList.
int length() The number of symbols in this SymbolList.
String seqString() Stringify this symbol list.
SymbolList subList(int start, int end) Return a new SymbolList for the symbols start to end inclusive.
String subStr(int start, int end) Return a region of this symbol list as a String.
Symbol symbolAt(int index) Return the symbol at index, counting from 1.
List toList() Returns a List of symbols.
String to SymbolList
import org.biojava.bio.seq.*import org.biojava.bio.symbol.*;
public class StringToSymbolList { public static void main(String[] args) {
try { //create a DNA SymbolList from a String SymbolList dna = DNATools.createDNA("atcggtcggctta"); //create a RNA SymbolList from a String SymbolList rna = RNATools.createRNA("auugccuacauaggc"); //create a Protein SymbolList from a String SymbolList aa = ProteinTools.createProtein("AGFAVENDSA");}catch (IllegalSymbolException ex) { //this will happen if you use a character in one of your strings that is //not an accepted IUB Character for that Symbol. ex.printStackTrace();}
}}
SymbolList to String
import org.biojava.bio.symbol.*;
public class SymbolListToString {
public static void main(String[] args) {SymbolList sl = null;
//code here to instantiate sl
//convert sl into a String String s = sl.seqString(); }}
The Sequence Interface
A Sequence is a SymbolList with more information.
In addition to Annotatable and SymbolList:String getName()
The name of this sequence.
String getURN() A Uniform Resource Identifier (URI) which identifies the sequence represented by this object.
Also implements FeatureHolder which allows addition of Feature Objects.
Quickly generate a Sequence
import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; public class StringToSequence { public static void main(String[] args) { try { //create a DNA sequence with the name dna_1 Sequence dna = DNATools.createDNASequence("atgctg", "dna_1"); //create an RNA sequence with the name rna_1 Sequence rna = RNATools.createRNASequence("augcug", "rna_1"); //create a Protein sequence with the name prot_1 Sequence prot = ProteinTools.createProteinSequence("AFHS", "prot_1"); } catch (IllegalSymbolException ex) { //an exception is thrown if you use a non IUB symbol ex.printStackTrace(); } } }
More Complex Symbols and Alphabets
Ambiguity Symbols
Ambiguous or Fuzzy data is a fact of life, especially with sequencing.
DNA traces can contain symbols such as n, r, w, v, h, k, y, n etc.
In BioJava DNA symbols a, c, g, t are AtomicSymbols.
Ambiguous symbols like y are BasisSymbols.
BasisSymbols
A BasisSymbol may be represented as a list of one or more Symbols.
BasisSymbol extends Symbol. Ambiguity Symbols are always
BasisSymbols getSymbols() The list of symbols that
this symbol is composed from.
AtomicSymbols
AtomicSymbols are not ambiguous. They cannot be further divided into
Symbols that are valid members of the parent Alphabet.
In the case of compound Alphabets they can be divided into valid Symbols from component Alphabets.
AtomicSymbols
The AtomicSymbol interface extends BasisSymbol but adds no new methods only behaviour contracts.
AtomicSymbol instances guarantee that getMatches() returns an Alphabet containing just that Symbol and each element of the List returned by getSymbols() is also atomic.
Atomic and Basis
A T
AATW
W
AlphabetManager“DNA”
AtomicSymbols
BasisSymbol
Translating Ambiguity
BioJava handles translation of ambiguity very smoothly.
DNA ‘n’ = [a,c,g,t] Transcribes to RNA ‘n’ [a,c,g,u] ggn translates to Gly agn translates to [Ser, Arg] Most protein ambiguities have no
‘token’ and are printed as ‘X’
CrossProduct Alphabets
A CrossProductAlphabet is a combination of two or more Alphabets.
Any type of CrossProductAlphabet is possible
Dimers (DNA x DNA) Codon (DNA x DNA x DNA) Conditional ((DNA x DNA) x DNA) Mixed ((DNA x DNA x DNA) x PROTEIN)
Finite and Compound Alphas
A CG
T
[AAC][GTG]GGTTCCAACT
DNA AtomicSymbols
ACA GTG(DNA x DNA x DNA) AtomicSymbols
GNG (DNA x DNA x DNA) BasisSymbol
What are they good for?
Codon Symbols (DNA x DNA x DNA). Many analysis Classes such as Count
and Distribution use Symbol as an argument. A hexamer can be an AtomicSymbol.
Phred is DNA x Integer 1st and Higher order Markov Models
use CrossProductAlphabets.
How do I make a CrossProductAlphabet?
import java.util.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; public class CrossProduct { public static void main(String[] args) { //make a CrossProductAlphabet from a List List l = Collections.nCopies(3, DNATools.getDNA()); Alphabet codon = AlphabetManager.getCrossProductAlphabet(l); //get the same Alphabet by name Alphabet codon2 = AlphabetManager.generateCrossProductAlphaFromName(
"(DNA x DNA x DNA)“ );
//show that the two Alphabets are canonical System.out.println(codon == codon2); } }
Making Triplet Views on a SymbolList
import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; public class CodonView { public static void main(String[] args) { try { //make a DNA SymbolList SymbolList dna = DNATools.createDNA("atgcccgcgtaa"); System.out.println("Length of dna " + dna.length()); //get a Codon View (window size of three) SymbolList codons = SymbolListViews.windowedSymbolList(dna, 3); System.out.println("Length of codons " + codons.length()); //get a Triplet View SymbolList triplets = SymbolListViews.orderNSymbolList(dna, 3); System.out.println("Length of triplets "+ triplets.length()); } catch (Exception ex) { ex.printStackTrace(); } } }
Getting a Symbol for a Codon
import java.util.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; public class MakeATG { public static void main(String[] args) { //make a CrossProductAlphabet from a List List l = Collections.nCopies(3, DNATools.getDNA()); Alphabet codon = AlphabetManager.getCrossProductAlphabet(l); //get the codon made of atg List syms = new ArrayList(3); syms.add(DNATools.a()); syms.add(DNATools.t()); syms.add(DNATools.g()); Symbol atg = null; try { atg = codon.getSymbol(syms); } catch (IllegalSymbolException ex) { //used Symbol from Alphabet that is not a component of codon ex.printStackTrace(); } System.out.println("Name of atg: "+ atg.getName()); } }
Breaking a Codon into its Parts
import java.util.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; public class BreakingComponents { public static void main(String[] args) { //make the 'codon' alphabet List l = Collections.nCopies(3, DNATools.getDNA()); Alphabet alpha = AlphabetManager.getCrossProductAlphabet(l); //get the first symbol in the alphabet Iterator iter = ((FiniteAlphabet)alpha).iterator(); AtomicSymbol codon = (AtomicSymbol)iter.next(); System.out.print(codon.getName()+" is made of: "); //break it into a list its components List symbols = codon.getSymbols(); for(int i = 0; i < symbols.size(); i++){ if(i != 0) System.out.print(", "); Symbol sym = (Symbol)symbols.get(i); System.out.print(sym.getName()); } } }
Basic Sequence Operations
Getting a section of a SymbolList
symbolAt(int i) Returns a Symbol
subList(int min, int max) Returns a SymbolList
subString(int min, int max) Returns the subsection tokenized to a
String
Transcription
In BioJava DNA sequences and RNA sequences are from different Alphabets. To convert between them:
//make a DNA SymbolListSymbolList dna = DNATools.createDNA("atgccgaatcgtaa");
//convert it to RNASymbolList rna = DNATools.toRNA(dna);
//just to prove it workedSystem.out.println(rna.seqString()); //augccgaaucguaa
//biological transcription (ie copy and reverse strand)rna = DNATools.transcribeToRNA(dna); //5’ atgccgaatcgtaa 3’System.out.println(rna.seqString()); //5’ uuacgauucggcau 3’
Reverse Complement
import org.biojava.bio.symbol.*; import org.biojava.bio.seq.*; public class ReverseCompiment { public static void main(String[] args) throws Exception{ SymbolList forward = DNATools.createDNA("atcgctagcgatcg"); //two step SymbolList reverse = SymbolListViews.reverse(forward); SymbolList revc1 = DNATools.complement(reverse); //one step SymbolList revc2 = DNATools.reverseComplement(forward); //test for equivalence System.out.println(revc1.equals(revc2)); } }
Translation
RNATools contains the “Universal” RNA to Protein TranslationTable.
Standard procedure is transcribe DNA to RNA and then translate.
Translation Example
import org.biojava.bio.symbol.*; import org.biojava.bio.seq.*; public class Translate { public static void main(String[] args) { try { //create a DNA SymbolList SymbolList symL = DNATools.createDNA("atggccattgaatga"); //transcribe to RNA symL = RNATools.toRNA(symL); //translate to protein symL = RNATools.translate(symL); //prove that it worked System.out.println(symL.seqString()); } catch (Exception ex) { ex.printStackTrace() }
} }
Sequence I/O
Don’t ever write another Parser
If you can avoid it! BioJava supports
Genbank, GenPept, RefSeq, EMBL, SwissProt, PDB, Fasta, ABI, LocusLink, Unigene (requires Java 1.4)
GAME, AGAVE Blast, Fasta, HMMER (models and results), BlastXML,
MEME, Phred OBDA, BioIndex, BioSQL, DAS, GFF, XFF Ensembl (with biojava-ensembl package)
StAX/ Tag value RMI and Serialization
Simple I/O
Most of BioJava’s simpler I/O operations are conveniently wrapped up behind static methods from the SeqIOTools class.
SeqIOTools can read and write: Fasta (protein or DNA) EMBL GenBank (flat file and XML) SwissProt GenPept MSF (protein or DNA) Fasta Alignments
SeqIOTools Reader Methods
SequenceIterator i = SeqIOTools.readGenbank(br); SequenceIterator i = SeqIOTools.readGenpept(br); SequenceIterator i = SeqIOTools.readSwissprot(br); SequenceIterator i = SeqIOTools.readEmbl(br); etc… SequenceIterator i = (SequenceIterator)
SeqIOTools.fileToBiojava("fasta", "dna“, br);
Alignment a = (Alignment) SeqIOTools.fileToBiojava(“MSF", “rna“, br);
Features, Locations, Annotations
Features and Annotations
Sequence data often comes with added information about the various properties of the sequence (Genbank, SwissProt etc).
BioJava divides this information into global properties (Annotations) and Localized properties (Features).
Annotatable
Annotatable is an “mix-in” interface that indicates the implementing object contains a Annotation object.
It defines one method. Annotation getAnnotation();
Annotations
org.biojava.bio.Annotation Annotations are used for Global properties. Species, Accession Number, xrefs, date,
publication. Key – value maps. Key and Value are objects but almost always are
Strings. Annotation.EMPTY_ANNOTATION
static convenience class good place holder, avoids null pointer exceptions immutable
Annotation API
Map asMap() Return a map that contains the same key/values as this Annotation.
boolean containsProperty(java.lang.Object key) Returns whether there the property is defined.
Object getProperty(java.lang.Object key) Retrieve the value of a property by key.
Set keys() Get a set of key objects.
void removeProperty(java.lang.Object key) Delete a property
void setProperty(java.lang.Object key, java.lang.Object value) Set the value of a property.
FeatureHolder
FeatureHolder is another “mix-in” interface which allows the implementing object to hold Features.
Sequence implements FeatureHolder. Features are created by
FeatureHolders. FeatureHolders can be filtered.
FeatureHolder methods
boolean containsFeature(Feature f) Check if the feature is present in this holder.
int countFeatures() Count how many features are contained.
Feature createFeature(Feature.Template ft) Create a new Feature, and add it to this FeatureHolder.
Iterator features() Iterate over the features in no well defined order.
FeatureHolder filter(FeatureFilter filter) Query this set of features using a supplied FeatureFilter.
FeatureHolder filter(FeatureFilter fc, boolean recurse) Return a new FeatureHolder that contains all of the children of this one that passed the filter fc.
FeatureFilter getSchema() Return a schema-filter for this FeatureHolder.
void removeFeature(Feature f) Remove a feature from this FeatureHolder.
Features are Annotatable
Features implement Annotatable Can hold an annotation Global annotations of a Feature
/note: /db_xref: etc
Features may be nested
Features implement FeatureHolder! Therefore Features may hold nested
Features c.f. The AWT Menu is a MenuItem e.g. A gene has exons and introns Filtering can be recursive A Feature cannot hold itself (directly or
indirectly)
Location API
Locations are objects that specify a minimum and maximum bound on a region of sequence.
Contains some useful methods, particularly getMin() and getMax().
Many methods have been deprecated and are now delegated to LocationTools.
LocationTools is the best place to get new instances of a Location.
PointLocation, RangeLocation, CircularLocation, CompoundLocation.
LocationTools
static boolean areEqual(Location locA, Location locB) Return whether two locations are equal.
static boolean contains(Location locA, Location locB) Return true iff all indices in locB are also contained by locA.
static Location flip(Location loc, int len) Flips a location relative to a length.
static Location intersection(Location locA, Location locB) Return the intersection of two locations.
static CircularLocation makeCircularLocation(int min, int max, int seqLength) A simple method to generate a RangeLocation wrapped in a CircularLocation
static Location makeLocation(int min, int max) Return a contiguous Location from min to max.
static boolean overlaps(Location locA, Location locB) Determines whether the locations overlap or not.
static Location subtract(Location x, Location y) Subtract one location from another.
static Location union(java.util.Collection locs) The n-way union of a Collection of locations.static
Location union(Location locA, Location locB) Return the union of two locations.
Location Example
import org.biojava.bio.symbol.*; import org.biojava.bio.seq.*; public class SpecifyRange { public static void main(String[] args) { try { //make a RangeLocation specifying the residues 3-8 Location loc = LocationTools.makeLocation(3,8); //print the location System.out.println("Location: "+loc.toString()); //make a SymbolList SymbolList sl = RNATools.createRNA("gcagcuaggcggaaggagc"); System.out.println("SymbolList: "+sl.seqString()); //get the SymbolList specified by the Location SymbolList sym = loc.symbols(sl); System.out.println("Symbols specified by Location: "+sym.seqString()); } catch (IllegalSymbolException ex) { //illegal symbol used to make sl ex.printStackTrace(); } } }
Filtering Features
FeatureHolders have a filter method that accepts a FeatureFilter as an argument.
Features that are accepted by the FeatureFilter are returned as a new FeatureHolder.
Filtering may be done recursively so that nested Features are subjected to the same FeatureFilter .
FeatureFilters
FeatureFilter is an interface that specifies one method. boolean accept(Feature f)
There are 26 implementations of FeatureFilter in BioJava available as inner classes of the FeatureFilter interface.
Most commonly used are ByType, BySource, StrandFilter, OverlapsLocation, ContainedByLocation.
Also boolean logic filters: And, Or, Not
Analysis and Distributions
Distributions and Counts
The Distribution and Count interfaces are from the org.biojava.bio.dist package.
Counts are maps from AtomicSymbols to counts.
Distributions are maps from Symbols to frequencies.
Distributions
Distributions are central to analysis Map Symbols to Frequencies Can be trained or weights can be set Used heavily in dp (dynamic programming)
package. HMM transitions and emmissions
Many implementations, frequently used are: SimpleDistribution OrderNDistribution UniformDistribution
Distribution API
Alphabet getAlphabet() The alphabet from which this spectrum emits symbols. Distribution getNullModel() Retrieve the null model Distribution that this Distribution recognizes. double getWeight(Symbol s) Return the probability that Symbol s is emited by this spectrum. void registerWithTrainer(DistributionTrainerContext dtc) Register this distribution with a training context. Symbol sampleSymbol() Sample a symbol from this state's probability distribution. void setNullModel(Distribution nullDist) Set the null model Distribution that this Distribution recognizes. void setWeight(Symbol s, double w) Set the probability or odds that Symbol s is emited by this state.
DistributionFactory
Generally a Distribution is created using a DistributionFactory.
The DistributionFactory interface contains a static inner class called DEFAULT that implements DistributionFactory
DistributionFactory df = DistributionFactory.DEFAULT; Distribution d = df.createDistribution(dna.getAlphabet());
Distribution Training
Distributions can be trained on observed sequences using a DistributionTrainerContext.
One or more Distributions can be registered with the DTC. //register the Distributions with the trainer
dtc.registerDistribution(dnaDist);
DistributionTrainerContext
A DistributionTrainer is assigned to each registered Distribution by the DTC.
If unusual training behaivour is required you can register your own DistributionTrainer at the same time.
The dtc can also add pseudocounts if needed.
Ambiguities are automagically handled. Counts are split according to the null model.
Training Example
//make a DNA SymbolList SymbolList dna = DNATools.createDNA("atcgctagcgtyagcntatsggca"); //get a DistributionTrainerContext DistributionTrainerContext dtc = new SimpleDistributionTrainerContext(); //make the Distribution Distribution dnaDist = DistributionFactory.DEFAULT.createDistribution(dna.getAlphabet()); //register the Distribution with the trainer dtc.registerDistribution(dnaDist); for(int j = 1; j <= dna.length(); j++){ dtc.addCount(dnaDist, dna.symbolAt(j), 1.0); } //train the Distribution dtc.train();
setWeight() Example
FiniteAlphabet a = DNATools.getDNA();Distribution d =
DistributionFactory.DEFAULT.createDistribution(a);//set the weight of each symbold.setWeight(DNATools.a(),0.3);d.setWeight(DNATools.c(),0.2);d.setWeight(DNATools.g(),0.2); d.setWeight(DNATools.t(),0.3);
DistributionTools
DistributionTools holds static methods for creating and manipulating Distributions.
Tasks include: Equal emission spectra? Shannon Entropy, information, KL Distance. Generate biased sequences. Make a Distribution[] from an Alignment (each Distribution
represents one position in an Alignment. Average two or more Distributions. Randomize a Distribution. Make a Distribution from a Count.
Serialization of Distributions
Distributions are Serializable Write to and Read from Binary RMI
XMLDistributionWriter Write any Distribution to a stream in XML format.
XMLDistributionReader SAXParser Read any Distribution from a XML stream
XML Output
<?xml version="1.0" ?>
<Distribution type="Distribution">
<alphabet name="DNA" />
<weight sym="adenine" prob="0.32178516910737204" />
<weight sym="cytosine" prob="0.04596199299395364" />
<weight sym="guanine" prob="0.1405504188012911" />
<weight sym="thymine" prob="0.4917024190973832" />
</Distribution>
What Else??
Dynamic Programming (HMMs) Bibliography Alignments Blast and Fasta parsing
What Else??
BioSQL support GUI components Chromatograms Molecular Biology (pI, mass, restriction
enzymes) Molecular Structure