AutoEditorAutoEditor
Automated base caller error correction toolAutomated base caller error correction tool
Slides courtesy ofSlides courtesy ofPawel Gajer, Ph.D.Pawel Gajer, Ph.D.
AutoEditorBase-calling in the context of single chromatogram is hard…
but finding base-calling “mistakes” in a multiple alignment is easy.
• Principal and secondary aims of AutoEditor• AutoEditor as a higher level base caller• Tiling discrepancy types• Base caller error types• Resolving discrepancies of the form B…B*• Resolving discrepancies of the form *…*B• AutoEditor statistics
A principal goal of AutoEditor is to automatically correct a majority of tiling discrepancies, reducing human editing effort to the most problematic discrepancy types.
A tiling discrepancy is any deviation from the homogeneous coverage of a consensus base.
autoEditor as a higher level base caller
single read trace data base caller nucleotide sequence
tiling of reads
tiling discrepancies multiple read trace data
autoEditor
list of corrected discrepancies
Other applications:
• Clear range editing (read expansion)
• SNP detection
Clear range editing
single read quality values datatrimming algorithm
trimmed read
less stringently trimmed reads
assembler
tiling of reads autoEditor
SNP detection
Alignment data of genome 1
Alignment data of genome 2
Combined genomes alignment data List of putative SNPs
autoEditor
List of putative SNPs that pass autoEditor error screening
Tiling discrepancy types
Single deletion:
Single insertion:
Single insertion and single deletion are extreme cases of insertion/deletion discrepancies
A A A AA A A *A A * *A * * ** * * *
The above sequence of discrepancies can be representedschematically as an edge in a two vertex graph:
A *
The configuration space of all tiling discrepancy types can be schematically represented as a 4-dimensional simplex
A
T
C
G
*
support
supportsupport (b)
amplitude (a)
minimum difference between amplitude and local minimum (c)
Open dots on the signal curve indicate local maxima and open circles indicate local minima.
Re-calling individual bases
Base caller error types
• Missed signal
• Signal shift
•Unresolved peaks
Resolving a single deletion discrepancy
compute discrepancy’s read multiplicity: mult
if mult = 0 then check for a missed signal error
if |mult| > 0 then check for a signal shift errorif it is not a signal shift error then it is a unresolved peaks error
To resolve it, find two other reads with well resolved peaks over the unresolved peaks
bases
A discrepancy read multiplicity is the number of bases to the right or left (negative sign) of the discrepancy positions equal to the consensus base covering the discrepancy.
Resolving a single insertion discrepancy
compute discrepancy’s read multiplicity - mult
if mult = 0 then check if the signal parameters are within allowable ranges
if | mult | > 0 then check if the insertion base is a part of |mult |+1 well-
resolved signal peaksif not find two other reads whose traces have exactly |mult | well-
resolved signal peaks between the bases flanking the discrepancy position
mult = 0, weak signal error
mult = -2, unresolved peakserror with two other readswith exactly 2 signal peaksbetween Gs flanking AA*
from Nov 12, 2002 Test set: the first 10 contigs of Mycoplasma arthritidisasmbl_id size(kb) # corrections # autoEdit # errors in
errors newer autoEdit1 132 124 3 02 64 78 4 13 40 55 3 04 53 45 2 15 16 15 0 06 22 29 1 07 23 19 0 08 51 48 1 09 26 33 1 010 15 15 0 0----------------------------------------------------------------------Total: 442 461 15 2
~3.25% ~0.43%
Missed-signal (MS) and signal shift (SS) correction errors AutoEditor version 1.1
Test set: the first 10 contigs of Mycoplasma arthritidis
asmbl_id size(in kb) #disc #corr %corr
1 132 3390 3266 96%2 64 2195 2142 98%3 40 1344 1325 99%4 53 1304 1242 95%5 16 508 487 96%6 22 777 757 97%7 23 624 613 98%8 51 1303 1232 95%9 26 783 760 97%10 15 437 423 97%--------------------------------------------------------------------Total: 442 12665 12065 95%
where #disc is the total number of discrepancies in the given contig#corr is the number of corrected discrepancies%corr is the percentage of corrected discrepancies
AutoEditor version 1.2 correcting all single deletion errors
Organism Discrep’s Corrected % Contig
Discrep’s Corrected % Acidobacterium capsulatum 103539 93729 90.5% 99555 89977 90.4% Neorickettsia sennetsu Miyayama 41408 37425 90.4% 38355 34579 90.2% Bacillus anthracis Kruger B 317745 284503 89.5% 296222 264646 89.3% Coxiella burnetii 131183 117232 89.4% 118723 105562 88.9% Dichelobacter nodosus 83804 73547 87.8% 76766 67900 88.5% Clostridium perfringens 71928 62822 87.3% 66546 59929 90.1% Mycoplasma capricolum 17805 15444 86.7% 16574 14584 88.0% Brucella suis 129870 112359 86.5% 120799 105250 87.1% Plasmodium vivax 783495 655642 83.7% 734298 618268 84.2% Pseudomonas fluorescens 234264 194771 83.1% 224049 186276 83.1% Campylobacter jejuni 96231 79237 82.3% 88800 73940 83.3% Fibrobacter succinogenes 243270 196150 80.6% 208790 175294 84.0% Erwinia chrysanthemi 219370 176354 80.4% 205161 165070 80.5% Mycobacterium smegmatis 433105 346503 80.0% 363017 309774 85.3% Prevotella intermedia 118857 94162 79.2% 110750 87931 79.4% Pseudomonas syringae 227887 177897 78.1% 200223 164561 82.2% Silicibacter pomeroyi 156130 116907 74.9% 148006 112093 75.7% Chlamydophila caviae 50137 36972 73.7% 47875 35103 73.3% Wolbachia sp. 70782 51163 72.3% 57357 45401 79.2% Burkholderia mallei 139359 99711 71.6% 130158 94540 72.6% Streptococcus agalactiae 152330 105878 69.5% 109821 92153 83.9% Streptococcus pneumoniae 53566 36557 68.3% 43093 33432 77.6% Myxococcus xanthus 33525 21789 65.0% 33254 21699 65.3% Dehalococcoides ethenogenes 71587 46416 64.8% 61878 42649 68.9% Listeria monocytogenes 229172 145274 63.4% 148177 123268 83.2% Streptococcus mitis 157348 92377 58.7% 106172 74203 69.9% Total 4367697 3470821 79.5% 3854419 3198082 83.0%
AutoEditoraccuracy
Organism Read length Corrections AE Errors Listeria monocytogenes 37420828 145274 4 Wolbachia sp. 11446011 51163 0 Burkholderia mallei 47407080 99711 28 Brucella suis 26629877 112359 2 Streptococcus agalactiae 23485615 105878 3 Coxiella burnetii 29135115 117232 30 Campylobacter jejuni 15013845 79237 11 Chlamydophila caviae 10286694 36972 6 Dehalococcoides ethenogenes 10724521 46416 12 Neorickettsia sennetsu Miyayama 8805232 37425 0 Fibrobacter succinogenes 46463268 196150 4 Mycoplasma capricolum 9353819 15444 0 Prevotella intermedia 20084365 94162 3 Pseudomonas syringae 50369232 177897 46 Total 346625502 1315320 149 Table 2. Comparison of AutoEditor corrections on 14 genomes to the finished sequence of those genomes.
AutoEditor accuracy