Unique HCV Data and Analysis Tools in the Virus Pathogen Resource (ViPR)
Yun Zhang
J. Craig Venter Institute, San Diego, USA
New HCV Typing Pipeline
Improvement of Subtype Annotations in Virus Pathogen Resource (ViPR)
Leveraging Annotation Results for Analysis
Future Plan
Outline
Objective for HCV Subtyping
https://talk.ictvonline.org/ictv_wikis/flaviviridae/w/sg_flavi/56/hcv-classification
• Phylogenetically-principled subtyping
• Consistently subtype all HCV genomes in public domain
• Make annotations available via ViPR
• Be consistent with the ICTV subtype classification
Phylogeny-based Subtype Classification
AqAX
q is A-type: bracketed by A and A.
AqBX
AB-anc
A.1qA.2X
q is of unknown type: bracketed by A and B (it could be "C", "A", or "A.x").Naïvely, it looks like q must be of A-type, but we do notknow at which point along the branch going from AB-ancestor to A, the type changes from AB-ancestor-type to A-type.
q is of A-type: bracketed by A.1 and A.2 (it could be "A.3", "A.1", or "A.1.x").
Novel HCV Typing Pipeline
Query Identifier
Query Length Type Consensus
Assignment SupportPhylogenetic
Tree Report
AB677533 9471Matching CladesMatching Down-tree Bracketing CladesMatching Up-tree Bracketing Clades
1b1b1b
1.01.01.0
ViewInput alignment (FASTA)Output tree (Newick)Subtype assignment (text)
Genotyping/Subtyping Report (Beta) (SOP)
Your analysis contains 1 records
HCV Typing in ViPR3/27/2017 Virus Pathogen Database and Analysis Resource (ViPR) - Flaviviridae - Genome database with visualization and analysis tools
https://www.viprbrc.org/brc/home.spg?decorator=flavi_hcv 1/2
Loading Virus Pathogen Database and Analysis Resource (ViPR)...
Hepatitis C VirusTaxonomy: Group IV ((+)ssRNA); Flaviviridae; Hepacivirus; Hepatitis C virusVirion: 50 nm, icosahedral, envelopedGenome: 9.6 kilobase positivesense, singlestranded RNAProteome: single polyprotein, co & posttranslationally cleaved into 10 mature proteinsInfection: initiates by E2 protein interacting with cell surface heparan sulfate proteoglycansRNA Transcript: 5’ internal ribosomal entry site (IRES), no 3’ polyA tailTransmission: infects humans & chimps via bloodtoblood contactPhylogeny: 6 distinct genotypes identified, each with multiple subtypesEpidemiology: 23 million infected each year worldwide, almost 200 million infectedClinical: causes cirrhosis, hepatocellular carcinoma, and liver failure
SearchSearch our comprehensive database for:
AnalyzeAnalyze data online:
Save to WorkbenchUse your workbench to:
Browse All Search Types Browse All Tools
Data on host response toInfluenza and SARS
infections is now available!Hostvirus interaction data produced bylaboratories associated with the NIAIDfundedSystems Biology for Infectious DiseasesResearch Program is now available in ViPR.
This release increases the amount of hostfactor data for a total of 46 microarray, 16proteomics and 4 lipidomics (in vivo and invitro) experiments for various SARS andMERSCoV strains as well as H5N1, H3N2and H1N1 influenza A viruses.In this release, the capability is nowavailable to search for a single host factoracross multiple experiments through the'Host Factor Results' button on the 'HostFactor Biosets' page. In addition, displayingthe Reactome pathway(s) containing one(or more) host factors are now availablefrom both the 'Patterns' and 'BooleanOperator' pages.Additional experiments using various 'omics' technologies, as well as analyticaland visualization tools will become availablein future releases of ViPR.
For more details about these studies, or to viewthe results, click on the “Host Factor Data” linkfrom the “Search Data” menu.
Genomes
Genes & proteins
Sequence Feature Variant Types
Immune epitopes
3D protein structures
Host Factor Data
Antiviral Drugs
Sequence Alignment
Phylogenetic Tree
Sequence Variation (SNP)
Metadatadriven Sequence Analysis
Genome Annotator
BLAST
Store and share data
Combine working sets
Integrate your data with ViPR data
Store and share analyses
Custom search alert
Highlights
Decoration options let you color tree leaves by metadata.Export image and legend, or download trees as Newick or
Start Analysis
Tutorial
Multiple Sequence AlignmentCompute and visualize multiple sequence alignmentstogether with derived consensus sequence andconservation score within ViPR. Perform customalignments using the MUSCLE algorithm. Alignments canbe saved to the ViPR WorkBench or downloaded invarious formats.
Key Highlights:
Align multiple virus sequencesVisualize alignments; customize alignment displaySave alignment to ViPR workbench
SEARCH DATA ANALYZE & VISUALIZE WORKBENCH SUBMIT DATA VIRUS FAMILIES HELP [email protected]
Hepatitis C virusAbout Us Community Announcements Links Resources Support
10/9/2018 Virus Pathogen Database and Analysis Resource (ViPR) - Flaviviridae - Sequence Search
https://www.viprbrc.org/brc/vipr_genome_search.spg?method=ShowCleanSearch&decorator=flavi_hcv 1/2
Loading Virus Pathogen Database and Analysis Resource (ViPR)...
Start to type subfamily, genus, species or strain to get suggestions Deselect All
DATA TO RETURNGenome
ProteinStrain
SELECT VIRUS(ES) TO INCLUDE IN SEARCHJump to subfamily, genus, species or strain in taxonomy:
COMPLETE GENOME Complete Genome Only
Start:
End:
COLLECTIONYEAR
To add month tosearch, see AdvanceSearch Options:Month Range
GEOGRAPHIC GROUPING
COUNTRY
HOST SELECTION
Host GenderAllMale
Female
HOST ATTRIBUTES
Sample Source
SAMPLE ATTRIBUTES
Subtype Infection Type
VIRUS ATTRIBUTES
Results matching your criteria: 542,434
Subtype: 1 Select All(0/7796 strains selected) (7796 Strains 38 complete genomes)
Subtype: 1a Select All(0/28683 strains selected) (28683 Strains 587 complete genomes)
Subtype: 1b Select All(0/28283 strains selected) (28283 Strains 830 complete genomes)
Subtype: 1b/2k Select All(0/1 strains selected) (1 Strain 0 complete genomes)
Subtype: 1c Select All(0/85 strains selected) (85 Strains 19 complete genomes)
YYYY
YYYY
Gene/Protein SearchSearch for virus protein/gene and related information. You can search for the whole virus family or search for specified genus, species etc. You can also find your strain orgenome record if you have its information, such as strain name, accession. Protein/Gene searches for Dengue virus or Hepatitis C virus can be augmented with clinical metadata criteria. Selecting the appropriate nodes in the taxonomy browser(Flavivirus, Dengue virus, Hepacivirus, Hepatitis C virus) will add metadata search panels and enable you to include these criteria. Some sequences have more metadata fieldsdefined than others. Queries based on metadata only retrieve sequences for which those fields are defined.
Choose a Geographic..
Choose a Country...
Choose a Host...
GENE SYMBOL( SOP )
ViPR Home Hepatitis C virus Home Gene/Protein Search
SEARCH DATA ANALYZE & VISUALIZE WORKBENCH SUBMIT DATA VIRUS FAMILIES HELP [email protected]
Hepatitis C virusAbout Us Community Announcements Links Resources Support
Improved Annotations in ViPR
83603
27336
443760
12498
1319 2550
20000
40000
60000
80000
100000
Identical types
between ViPR
& GB
ViPR new
annotations
GB unique
annotations
ViPR improved
type precision
ViPR lost type
precision
Different types
between ViPR
& GB
Others
65%
21%
3%
10%
1%
Total sequences: 223,324Sequences >= 400 nt: 128,815
Jul 30, 2018 release
Genotype/Subtype Distribution
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
1 1a 1b 1c 1e 1g 1h 1l 1m 1n 2 2a 2b 2c 2f 2i 2j 2k 2m 2q 3 3a 3b 3g 3h 3i 3k 4 4a 4d 4f 4g 4k 4l 4m 4n 4o 4r 4v 4w5 5a 6 6a 6c 6e 6f 6g 6h 6i 6j 6l 6m 6n 6o 6p 6q 6r 6s 6t 6u 6v 6w 6xa
6xb
6xd
6xe
6xf 7 7a
Log2 # sequences July 2018 release
0%
20%
40%
60%
80%
100%
1 1a 1b 1c 1e 1g 1h 1l 1m 1n 2 2a 2b 2c 2f 2i 2j 2k 2m 2q 3 3a 3b 3g 3h 3i 3k 4 4a 4d 4f 4g 4k 4l 4m 4n 4o 4r 4v 4w5 5a 6 6a 6c 6e 6f 6g 6h 6i 6j 6l 6m 6n 6o 6p 6q 6r 6s 6t 6u 6v 6w 6xa
6xb
6xd
6xe
6xf 7 7a
Af rica Asia Europe North America Oceania South America
GT1 78% GT3 11%
Leveraging Annotation Results for Analysis
GT1
GT2
GT3
GT4
GT6
GT5
Other Comparative Analysis Tools in ViPR
Excel Download FASTA Download View Phylogenetic Tree Find a VT(s)
Protein Name NS5aSequence Feature Name Hepatitis C Virus_NS5a_RAS_31(1)Sequence Feature ID Hepatitis C virus_NS5a_SF3Reference Strain H77-1aReference Sequence Accession NC_004102Reference Position 31
Source Strain
VT Number
Source Position
Source Accession
3D Protein
StructurePublication Evidence
Codes Comment
H77-1a -N/A- 31 NC_004102 1CWX EXP L31F/M/V substitutions conferred resistance to
NS5A inhibitor treatment for certain genotype infections.
Source: HCV Guidance: Recommendations for
Testing, Managing, and Treating Hepatitis C
[http://www.hcvguidelines.org/print/92]
SEQUENCE FEATURE DEFINITION
SOURCE STRAIN(S)
VARIANT TYPES
Strain Count Variant Type Phenotypic Variant TypeSequence Variation
31 Total Variation11933 VT-1 No L 0670 VT-2 Yes M 155 VT-3 Yes V 125 VT-4 No I 15 VT-5 No P 13 VT-6 No S 1
Known limitations– Reference tree defines subtype
boundaries– Limited gold standard annotations
curated by experts
Big data– Genome sequences: 223,324– Fragments– Highly similar
– Regional diversity between subtypes
Improving Typing Tool via Data Mining
2862
1888
1005
235840
1000
2000
3000
4000
0.80 0.90 1.00
# se
quen
ces
CD-HIT ThresholdCDS
A section of HCV reference alignment
Expand the Reference Tree
Current reference tree– GT1b
Expanded reference tree– GT1b
GB-type_accession|new
Count %Unique testing sequences 6381 1.00Typed as GT1b using the expanded tree 5699 0.89Typed as GT1 using the expanded tree 668 0.10
Testing expanded reference treeTesting data: 8186 sequences
annotated as GT1 in ViPR/GT1b in GenBank
Leverage subtype metadata in GenBank
An automated HCV subtyping pipeline• accurate• efficient
Comprehensive, improved subtype annotations in ViPR
Many comparative genomic analysis tools in ViPR
Future plan: Verify new candidate reference sequences
Summary & Plan
Acknowledgements
Christian ZmasekRichard Scheuermann
Sherry HeChristian SulowayJyothsna ReddyXiaomei Li Sam Zaremba
Donald Smith
HHSN272201400028C