Microbial Bioinformatics Tools BIOL 4800/01, Fall 2016Practical Computing for Biologists, Haddock &...

Post on 11-Jun-2020

1 views 0 download

transcript

BIOL4800/01SyllabusFall2016

Page1of10

MicrobialBioinformaticsToolsBIOL4800/01,Fall2016When:1230-1320W,1230-1520FWhere:Coates169Instructor:J.CameronThrash,Ph.D.Email:thrashc@lsu.eduTwitterhandle:@DrJCThrashMyoffice:A112LifeSciencesAnnexOfficehours:Byappointment(seebelow)Prerequisite:PermissionoftheDepartmentRecommendedtextbooks:

PracticalComputingforBiologists,Haddock&Dunn;Phylogenomics,DeSalle&Rosenfeld

Coursewebsite(s):MoodleeCommunicationPolicy:Thebestwaytocontactmeisthroughemailand/ortwitter.Iwilltrytorespondtoemailortwittermessageswithin6hours,exceptonweekendsorbetween2200and0700.Imayrespondmuchquicker,becauselikeyouIamgluedtomydevices,butIdohavealifeoutsideofteachingandresearch(whenI’mlucky).Ifyouwantanappointment,emailmewith1)ashortdescriptionofyourissue,and2)thedesiredtimeand3)durationofthemeeting.Thiswillbesubjecttomyavailability.Iacceptandencouragetwitterfollows,butIdonotacceptanyothersocialmediafriendrequests.Coursedescription.Inmodernbiology,theneedforcompetenceincomputationaltoolsisbecomingasubiquitousasthatfortraditionaltechniqueslikePCR.Thiscoursewillprovidebasictraininginnavigatingthecommand-lineenvironment,utilizingcommontoolsforgenomicsandecology,submittingjobstoHighPerformanceComputingclusters,andmanaginginputandoutputfiles.ItisNOTaprogrammingclass.Priorcomputationalexperienceishelpful,butnotrequired,asthegoalofthiscourseistobringneophytestoabasiclevelofcompetencewithsomecommonbioinformaticsmethods.Whilethefocuswillbeapplyingthesetomicrobiologicalresearch,manytoolsaresystem/organismindependent.ClasseswilltakeplaceinacomputerlabandwillhaveaccesstotheLSUHighPerformanceComputing(HPC)infrastructure.Eachweekwillconsistoftwohoursoftheory/practicallectureandtwohoursofpracticalhands-oncomputerlaboratoryexercises(plustake-homeexercises).The4800/01coursecanbetakenforcreditbyupper-levelundergraduatesandgraduatestudentsequally(3credithours).Courselearningoutcomes.Bytheendofthiscourse,youshouldbeableto:

• UnderstandthebasicsofaHPCinfrastructure• RemotelyaccessaHPCclusterusingthecommandline• NavigateandmanipulatethefilestructurewithinaLinuxenvironment• CompletebasicfilemanipulationtasksusingLinuxcommands• Writebasicshellscriptsforparsinginputandoutputfilesandsendingjobstothecomputenodes• Beabletoassessprogramperformanceusingdatasubsetstoaccuratelyestimateusage

requirements• DownloadgenomicinformationfrompublicdatabasesdirectlytoaHPCcluster• Executeparallel(threaded)analysesusingBLAST,HMMER3,andmultiplealignmenttools• Understandthemodernsequencingplatformmethodologies,capabilitiesandlimitations• Executephylogeneticinferencesfrommultiplealignmentswithdifferentplatforms• PerformbasicautomatedmicrobialgenomeassemblyandannotationwiththeA5pipeline• Assessthecoreandpan-genomeofagroupofcloselyrelatedmicroorganisms• CompleteoperationaltaxonomicunitclusteringandanalysisusingMothur.

Howwe’regoingtogetthere(CoursePhilosophyandFormat)

Circularrepresentationofmultiplebacterial

genomes(Groteetal.2012mBio)

BIOL4800/01SyllabusFall2016

Page2of10

Philosophy.Thiscourseisdesignedtogetyoutoabasicworkingknowledgeofmanyofthecommontoolsusedinmodernbioinformatics,particularlyasappliedtomicrobialgenomics.Thisisacombinedlecture/laboratorycourse,withthelaboratoryportionspentutilizingcomputersinsteadofatypicalwetlab.WhiletherewillbesomelecturecomponentduringtheWednesdayclass,asmuchaspossiblethisperiodwillhaveactivelearningexercisesinsteadofmesimplystandingaroundtalking.Extensiveresearchoneducationandtheneuroscienceoflearninghasshowntherearemuchmoreeffectivewaysforustolearnthanbysittingandlisteningtoapersonstandinthefrontoftheroomandtalk.Youdon’thavetocometoclasstolearnthatwayanyway,forthereareendlesslecturesandresourcesavailableonline,manyfromthemosteminentscientistsintheirfields.Someofthesewillbepartofyourpre-classassignments.Therefore,Iendeavortomakeclasstimeaseffectiveaspossibleforstimulatingyourinvestmentinthematerialandactivatingallmodesofthinking.Theaddedbenefitofbeingabletodothisworkyourselvesinthelabportionoftheclasswillhelpcompletetheprocess.Classroommechanics.Theone-hourWednesdayclasswillbechieflyconcernedwithintroducingthetoolswewillbelearningtouse.Thethree-hourFridayclasswillsometimeshaveapracticallectureforinstructionalpurposesandthenatleasttwohourstoworkonyourassignments(detailedbelow),whichwillbegradedaccordingtoaspecificrubric,andinvolveashortwrittencomponent.Therewillbesomeassignedreadings/podcasts/web-videos/lectureslidesyouwillberesponsibleforbeforeeachclassperiod,listedas“Readings,etc.”intheclassschedule,below.Iwillalsointroducethesepre-lectureassignmentseachweekbyemail.TherewillbeashortonlineMoodlequizonthematerialforWednesdaysthatclosesonehourbeforeclass.ComputationalRequirements.Ourclasseswillbeconductedinacomputerlab.Priortothefirstclass,youneedtohaverequestedanaccountwithLSUHPCforaccesstothesupercomputerSuperMike-II(https://accounts.hpc.lsu.edu/allocations.php).Youwillusethisaccountforcompletingclassroomexercisesandmajorassignments.Forin-classexercises,youwillbeusingthelabcomputersorapersonallaptopandloggingonthroughaterminal.Foryourmajorassignments,youwillneedanothercomputerwithterminallogincapabilitiessothatyoumayaccessSuperMike-IIremotely.Allexercisesinvolvingsignificantcomputationaleffortwillrequiretheuseofaclasscomputingallocation.DetailsforoperatingintheHPCenvironmentwillbepresentedduringthefirsttwoweeksofthecourse.Youwillbegradedonthefollowing:Quizzes 10%WeeklyAssignments 90%

Therewillbe1000totalpoints,gradedaccordingly:A-900-929;A930-969;A+>969B-800-829;B830-869;B+870-899C-700-729;C730-769;C+770-799D-600-629;D630-669;D+670-699F<600

Lateassignments.Assignmentswillsacrifice25%oftheirpointsperdaytheyarelate.Dealingwithchallenges.Makingmistakesandrunningintoroadblocksisinherenttotheprocessoflearning.Bydesign,thiscoursewillchallengeyoutofigureoutsolutions,notsimplygiveyouthepathfrompointAtopointZ.Thereasonforthisisthatthestruggletoovercomewhateverchallengeyoufaceiswherethetruelearninghappens.Therefore,whileyourfirstimpulsewhenyouencounteraproblemwillprobablybetoemailme,thiswillnotbemetwiththetypeofresponseyouarelookingforUNLESSyouhavedoneallofthefollowingproblem-solvingeffortsfirst,inthisorder:

1. Think.Reviewyourcommands,inputs,andoutputs.Seeifyoucanfigureoutwhatwentwrong.Oftenit’ssimplythataspaceismissingsomewhere,acommaismisplaced,oryou’reusinga‘insteadofa`.Trytofindthisyourselfbeforebotheringanyoneelse.

2. Consulttheinternet.Thepeoplewhohavedevelopedandusetheprogramsyouarelearninghavecreatedamassiveamountofonlineresources.Oftengooglingyourerrormessagewillallowyoutofindtheproblem.Creativegooglesearchingcanusuallydotherest.

3. Consultyourpeers.Iseveryoneintheclasshavingthesameproblem,orhassomeoneelsediscoveredthesolution?Whilethismayseemlikethesamethingasaskingtheprofessor,askingyourclassmatesfosterspeer-to-peerinstruction,whichreinforcesconceptsforthosewhogetthe

BIOL4800/01SyllabusFall2016

Page3of10

opportunitytoteachtheirsolutionandgivesadifferentperspectivetothosewhoareseekinganswers.

4. ConsultDr.Thrash.Ifyou’restillhavingproblems,byallmeanscontactmeorsetupanappointmentforofficehours.Solutionsaresometimesverysimplebutobscure.

OtherimportantinformationAbsences/CodeofStudentConduct.Youareexpectedtohaveread,understand,andadheretotheLSUAbsencePolicy(http://saa.lsu.edu/important-lsu-policies)andtheCodeofStudentConduct(http://saa.lsu.edu/code-student-conduct).Ourgoalshouldbetolearn,notsimplytogetgrades.Inscience,asinlife,yourintegrityisoneof,ifnotthe,mostvaluableassetyouhave.Preserveit,protectit,cultivateit.StudentswithDisabilities.Ifanyonehasadisabilitythatmayrequireaccommodation,youshouldimmediatelycontacttheofficeofServicesforStudentswithDisabilitiestoofficiallydocumenttheneededaccommodation.Theinstructormustbepresentedwiththisdocumentationduringthefirstweekofclass.Timerequirements.Itisexpectedthatyouwillhavereadorviewedtheassignedmaterialpriortoclassforthebackgroundnecessarytoproperlyparticipateintheactivitiesandthinkcriticallyabouttheconceptsaddressed.Asageneralpolicy,foreachhouryouareinclass,you(thestudent)shouldplantospendatleasttwohourspreparingforthenextclass.Sincethiscourseisforthreecredithours,youshouldexpecttospendaroundsixhoursoutsideofclasseachweekreadingorworkingonassignmentsfortheclass.ClassscheduleThescheduleispreliminaryandsubjecttochangedependingonhowquicklywearemovingthroughthematerial.Detailsonyourpre-classreadings,etc.,aresuppliedbelow.Class Week Date Subject Readings,etc. Assignment1 1 August24th(W) Thecommandlineenvironment 1-6 2 1 August26th(F) HPCTutorial WA013 2 August31st(W) BasicLinuxcommands 7-11 4 2 September2nd(F) MoreLinux,shelltexteditors WA025 3 September7th(W) MoreLinux,databaseaccess,bashscripts 12 6 3 September9th(F) Download,manipulatefasta/Genbankfiles WA037 4 September14th(W) Localalignment&dynamicprogramming 13-15 8 4 September16th(F) BLAST WA049 5 September21st(W) Multiplesequencealignment 16-19 10 5 September23rd(F) clustalW,MUSCLE,Gblocks WA0511 6 September28th(W) Singlegenephylogeny 20-24 12 6 September30th(F) RAxML,FastTree,clustalw WA0613 7 October5th(W) Genomesequencing 25,26 WA07 7 October7th(F) FallBreak 14 8 October12th(W) HiddenMarkovModels 27,28 15 8 October14th(F) HMMER3 WA0816 9 October19th(W) IntrotoA5 29,30 17 9 October21st(F) A5pipelineandevaluationstats WA0918 10 October26th(W) Annotation:theSEEDviewer 31,32 19 10 October28th(F) Analyzingandcomparingannotations WA1020 11 November2nd(W) MicrobiomesI:background 33-35 21 11 November4th(F) MicrobiomesI:OTUsandecologicalanalysis WA1122 12 November9th(W) MicrobiomesII:Mothur 36,37 23 12 November11th(F) MicrobiomesII:Mothur/analysis WA1224 13 November16th(W) Assessingcoreandpangenomes 38-40 25 13 November18th(F) Orthologydetermination,GetHomologues WA13 14 November23rd(W) Thanksgiving 14 November25th(F) Thanksgiving 26 15 November30th(W) PresentingWA13 27 15 December2nd(F) PresentingWA13

BIOL4800/01SyllabusFall2016

Page4of10

Readings,etc.tobecompletedbeforeclass1. ReadtheSyllabus2. ApplyforanHPCaccount(https://accounts.hpc.lsu.edu/login_request.php)3. SoftwareCarpentryUnixshelltutorials(https://v4.software-carpentry.org/shell/index.html)-

Introduction,FilesandDirectories,CreatingandDeleting4. ReviewtheHPCJumpstart.pdfandtheHPC@LSUwebsite(http://www.hpc.lsu.edu),familiarizing

yourselfwithAccountsandAllocations,theLSUHPCUsagePolicy,theUserGuideforSuperMikeII,andtheComputationalBiologytoolsavailable.

5. HaddockandDunn,Chapter4(Optional)6. HaddockandDunn,Chapter20(Optional)7. SoftwareCarpentryUnixshelltutorials(https://v4.software-carpentry.org/shell/index.html)-Pipesand

Filtersthroughremainingtutorials.8. SoftwareCarpentryregextutorials(https://v4.software-carpentry.org/regexp/index.html)-all.9. HaddockandDunn,Chapter2(Optional)10. HaddockandDunn,Chapter3(Optional)11. HaddockandDunn,Chapter5(Optional)12. DeSalleandRosenfeld,Chapter4(Optional)13. WebresearchontheBLASTsuite.14. Eddy,S.R.(2004).Whatisdynamicprogramming?NatureBiotechnology,22(7),909–910.(Optional)15. DeSalleandRosenfeld,Chapter5(Optional)16. WikipediaintrotoMSA:https://en.wikipedia.org/wiki/Multiple_sequence_alignment17. Edgar,R.C.(2004).MUSCLE:multiplesequencealignmentwithhighaccuracyandhighthroughput.

NucleicAcidsResearch,32(5),1792–1797.18. Castresana,J.(2000).Selectionofconservedblocksfrommultiplealignmentsfortheirusein

phylogeneticanalysis.MolecularBiologyandEvolution,17(4),540–552.19. DeSalleandRosenfeld,Chapter6(Optional)20. Wikipediaintrotophylogenetictrees:https://en.wikipedia.org/wiki/Phylogenetic_tree21. SlidesfromDr.JonathanEisen22. Price,M.N.,Dehal,P.S.,&Arkin,A.P.(2010).FastTree2--approximatelymaximum-likelihoodtreesfor

largealignments.PlosOne,5(3),e9490.23. Stamatakis,A.(2006).RAxML-VI-HPC:maximumlikelihood-basedphylogeneticanalyseswiththousands

oftaxaandmixedmodels.Bioinformatics,22(21),2688–2690.24. DeSalleandRosenfeld,Chapter8(Optional)25. Metzker,M.L.(2010).Sequencingtechnologies-thenextgeneration.NatureReviewsGenetics,11(1),31–

46.26. EvolutionofDNASequencingMethods,talkbyJonathanEisen:

https://www.youtube.com/watch?v=s9UbA7VyISQ27. Eddy,S.R.(1998).ProfilehiddenMarkovmodels.Bioinformatics,14(9),755.28. Eddy,S.R.(2011).AcceleratedProfileHMMSearches.PLOSComputationalBiology,7(10),e1002195.29. Tritt,A.,Eisen,J.A.,Facciotti,M.T.,&Darling,A.E.(2012).AnIntegratedPipelinefordeNovoAssembly

ofMicrobialGenomes.PlosOne,7(9),e42304.30. Coil,D.,Jospin,G.,&Darling,A.E.(2015).A5-miseq:anupdatedpipelinetoassemblemicrobialgenomes

fromIlluminaMiSeqdata.Bioinformatics,31(4),587–589.31. Edwards,D.J.,&Holt,K.E.(2013).Beginner'sguidetocomparativebacterialgenomeanalysisusingnext-

generationsequencedata.MicrobialInformaticsandExperimentation,3(1),2.32. Overbeek,R.,Olson,R.,Pusch,G.D.,Olsen,G.J.,Davis,J.J.,Disz,T.,etal.(2014).TheSEEDandtheRapid

AnnotationofmicrobialgenomesusingSubsystemsTechnology(RAST).NucleicAcidsResearch,42(Databaseissue),D206–14.

33. Goodrich,J.K.,DiRienzi,S.C.,Poole,A.C.,Koren,O.,Walters,W.A.,Caporaso,J.G.,etal.(2014).ConductingaMicrobiomeStudy.Cell,158(2),250–262.

34. Seekatz,A.M.,Aas,J.,Gessert,C.E.,Rubin,T.A.,Saman,D.M.,Bakken,J.S.,&Young,V.B.(2014).RecoveryoftheGutMicrobiomefollowingFecalMicrobiotaTransplantation.mBio,5(3),e00893–14–e00893–14.

BIOL4800/01SyllabusFall2016

Page5of10

35. Schloss,P.D.,Westcott,S.L.,Ryabin,T.,Hall,J.R.,Hartmann,M.,Hollister,E.B.,etal.(2009).Introducingmothur:Open-Source,Platform-Independent,Community-SupportedSoftwareforDescribingandComparingMicrobialCommunities.AppliedandEnvironmentalMicrobiology,75(23),7537–7541.

36. MothurMiSeqSOP37. KozichJJ,WestcottSL,BaxterNT,HighlanderSK,SchlossPD.(2013).Developmentofadual-index

sequencingstrategyandcurationpipelineforanalyzingampliconsequencedataontheMiSeqIlluminasequencingplatform.AppliedandEnvironmentalMicrobiology.79(17):5112-20.

38. Tettelin,H.,Masignani,V.,Cieslewicz,M.J.,Donati,C.,Medini,D.,Ward,N.L.,etal.(2005).GenomeanalysisofmultiplepathogenicisolatesofStreptococcusagalactiae:implicationsforthemicrobial"pan-genome".ProceedingsoftheNationalAcademyofSciences,102(39),13950–13955.doi:10.1073/pnas.0506758102

39. Grote,J.,Thrash,J.C.,Huggett,M.J.,Landry,Z.C.,Carini,P.,Giovannoni,S.J.,&Rappé,M.S.(2012).StreamliningandCoreGenomeConservationamongHighlyDivergentMembersoftheSAR11Clade.mBio,3(5),e00252–12.

40. Contreras-Moreira,B.,&Vinuesa,P.(2013).GET_HOMOLOGUES,aversatilesoftwarepackageforscalableandrobustmicrobialpangenomeanalysis.AppliedandEnvironmentalMicrobiology,79(24),7696–7701.doi:10.1128/AEM.02411-13

StayingOrganizedPartofanygoodcomputationalbiologyworkflowiskeepingyourinputsandoutputsorganized,andallyourprocessesandthecontentsofeachfileanddirectorydocumented.Thisnotonlyallowssomeoneelsetounderstandandreproduceyourwork,butpreventsyoufromforgettingthevaluablestepsyoutooktoproduceyourworkaswell.It’sahorriblefeelingtoenteradirectoryayearafterworkingonaprojectandnotrememberthecontentsofthefilesorhowtheywerecreated.Throughoutthesemester,wewillutilizeacommoncoresetoforganizationalprocedurestofacilitatekeepingorganized.Eachweekwillhaveaseparatedirectoryinyourhomedirectorywhereyouwillstoreinputsandoutputs,andpossiblyincludesubdirectories.YouwilldocumentthecontentsofeachfileandsubdirectoryinaREADMEdocument,includingoneforeachdirectory.Finally,foreachassignmentyouwillcreateabriefsummaryreport,describedbelow.Allthreeofthesedocumentswillbeinstrumentalinyourgrade.(Youmayfindthatwhenyoubranchoutonyourown,adifferentsystemmaysuityou.Regardless,itisimportanttoleaveatransparenttrailofallyourworksothatitcanberecreatedatanypointinthefuture.Thepointhereissimplytoenforcegoodpracticesincomputationalbiology,andwehavetopickonesysteminadvance.)ReportsEachweekyouwillbecompletingaseriesoftasksusingatoolorsetoftools.Aspartofyourassignmentsyouneedtoincludeashortwrittenreportwiththeelementsbelow.Ifthereportincludesonlytext,createitwithatexteditor(e.g.nano)andsaveas~/<workingdirectory>/report.txt.Ifitincludesgraphics,saveasa.docxor.pdffile,anduploadto~/<workingdirectory>/report.docx(or.pdf).

1. Name(s),date2. General(1-line)summaryofobjective(s)andpurpose(s)3. Workingdirectory4. Programsused,includingbasicscripts,andrelevantreference(s)5. Commands,inputs,outputsandresults/evaluationofoutputforNEWoperations.*

• Organizeinsectionsaccordingtotherubricinoutlineform.• Includespecificfilenames.• Forbatchjobs,indicatetheimportantcommand(s)andthenameofthePBSscript.• Forrepeatingtasks,onlydetailthefirstinstance,thenindicatethatthiswasrepeatedandnote

variationininput/outputfilenames.Similarly,outputonlyhastobeshownforafirstinstance.• *Foroperationsyouarerepeatingfrompreviousassignments(blastp,muscle,etc.),youmay

simplyreferenceapreviousreportforyourworkflow,butyouneedtobespecificenoughthatonecouldfindthecorrectcommandandrepeatit.

6. Personalreflection.Whatdidyoulearninadditiontousingtheassignedtool(s)?Whatwouldyoudodifferently?Whatareyoustillconfusedabout?

BIOL4800/01SyllabusFall2016

Page6of10

WeeklyAssignments(WA)WA01. LearningthecommandlineandHPCtutorial(40pts)

a. Pathhomework(relativevs.absolute).Createa~/week1/paths.txtfileonSuperMike-IIwiththefollowing,oneperline:

i. ArelativepathtoyourhomedirectoryonMikeii. TheabsolutepathtoyourhomedirectoryonMikeiii. Arelativepathtoyourworkdirectoryiv. Theabsolutepathtoyourworkdirectoryv. Arelativepathtoanotherstudent’shomedirectoryvi. Theabsolutepathtoanotherstudent’shomedirectory

b. Logging.Createatab-delimited~/week1/report.txtfilethatincludestheuniquecommandsyouhaveused(historyisveryhelpfulhere).MakesuretoincludewhatyouhavelearnedabouttheHPC@LSUandwhatyouarestillconfusedabout.Createatab-delimited~/week1/READMEfilethatcontainseachofthefilesineachdirectory,atabover,andabriefdescriptionofcontents.

WA02. BuildingonourLinuxskillsandincorporatingshelltexteditors(60pts)a. Ingroupsoftwo(number-assigned),researchandteachtheclassaboutoneofthefollowing

(agoodstartingpointwillbethe“cheatsheets”andtheirrespectiveweblocations):i. head,tailii. wciii. grepiv. >vs.>>v. sortvi. uniqvii. sedviii. piping

Youwillneedtoexplainthetool/concept,whatitcanbeusedfororwhereitisused,andprovideanexampleusingabasicfastafile.Eachgroupwillhave4minutes.Performthewholepresentationonthecommandline(i.e.,don’tcreatepowerpointsforthis).WewillstartonWednesdayandcontinueintoFridayifnecessary.

b. CreateasetoffivepipedcommandsthatutilizeanythreeoftheLinuxtools(filters)youlearnedlastweektomanipulateafastafile.Documentthesecommands,theirpurpose,andtheinputandoutputinyourreport.UsenanotocreatethetextwhileloggedintoSuperMike-II.

c. Logging.Createatab-delimited~/week2/report.txtfilethatincludestheuniquecommandsyouhaveused(historyisveryhelpfulhere).Createatab-delimited~/week2/READMEfilethatcontainseachofthefilesineachdirectory,atabover,andabriefdescriptionofcontents.

WA03. Bashscripts,downloading,andLinuxpractice(70pts)Createtwoseparatebashscriptsusingthetoolswe’vecoveredthusfar(orothersthatyouknowabout),runthem,anddocumenttheirinput/outputina~/week3/READMEfile:

a. Abashscriptyouruninyourworkingdirectoryb. APBSsubmissionscript,submittedviaqsub

Downloadthegenomesequencesforagroupofcloselyrelated(sameFamily)microorganismsfromGenBankandpracticemanipulatingfastafileswithbasiclinuxcommands.

c. Compilethenames,GenBankentries,andphylogenyinatab-delimitedfileforfiveorganisms.

d. Downloadproteinfastas,GenBankfiles,nucleotidefastas,andscaffoldfastasfortheorganismsyouidentified(total=20files).

e. Usepiping,linuxcommands,fastaToTab,tabToFasta,andgenbank_to_fasta.py,completethefollowing:

i. Convertyourgenbankfilesto.fastaii. Comparethenumberofgenesinthe.faafilesyoudownloadedwiththeconverted

files

BIOL4800/01SyllabusFall2016

Page7of10

iii. Splityour.faafilesintointofastafilesof100geneseachiv. Createasinglefilewithallthegeneannotationsfromallyourgenomes

f. Logging.READMEandreportfiles.WA04. Learningtoexecutethethreebasicaspectsoftheblastsuite-makingadatabase,searching

sequencesagainstadatabase,andqueryingthedatabaseforadditionalinformationusingalternativesearchinput.(80pts)

a. Makeadatabasefromyourgenomescaffoldsusingmakeblastdb.b. Executeaproteinblastagainstatranslatednucleotidedatabasewithtblastnusing100aa

sequencesfromagivengenomeagainstanotherstudent’sscaffolddatabase(withdifferentorganismsthanyours).

c. PerformanassessmentofblastefficiencyusingblastpagainstIMGv4,followingthe10,100,1000rule.Youwillneedtosplitproteinfastasequencesintosubsetsof10,100,and1000proteinsequencesandrunblastpwiththesesubsetsagainstthedatabaseusing1,2,4,and16processors(12totalblastpjobsubmissions).Usingyourstandardoutinformation,createtwographsoftheperformance,onewithnumberofsequencesvs.timeforagivensetofprocessors,theotherwiththeamountoftimepersequencesearchvs.numberofprocessors.Produceawrittensummaryofyourresultstoaccompanyyourgraphs.

d. Usingtheinformationfromyour10,100,1000assessment,executeathreadedBLASTsearchofthehypotheticalproteinsinyourfivegenomedatasetagainsttheIMGv4databasewiththeidealnumberofprocessorsandtimerequested.

e. Collectthesequencesforthetop100hitstooneoftheseproteinsusingblastdbcmd.Thiswillrequireyoutouseasetofpipedlinuxcommandsinconjunctionwithblastdbcmd,which,amongotherthings,acceptssequenceaccessionnumbersasinputandoutputsavarietyofinformation,includingthesequencedatainfastaformat.

f. READMEandreportfiles.WA05. Produceandvisualizethreadedmultiple-sequencealignments,editforpoorlycuratedsites,and

evaluatevariancebetweentwodifferentalignmentprograms.(70pts)a. In-classresearchonfastavs.phylipformattedalignments

i. Descriptionofthedifferencebetweenthetwoii. Listofthreetools/sitesthatdoconversion

b. In-classresearchonalignmentviewers-findthreec. Pickthreedifferentproteinsequencesinanyofyourgenomes,includingRecA,getthetop20

hitsfromtheIMGv4database.Foreachgene,placethequeryandhitsequencesintoasingle.faafile(21totalsequencesforeachofthe3proteinfastas).

d. AligneachfilebothMUSCLEandCLUSTAL.Visualizethealignmentswithgraphicalsoftware,comparebyeye.Describehowthealignmentforagivengenediffersbetweenprograms.

e. EditwithGblocksusingthesettingsfromSasseraetal.2011andnotethealignmentvariation.Youmayneedtoconvertyouralignmentsfromphyliptofastaformatfirst.

f. READMEandreportfiles.WA06. Executephylogeneticinferencesusingthreedifferentprogramsfor2ofthegenesfromyour

genomesand60tophitstoIMGv4.(70pts)a. Constructingatreeonpaperb. Identifyaribosomalproteinwith>100aminoacids,andoneothergeneinyourgenomethat

havetodowithcentralmetabolism,pathogenicity,orrespiration.PerformablastoftheirproteinsequencesagainstIMGv4,andcollectthetop50hits(initialsequenceincluded).Youwillalsoneedtopick2outgroupsfromblasthitswithconsiderablylessidentitytoyourquerysequencethanthetop50hits.

c. PerformMUSCLEalignmentsandcullwithGblocksusingthesettingsfromSasseraetal.2011.

d. ExecutephylogeneticanalysiswithPBSsubmissionstotheclusterusingClustalW(tocreateatreethistime,notanalignment),FastTree2,andRAxML.Thelatterwillneedtoberuninathreadedformatwith16processors.Someofthesewillneedinputdatainphylipformat.Use1000bootstrapsforClustalWandRAxML.

e. Usingatree-viewer,outputandcomparethetopologyandnodeconfidencebetweenthedifferenttreesforagivengene.Besuretorootyourtreeonyouroutgroupsequence.

BIOL4800/01SyllabusFall2016

Page8of10

Compileashortsummarywithtreegraphicsdescribingthesevariablesandaddtoyourreport.

f. READMEandreportfiles.WA07. TakeHomeEssay(therewillbeanin-classexerciseworth10pts).WA10:TakeHomeEssay.Finda

“microbiome”studyintheprimaryliterature,andidentifyanimportantorganisminthesystem.Createamaximumtwo-pageproposalforsequencingofthisorganism’sgenome(50pts).Inyourproposalyoumustinclude:

a. Yourmotivationforsequencingthisparticularstrain.Whatmakesitimportant?Whyshouldwecareaboutthisorganism?Includeecologicaldatathatdemonstratewhere/whenthisorganismisfound.Whatwillsequencingthisorganismhelpyoutounderstandaboutthesystemit’sin(thinkaboutthingslikephysiology,populationgenetics,etc.)?

b. Phylogeneticcontext.Whereinthetreeoflifedoesthisorganismsit?Whatareitsclosestrelatives?Areanyofthesealreadysequenced?

c. Sequencingparameters.Whattechnologywouldyouliketouse?Why?HowmuchDNAwillyouneed,howmanylanes/runs/etc.willyouuse,andhowmuchcoveragedoyouexpecttoget?

d. Allinformationneedstobecompletelyreferencedwithprimaryliterature,exceptinformulatinghowmuchcoverageyouwillgetforagiventechnology.Thiscancomefromwebsites,butmustbecitednonetheless.Noreferences,nocreditforentireassignment.

WA08. Creating,searching,andscanningHMMsforseveralofyourhypotheticalproteinsusingHMMER3.(70pts)

a. HMMscavengerhuntb. CreateHMMsforthreeofthehypotheticalproteinsinyourvariousgenomesthatare≥100

aminoacids,usingthetop30blastphitstoIMGasthefoundationforyourmultiplealignments

c. hmmsearchoneoftheseHMMsagainstRefSeq.CompareyourresultstoblastpsearchesagainstRefSeq,usingonlythetop15hitsfromeachsearch.

d. Asaclass,combineallofyourHMMsintoasinglefileandcreateaHMMdatabasetosearchagainstusinghmmscan.OnepersonmusthostthisdatabaseintheirBIOL4800directory.

e. hmmscananewhypotheticalproteinsequence(youwillhavetoidentifyadifferenthypotheticalfromoneofyourgenomes)againstthisdatabase,andnotethebesthit.

f. READMEandreportfiles.Besuretoincludewhichmodelsyourproteinsmatchbest.Notewhichgroupcreatedthatmodel,andwhatsequenceswereusedtocreateit.

WA09. ExecuteagenomeassemblyusingtheA5pipelineandanalyzethecompletedassemblyusingassemstats2.py(70pts).Youwillworkingroupsoftwotocompletethisassignment,aswellasneedtoconsulttwoothergroupstocompareyourassemblies.ThecompletedassignmentwillincludeacopyoftheoutputfilesfromtheA5assembly,atableofyourgenomestatistics,twotablescomparingyourgenomestatstotwoothergroups,abriefhalfpagewriteup,andacompletedsubmissiontoRAST.

a. In-classtextassemblywithyourgroupi. http://ivory.idyll.org/blog/the-assembly-exercise.html(TitusBrown)

b. Reflectivewritingi. Whatworkedanddidn’twithyour“genome”assembly?ii. Ifthe“genome”haderrorshowwouldyoucorrectforthem?iii. Whatwouldhavehappenediftherewererepeatsegmentsinyourgenome?

c. DownloadtherawsequencingdataforamicrobialgenomeofchoiceinGenBank.d. CompleteanassemblyofamicrobialgenomeusingA5.e. Onceassembled,examinetheassemblystatscreatedbyA5.

i. WhatdothesestatisticstellyouabouthowwellorpoorlyA5assembledyourgenomesequences?

f. Evaluateyourassemblycomparedtothoseofyourclassmates.Picktwoothersandcreateatablecomparingallthreeassemblies.

i. Howdoyourgenomescompare?ii. Whatmadeyourgenomehavea“better”or“worse”assemblywhencomparedto

thatofothers?

BIOL4800/01SyllabusFall2016

Page9of10

iii. Thisdiscussionshouldberoughlyahalfpagelongandaddresstheabovequestionsaswellasthequestionine.

g. SubmityourassemblytoRASTtobeannotated.h. READMEandreportfiles(include-thereflectivewritingdoneearlier).

WA10. ObtaintheannotatedoutputfromRAST,compareseveralsubsystemswithotherstrains,presentasummaryofthebasicfeaturesofyourassembly,andperformseveralrudimentaryanalysesbetweenthegenesfromyourassemblyandthoseofyourothergenomes(70pts).Workingwithyourproposal/assemblygroup:

a. SEEDviewerscavengerhunt.b. DeterminethefourmostcloselyrelatedorganismswithsequencedgenomesintheSEED

databasetothegenomeyou’veassembled.Hint,you’regoingtowanttoidentifyconservedgenesthatcanbeusedtolookforotherorganisms.

c. Puttogetheratablecomparingyourassembledcontigdatawithfourothercloselyrelatedgenomes,includingthefollowinginformationforallofthem:Organismname,Isolationsource,Genomesize(Mbp),numberofcontigs,GCcontent%,Totalnumberofgenes,Numberofproteincodinggenes,NumberoftRNAs

d. Compare,foryourassembledgenomeandtheclosestrelative,thegenepresence/absenceprofileforthefollowingsubsystems:glycolysisandgluconeogenesis,flagellarmotility,andmulti-drugeffluxpumps.

e. Pickthreeproteinsfromyourassembly,includingoneribosomalprotein,blasttheseproteinsagainstRefSeqandconstructphylogeniesforeachgenewithatleast15members,includinganoutgroup.Thiswillresultinthreetotaltrees.Youmayusewhateveralignmentandtree-buildingalgorithmyouwish,makingsuretocullwithGblocks.Outputthetreesandsummarizehowthetopologiesaresimilarand/ordifferentfromeachother.AlsonotewhetherornotthegenomesofyourotherfourorganismsarepresentintheRefSeqresults,andwhetherornottheyarestilltheclosestneighborstoyourgenome.

f. READMEandreportfiles.WA11. LearningtomeasureandestimatemicrobialdiversityinpreparationofusingMothur.(50pts)

a. In-classshortreflectiveessayonthedefinitionofOTUsandhowtheyareusedinmicrobialecology-explainfrommemory.Thentake10minutestodowebresearchonprimaryliterature.Re-defineOTUsagain,citingyourreferences.

b. Mark,release,recapture/rarefaction/relativeabundanceworksheet.Questionsandwritingsposedintheworksheetneedtobepartofthereport,alongwithfinaltablesandgraphs.

WA12. OTUanalysisofLSUMikereauxbiomedatausingMothur.(90pts)Youwillbeworkingingroupsoftwo,analyzingdatafromfoursamplesinoneofseveraldifferentsampletypes.MostofyourworkflowwillcomefromfollowingtheMothurMiSeqSOP,buttherewillbesomestepsthataremodifiedand/orleftoutfromtheSOP.Wewillidentifysomeoftheseinclass.

a. Createaworkflowforyouranalysis.Includeasmanyspecificcommandsaspossible,andannotatethesewiththeirpurpose.

b. Obtainthesequencedataforthesampleswithwhichyouwillbeworking,beingsuretoincludebothforwardandreversereads.PerformacompleteMothurrun,throughOTUclusteringandtaxonomicassignment.

c. Completethefollowinganalysesofyourdata:i. Rarefyyourdataii. Chao1richnessandinverseSimpsondiversityiii. RarefactioncurvesofOTUsvs.samplingeffortiv. Tableof#seqs,coverage,#OTUs,InverseSimpsonv. RelativeabundanceheatmapwithJaccardindexvi. VenndiagramofsharedOTUs,includingpredicted#ofoverlappingOTUsvii. Createfourrankabundancecurves-oneeachforthetop20OTUsineachofyour

samples.Compareyourtop10taxawiththosefromtheothergroups.WA13. Completecoreandpan-genomeanalysisofcloselyrelatedgenomeswithGetHomologues,reporting

thevariousadditionaloutputs.Incorporateadditionalmaterialfromsomeoftheothertoolsyou’vedealtwiththusfar.Formatforfinalpresentationtotheclass.Youmayworkindividuallyoringroups.Youwillneedtocdtotheget_homloguesdirectoryin/project/jcthrash/tools/,completeyourwork,andthen

BIOL4800/01SyllabusFall2016

Page10of10

moveyouroutputtoyourhomedirectory.ONLYONERUNCANBECOMPLETEDATATIME,soifyouwanttousethisoption,youmustcoordinatewithyourclassmates.Anotheroptionistoinstalltheget_homloguespipelinedirectlyinyourhomedirectory.(100pts)

a. GetHomologuesscavengerhuntb. Pickagenomefromonetaxoninthetop5OTUsofyourMothuranalysis,plussixadditional

closelyrelatedstrains(samegenus),andperformtwoclusteringrunsusingget_homologues,onewithCOGtriangles,onewithOrthoMCL

c. CreateVenndiagramoutputshowingtheintersectionoftheseclustersd. Usingtheintersectingclustersonly,createadendrogramshowingrelationshipsamongyour

taxawithgenepresence-absenceinformatione. Usingtheintersectingclusters,createcoreandpan-genomeextrapolationcurveslikethose

inTettelinetal.2005,Figs2,3.f. Createacomparisonofthefollowing:

i. Thedendrogramcreatedin3,aboveii. PhylogenetictreesbasedontheaminoacidsequencesfromRecA,aribosomal

protein,andaDNApolymeraseg. IntegratethismaterialwithmicrobialecologydatafromyourMothuranalysis.Whatisthe

relativeabundanceofyourorganismsinthedatasetsyouexamined?Wheredotheysitontherank-abundancecurves?

h. Identifyatleastthreecandidatepathwayspresentinyourorganismthatcanhelpexplainwhyitisdominantinyourecologicaldata.Describehowthegenecontentofthesepathwaysisdifferentorsimilartotheother4organismswithwhichyouarecomparingit.

i. Createapolishedpresentationofthisinformation,withnomorethanoneslideperelement,withbriefsummariesforeachsection,thatisnomorethan15minuteslong.

j. Prepareonequestionforeachgroupbasedontheirpresentation.k. READMEandreportfiles.

AdditionalResources

1. Learningregex:http://www.regexr.com2. Linuxcheatsheet:http://peoplesofttutorial.com/learn-basic-linux-commands-using-linux-cheat-

sheet/3. Rosalindprogrammingtraining:http://rosalind.info/problems/locations/4. SoftwareCarpentrytraining:http://software-carpentry.org/index.html5. ElementsofBioinformatics:http://elements.eaglegenomics.com