Organizing your Research and Data Management
UI Library Workshop
Friday, April 3, 2020
Summary
• Workflows
• Formats/Volume
• Data Dimensionality and Format
• Storage
• Backing Up
• File Organization
• File Naming
• Managing PDF Comments and Annotations
Data Management: Four stage process
Output data, visualizations,
associated metadata
Data ingestion
Data cleaning
Analysis Step 1
Analysis Step 2
Output generatio
n
Raw data and
associated metadata
1. Collect 2. Store 4. Apply/Use
3. Describe/Document
Adapted from: DataONE, 2012
Flow charts: simplest form of workflow
Informal Workflows
Inputs & Outputs
Graph production
Analysis: mean, SD
Quality control & data cleaning
Data import into R
Salinity data
Temperature data
“Clean” T & S data
Data in R format
Summary statistics
From: DataONE, 2012
Informal Workflows
species name
lat/long
Data ingestion
Data and metadata
QA/QC (e.g., check for appropriate data
types, outliers)
Determine range (calculate convex hull
of points)
Compile list of unique taxa
Generate output data and visualizations for taxa frequencies in
stated range
Data, visualizations, and metadata
FORcounts
Calculate taxa frequencies
Workflow diagrams: a simple example
From: DataONE, 2012
Data Formats and Volume
• Two considerations:
• Fancy files (.docx, .dat, .xlsx) – program overhead, valuable/necessary for a specific tool
• Simple files (.txt, .csv) – usually cleaner, fewer encoding issues, more accessible to other tools/systems
• Some exceptions like .pdf and .shp
• Data re-use implications• Future users (including yourself) will
appreciate simpler formats
• .csv vs .xlsx (Excel) vs .wks (Lotus) OR
• .txt vs Word vs WordPerfect
• Volume
• Primary issue: large amounts of data will affect your organizational approach, and possibly the structures and storage you use. Figure out the general volume early!
Dimensionality in Data Modeling
• Start with core “fact(s)” or variables of interest
• e.g. an observation of an animal
• You might record a number of facts about the animal
• e.g. weight, length, a behavior
• There might be relevant other dimensions
• e.g. time, space, environmental variables, observer/instrument information, proximity to relevant vegetation, etc.
• The more of these “dimensions”, the more complex your data may become
Multi-dimensional Datasets
• Lower dimensional data might be fine as a spreadsheet
• Easy to add/ingest new data, easy to manage
• Higher dimensional data needs more control and management
• Relational databases, other database systems
• Size, complexity becomes a problem that database software can mitigate
• For large 4-dimensional data (x, y, z, t)
• Hierarchical Data Format (HDF) and similar (e.g. netCDF)
• Basically, for very large time-series data, it is extremely fast and efficient. Can read/write with it much more quickly than a relational DB.
“Not Only SQL” (NoSQL) Databases
• If your data is part of the 3 v’s – velocity, volume, variety – a NoSQL database might be the best form for organizing it
• Heterogenous data:
• Images + spatial data + time series + text
• Both semi-structured and unstructured data are harder in a pure SQL environment
• Changing, evolving, or “unfinished” data
• If you expect to change the schema, you might need something more flexible than an RDBMS
• Ultimately, your data model/format should be closest to the problem you are trying to solve
• A few types:
• Key-Value Store
• Just a simple table of keys and values. Only way to retrieve data is via the key
• Document
• Like a Key-Value Store, but the value is usually a structured document – such as a JSON or XML document
• Advantage here is for items that have varying attributes, e.g. one record has 5 fields, another has 8, another has 2.
• Graph
• Database that prioritizes relationships between records, not just tables. Also useful for cases in which the attributes vary in number.
“Not Only SQL” (NoSQL) Databases
Storage Types
• Local Storage
• Good temporary solution, assuming you have the space
• Networked Drive Storage
• Better solution, both for temporary and longer term. Usually due to local support in cases of emergencies or failure.
• Cloud Storage
• Convenient storage and sharing platform, but there are issues to consider, including syncing problems (can affect active read/write) and security for sensitive data
• Physical Media (Flash Drives, External HDs)
• Meant to move data. Not necessarily reliable, or a good option in most cases.
Storage Options at University of Idaho
• UI Cloud/Networked Option:
• OneDrive, approx. 1 TB of space
• Both web and “app” (aka mounted drive)
• Cloud Options
• Dropbox – 2 GB
• Google Drive – 15 GB
• Other similar providers
• Some resources like the Open Science Framework
• Occasionally, other local resources through Dept., Lab, IBEST/CRC/NKN
• NKN – approx. 2 TB
Backing Up Data
• 3-2-1 Rule
• Have at least 3 copies of your data
• Store them in 2 different media
• Keep 1 copy off-site (geographically differentiated)
• Common problems
• Corrupted data, failed hard drive, laptop lost/stolen, mistakes (deletions, user error)
• Example plan:
• One copy on local hard drive
• One copy on OneDrive (geographic replication off-site)
• One copy on a physical media device
Adapted from: Univ. of Virginia, 2020
Image from: PartionWizard, 2020
Folder Organization
• Hierarchical:
• Project Name
• Folder 1
• Folder 2
• Subfolder 1
• Subfolder 2
• File 1
• File 2
• Pros: Clear, Easy to Use, Similar items stored together
• Cons: Items can only go in one place, can become too granular (too deep)
Adapted from Malinowski C. 2016
Folder Organization
• Tag-based:
• File 1 raw data
• File 2 manuscript
• Folder 1 raw data
• Folder 3 R scripts
• File 3 R code
• Pros: Items can go in more than one category, quick/easy
• Cons: Need a consistent tag list/code list, understanding collaborator’s tags can be challenging
Adapted from Malinowski C. 2016
Folder Organization
• You ultimately want:
• A structure that allows you to quickly and easily find what you’re looking for
• To include documentation and descriptive information
• Organize folders into meaningful categories:
• Primary/secondary/tertiary levels of analysis
• Subject/collection method/time/space/data type
• Code/Reports separate from data itself
Adapted from Malinowski C. 2016
File Naming
• File naming convention (FNC):
• a framework for naming your files in a way that describes what they contain and how they relate to other files (Brandt, 2017)
• Principles
• Be consistent
• Be descriptive
• Imagine someone else trying to understand your file from the name
• Make it machine readable (computable) and human readable (comprehensible)
Adapted from Bryan, J. 2015
File naming conventions
• Time stamp is always useful
• YYYY-MM-DD: ISO 8601 date/time format
• Use leading zeros, except at the start of the file name
• Use only one period – it’s just clearer to use a period to denote the file extension
• Avoid using generic names – MyData, FirstProject, FinalData
Adapted from Bryan, J. 2015
From Strasser, 2011
Dealing with NotesNotecards & the Zettelkasten Method:
- Both focus on the connection between reading, writing, and thinking
- Bibliographic Reference Cards
- Direct Quotations Cards
- Summary Cards
Digital Tools:
- EverNote, OneNote, some specialized software tools, e.g. Bear and tools at the Zettelkasten site
Resources:
- http://www.raulpacheco.org/2018/11/note-taking-techniques-i-the-index-card-method
- https://zettelkasten.de
Annotations
Aside from the techniques above, consider annotation tools:
- Mendeley.com’s application has a built-in annotator
- Hypothes.is provides a browser extension for annotating documents
References
• DataONE. 2012. “DataONE Education Module: Analysis and Workflows.” Retrieved from: http://www.dataone.org/sites/all/documents/L10_Analysis Workflows.pptx
• University of Virginia Library Research Data Services + Sciences. 2020. “Data Storage and Backups.” Retrieved from: https://data.library.virginia.edu/data-management/plan/storage/
• Malinowski, C. 2015. FileOrg. http://libraries.mit.edu/data-management/files/2014/05/FileOrg_20160121.pdf
• Bryan, J. (2015). “naming things.” Presentation at a Reproducible Science Workshop. Retrieved from: http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf
• Strasser, C. 2011. “Best Practices for Ecological & Environmental Data Management.” Retrieved from: https://www.dataone.org/sites/all/documents/ESA11_SS3_carly.pdf
• Vera. 2020. “Best Practice: 3-2-1 Backup Strategy for Home Users & Businesses [Clone Disk].” Retrieved from: https://www.partitionwizard.com/clone-disk/backup-strategy.html