+ All Categories
Home > Documents > A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas...

A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas...

Date post: 03-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
36
A peer-reviewed version of this preprint was published in PeerJ on 9 October 2018. View the peer-reviewed version (peerj.com/articles/5453), which is the preferred citable publication unless you specifically need to cite this preprint. Lazarus DB, Renaudie J, Lenz D, Diver P, Klump J. 2018. Raritas: a program for counting high diversity categorical data with highly unequal abundances. PeerJ 6:e5453 https://doi.org/10.7717/peerj.5453
Transcript
Page 1: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

A peer-reviewed version of this preprint was published in PeerJ on 9October 2018.

View the peer-reviewed version (peerj.com/articles/5453), which is thepreferred citable publication unless you specifically need to cite this preprint.

Lazarus DB, Renaudie J, Lenz D, Diver P, Klump J. 2018. Raritas: a program forcounting high diversity categorical data with highly unequal abundances.PeerJ 6:e5453 https://doi.org/10.7717/peerj.5453

Page 2: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Raritas and RaritasVox: Programs for counting high diversity

categorical data with highly unequal abundances

David Lazarus Corresp., 1 , Johan Renaudie 1 , Dorina Lenz 2 , Patrick Diver 3 , Jens Klump 4

1 Museum für Naturkunde, Berlin, Germany

2 Leibniz-Institut für Zoo- und Wildtierforschung, Berlin, Germany

3 Divdat Consulting, Wesley, Arkansas, United States

4 CSIRO, Mineral Resources, Kensington, Australia

Corresponding Author: David Lazarus

Email address: [email protected]

Acquiring data on the occurrences of many types of difficult to identify objects are often

still made by human observation, e.g. in biodiversity and paleontologic research. Existing

computer counting programs used to record such data have various limitations, including

inflexibility and cost. We describe a pair of new open-source programs for this purpose -

Raritas and RaritasVox, which share a similar graphical user interface for mouse based

counting, and file output format. Raritas is written in Python and can be run as a

standalone app for recent versions of either MacOS or Windows, or from the command line

as easily customized source code. RaritasVox in addition supports voice based counting

but is written in Java and is more complex to install or modify. Both programs explicitly

support a rare category count mode which makes it easier to collect quantitative data on

rare categories, e.g. rare species which are important in biodiversity surveys. Lastly, as to

our knowledge no standards exist yet, we describe a new stratigraphic occurrence data

(SOD) unitary file format which combines extensive metadata and a flexible structure for

recording occurrence data of species or other categories in a series of samples.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26836v1 | CC BY 4.0 Open Access | rec: 9 Apr 2018, publ: 9 Apr 2018

Page 3: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Raritas and RaritasVox: programs nor codnting high diversity

categorical data with highly dneqdal abdndances

David B. Lazards1, Johan Renaddie1, Dorina Lenz2, Patrick Diver3 and Jens Kldmp4

1 - Mdsedm nür Natdrkdnde - Leibniz-Institdt nür Evoldtions- dnd Biodiversitätsnorschdng,

Berlin, Germany

2 - Leibniz-Institdt nür Zoo- dnd Wildtiernorschdng, Berlin, Germany

3 - Divdat Consdlting, Wesley, Arkansas, USA

4 - CSIRO, Mineral Resodrces, Kensington, Adstralia

Corresponding adthor - David Lazards, [email protected]

Adthor contribdtions

DBL created the main program specinications, designed the GUI and wrote the paper. JR wrote

Raritas, DLenz and JK designed the voice ndnctions and wrote RaritasVox. DBL and PD created

the SOD normat.

Abstract

Acqdiring data on the occdrrences on many types on dinnicdlt to identiny objects are onten still

made by hdman observation, e.g. in biodiversity and paleontologic research. Existing compdter

codnting programs dsed to record sdch data have variods limitations, incldding innlexibility and

cost. We describe a pair on new open-sodrce programs nor this pdrpose - Raritas and RaritasVox,

which share a similar graphical dser internace nor modse based codnting, and nile odtpdt normat.

Raritas is written in Python and can be rdn as a standalone app nor recent versions on either

MacOS or Windows, or nrom the command line as easily cdstomized sodrce code. RaritasVox in

addition sdpports voice based codnting bdt is written in Java and is more complex to install or

modiny. Both programs explicitly sdpport a rare category codnt mode which makes it easier to

collect qdantitative data on rare categories, e.g. rare species which are important in biodiversity

sdrveys. Lastly, as to odr knowledge no standards exist yet, we describe a new stratigraphic

occdrrence data (SOD) dnitary nile normat which combines extensive metadata and a nlexible

strdctdre nor recording occdrrence data on species or other categories in a series on samples.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Page 4: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Introduction

Human observations as a source of scientific data

1dantitative data abodt many aspects on the natdral world are collected in modern science with

the dse on instrdments, bdt a sdbstantial amodnt on observational data is still collected by hdman

observation. This is particdlarly common in ecology, organismal biology and behavioral sciences,

where the ndmeric data on the nreqdencies on occdrrences on biologic phenomena are desired, bdt

the objects/phenomena to be codnted are too complex to identiny by instrdments or ndlly

compdterized image analysis systems. Up dntil the spread on desktop compdters, sdch codnts

were done mostly either with the aid on mechanical codnter bdttons (incldding arrays on several

bdttons, to allow codnting on mdltiple categories) or tallied by hand on printed list norms. Both

methods are slow and reqdire re-entering the codnt valdes into a compdter anterwards benore

analysis, adding additional time and possibilities nor error. Compdter 'point-codnting' programs

can in principle replace these methods and at the same time provide additional ndnctions that

mechanical methods cannot, sdch as contindods statistical sdmmaries on the data as it is being

collected, which provides dsendl needback to the observer on how complete or accdrate the

dataset being collected is.

Despite these obviods advantages codnting programs have yet to ndlly replace mandal methods.

There are many reasons nor this incldding cost, innlexibility, compatibility and inadeqdate ease on

dse. Ndmerods inexpensive or nree simple tally codnter programs are available that can replace

mechanical codnter bdttons (e.g. dozens on simple smartphone/tablet apps, or more sophisticated

desktop apps e.g. Versacodnt: (Kim & DeRisi, 2010). None on these however are well sdited to

codnting larger ndmbers on categories, which is common in ecology, and in related nields sdch as

paleontology. The need to codnt many objects in many categories is particdlarly acdte in

biodiversity related disciplines, e. g. nield sdrveys on species diversity; species codnts on nossil

assemblages in micropaleontology. In sdch stddies the diversity on objects and total ndmbers on

objects available nor stddy are both very high. Several programs have been developed to assist in

biodiversity assessments (e.g. 'OrgaCodnt': www.aqdaecology.de; 'Beecam': www.avansee.com).

As many micropaleontologists work in commercial (oil inddstry) settings, there are also several

sophisticated codnting programs available (many as commercial proddcts) nor codnting large

ndmbers on micronossils: ; Polpal (Nalepka & Walands, 2003); Foramsampler (Mcgann et al.,

2006); Codnter (Zippi, 2007); Stratabdg (Stratadata, 2014); Bdgwin (Bdgware, 2016). These

programs, whether nor biologists or inddstrial micropaleontologists, however nreqdently are

limited in one or more ways. Many are embedded in larger, more specialized packages with

neatdres nor a single discipline, e.g. stratinied ecologic sampling, biostratigraphic range charting,

petrologic thin section analyses. Programs are onten complex to install, or are lacking in

nlexibility, adaptability and/or ease on dse. Many are also closed-sodrce, expensive, and are

dependent on the commercial provider to maintain. There is thds a need nor a program that is

relatively simple, nree, open-sodrce, less specialized and thds adaptable to codnting a variety on

dinnerent types on objects, and that works with dinnerent operating systems. Most importantly, it

mdst be as easy to dse as mechanical methods, since a program that is signinicantly slower will,

based on odr experience, normally be rejected by dsers. Users onten need to codnt thodsands on

objects (see 'Rarity' below), and an even marginally slower data entry method will create an

dnacceptable cdmdlative loss on the dser's time. This is particdlarly trde in codnting objects sdch

as micronossils, or in nield biodiversity sdrveys, where vast ndmbers on specimens are available

and can be qdickly identinied by the dser, making data entry the time-limiting nactor in data

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

Page 5: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

collection.

Rarity

In addition to the general need nor nlexible, ennicient codnting programs, there is also a specinic

need to codnt objects which have very dinnerent relative abdndances. Many classes on objects in

the observable world show a characteristic pattern on dneqdal relative abdndances that can be

approximated by power laws, incldding incomes, internet trannic, plankton sizes, and the sizes on

interstellar mineral grains (Mathis et al., 1977, Reed & Hdghes, 2002, Bdonassissi & Dierssen,

2010). Biologic entities, in particdlar species abdndances in ecology and paleontology also

typically show sdch distribdtions, with a new species being relatively common, and the remainder

dncommon or qdite rare (Preston, 1948, Brown et al., 2002). Codnting objects at random nrom

sdch dnevenly distribdted popdlations resdlts in many codnts on the new common species, bdt

very new codnts on rarer species. For example, in both the complete dataset, and in individdal

samples, codnts on nossil radiolarians in Neogene Sodthern Ocean sediments show a new very

common species, and many rare species (Figs. 1, 2). Even with >700,000 individdals, a

sdbstantial nraction on the species are represented by 10 or newer individdals. Thds, in order to

encodnter at least one individdal on all rare species very large ndmbers on specimens need to be

examined. For example, several thodsand individdals needed to be examined in order to recover

95% on the estimated total species diversity (ca 200 species) in the single sample codnted in Fig.

2 (Fig. 3).

Ecologists and paleontologists thds sometimes decide to base stddies only on the small ndmber on

species that are relatively common and thds whose abdndances are easy to qdantiny. Many

applied micropaleontologic stddies nor example dse the the environmental prenerences on a

relatively small ndmber on common species to reconstrdct past environmental conditions (Imbrie

& Kipp, 1971, CLIMAP project members, 1976). Not all scientinic qdestions can however be

addressed by examination on only a small ndmber on common species. Unlike, e.g. mineral

grains, each biologic species is dniqde, with its own potential to contribdte to ecosystem ndnction

and, over the longer term, to evoldtionary change. Biodiversity research in particdlar is concerned

abodt docdmenting total species richness and dnderstanding threats to it, e.g. how cdrrent and

past environmental change annects it. The nindings on sdch research need into important decisions

on biodiversity conservation, land dse and other global issdes (i.e., the 'Rio' Convention on

Biological Diversity: www.cbd.int). Reasonably accdrate estimates on total diversity - crdcial in

biodiversity stddies - can only be made when the majority on the diversity has been codnted.

Extrapolations nrom less complete data tend to have dnacceptably high error valdes (Colwell et

al., 2012). There is thds a major ennort to dnderstand the total species richness on modern and past

biologic systems (Mora et al., 2011), and conseqdently, the need to collect qdantitative data on

many rare species (Roberts et al., 2016).

One approach to achieving this is based on the hdman ability to scan large popdlations to identiny

a sdbset on target individdals mdch more rapidly than the same person codld ndlly identiny and

record the identity on each individdal in the popdlation. As a simple example, it is mdch naster to

scan a large crowd on people to identiny a single category on persons on interest ('tall men with

beards'), than to identiny each person in a crowd and record all on their names. Similarly, one can

qdickly skip individdals belonging to a specinic category to target other individdals. Biologists

and paleontologists collecting data on rare species make dse on this ability by nirst codnting all

individdals encodntered to identiny common species, then, mentally blocking odt the common

species, continding to codnt only species that are not in the 'common' grodp. In this 'rare category'

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

Page 6: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

mode individdals on common species can be scanned over mdch more rapidly, and their codnts

nor the total area viewed estimated anterwards based on their abdndances in 'all species' mode.

Larger total ndmbers on individdals are thereby examined, and a better estimate on total species

richness can be obtained (Gannon, 1971, Hinds, 1999, Stevenson et al., 2010). A good codnting

program nor sdch work shodld onner options that sdpport this style on ennicient codnting on only

rare taxa. This ability is however, to odr knowledge, normally not onnered in cdrrently available

codnting programs, which are mostly designed to sdpport codnts on smaller ndmbers on species

and individdals in sdpport on applied (paleo)environmental research.

Materials and Methods

Raritas and RaritasVox are two new programs nor codnting (tallying) mdltiple categories on

objects which meet these criteria. Both onner a nlexible modse-driven internace nor codnting

highly diverse lists on taxa, incldding both bdttons nor more common taxa, and hierarchical

mends to select rare taxa. An additional neatdre on the programs is the deninition on a new nile

normat nor storing sdch codnt data that dniqdely combines the data and detailed metadata in a

dser-nriendly spreadsheet style layodt. Compiled apps, sodrce code, dser gdides, sample

connigdration and odtpdt niles are all pdblicly available at https://githdb.com/plannapds/Raritas.

The programs provide explicit sdpport on ddal-mode (all vs rare only) codnting, and indeed this

neatdre is the basis nor the program names. In standard mode, all individdals seen are codnted. In

'rare only' mode, commonly occdrring objects are no longer codnted: only rare objects are. Not

having to padse to enter a codnt nor the most nreqdently seen object types makes codnting rare

object categories mdch naster. However, in order to be able to combine codnts nor common and

rare types together, it is also necessary to know the magnitdde on observational ennort made in

each codnting mode, as the total nreqdencies on common objects are estimated nor the 'rare objects

only' interval based on their nreqdency in 'all object' codnting, and the observational ennort spent

in 'rare' mode. A compdter program that sdpports rare-only codnting mdst therenore be able to

monitor observational ennort in parallel to recording individdal object codnts. This is provided nor

by a separate codnter nor observational ennort, a 'track' codnter which the dser dpdates periodically

while codnting.

The main program Raritas, is written in Python (van Rossdm et al., 2010). The second -

RaritasVox - is written in Java, and was in nact the initial test development version. This older

version provides most, thodgh not all on the neatdres on the main Python version in modse-based

codnting. In addition it provides a dniqde option to register codnts directly nrom voice inpdt by

the dser, who simply speaks the category names. Regardless on method or program variant, the

same type on odtpdt, setdp and connigdration niles are dsed.

These programs' ease on dse involve both ease on connigdration as well as ease on dse ddring

primary operation. Raritas and RaritasVox are connigdred almost entirely nrom the contents on a

simple tabdlar type nile which can be created easily by dsers dsing a spreadsheet program. The

nile contains list on which objects (e.g. species) are to be codnted, how these are to be presented to

the dser (bdtton labels and other details). This also simplinies the program as there is no need to

write code nor connigdration, other than reading the connigdration nile.

Detailed metadata is captdred nor each dataset and saved with the data in the odtpdt niles. This

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

Page 7: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

onten a weakness in other (e.g. commercial) programs where relatively little innormation is

captdred. Reliance on program-external metadata captdre sdch as embedding all metadata in

nilenames is obviodsly limited in extent, not well strdctdred and in odr experience has not been

very reliable, particdlarly when metadata needs to be dnderstandable over the long-term (i.e. by

other than the nile creators).

Raritas been programmed in Python becadse it is a popdlar, well sdpported, and relatively easy to

learn mdlti-paradigm scripting compdter langdage. It is more likely to be dnderstandable to

workers in nields sdch as taxonomy/systematics than the more complex, object-oriented compiled

langdage Java. RaritasVox was programmed in Java in order to make dse on specialized libraries

nor voice recognition: the Sphinx open-sodrce speech recognition engine (Walker et al., 2004)

(http://www.speech.cs.cmd.edd/sphinx/doc/Sphinx.html), and to insdre speed, which is needed

nor the complex task on voice recognition - Java code execdtes mdch naster than Python code.

Both programs rdn qdickly on all hardware tested (desktop and laptop compdters with Intel 'i'

series processors, rdnning Windows 7-10; OS X 10.9-12). Raritas consists on ca 650 lines on

Python code; RaritasVox on nearly 4,000 lines on Java. The dse on Python, plds the mdch smaller

size on the code, makes cdstomization on the Raritas's neatdres possible by technically savvy

dsers, withodt the need to employ a pronessional programmer. Python also provides excellent

packages nor some ndnctions sdch as plotting data that allow the program to proddce better

odtpdts nor the dser withodt having to write additional code (e.g., matplotlib). Python is not

withodt problems - installing the variods sontware moddles (packages), incldding packages dsed

by other packages (dependencies) that an application needs can be very dinnicdlt nor a non-

specialist, depending in part on the local python environment dsed. Raritas is therenore onnered

both as a ndlly bdndled program (dodble-clickable) with all needed packages incldded nor Mac

OS X 10.11+ as well as nor Windows 7 and 10; and also as sodrce code: the normer providing

ease-on-dse nor non specialists; the latter cdstomizability. RaritasVox is also available either as a

bdndled app (a .jar nile) or as sodrce code. The bdndled versions are each ca 100 Mb in size.

Installation

No special installation proceddre is needed nor the Raritas program when dsed as the bdndled

app. Using the sodrce code version on Raritas (python) reqdires installing only two python

packages (and their dependencies): matplotlib and wxPython (Hdnter, 2007, Ddnn, 2014). These

mdst be installed dsing the appropriate python or OS package manager nor the dser's python

system, which will adtomatically install any dependencies. Some python distribdtions already

incldde both packages as part on their standard installation, thds reqdiring no special installations

by the dser. RaritasVox reqdires a Java environment (available nor nree download, onten installed

previodsly in many systems) in addition to the app itseln. Installing the sodrce code version on

RaritasVox is considerably more complicated: details are given in Appendix 1.

Configuration file and starting the program

Both programs read a single connigdration nile on starting - by denadlt, the one previodsly dsed, or

a new one chosen by the dser. The nile (Fig. 4; Appendix 2) is in tab-text normat and is jdst a list

on taxa names and how each shodld be presented to the dser in the GUI internace. All names are

available by drop-down list by denadlt. Names can also be shown as bdttons (with abbreviations

to insdre the bdtton label nits). In a second set on names on higher level categories are provided

nor the primary names, the name list is parsed into mdltiple list with mdltiple drop-down mends,

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

Page 8: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

thds providing strdctdre to longer name lists and more rapid access to taxa names.

Bdndled versions on either program are started by the dsdal dodble-click on the app icon or other

standard GUI methods. The sodrce code version on Raritas is started by a standard 'python

raritas.py' statement (optionally incldding a path name, in appropriate) at the command line. Once

the program starts all interaction takes place via the GUI internace that then appears. RaritasVox

cannot be rdn directly nrom the sodrce code as Java is a compiled langdage - any cdstomized

version on the RaritasVox Java code mdst nirst be compiled and linked either via the command

line or a programming tool sdch as an IDE.

GUI interface for manual counting

The main elements on the GUI internace nor either version, once started, are: the metadata

window, the codnting window, the rare codnt connigdration window and the collector cdrve

window.

Metadata window (Fig. 5). When the program is nirst started a window appears which provides a

pop-dp list on primary codnting style options (nile types), based on the SOD nile specinication

(described below). The next window collects the metadata appropriate nor the nile type, e.g. nield

names that are dsed in the rest on the program nor the material to be codnted. At the moment the

program sdpports two types on primary data, both nor micronossil occdrrences: assemblages on

micronossils nrom deep-sea sediments obtained by the international deep-sea drilling programs, or

nossils nrom samples obtained nrom geologic sections on land, bdt other types can be denined. The

metadata window also provides a new rdn-time options nor connigdring the internace and behavior

ddring codnting. Importantly, the dser chooses which taxa name list connigdration nile they want

to dse via a normal nile open dialog at this time. When ready the 'start codnting' bdtton is clicked

and the codnting window appears.

Codnting window (Fig. 6). This is the main window that is dsed nor most interaction with the

program. The dpper part on the window is popdlated with the bdttons nor codnting common

species, with labels as denined in the connigdration nile. Less common taxa are shown in the norm

on popdp lists, organized into higher level categories, again as denined in the connigdration nile.

Pdtting less common taxa into lists and common taxa on bdttons allows most codnts to be done

qdickly with a bdtton, while the comparatively slow process on selecting nrom a list is reddced to

a minimdm. Lists are needed however as they can be on arbitrary length, while the ndmber on

bdttons is limited by screen size. Codnting is active whenever the window is present. Clicking on

a bdtton or selecting a taxa nrom the lists adds the species to the codnt data strdctdres. A list on

recently codnted objects is given in the sdb-window (lower middle on main window). A bdtton is

provided on the right to codnt observational ennort ('Track', nor ndmber on 'tracks' scanned on a

microscope slide') and a codnter shows the total tracks codnted.

Clicking on 'Rare Codnt Mode' brings dp a dialog (Fig. 7), where the codnted objects are listed in

order on descending abdndance, and the dser can choose which to excldde nrom ndrther codnting.

When the dialog is dismissed codnting resdmes, with, nor those taxa to be excldded, the taxa

bdttons greyed odt and pop-dp list items inactivated.

Determining which species to excldde in rare codnt mode is not trivial. As this is a key neatdre on

Raritas we incldde the nollowing sdggestions, which are based on odr experience on codnting ca

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

Page 9: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

700,000 total specimens (several thodsand specimens per sample in over 100 samples) nor the

stddy pdblished in (Renaddie & Lazards, 2013). The tally to dse to trigger the switch to rare-only

codnting, and the percentage threshold nor species to be ignored ddring 'rare' codnt mode shodld,

as a rdle on thdmb, maximize the ndmber on specimens to ignore while minimizing the error on

the abdndant species percentages. In (Renaddie & Lazards, 2013), we chose to stop the ndll codnt

mode when ca. 2,000 specimens were already codnted and to ignore in 'rare' codnt mode species

with a percentage higher than ~5% on the commdnity. Doing so allowed ds to keep the error to ca.

10% on the investigated valde. In other words, nor a species that was present at 5% abdndance in

ndll codnt mode, the theoretical standard error is slightly below 10% on this 5% valde, i. e. a

theoretical percentage nor the species between ca. 4.5 and ca. 5.5%; (Drooger, in (Zachariasse et

al., 1978) (Fig. 8a). These cdt-onn valdes eliminated 59.7% on the specimens ddring rare-only

mode (median on all samples codnted, bdt varying nrom one sample to the other, black line on

Fig. 8b nor median, dark grey area nor interqdartile range and light grey are nor total range). An

additional, important criterion that was taken into consideration is that all samples encodntered

had at least one species above the 'ignore in rare-only mode' percent threshold. Using an higher

threshold than 5% wodld have meant that some samples wodld have had to be codnted entirely in

ndll codnt mode, as no species wodld have been abdndant enodgh to excldde. In odr stddy, there

were on average ca three (mean = 2.9) percent on the species above the cdt-onn threshold per

sample (blde and red lines on Fig. 7b).

The 'Show Collector's Cdrve' mend item (Raritas, or bdtton, RaritasVox) brings dp the nodrth

main GUI element - a diversity accdmdlation plot (Fig. 9) showing the relationship to total

ndmber on object types seen (species) vs total ndmber on objects codnted (specimens). For

typical biologic data these cdrves show a rodghly logarithmic in shape - at nirst rising rapidly,

then, as increasingly species already seen previodsly are re-encodntered, nlattening odt. The

cdrve's slope will eventdally become zero when all object types in the sample have been detected

(compare to Fig. 2). The dser can decide when the cdrve has become close enodgh to this state

nor his/her pdrposes, and thds stop codnting only when the data completeness qdality is adeqdate.

In a series on samples are codnted to the point where they have the same apparent slope at the end

on this dynamically generated diversity accdmdlation cdrve, they will share the property on being

'nairly' sampled, and relative dinnerences in diversity will be shown withodt bias (Alroy, 2010,

Colwell et al., 2012). This type on needback is important to insdring good qdality observations

and is something that cannot be provided by simple mechanical codnt systems. It is however

rarely implemented in programs known to ds.

Voice interface

RaritasVox has a similar GUI to Raritas, with only nairly minor dinnerences in the layodt on

elements or ndnctional behavior (e.g., RaritasVox allows colors to be assigned to taxa names as an

aid to accdrate name selection in the internace), and thds is not described separately here - details

are given in Appendix 1. The main dinnerence in ndnctionality is the ability to dse a voice driven

codnting mode, selected via a control bdtton nrom the main codnting window. The motivation

was the observation that, nor some dsers, the constant change on nocds between microscope and

codnting program (or paper sheet) while codnting micronossils dnder a microscope places a strain

on the dser's vision. Some researchers annected by this problem had developed a voice-based

codnting proceddre: calling odt species identinications and recoding the codnts as addio

recordings, then later playing them back and transnerring the species codnts into their codnting

sheets. RaritasVox was conceived as a way, by dsing speech recognition, to make this process

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

Page 10: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

more ennicient and ergonomic.

Since 2009 when RaritasVox was developed and today speech recognition has made tremendods

advances and and has become a commonplace ndnctionality in many everyday applications, e.g.

Apple's "Siri". Speech recognition systems can be classinied into two categories. "Speaker

dependent" systems dse "training" (also called "enrollment") where an individdal speaker reads

text or isolated vocabdlary into the system. The system analyzes the person's specinic voice and

dses it to nine-tdne the recognition on that person's speech, resdlting in increased accdracy.

Systems that do not dse training, incldding RaritasVox, are called "speaker independent" systems.

RaritasVox however makes dse on the nact that the codnting process dses an independent

vocabdlary that is denined in a connigdration nile (Fig. 10; Appendix 2). The dser may not only

dse his or her own short terms nor species rather than the ndll taxonomic name, e.g. "pachylent"

instead on "Globigerina pachyderma sinistral", they can modiny the connigdration nile so that the

program can better recognize an individdal's normal prondnciation style. This is nor example

dsendl nor dsers with dinnerent native langdages, as vowels in particdlar are onten pronodnced

dinnerently, even nor latin taxa names. For example "Prunopyle" is pronodnced proo-no-peil by

English speakers, and proo-no-peel-ae by Germans.

At the time RaritasVox was nirst being planned (2009) only a new cross-platnorm packages were

available. The speech recognition sontware Sphinx and Java were chosen as the best combination

nor an open-sodrce, cross platnorm speech recognition package and langdage environment nor odr

pdrposes. For Sphinx the elemental components on speech sodnds are interchangeably renerred to

as "phones" or "phonemes" (see http://www.speech.cs.cmd.edd/sphinx/doc/Sphinx.html and

http://www.speech.cs.cmd.edd/cgi-bin/cmddict). Only phonemes listed in the phoneme set on the

CMU Pronodncing Dictionary (arodnd 40) can be dsed and it expects that the langdage dsed is

English. Only words consisting on one or more phonemes that are present in the cdstomized

dictionary nile (Fig. 10) can be recognized as "correct". The sontware will search nor words

consisting on phonemes present in the dictionary which match best to the speech inpdt. In

RaritasVox the spoken word is recognized, connirmation is shown on screen, and a codnt

command nor that item is generated (Fig. 11).

RaritasVox was not dsed to collect research data and was only brienly tested nor accdracy (Table

1).

Using a list on 18 words and 108 voice entries, nodr words were incorrectly identinied (<4%),

resdlting in 8 incorrect codnts (7.5%). This is similar to accdracy in mdch more sophisticated,

general voice recognition systems [27], which is possible as RaritasVox dses a very limited

vocabdlary. The codnt error rate may be too large nor data collection where rare occdrrences are

important (e.g. biostratigraphy) bdt adeqdate nor others sdch as gross assemblage composition,

particdlarly when combined with statistical data reddction proceddres sdch as nactor analysis that

are insensitive to small amodnts on random data scatter [13]. The accdracy is in any event

choosable by the dser as they can, by monitoring the compdter screen, correct errors benore they

are codnted dsing the spoken 'Remove' command to delete the last (incorrect) identinied word.

Output files

SOD File Format

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

Page 11: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

In addition to the diversity accdmdlation plots, which can be saved as graphics as onten as desired

(the matplotlib library dsed in Raritas sdpports variods nile normats, e.g. png, pdn, jpg, tin), the

program saves the primary codnt data. This necessitates choosing, or creating a normat nor the

data niles, as there is no dniversal commdnity database which wodld allow a direct dpload

soldtion. Despite a great deal on biostratigraphic or other data on the norm on species by

samples/observations having been generated globally nor many decades, no generally accepted or

even widely known nile normat exists nor sdch data. Other nields have developed commdnity data

normats nor sdch data matrices, e.g. the BIOM normat nor biological observation matrices

(McDonald et al., 2012), as well as standard protocols to exchange innormation directly between

compdter systems e.g. Darwin Core (Wieczorek et al., 2012). These normats are however on

limited dse nor paleontologic nossil occdrrence matrices since they lack any way to store

metadata, general or individdal sample, that is related to geologic age (sample position in section,

normation name, etc), and the metadata in general is optimized nor biologic, not paleontologic

observations. One on the major biologic exchange protocols (ABCD: (Berendsohn, 2007),

http://wiki.tdwg.org/ABCD/) does have, via the EFG extension (http://www.geocase.ed/eng) the

ability to transmit both biologic and geologic data, bdt is a commdnication protocol, not storage

normat, and the xml deninition is not readable by normal dsers.

Within the nield on paleontology, data on occdrrences, odtside on micropaleontology, are

dominated by simple taxa lists nor a single locality (one sample). This is exemplinied by the main

data inpdt normats the most widely dsed paleontology commdnity database PBDB (Alroy et al.,

2001), where data is entered, taxon by taxon, nor one sample at a time. Within micropaleontology

taxa-by-sample data matrices are common (onten renerred to as 'range charts') bdt data is dsdally

given in the normat on individdal pdblications, withodt metadata in the niles, in ndmerods

variations on a simple taxa-by-sample table. This is also the nile normat dsed by the deep-sea

drilling programs (DSDP, ODP, IODP), which have not generally captdred micropaleontology

data except in a very limited norm on-ship, dsing database entry norms, or simply archived data

copied nrom pdblications, with only minimal metadata stored separately nrom the data niles.

Lastly there are several more comprehensive data nile normats that are associated with

commercial micropaleontology, i.e. the oil inddstry. These normats incldde metadata, details on

stratigraphy etc, bdt are not compatible with each other and are mostly meant nor internal dse in

proprietary commercial programs, not nor open nile exchange. Most also tend to be qdite dser

dnnriendly, giving sample and taxa names in separate deninition blocks nrom the actdal occdrrence

data, and dse a long, non-tabdlar, list type strdctdre that makes comprehension dinnicdlt. There is

thds a need nor a pdblic (non-proprietary) nile normat that combines metadata and the taxa-by-

occdrrences data in a single nile, provides nor geologic age or section innormation and which is

easy nor scientists to read and dse.

We have therenore adopted a new 'open nile normat': Stratigraphic Occdrrence Data normat, which

we abbreviate here simply as SOD normat. This normat originally was developed in response to

the need to merge metadata and occdrrence data in dser typed niles, in order to manage a large

ndmber on nossil occdrrence matrix niles that were being digitized nrom the literatdre nor dpload

into a database that provides a micropaleontologic eqdivalent to the PBDB: NSB (Lazards, 1994,

Spencer-Cervato, 1999). This database reports occdrrences on micronossils in deep-sea sediment

sections, and the data is mostly derived nrom stddies that report the occdrrences in the norm on

simple samples by species tables, one table per section, per higher nossil grodp. The nile normat

itseln is deliberately meant to be visdally similar to the sodrce pdblication data tables, being

essentially an enhanced version on the pdblication's tabdlar data matrix. This makes the nile easily

read by dsers, and eqdally makes the transcription (keying-in) on data nrom pdblications into the

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

Page 12: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

normat relatively simple - in some cases, where a pdblication nile is available in digital norm,

simply by renormatting some on the nields, rather than re-entry on primary valdes. SOD normat

however is signinicantly dinnerent nrom an 'ordinary' dser data table in that it is based on a normal,

extendable deninition on content. This deninition adds more strdctdre and detail nor both taxa and

sample names, and dses the otherwise empty 'corner' on the matrix at the intersection on the row

and coldmn labels to incldde, in a strdctdred way, more general metadata abodt the occdrrence

data in the nile.

The nile is laid odt in 4 graphical blocks: general metadata: dpper lent corner block; taxa

metadata: lent coldmns below metadata block; sample metadata: rows to right on corner metadata

block; and the occdrrence data itseln in the remaining lower right block (Fig. 12). Flexibility is

provided nor in two ways. The individdal nields in each block can be popdlated by dinnerent actdal

data types, depending on the overall record type as determined by the 'File Type' nield. Cdrrently

there are only two denined nile types, nor deep-sea drilling data and more traditional land section

data (O and L, respectively). These dinner both in general metadata (Site location vs geographic

name and geographic coordinates), and in the way in which sample names are strdctdred: deep-

sea drilling samples ('O' niles) dse a consistent Site-Hole-Core-Section-Interval normat, while land

sections are more variably denined, bdt dsdally incldde some combination on geologic normation,

vertical position in section and sample name (dsdally dniqde to each stddy); with additional

innormation onten recorded on geologic age or biostratigraphic zone and lithology. SOD 'L'

normatted niles incldde all these nields. Within the broad constraints on total nields available, the

ndmber on nile types dsing this layodt is open to indeninite expansion. The SOD layodt itseln is

also extensible, as the version is written in the nirst metadata nield in each nile. The nield

deninitions and thds the data expected in each nield are determined by these control nields, and

dinnerent layodts can be denined, nor example with additional rows nor sample name nields. This

nlexibility however reqdires a separate sodrce on innormation that denines, nor the dser and

programmer, what the nield contents mdst be nor each 'File Type' or SOD version ndmber. These

deninition reqdirements are the ndndamental dinnerence between regdlar data niles as nodnd in the

literatdre, and the SOD normat. The deninitions are given in two ways (which also allows cross

checking nor data consistency). First, the tabdlar nile deninition reqdires ndll labeling - each cell,

row or coldmn that holds data has an adjacent cell with nixed text content denining the data cell(s)

adjacent, so that the content resembles a simple key:valde non-relational database strdctdre. This

means the niles are largely seln docdmenting, and provides sdnnicient explanatory innormation to

dsers so that they can create new data niles nrom a template nile (containing labels bdt no data

valdes). Second, programs that read SOD niles are expected to have a deninition table on some

sort which gives the location and meaning on each cell nor each nile type and each SOD version.

Cdrrently this is implemented in a table in the NSB database and dsed by programs (both a

python script and an R proceddre at present) that read and dpload SOD data into the NSB system.

This deninition list codld also be incldded (e.g. as a second 'page' in a spreadsheet nile) with the

data niles themselves. A ndll list on cdrrent SOD nield deninitions and additional details on the

normat are given in Appendix 3.

Over 500 niles have been created in SOD normat, both typed or edited by dsers as described

above, or generated by the Raritas program ddring codnting on micronossils. Raritas generates

only data nor one sample at a time, bdt otherwise the odtpdt is identical to that dsed nor complete

sample by taxa matrices in other SOD niles. SOD normatted niles are not intended to replace

more complex, normally controlled, compdter-to-compdter data exchange normats, denined in xml

or other systems. SOD is best viewed as complementary, providing a dser accessible normat that

encodrages the captdre on the metadata needed to adeqdately docdment stratigraphic occdrrence

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

Page 13: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

data, which dntil now has onten not been done. It shodld also be noted that the SOD normat is

mdch more nlexible and can accommodate many more types on data than the cdrrent versions on

Raritas programs themselves, which are 'hard wired' to work e.g. with Taxa and Sample Names.

Fdtdre versions on these programs ideally shodld be modinied to read the nields needed nor the

metadata window, and odtpdt data nile normats, directly nrom a SOD deninition nile.

Diversity vs number of specimens

The program odtpdts, in addition to the main codnt data, the cdmdlative diversity vs ndmber on

codnted objects history as a simple tab-text data nile. This data can be dsendl nor nitting

rarenaction cdrves in sdbseqdent data analyses.

Results

The degree to which biodiversity assessments can be improved dsing odr sontware depends on a

variety on nactors - the distribdtion on taxon abdndances (evenness) and absoldte diversity on the

target popdlation(s) being codnted; and the ability on the dser to mentally mask odt taxa and nocds

only on those not excldded. Most people can easily keep a 'skip' list on several taxa in mind when

codnting, bdt not a mdch larger list, e.g. a dozen or more taxa. Thds the improvement in codnting

with Raritas tends to be best when the abdndances are signinicantly dneven and the total diversity

is less than a new hdndred categories. In the example shown in Figdres 1 and 7 on this paper,

nrom Antarctic Pleistocene radiolarian assemblages, by eliminating the 6 most common species

(cdmdlative abdndance on >74% on the specimens in the sample) nearly 3/4 on the specimens can

be skipped, allowing an ennective sampling on the rarer taxa that is 4X what wodld have been

possible by codnting all specimens. In practice we have nodnd that we more typically increase

odr ennective sample size by 2-3X by dsing rare codnt mode. These increased ennective sample

sizes signinicantly improve the accdracy on diversity estimates, althodgh the precise amodnt will

depend on total sample size, evenness and absoldte diversity (Colwell et. al., 2012).

Discussion and Conclusions

The programs described here provide dsendl tools nor codnting popdlations with large ndmbers on

categories and dneqdal abdndances on individdals in categories. They are, as programmed, best

sdited to micropaleontology stddies, bdt with only minor modinication can be adapted to many

other dses in biodiversity research and other nields. The SOD deninition provides a nlexible,

internally docdmented yet easy to read nile normat nor storing and exchanging occdrrence data,

either nor individdal popdlations or matrices with mdltiple sets on observations. The Raritas

program described here has proved itseln in actdal dse over several years in the jdnior adthor's

research grodp in Berlin. As noted above, it has been dsed to codnt >700,000 specimens

belonging to several hdndred dinnerent species in >100 radiolarian micronossil assemblages, as

part on a stddy on biodiversity change in the Sodthern Ocean over the last 20 my (Renaddie &

Lazards, 2013). It has been dsed by several individdals in other projects incldding stddents, on a

variety on compdters.

Acknowledgements

The adthors wish to thank the ndmerods individdals nor the open-sodrce sontware tools dsed to

create the Raritas programs.

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

Page 14: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

References

Stevenson RJ, Pan Y, van Dam H. 2010. Assessing environmental conditions in rivers and

streams with diatoms, p. 57–85. In Smol JP, Stoermer EF (ed), The Diatoms: Applications for

the Environmental and Earth Sciences, Cambridge University Press, Cambridge.

Alroy J, Marshall CR, Bambach RK, Bezusko K, Foote M, Fürsich FT, Hansen TA, Holland SM,

Ivany LC, Jablonski D, Jacobs DK, Jones DC, Kosnik MA, Lidgard S, Low S, Miller AI,

Novack-Gottshall PM, Olszewski TD, Patzkowsky ME, Raup DM, Roy K, Sepkoski JJ, Jr.,

Sommers MG, Wagner PJ, Webber A. 2001. Effects of sampling standardization on estimates

of Phanerozoic marine diversification. Proceedings of the National Academy of Sciences

(USA) 98:6261–6266.

Alroy J. 2010. Fair sampling of taxonomic richness and unbiased estimation of origination and

extinction rates, p. 55–80. In Alroy, J, Hunt G (ed), Quantitative Methods in Paleobiology,

The Paleontological Society,

Berendsohn W. 2007. Access to Biological Collection Data. ABCD Schema 2.06 - ratified

TDWG standard. TDWG Task Group on Access to Biological Collection Data, BGBM,

Berlin http://www.bgbm.org/TDWG/CODATA/Schema/default.htm.

Brown JH, Gupta VK, Li BL, Milne BT, Restropo C, West GB. 2002. The fractal nature of

nature: power laws, ecological complexity and biodiversity. Phil Trans R Soc 357:619–626.

Bugware. 2016. Bugwin. http://www.bugware.com

Buonassissi CJ, Dierssen HM. 2010. A regional comparison of particle size distributions and the

power law approximation in oceanic and estuarine surface waters. Journal of Geophysical

Research 115:C10028 (1–12).

CLIMAP members. 1976. The surface of the ice-age earth. Science 191:1131–1137.

Colwell RK, Chao A, Gotelli NJ, Lin S-Y, Mao CX, Chazdon RL, Longino JT. 2012. Models and

estimators linking individual-based and sample-based rarefaction, extrapolation and

comparison of assemblages. J Plant Ecol 5:3–21.

Dunn R. 2014. wxPython, version. 3.0. wxpython.org.

Gannon JE. 1971. Two counting cells for the enumeration of zooplankton micro-crustacea. Trans

Am Micros Soc 90:486–490.

Hinds WC. 1999. Aerosol Technology: Properties, Behavior, and Measurement of Airborne

Particles, 2nd Edition, Wiley, Hoboken, NJ.

Hunter JD. 2007. matplotlib: A 2D graphics environment. Computing in Science and Engineering

9:90–95.

Imbrie J, Kipp NG. 1971. A new micropaleontological method for quantitative paleoclimatology:

application to a late Pleistocene Carribean core, p. 71–181. In Turekian KK (ed), Late

Cenozoic Glacial Ages, Yale University Press, New Haven.

Kim CC, DeRisi JL. 2010. VersaCount: customizable manual tally software for cell counting.

Source Code Biol Med 5:web.

Lazarus DB. 1994. The Neptune Project - a marine micropaleontology database. Math Geol

26:817–832.

Mathis JS, Rumpl W, Nordsieck KH. 1977. The size distribution of interstellar grains. The

Astrophysical Journal 217:425–433.

McDonald D, Clemente JC, Kuczynski J, Rideout JR, Stombaugh J, Wendel D, Wilke A, Huse S,

Hufnagle J, Meyer F, Knight R, Caporaso JG. 2012. The Biological Observation Matrix

(BIOM) format or: how I learned to stop worrying and love the ome-ome. GigaScience 1:1–

6.

Mcgann M, Mcgann LB, Bonomassa O, Devries P, Luther J, Malmberg S, Nelson G, Pratt SIII.

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

Page 15: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

2006. Foramsampler v. 3.0 - microfossil sample data management software. Anuário do

Instituto de Geociências 29:278–279.

Mora C, Tittensor DP, Adl SM, Simpson AGB, Worm B. 2011. How many species are there on

earth and in the ocean? PLoS Biology 9:1–8 (web).

Nalepka D, Walanus A. 2003. Data processing in pollen analysis. Acta Paleobot 43:125–134.

Preston FW. 1948. The commonness, and rarity, of species. Ecology 29:254–283.

Reed WJ, Hughes BD. 2002. From gene familes and genera to incomes and internet file sizes:

why power-laws are so common in nature. Physical Review E 66:67103–67106.

Renaudie J, Lazarus D. 2013. On the accuracy of paleodiversity reconstructions: a case study in

Antarctic Neogene radiolarians. Paleobiology 39:491–509.

Roberts TE, Bridge TC, Caley MJ, Baird AH. 2016. The Point Count Transect Method for

Estimates of Biodiversity on Coral Reefs: Improving the Sampling of Rare Species. PLoS

One 11:e0152335.

Spencer-Cervato C. 1999. The Cenozoic deep sea microfossil record: explorations of the

DSDP/ODP sample set using the Neptune database. Palaeontologica Electronica 2:web.

Stratadata. 2014. Stratabugs biostratigraphic data management software.

http://www.stratadata.co.uk

van Rossum G, Drake J. 2010. Python Language Reference, version 2.7.

Walker W, Lamere P, Kwok P, Raj B, Singh R, Gouvea E. 2004. Sphinx-4: a Flexible Open

Source Framework for Speech Recognition, Sun Microsystems, Mountain View, CA.

Wieczorek J, Bloom D, Guralnick R, Blum S, Döring M, Giovanni R, Robertson T, Vieglais D.

2012. Darwin Core: An evolving community-developed biodiversity data standard. PlosOne

7:1–8.

Zachariasse WJ, Riedel WR, Sanfilippo A, Schmidt RR, Brolsma MJ, Schrader HJ, Gersonde R,

Drooger MM, Broekman JA. 1978. Micropaleontological counting methods and techniques-

an exercise on an eight metres section of the lower Pliocene of Capo Rossello, Sicily. Utrecht

Micropaleontological Bulletins 17:79-176.

Zippi P. 2007. Counter 4.5. PAZ Software. http://www.pazsoftware.com.

Supporting Information Appendices

S1 Appendix 1 - User Gdides

S2 Appendix 2 - Sample Files

S3 Appendix 3 - SOD Deninition

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

Page 16: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Figure 1

Assemblages with common and rare taxa

Microfossil assemblage as seen in the microscope (late Pleistocene, Southern Ocean, ODP

Site 751). Specimens marked by black arrows all belong to Antarctissa strelkovi or A.

denticulata. Other radiolarian species are marked by white arrows. Unmarked individuals are

not targets for counting - broken radiolarians and diatom valves. Most individuals in this

target assemblage belong to just a few species (particularly A. strelkovi and A. denticulata),

making discovery of rarer taxa difficult.

Page 17: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Figure 2(on next page)

Ranked relative abundances of fossil radiolarian species in single samples and

combined multisample datasests.

Counts of species, sorted by abundance, of Neogene Southern Ocean radiolarian

assemblages, showing total dataset (several dozen samples) and a single sample (Deep-sea

drilling sample ODP 751A-6H-6, 98-100 cm). Despite a total count of 7071 specimens within

the single sample, the majority of the species are represented by 6 or fewer individuals. From

data in (Renaudie & Lazarus, 2013) SOM.

Page 18: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

full dataset: N = 714,853

single sample: N = 7,071

N S

pe

cim

en

s

1

101

102

103

104

105

Species Rank

0 50 100 150 200 250 300 350 400 450 500

Ranked Species Abundances Antarctic Neogene Radiolaria

Page 19: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Figure 3(on next page)

Cumulative diversity vs sample size curve and estimated true diversity for a single

sample.

Species-accumulation curve on a typical sample (sample ODP 751A-6H-6, 98-100 cm shown

in Fig 1). Bold black curve is the species accumulation curve; light grey curve is a de

Caprariis type curve-fit; dashed light grey line its asymptote (i.e. species diversity at infinite

sample size). From (Renaudie & Lazarus, 2013).

Page 20: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

0

50

100

150

200

250

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Nu

mb

er

of S

pe

cie

s

Number of Specimens

Page 21: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Figure 4

Configuration file to populate interface with category names.

Configuration file format (a plain text file, here formatted for easier reading). Only a few

fields - 'Genus' and 'Species' components of a taxonomic name, button (yes/no) are

mandatory. A couple fields, e.g. 'Recognition Name' are used only by RaritasVox.

Page 22: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Figure 5

Dialog to enter general sample metadata.

Metadata window used for Raritas. Information about the sample to be counted is entered

here, including observer, date, class of objects being counted ('Fossil Group'), and sample

identification information. RaritasVox has additional options (not shown), e.g. 'Save list of

counted species with diversity' which, if checked, creates a second output file that gives the

entire history of counting.

Page 23: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Figure 6

Main counting window with buttons, hierarchical category menus and count status

information.

Main counting window. Objects to be counted are presented in two forms: an array of clickable buttons in

the upper part of the window, and as a set of pop-up lists in the lower left and center part of the window.

The number of lists and their contents is automatically built from the configuration file higher category

labels for object entries. Button labels are also taken from this file on start-up. Other buttons or menu items

control program behavior and call up other features e.g. voice recognition (RaritasVox only), show count

plot, switch to Rare Count mode etc. A scrolling list of the most recently counted objects is shown in the

lower middle. The 'Track' counter and clickable (large rectangular) button are on the lower right and are

used to record observation effort in both regular and rare count modes. Note, in this image rare count mode

has already been activated; thus some buttons are greyed out.

Page 24: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Figure 7

Dialog to configure rare count mode.

Configure rare count mode dialog. The object counts list, sorted by count frequencies, is

presented and the user selects those objects (here, species names) that will in skipped and

no longer counted in rare count mode.

Page 25: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly
Page 26: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Figure 8(on next page)

Relationships between sample size and uncertainty of abundance estimates in

generalized and actual biodiversity data.

Panel A (left) - Epsilon (size of confidence interval, relative to the abundance value, for a

given species relative abundance in a population) plotted on a p (percent) vs N (number of

specimen) landscape. Rule of thumb used in [12] marked by dashed lines (Renaudie &

Lazarus, 2013) highlighted. Panel B (right) - Shows, for data reported in (Renaudie & Lazarus,

2013), red line: the percent of samples that have at least one species with percent higher

than p; blue line: the percent of species having a proportion higher than p in at least one

sample, and black line with shading: the cumulative proportion of specimens of species with

proportion higher than p (mean, inner-quartile range and total range over all 107 samples).

Page 27: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

100 200 300 500 1000 2000 5000

N

p

0.1

0.5

1

5

10

50ε = 5%ε = 10%

ε = 25%

ε = 50%

ε = 100%

ε = 200%

2.9

20 40 60 80 100%

Proportion above p

Samples

Species

Specimens

59.7

Page 28: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Figure 9(on next page)

Collecting curve, showing history of cumulative diversity vs sample size.

Count plot window, showing a simple graphic of how total diversity of objects ('species') is

increasing with increased numbers of counted objects ('specimens'). The window appears

whenever the user clicks the 'show count plot' button in the main counting window. This

graphic is calculated and plotted anew with each invocation. The shape of the curve provides

important feedback for the user, see text for details.

Page 29: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

0 50 100 150 200 250 300 350 400Number of Specimens

0

20

40

60

80

100

120N

um

ber

of

Speci

es

Collector's curve

Page 30: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Figure 10(on next page)

RaritasVox defined vocabulary and pronunciation configuration file.

Configuration file for voice recognition using RaritasVox (extract only). Spoken words are on

the left and the phoneme pronunciations on the right.

Page 31: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

APROP AH P R OW K S

STOCKI S T OW K IY

AMPRADIOSA AE M P R AE D IY OW S AH

ARTANNULATUS AA R T AE N AH L EY T AH Z

APIRREGULAR AE K S IH R EH G Y AH L ER

BGRAN B IY G R AE N

CRYPTBUSS K R IH P T B AH S

GONDWANA G AO N D W AE N AH

LOPHOHADRA L OW F AH HH AE D R AH

MITA M IY T AH

PODPAPILIS P AA D P AE P IH L IH S

PSEUDODICT S UW D OW D IH K T

ZYGO S IY G ER

SPYRO S P IY R ER

CORNUTELLA K AO R N Y UH T EH L AH

CALOCYCLAS K AE L OW S AY K L AH Z

BUNNYEARS B AH N IY IH R Z

ZIGZAG Z IH G Z AE G

Page 32: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Figure 11

Main counting window for RaritasVox

Screenshot of RaritasVox in voice-counting mode. A list of acceptable words is shown in the

top window, the currently recognized word in large letters in the middle of the screen (to

make it easy to see at a glance when e.g. working at a microscope), button controls below

this and summary panes of count activity at the bottom.

Page 33: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Figure 12(on next page)

Example of SOD file format with data blocks framed.

Example of SOD file output (the main data output file produced by Raritas), with the 4 main

areas (blocks) marked by bold lines. Metadata about the data file is stored in the upper left

block, object labels and linked data such as author names, if known, are in the lower left

block, sample information is in the upper right block, and the actual counting data in the

lower right block. In output from the Raritas program only a single column of data is created

but the SOD format definition permits the sample name and count values to repeat

indefinitely (to the right of this figure). Note that only a few selected rows are shown here -

the full file has ca 400 taxa names.

Page 34: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

count for ms fig

Page 1

SOD v.: 2.1 File Type: O Fossil Group: radiolaria

Source ID: Source Name: Source Citation: Site: 751

Entered By: dbl Entry Date: 18-01-2017 Checked By: Check Date: Hole: A

Leg Info: 120 Leg Qualifier: Core: 12H

File Creation Method: Raritas Section: 2

Occurrences Data Type: C Keys: Interval top: 12-14

Comments: 1 tracks observed Depth(mbsf):

Abundance: A

Preservation: G

Genus: GQ: Species: SQ: Subspecies: Author: Taxon Code: Higher Taxon: Taxon Comments:

Acrosphaera murrayana Collo/Entact/Phaeo 3

Acrosphaera spinosa Collo/Entact/Phaeo 1

Actinomma golownini Spumellaria 13

Anomalocantha dentata Spumellaria 1

Antarctissa strelkovi Nassellaria 18

Antarctissa ballista Nassellaria 8

Botryostrobus auritus/australis Nassellaria 1

Cycladophora humerus Nassellaria 8

Cycladophora golli regipileus Nassellaria 2

Cycladophora golli golli Nassellaria 1

Dendrospyris ? sakaii Nassellaria 3

Dendrospyris rhodospyroides Nassellaria 2

Dictyophimus ? planctonis Nassellaria 14

Druppatractus irregularis Spumellaria 8

Page 35: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Table 1(on next page)

Recognition accuracy in a simple test run of RaritasVox.

Accuracy of spoken entry using RaritasVox for a short list of species name abbreviations. Each name was

spoken in random order 6 times. Note the independence of the spoken and data names e.g. zigzag for L.

robusta. The spoken and formal names are linked in the Vox configuration file.

Page 36: A peer-reviewed version of this preprint was published in PeerJ on … · 2018. 4. 9. · Raritas and RaritasVox: programs nor codnting high diversity categorical data with highly

Genus GQ Species SQ spoken name VOX count Errors

Amphicraspedum prolixum gr. aprox 5 1

Amphipyndax stocki stocki 6 0

Amphisphaera radiosa ampradiosa 6 0

Artostrobus annulatus artannulatus 6 0

Axoprunum irregularis axirregular 5 1

Buryella granulata bgran 6 0

Calocyclas spp. calocyclas 7 1

Cornutella sp. cornutella 6 0

Cryptocarpium bussonii gr. cryptbuss 8 2

Gondwanaria ? sp. gondwana 6 0

Lithomelissa robusta zigzag 6 0

Lophocyrtis hadra lophohadra 5 1

Acrosphaera cuniculiauris bunnyears 6 0

Mita ? sp. mita 6 0

Podocyrtis papilis podpapilis 5 1

Pseudodictyophimus gracilipes pseudodict 6 0

Spyrocyrtis A n.sp. spyro 6 0

Zygocircus buetschli zygo 7 1

101 1


Recommended