An Alternative Information Plan Internet of Things technology ... B.1 Node properties...

An Alternative Information PlanMatteo MontiSteen RasmussenMarco MoschettiniLorenzo Posani

SFI WORKING PAPER: 2017-07-021

SFIWorkingPaperscontainaccountsofscienti5icworkoftheauthor(s)anddonotnecessarilyrepresenttheviewsoftheSantaFeInstitute.Weacceptpapersintendedforpublicationinpeer-reviewedjournalsorproceedingsvolumes,butnotpapersthathavealreadyappearedinprint.Exceptforpapersbyourexternalfaculty,papersmustbebasedonworkdoneatSFI,inspiredbyaninvitedvisittoorcollaborationatSFI,orfundedbyanSFIgrant.

©NOTICE:Thisworkingpaperisincludedbypermissionofthecontributingauthor(s)asameanstoensuretimelydistributionofthescholarlyandtechnicalworkonanon-commercialbasis.Copyrightandallrightsthereinaremaintainedbytheauthor(s).Itisunderstoodthatallpersonscopyingthisinformationwilladheretothetermsandconstraintsinvokedbyeachauthor'scopyright.Theseworksmayberepostedonlywiththeexplicitpermissionofthecopyrightholder.

www.santafe.edu

SANTA FE INSTITUTE

An alternative information planMatteo Montiabc1∗, Steen Rasmussenad∗, Marco Moschettinic2 and Lorenzo Posanie

aCenter for Fundamental Living Technology (FLinT),University of Southern Denmark (SDU), Denmark;

bÉcole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland;c1Complex Systems Group, University of Bologna, Bologna, Italy;

c2Department of Computer Science and Engineering,University of Bologna, Bologna, Italy;

dSanta Fe Institute (SFI), New Mexico, USA;eLaboratoire de Physique Statistique, École Normale Superiéure, Paris, France

∗Corresponding authors. Emails: [email protected], [email protected]

June 27, 2017

Abstract

We present and evaluate an alternative architecture for data storage in distributed networks that ensure privacy and security, we call RAIN1. The RAIN network architecture offers a distributed file storage service that: (1) has privacy by design, (2) is open source, (3) is more secure,(4) is scalable, (5) is more sustainable, (6) has community ownership, (7) is inexpensive, (8) is potentially faster, more efficient and reliable. RAIN has the potential to democratize and disrupt cloud storage by eliminating the middle-man: the large centralized data centers. Further, we propose that a RAIN style privacy and security by design architecture could form the backbone of multiple current and future infrastructures ranging from online services, cryptocurrency, to part of government administration.

Keywords: distributed storage; privacy by design; internet of things; community ownership; sustainability; bio-inspired design

1 RAIN is a metaphor for what comes after the clouds.

1

A.2 STATE OF THE ART

Part A

IntroductionA.1 Background

The Internet data storage services provided to-day violate privacy, are expensive, and come ata high environmental cost. More than 3% ofthe world’s power consumption is currently dueto datacenters, its CO2 footprint surpassed thatof global air traffic in 2013 [1], and their powerconsumption rate is rapidly growing [2]. Ourpreliminary estimates suggest that a distributedinfrastructure made of low-energy devices, finelydistributed in close proximity to the clients theyservice, would have significantly lower power re-quirements than a centralized paradigm, whichrequires active cooling and increases the load onlong-range Internet routing infrastructure.

Further, the high entrance cost to the mar-ket creates monopolies where only the largestcompanies are capable of offering scalable cost-efficient services. Owned by the community ofits users, e.g. citizens, businesses and organiza-tions, this network service will be free to join,democratic and designed to guarantee the pri-vacy and security of the data it stores.

Since it is now possible to have cheap, energyefficient, fast, reliable, and always online com-puting nodes in our homes and businesses, ourarchitecture relies on privately owned computingdevices (e.g. Raspberry Pis with flash drives).Our network design leverages on the collectivestorage power of these devices: every node willstore parts of other nodes’ data to guaranteeredundancy and reliability, and a carefully de-signed cryptographic architecture will preventunwanted access to the stored data.

A.2 State of the art

A wide variety of technologies have been devel-oped along the way from mainframe-terminalparadigm to Internet of Things and decentral-ized networks. In this paper, we will under-line, when needed, what technologies still needto be perfected to achieve our goal of developingan headless, peer-to-peer infrastructure offeringa level of reliability, security and performancecomparable to that of a centralized paradigm toan arbitrarily large user base. However, most ofthis work will arguably consist in using technolo-gies that are already in place under stronger reli-ability assumptions offered by the possibility ofhaving permanently online, low-energy devicesin our homes and businesses.

A.2.1 Technologies

Hardware Internet of Things technologyis arguably at the core of RAIN’s architecture,as we plan to develop our entire infrastructureon home-based single-board computers such asRaspberry Pi. While no significant differencewould exist in terms of security and reliabilitybetween a network made of, e.g., personal desk-tops and one made of single-board computers,the latter make the deployment of RAIN’s net-work feasible, as their cost and energy require-ments are small enough to make a RAIN nodepart of a normal network setup.

Decentralization and distribution Dis-tributed computing significantly predates peer-to-peer technology: Grid Computing (see, e.g.,[3]) quickly became an effective replacement forsupercomputers as soon as the computing powerof many slower nodes exceeded that of a main-frame. When an increasing number of users hadaccess to high-speed Internet connections, GridComputing projects (like BOINC [4]) were devel-oped to tackle not only the computing power ofarrays of nodes in large computing facilities, butalso the spare resources of personal computersof volunteers.

A wide variety of Internet services is now of-fered by Cloud infrastructure, that builds on topof Grid an abstraction layer that takes care ofautonomously mapping services to the hardwarethat offers them. However, it relies on a similarsemi-centralized infrastructure, where multipleservers in one or more facilities collectively offera service by load-balancing the requests.

In order to process and make accessible theincreasing amount of communication producedby a growing population of Internet of Thingsdevices, Edge computing technology (see, e.g.,[5] [6]) decentralizes the paradigm further, us-ing finely-distributed cloud nodes that are geo-graphically close to the source of data to performpre-processing and improve distribution perfor-mance.

Redundancy & retrievability The prob-lem of reliably storing and retrieving informa-tion on faulty devices is addressed in central-ized as well as decentralized contexts by error-correcting codes (such as Solomon-Reed encod-ing [7]), RAID (Redundant Array of IndependentDisks) technology (see, e.g., [8]), and in peer-to-peer contexts with Distributed Hash Tables, asin [9] (see, e.g., [10]).

Security The issue of storing data on un-trusted nodes while guaranteeing confidentiality

2

A.4 STRUCTURE OF THIS WORK A.2.2 Related projects

and integrity has been addressed in the past bymeans of asymmetric ciphers and AuthenticatedData Structures [11] [12]. Specifically in the con-text of headless networks, the issue of removingtrusted third-party authorities is currently ad-dressed by blockchain technolgy (see, e.g., [13]):we later argue that Authenticated DistributedHash Tables could be in principle used to sub-stitute that technology in a context where allnodes have a relatively high uptime.

P2P security More specifically in the field ofpeer-to-peer networks, where no party is trusteda priori, we will be using Proofs of Space [14][15] as a means to perform hardware commit-ment and limit Sybil attacks, and distributed,verifiable random number generation to guaran-tee security properties of the global organizationof our network.

A.2.2 Related projects

Cubbit A startup project founded by twoof the authors (Moschettini and Posani), im-plements the redundancy strategy and recov-ery procedures described in this paper to offera distributed file storage service using single-board computers and home-grade Internet con-nections. Metadata is stored in a traditionalclient-server architecture.

Storj A cooperative storage cloud based onBitcoin blockchain technology where cryptocur-rency is exchanged for storage.

Lima A startup project offering a specializedsingle-board device to interface personal storagedevices with an Internet connection. No redun-dancy across multiple devices is implemented.

IPFS An open source distributed file systemprotocol designed to persistently store and makeavailable objects on the Internet. There is nocentral point of failure, and nodes do not needto trust each other.

Telhoc A company specialized in securing andoptimizing peer-to-peer database storage.

A.3 Potential applicationsThe RAIN architecture is a bio-inspired networkwith redundancies, distributed control, high en-ergy efficiency, error correction, self-repair, andobvious potential for autonomous adaptation(learning) in later versions. Together with thesehigh level properties, the architecture’s privacy

by design, open source, community ownership,scalability, and performance makes it a candi-date architecture for supporting a number of ad-ditional services besides cloud storage. These in-clude secure communication (text, voice, video),content delivery, search engines, finance (cryp-tocurrency), digital manufacturing, transporta-tion (autonomous vehicles), as well as part ofsecure e-governance (state administration). Inshort we can envision a RAIN style architecturewith privacy and security by design at the centerof the our emerging infrastructure of infrastruc-tures.

A.4 Structure of this work

This work is divided into three parts.

The first part, Architecture Feasibility,demonstrates that a distributed infrastructurewith a redundancy strategy based on Solomon-Reed encoding easily yields a data lifetime of1016 years (∼ age of the Earth), with a storageratio (how much we need to write across the net-work vs. the size of the data uploaded) of 1.5,using a local village of 36 physical nodes wherethe primary data is stored. Assumptions, proofsand estimates are detailed in this section. Thispart provides a proof of principle for the pro-posed RAIN architecture.

The second part, Architecture Challenges,defines the scientific and technical challenges fordesigning data privacy and security, network or-ganization as well as data, metadata and cre-dential handling. As this cryptographic part ofthe architecture is both technical and lengthythese details will be published elsewhere. In thispart we also briefly touch on the environmen-tal impact issues of our distributed infrastruc-ture and that of a traditional centralized client-server paradigm. This part outlines the nextsteps needed in developing the RAIN architec-ture.

In the Potential Impact part, we providean overview of the potential impact a large dis-tributed digital infrastructure like RAIN couldcause. RAIN could support the developmentof communitarian services including telecommu-nication, content delivery, cryptocurrency, anddistributed administrative (nation state and re-gional governmental), which currently are ser-vices managed in a centralized manner throughtrusted third parties.

3

B.1 NODE PROPERTIES

Part B

ArchitectureFeasibilityOur network will be free to join by anyone with aheadless single-board computer (e.g. RaspberryPi), a storage device and a home-grade Internetconnection.

Nomenclature

Network: the set of interconnected single-board computers and servers operating ourservice.

Client: a device (e.g., personal computer,smartphone) that a user uses to access ourservice.

Node: a headless single-board computer,attached to a storage device (e.g., HDD,SSD), owned by a user and persistently con-nected to his/her home Internet connection.

Drive: an online environment where filescan be stored and shared by one or moreusers.

Our Free and Open Source software will be avail-able to the end user in the form of downloadableclients. Once installed on a new user’s personalcomputer, our client will download a platform-specific disk image including a lightweight op-erating system and our pre-configured software,and use it to initialize the user’s single-boardnode. This initialization procedure also servesthe purpose to bind the node to its user’s ac-count.

Once the node has been initialized, the user isasked to power it and persistently connect it toa home-grade Internet connection. Whenever astorage device (e.g., an hard disk or a flash drive)is connected to the node, half of its capacity isallocated to its owner’s account.

A user can associate one or more clients tohis/her account, and use them to access the ser-vice via a Graphical User Interface that allowsfile tree navigation. Users can create drives,namely, storage environments that can be sharedamong one or more users. Each drive has a dis-tinct file tree and an owner, that can manageother users’ privileges or revoke their access.

If a user’s node experiences temporary down-time or permanent failure, its owner’s files’ avail-ability is unaffected. Any user’s files can be ac-cessed from any Internet connection. Whenevera client is not available, access is guaranteed bya Web-based interface.

B.1 Node propertiesOur network is based on unreliable nodes subjectto downtime (temporary unreachability due toconnectivity issues) and failure (permanent lossof data due to hardware breakage or wearing, orto human interaction). In order for it to reliablystore data, a redundancy strategy must beimplemented.

Our redundancy model depends on threenode properties:

Lifetime ([L] = s): the amount of timebetween the moment a node enters the net-work and its first unrecoverable failure.

Downtime (d ∈ [0, 1]): the fraction oftime a node is unreachable due to tempo-rary network- or power-related issues.

Upload and download speed ([Su] =[Sd] = B/s): the average amount of datathat can be transferred from and to a nodeper time unit.

Disclaimer (A) As example scenario, wemodeled our expected network behavior for a setof nodes in Italy, urban area of Bologna.

Disclaimer (B) Wherever little or no data isavailable, we will try to provide the most re-laxed authenticated conservative estimate. Thiswill often result in overly conservative estimates,which we will relax only when more data will beavailable.

B.1.1 LifetimeHazard function

Let R(t) be the survival rate of a node, i.e.,the probability that a node will not experiencean unrecoverable failure before t.

The hazard function h(t) is defined as thelimit fraction of nodes experiencing failure pertime unit:

h(t) = − 1

R(t)

dR

dt(t)

We can express R as a function of h by

R(t) = exp

(−∫ t

0

h(t′)dt′)

and note how, if h(t) = const, R is an exponen-tial function and failure is a memory-less pro-cess. From R(t) we get the probability densityof lifetimes l(t) by

l(t) = −dR(t)dt

4

B.1 NODE PROPERTIES B.1.1 Lifetime

and the expected lifetime L by

L =

∫ ∞0

t l(t)dt

And for h(t) = h = const we have L = 1/h.

Exponential approximation

In our case, failure can be caused by technicalmalfunction or user interaction. We can there-fore expand

h(t) = htech(t) + huser(t)

Technical malfunction The failure rate ofdevices subject to wearing (storage devicesamong them) is commonly referred to as bath-tub curve:

• Manufacturing imperfections generally re-sult in early failures (infant mortality), thusproducing a higher failure rate at the earlystages of the device usage.

• Failure rate is then lower for a certain pe-riod of time where in general only failuresof random nature occur.

• Failure rate increases again in the latestages of the device usage due to wearing.

Storage devices connected to our network,however, will not necessarily be new. The agedistribution a(t) of random, alive, not newlybought devices is

a(t) =R(t)∫∞

0R(t)

therefore, let b(t) be a bathtub hazard functionand pnew the probability that a storage deviceis newly purchased when the node is initialized,then htech becomes

htech(t) = pnew b(t) + (1− pnew)(∫∞

0a(t′)b(t+ t′) dt′

)(1)

Figure 1 shows the effect of the above convo-lution on an experimental [16] bathtub function.As pnew decreases, the hazard function becomessmoother and closer to its asymptotic value.

Under the assumption pnew � 1, htech can besafely approximated to a constant equal to theasymptotic value of b.

User interaction The rate of failures causedby user interaction will highly depend on the us-ability, performance and reputation of our net-work, and is more difficult to estimate now.

User fidelity is subject to infant mortality (asevery other Internet service, our network will

0 1 2 3 4 5

0.02

0.04

0.06

0.08

0.1

0.12

Time (years)

Hazardfunction

(hte

ch)

Hazard function smoothing effect

pnew = 0.00pnew = 0.33pnew = 0.66pnew = 1.00

all

Figure 1: Hazard function htech as per Equation1 for different values of pnew. For pnew = 1,htech reduces to the bathtub function b. As pnewdecreases, htech becomes smoother and higher.

have a bounce rate of users that try our servicebut don’t find it appealing to their needs) butnot to wearing. huser will therefore be higher foryoung nodes, and reach a plateau after a shorttrial period.

When discussing our upcoming security chal-lenges, we will see how we plan to store no crit-ical data on newly connected nodes. This willcontribute to reduce the bounce rare effect andcrop away the higher, non-uniform part of huser,which we will therefore assume constant.

Experimental data

Hard Disk Drives (HDD) A four-yearstudy on HDD wearing [16] (see Figure 1 forpnew = 1) provides experimental data in agree-ment with a bathtub curve model:

Regime Failure rateInfant mortality 5.1%Random failures 1.4%Wearing 11.8%

Under the limit approximation hHDD(t) =hHDD = 0.126/year we get an expected lifetime

LHDD ' 8 years

Solid State Drives (SSD) Due to their ex-tended lifetime, and to the fact that their highcost delays acquisition in the enterprise market,less experimental data is available concerningSSD lifetime.

Due to the absence of mechanical parts, SSDwearing is not determined by time. Each flashmemory block can undergo a limited numberof rewrites before it becomes unusable. Loadbalancing algorithms are implemented to evenly

5

B.1.2 Speed and downtime B.1 NODE PROPERTIES

distribute the load across the sectors even if thesame file is rewritten multiple times. To pro-vide an estimate of SSD lifetimes, we then needto have an estimate of how much data will bewritten on the drive per time unit.

For a node in our network, SSD write speed isbound by its download bandwidth, which we canpreliminarily estimate (see later) to 6.5Mbit/s =70GB/day. It is important to remark that thisis a large overestimate, as it assumes that ev-ery user will use 100% of his nominal connectionbandwidth only to flood our service with writerequests.

An 18month experimental study [17] has beencarried out to determine how much data could bewritten on 6 different Solid State Drives. Datawas continuously written on them at maximumthroughput for the whole duration of the ex-periment. The first SSD to experience a failurebroke down after 728 TB of data were written onit (the last two failed around the 2.5 PB mark).

Together with the write overestimate above,this yields an SSD lifetime expectancy of

LSSD � 28.5 years

Node lifetime No data is available on the fail-ure rate of common single-board computers likeRaspberry Pi, most informal failure reports re-ferring to physical damage and not to wearing.Due to this and to the fact that single-boardnode failures are recoverable, we will thereforeassume the failure rate of nodes due to single-board computer failure to be negligible.

As we said, we cannot provide estimates forthe user contribution to failure rate. In line withour policy to provide conservative estimates, wewill then set the average expected node lifetimeto

L = 2 years

B.1.2 Speed and downtime

Download speed . Accurate, high qualitydata is available [18] from global Content De-livery Networks that can use their service as ameans to study the download speed distributionof a broad sample of Internet users.

Upload speed . Data on upload speeds ismore difficult to obtain, as common users usetheir connection mostly to download data (hencethe bandwidth asymmetry in home-grade Inter-net connections). Data is explicitly gathered[19] by organizations that offer connection speedtests online. Their sampling, however, are bi-ased, as connectivity issues often lead users to

make speed tests, and only the more experiencedusers are aware of speed test services.

Downtime . Downtime data for home-gradeInternet connections is not publicly availableand difficult to gather. Uptime (real time = up-time + downtime) monitoring would require anode to be permanently active in users’ homesto monitor their connection for extended periodsof time.

Experimental procedure

In order to get a reliable estimate for speedand bandwidth, we develop a script to monitorhome-grade Internet connections for extendedperiods of time with a Raspberry Pi 3.

• The script only runs when the Raspberryis connected to a router. Every log entryis paired with the router’s physical address(MAC address) to permit analysis.

• Every two seconds, the script tries to pingan external node (Google Public DNS, ad-dress 8.8.8.8). Both success and failureare logged.

• Every ten minutes, the external IP addressof the node is logged. We will use IP con-sistency later as a means to detect BotnetAttacks.

• Every thirty minutes, a speed test (usingOokla’s Speedtest) is run, determiningand logging roundtrip time, download andupload speed with the aid of an external re-liable server.

• Every time the script detects a new router(i.e., new MAC address), the connection’sprovider and location (latitude and lon-gitude) are determined and logged us-ing ip-api.com geolocation API and therouter’s vendor is determined and loggedusing macvendors.com MAC Vendor LookupAPI.

Results We proceeded to monitor 7 home-grade Internet connections, each for up to 48hours. Raw data (MAC addresses were obfuscatedfor privacy reasons) is available1.

Downtime data

1https://github.com/rainvg/supplementary-material/tree/master/The%20Alternative%20Information%20Plan%20(2017)/netmonitor

6

https://github.com/rainvg/supplementary-material/tree/master/The%20Alternative%20Information%20Plan%20(2017)/netmonitor



B.2 SOLOMON-REED ERASURE CODES

# Ping OK Ping FAIL Downtimeα 78 91088 (8.5± 1) · 10−4β 0 2018 (0±?)γ 15 93013 (1.6± 0.4) · 10−4δ 68 3690 (1.8± 0.2) · 10−2ε 14 116101 (1.2± 0.3) · 10−4ζ 4 2288 (1.74± 0.04) · 10−3η 62 117561 (5.2± 0.7) · 10−4

Speed data

# Tests Upload Downloadα 108 (0.855± 0.005)Mbit/s (12.3± 0.2)Mbit/sβ 3 (72.7± 0.5)Mbit/s (83± 5)Mbit/sγ 103 (9.04± 0.04)Mbit/s (9.29± 0.03)Mbit/sδ 5 (25.1± 0.2)Mbit/s (63± 3)Mbit/sε 130 (19.2± 0.2)Mbit/s (47.2± 0.3)Mbit/sζ 4 (0.82± 0.09)Mbit/s (9± 2)Mbit/sη 131 (0.843± 0.004)Mbit/s (12.9± 0.2)Mbit/s

Our experiment provides the following esti-mates:

Average downtime 0.003Average upload speed 17Mbit/sAverage download speed 34Mbit/s

In line with our policy to keep all our esti-mates conservative and compatible with a worst-case scenario, for the rest of this work we willassume:

d = 0.005

Su = 0.2 MB/sSd = 0.5 MB/s

B.2 Solomon-Reed erasurecodes

An erasure code is a forward error correctioncode which transforms a block of N symbols intoa message with K symbols (K > N) such thatthe original block can be recovered from a subsetof the message. We use Solomon-Reed encoding[7], an optimal (i.e., the original block can berecovered from any N symbols of the message)erasure code that uses polynomial oversamplingand redundancy over finite fields.

B.2.1 Polynomial interpolationMatrix preliminaries

Let K be a field. We define an N − 1 degreepolynomial on K by

pa0,...,aN−1(x) =

N−1∑i=0

aixi

with a0, . . . , aN−1, x ∈ K.

Let x = (x0, . . . , xN−1) be a vector of Ndistinct values in K, which we call samplingpoints. We can define x’s Vandermonde ma-trix by uplifting each sampling point succes-sively to the power 0, . . . , N − 1 in each column

V (x) =

x00 · · · xN−10... · · ·

...x0N−1 · · · xN−1N−1

It is possible to prove that the determinant of

a Vandermonde matrix is

det(V x) =∏

0≤i<j≤N−1

(xi − xj)

which is non-null whenever i 6= j ⇐⇒ xi 6= xj .Now let δi,j denote the Kronecker delta (δi,i =

1, δi,j 6=i = 0). For any j ∈ [0, N−1] we can solve

x00 · · · xN−10... · · ·

...x0N−1 · · · xN−1N−1

a(x,j)0...

a(x,j)N−1

=

δ0,j...

δN−1,j

(2)

for a(x,j)0 , . . . , a(x,j)N−1. Note that the right side of

Equation 2 constrains the value of the polyno-mial of coefficients a(x,j)0 , . . . , a

(x,j)N−1 to be 1 in xj

and 0 in xi6=j . As j can take N distinct val-ues, Equation 2 defines N linear systems of Nequations.

We now define the Lagrange basis{l(x,j)}j∈[0,N−1] for x by

l(x,j) = pa(x,j)0 ,...,a

(x,j)N−1

and by Equation 2 we have for each j, i ∈ [0, N−1]

l(x,j)(xi) = δi,j .

Let y = (y0, . . . , yN−1) ∈ KN define a partic-ular set of desired polynomial values correspond-ing to each sampling value x. Since polynomialsform a vector space we have that

p(x,y) =

N−1∑j=0

yj l(x,j)

satisfiesp(x,y)(xi) = yi

as the value of each l(x,j) is 1 only in xj , and 0in all the other sampling points.

In other words, for any set ofN distinct pointsx and N values y, it is possible to determine thecoefficients of a polynomial whose value on eachxi is yi.

7

B.2.2 Galois Fields B.2 SOLOMON-REED ERASURE CODES

Computational complexity

As we have seen, polynomial interpolation is atwo-step procedure. For a given x, the coeffi-cients {a(x,j)i }i,j of the Lagrange basis on x canbe determined as per Equation 2.

Indeed, Equation 2 can be rewritten asa(x,j)0...

a(x,j)N−1

=

x00 · · · xN−10... · · ·

...x0N−1 · · · xN−1N−1

−1 δ0,j

...δN−1,j

where the Kronecker deltas simply selectcolumns of the inverse matrix. Therefore theLagrange basis for one x can be computed withone matrix inversion.

Once the Lagrange basis matrix has been com-puted, the coefficients a(x,y)0 , . . . , a

(x,y)N−1 of p(x,y)

are then given bya(x,y)0...

a(x,y)N−1

=

x00 · · · xN−10... · · ·

...x0N−1 · · · xN−1N−1

−1 y0

...yN−1

(3)

which requires only one matrix-vector multipli-cation.

It is a known result that matrix inversion canbe computed in sub-cubic time. Strassen’s al-gorithm [20], for example, allows N × N ma-trix inversion in O(N2.807) time. Other algo-rithms are known with faster asymptotic com-plexity (see Coppersmith-Winograd algorithms)but their large constant factor makes them usu-ally efficient only for very large matrices.

Matrix-vector multiplication can be optimizedon finite semirings [21]. By allowing O(N2+ε)preprocessing on the matrix only, it is possi-ble to compute a matrix-vector multiplicationin O

(N2

(ε logN)2

)time.

The two above results make the polynomialinterpolation process efficient, especially for rel-atively small matrices, whenever we need to in-terpolate several polynomials on the same xisbut for distinct yis, which, as we will see, is thecase at hand.

B.2.2 Galois Fields

Finite fields (Galois fields) exist that contain afinite number of elements. A field with q ele-ments exists if and only q can be expressed inthe form

q = pk,

where p is a prime number and k is a positiveinteger. Therefore, a field GF (28) exists with256 elements, and each element of GF (28) canbe represented by exactly one byte.

Computational remark . Summation onGF (28) reduces to bitwise XOR-ing. Multiplica-tion and inversion are more complex. However,unlike GF (232) and GF (264), which could takeadvantage of the larger registries of 32 and 64bit CPUs, GF (28) is small enough that multipli-cations and divisions can be pre-computed andstored in two exhaustive 64 KB lookup tables,small enough to fit the L2 cache of a typical ARMCPU.

B.2.3 Polynomial oversamplingfor redundancy

Let s = s0, . . . , sS−1 be a string of S bytes, thatwe want to reliably store. Let N,K be integersso that N < K � S. We can organize s inL = dS/Ne padded blocks ai of N bytes:

a(i)j =

{sNi+j iff Ni+ j < S

0 otherwise

such that each a(i) can be interpreted as theGF (28) coefficients of an N − 1 degree polyno-mial. Note that since GF (28) is finite, we musthave N ≤ 256 to prevent degenerate polynomi-als over different coefficients. Let

p(i) = pa(i)0 ,...,a

(i)N−1

We can now define L K-bytes long datablocks b(i) by

b(i)j = p(i)(j)

(note how again since GF (28) is finite we mustalso have K ≤ 256).

Now, let x ∈ KN be a vector with distinctcomponents x0 6= . . . 6= xN−1 ∈ [0,K − 1] andyi defined by

y(i)j = b(i)xj

we know from Section B.2.1 that we can use xand y as inputs for polynomial interpolation. Inparticular, we have

a(x,y(i))j = a

(i)j

Due to oversampling, any N -subset of compo-nents of b(i) is sufficient to recover the originala(i).

Summing up To implement Solomon-Reedredundancy we organize s in N bytes long se-quences a(i). We interpret each a(i) as co-efficients for a N − 1 degree polynomial thatwe oversample in 0, . . . ,K − 1, with K > N .Oversampling produces a b(i) ∈ KK from eacha(i) ∈ KK . As we just showed, from any N

8

B.3 REDUNDANCY STRATEGY

components of each b(i) we can recover the cor-responding a(i) by means of polynomial interpo-lation.

To implement redundancy, we can now storeeach component of b(i) separately: b(i)0 will bestored on a node, b(i)1 will be stored on indepen-dent node, and so on. As a result, K nodes willbe storing each one component of b(i), and aslong as any N of them are online and reachable,the original value of a(i) can be retrieved.

Samples generation Let i 6= k. Since b(i)

and b(k) are the result of polynomial oversam-pling on two independent blocks a(i) and a(k),knowledge on the components of b(i) contributeto the recovery of a(k). Therefore, while eachcomponent of b(i) needs to be stored on an inde-pendent node, we can safely store a componentof b(i) on the same node that also stores a com-ponent of b(k), as this does not affect the overallprobability to recover a(i) or a(k).

Indeed, the most efficient way to implementredundancy is to let each node store one of thedata samples s(0), . . . , s(K−1) defined by

s(i)j = b

(j)i

Figure 2 shows the complete redun-dancy process, starting from s, to produces(0), . . . , s(K−1).

Data retrieval For any set of distincti0, . . . , iN−1, if we retrieve from the networks(i0), . . . , s(iN−1) then, for each k ∈ [0, L− 1], in-terpolation on {s(ij)k }j∈[0,N−1] yields a(k), whoseconcatenation produces the original s. Figure3 displays the whole retrieval process, startingfrom s(i0), . . . , s(iN−1) to produce s.

Note how the above procedure requires Lpolynomial interpolations for different values ofs(ij)k , but the sampling points remain ij for eachpolynomial. As showed in Section B.2.1, thismakes the procedure particularly efficient forlarge values of L, as it allows to invert only onematrix and to amortize reconstruction time bymeans of preprocessing on the Lagrange basismatrix.

B.3 Redundancy strategy

B.3.1 Villages

Decay and recovery

In Section B.2, we have seen how Solomon-Reederasure codes allow us to generate, from an S-bytes string of data, K data samples of size

dS/Ne. As long as any N data samples are avail-able, the original string can be retrieved. Thisproperty alone, however, does not significantlyincrease the lifetime of the string.

Let X denote a random variable representingthe lifetime of a node. Modeling failure as an ex-ponential process (see Section B.1.1) the proba-bility of a node being alive at time t is

P (X > t) = exp(− tL)

and the probability of H independent nodes be-ing alive at time t

P (X0 > t ∧ . . . ∧XH−1 > t) =

H∏i

P (Xi > t)

= exp(−HtL

)

follows an exponential distribution with averageL/H, i.e., it takes on average L/H for one of Hindependent nodes to experience a failure. Sinceexponential processes are memoryless the aver-age time t̃d for a string to become unrecoverableis

t̃d = L

(K∑i=N

1

i

)= O (L(log(K)− log(N)))

Whenever at least N data samples are avail-able, polynomial interpolation and re-samplingallow us to generate missing data samples andstore them on new nodes. In order to extenddata lifetime beyond the order of magnitude ofnode lifetime, we will implement real-time datamonitoring and recovery procedures.

Optimal data distribution

A file is the minimum unit of data whose in-tegrity we are safeguarding: failing to retrieveany block of a file will be equivalent to failingto recover the whole file. Under this premise,the probability of failed retrieval is minimizedby storing corresponding samples of blocks inthe same file on the same set of nodes.

Storing distinct files on distinct sets of nodesreduces correlations among failed retrievalswithout affecting their expected number. Onthe other hand, if multiple files are stored on thesame set of nodes, upon failure of a node morefiles will simultaneously need to be recovered.This will make the recovery procedure longerand increase the probability of the redundancylevel falling under N before the recovery proce-dure is completed. An optimal data distributionstrategy would store each file on a distinct set ofnodes.

9

B.3.1 Villages B.3 REDUNDANCY STRATEGY

s(0)

=

s(1)

=

· · · s(K−1)

=s0 s1 · · · sN−1sN sN+1 · · · s2N−1... · · · · · ·

...sS−1 0 · · · 0

=

a(0)0 a

(0)1 · · · a

(0)N−1

a(1)0 a

(1)1 · · · a

(1)N−1

... · · · · · ·...

a(L)0 a

(L)1 · · · a

(L)N−1

→→→→

b(0)0 b

(0)1 · · · b

(0)K−1

b(1)0 b

(1)1 · · · b

(1)K−1

... · · · · · ·...

b(L)0 b

(L)1 · · · b

(L)K−1

Figure 2: Solomon-Reed redundancy process. A string s is organized in blocks a(0), . . . ,a(L−1)

of N bytes each. Each block is interpreted as polynomial coefficients and the polynomial is over-sampled to produce b(0), . . . ,b(L−1) (each arrow represents a polynomial oversampling). All thecorresponding components of each b(i) are then reorganized into data samples s(0), . . . , s(K−1),each of which is stored on an independent node.

s(i0)

=

s(i1)

=

· · · s(iN−1)

=b(0)i0

b(0)i1

· · · b(0)iN−1

b(1)i0

b(1)i1

· · · b(1)iN−1

... · · · · · ·...

b(L)i0

b(L)i1

· · · b(L)iN−1

→→→→

a(0)0 a

(0)1 · · · a

(0)N−1

a(1)0 a

(1)1 · · · a

(1)N−1

... · · · · · ·...

a(L)0 a

(L)1 · · · a

(L)N−1

=

s0 s1 · · · sN−1sN sN+1 · · · s2N−1... · · · · · ·

...sS−1 0 · · · 0

Figure 3: Solomon-Reed retrieval process. N data samples s(i0), . . . , s(iN−1) are recovered from Nindependent nodes. They are reorganized in L N -component subsets of the original b(0), . . . ,b(L).Polynomial interpolation (as per Equation 3; each arrow represents a polynomial interpolation onthe same Lagrange basis) is then used to reconstruct a(0), . . . ,a(L−1), which can be reorganizedinto the original string s.

Polling overhead

Failure of a node can be caused not only byfailure of its storage device, but also by dam-age to its single-board computer or by user-generated permanent disconnection. The lat-ter causes cannot be detected by the node ex-periencing the failure and notified to the nodesthat share the redundancy of files with it. Real-time, distributed monitoring of uptime and dataavailability must therefore be implemented viapolling (namely, iterated active checking for theavailability of a resource).

File size statistics In order to estimate thepolling overhead caused by storing each file on adistinct set of nodes, we need to determine theaverage number of distinct files each node willbe storing.

In order to do so, we implemented an auto-mated online survey to anonymously scan thesize distribution of personal files of 34 volun-teers. Volunteers were prompted with an onlineform that allowed them to select one or morefolders from their computer, and were asked toselect any folder where personal files were stored

0 5 10 1510−9

10−8

10−7

10−6

File size (MB)

Proba

bilityde

nsity

File size distribution

Figure 4: Experimental file size distribution gen-erated by our scanner on the personal files of34 volunteers. Bottom 98 percentiles are rep-resented here (the top 2 were cropped to allowsemilogarithmic representation.)

10

B.3 REDUNDANCY STRATEGY B.3.2 Redundancy parameters

(e.g. Desktop, Documents, Music, Pictures,and so on). 1.01 · 106 files were scanned.

Figure 4 shows the resulting file size distribu-tion. After a steeply decreasing regime, the dis-tribution shows a dent around 3.8 MB followedby an exponential distribution for larger files.

The experiment shows an average file size

〈S〉 = 5.56 MB

Polling bandwidth If we preliminarily un-derestimate N = 8, K = 12, the average datasample’s size will be 695 KB, and the timeneeded to transfer it will be approximately 7 s.In order for recovery procedure time to benefitfrom the reduced concurrency induced by inde-pendently storing files, polling frequency mustbe in the same order of magnitude, e.g., 10 s. Afull node with C = 1 TB storage capacity willcontribute to store on average 1.58 · 106 distinctfiles, which would result in 1.89 ·107 polling con-nections. The size of a single, empty UDP packetis 52 B. The polling bandwidth required to mon-itor files distributed independently across thenodes would therefore be at least 98.3MB/s, farbeyond the upload speed Su (see Section B.1.2).

Daemons and villages

In order to reduce polling the overhead stem-ming from real-time, distributed data monitor-ing, we will divide the storage capacity of eachnode in partitions of Z bytes. Each partition willbe managed by a daemon, namely, an indepen-dent instance of a data-managing software. Dae-mons will be organized in groups of K that wecall villages. Each file in our network will bestored by a village, and each daemon will storethe same components of each data block of eachfile.Remark: Z must be the same for every dae-

mon in the same village. Since every time a fileis uploaded to a village the same space is occu-pied on every daemon (each daemon is storing adata block of the same size), any space allocatedbeyond the storage capacity of the smallest dae-mon would be wasted.

B.3.2 Redundancy parameters

Lazy recovery

Polynomial interpolation and re-sampling allowsus to recover data samples that went missing.In order to recover any number of missing datasamples, however, at least N other data samplesneed to be gathered on the same node (polyno-mial interpolation requires at least N samples tooccur).

Due to the fixed networking cost of interpo-lation, it would therefore be more efficient toallow multiple data samples of the same file tobe lost before performing a recovery procedure,thus amortizing the fixed networking cost of in-terpolation by re-sampling more missing datasamples at a time.

We will model a village with the followingrules:

• The village is always composed of K dae-mons. Whenever a daemon experiences afailure, it is immediately replaced by a new,empty one.

• Whenever the redundancy level of a file(namely, how many distinct data samples ofthe file are stored by the village) reaches arecovery threshold T (h = T/N will becalled recovery ratio), a recovery proce-dure is performed.

Let us call sources the daemons that, atthe beginning of the recovery procedure,have a data sample of the file, and sinksthe daemons that don’t. The procedure isimplemented as follows:

– One of the sinks downloads N chunksamples of the file from N sources.

– The sink recovers the original file andre-samples all the missing chunk sam-ples.

– The sink stores one of the new datasamples, and sends one of the otherT −N − 1 to each other sink.

Parameters

To sum up, our redundancy strategy dependson four redundancy parameters, for whichadequate values need to be determined:

• Block size (N): the number of coefficientsin a polynomial.

• Village size (K): the number of daemonsin a village, equal to the number of datasamples per file. r = K/N will be calledstorage ratio.

• Recovery threshold (T ): the redundancylevel that triggers a recovery procedure fora file. h = T/N will be called recoveryratio.

• Storage size (Z): the amount of storagespace managed by a daemon.

11

B.4 REDUNDANCY PERFORMANCE

B.4 Redundancy perfor-mance

The efficiency and reliability of our redundancystrategy will be determined by the following val-ues:

Data lifetime ([L∗] = s): the average life-time of a file stored by our network.

Recovery bandwidth ([B∗] = B/s): theaverage amount of bandwidth used by adaemon to perform the recovery proceduresneeded to maintain file redundancy.

Data downtime (d∗ ∈ [0, 1]): the averagefraction of time a file is unretrievable due totemporary network- or power-related issues.

B.4.1 Data lifetimeRecovery time

Assuming 100% storage occupancy 2, whenevera node experiences a failure its village will bestoring NZ worth of files. The files that willneed recovery after a node n experiences a failurewill be those stored by n and exactly other Tdaemons.

As we have seen earlier, the time a file spendsin a redundancy level l is on average L/l. There-fore the probability of any file being hosted byT + 1 daemons at any time is

pc =L(T + 1)−1

L∑Ki=T+1 i

−1 + tr≤

(T + 1)−1∑Ki=T+1 i

−1≤ (T + 1)−1

log(K + 1)− log(T + 1)

where the divisor of the right hand side of thesecond line stems from

L

K∑i=T+1

1

i≥ L

∫ K+1

T+1

1

xdx

= L(log(K + 1)− log(T + 1))

The probability for a file whose redundancylevel is T + 1 to be hosted by any daemon is

ph =T + 1

K

therefore the average data that will need to berecovered when a node experiences a failure is

C = NZpcph

≤ NZ

K(log(K + 1)− log(T + 1))

2This is likely to be a large overestimate, as it wouldrequire each and every user in the network to occupy allof his/her storage space.

The average data transfer load during a recov-ery procedure is as follows:

Upload Download

Source CT 0

Sink(K−T−1

N

)C

K−T(1 + K−T−1

N

)C

K−T

Most of the data transfer is carried out bythe sinks. As a home-grade Internet connec-tion has an asymmetric upload/download band-width, the time required to complete a recoveryprocedure is therefore (see Section B.1.2):

tr = max

CT S−1u(K−T−1

N

)C

K−T S−1u(1 + K−T−1

N

)C

K−T S−1d

Recovery failure probability

A file is permanently lost if its redundancy levelgoes under N before its recovery procedure iscompleted. The probability of any alive daemonsurviving at least tr is ps = exp(−tr/L). Theprobability of losing exactly k daemons out of Tin tr time is

p(k)l =

T !

k!(T − k)!(1− ps)kp(T−k)s

and by the Chernoff bound the probability pl oflosing T −N + 1 or more chunks is

pl =

T∑k=T−N+1

p(k)l ≤ exp(−TD̃(h, 1− ps))

with

D̃(r, d) = D

(r − 1

r|| d)

and D(a, p) is the relative entropy between aBernoulli(a) and a Bernoulli(p) distribution:

D(a||p) = a log

(a

p

)+ (1− a) log

(1− a1− p

)

Lifetime

As we have seen in Section B.3.1, the averagetime for the redundancy level of a file to decayfrom K to T is

td = L

(K∑

i=T+1

1

i

)

≥ L∫ K

T+1

1

xdx = L(log(K)− log(T + 1))

12

B.4 REDUNDANCY PERFORMANCE B.4.2 Recovery bandwidth

10 20 30 40 50 60101

107

1013

1019

1025

Village size (K)

Datalifetim

e(years)

Data lifetime

h = 1.16h = 1.2h = 1.25h = 1.33

Figure 5: Data lifetime for different values of therecovery ratio h, as a function of the village sizeK. Here we have r = 1.5 and Z = 100 GB.

The time for the redundancy level of a file todecay from K to T and for the file to be recov-ered is td + tr. The average number of theseredundancy cycles is 1/pl. Therefore the ex-pected data lifetime is

L∗ =td + trpl

Figure 5 shows data lifetime for different val-ues of the recovery ratio, as a function of thevillage size. Data lifetime can be made suffi-ciently large for any practical purpose by villagesof manageable size, even when Z is as large as100 GB.

Remarks

Reliability calculations depend critically on theassumption of independence. In a centralizedstorage paradigm, disk failures could be corre-lated for any number of reasons: disks com-ing from the same manufacturing lot, disks op-erating in the same physical environment, andso forth. Reliability estimates, therefore, tendto be inflated when the assumption of indepen-dence is unfounded [22] [23].

In a distributed paradigm, however, storagedevices will be purchased from independent lots,and each will in general operate in an inde-pendent physical environment with independentconditions. Therefore, reliability estimates canbe considered more trustworthy in a distributedcontext.

B.4.2 Recovery bandwidthRecovery bandwidth can be easily computedfrom the results of Section B.4.1. Upon nodefailure, on average C worth of data will needrecovery.

10 20 30 40 50 6010

12

14

16

18

20

22

Village size (K)

Recoveryba

ndwidth

(KB/s)

Recovery bandwidth

h = 1.16h = 1.2h = 1.25h = 1.33

Figure 6: Recovery bandwidth for different val-ues of the recovery ratio h, as a function ofthe village size K. Here we have r = 1.5 andZ = 100 GB

Upon recovery, T sources will transfer datato K − T sinks, and from the table in SectionB.4.1 we get the total amount of data Q thatis transferred every time a recovery procedure isperformed

Q = 2C

(1 +

K − T − 1

N

)and the recovery bandwidth

B∗ =Q

K

(L

K

)−1=Q

L

Figure 6 shows the recovery bandwidth for dif-ferent values of the recovery ratio h, as a func-tion of the village size Z. As expected, B∗ is anincreasing function of h. Recovery bandwidthseems to reach a plateau for large values of K,on which it does not strongly depend. For largevillages and relaxed recovery ratios, B∗ is in theorder of 10 to 20 KB/s when Z is as large as100 GB.

Remarks

Unlike data lifetime, whose value can be easilymade high enough for any practical purpose, re-covery bandwidth is small, but non-negligible.Our estimates, however, can be reduced by thefollowing arguments:

Full usage: we assumed that every userof the village will use 100% of its allocatedstorage space, which is unlikely to be truein real-world scenarios.

Immortal files: we assumed that files arenever deleted. Files whose lifetime is shorter

13

B.4.3 Downtime

than a redundancy cycle (which is in the or-der of several months) will be deleted beforeundergoing any recovery procedure.

No access: we assumed that files are neveraccessed. Whenever a file is downloaded bya user, its data samples are collected and in-terpolated by a client. Since the most costlypart of a recovery procedure is, in fact, al-ready carried out, missing samples can bere-sampled and uploaded, and the redun-dancy for the file restored.

Recovery procedures can also be designed tobe carried out when network load is low. Forexample, when the redundancy level for a filereaches T + 1, an early recovery procedurecould be scheduled for the next nighttime. In thelikely event that no additional failure will occurin the next few hours, the recovery procedurewill not significantly affect user experience, andmake use of the routing infrastructure when ittends to be more idle.

B.4.3 DowntimeAs we have seen, the probability of a file’s re-dundancy level being k is

p(k)c =Lk−1

L∑Ki=T+1 i

−1 + tr

and the downtime of a file whose redundancylevel is k is given by the binomial

d(k) =

k∑n=k−N+1

k!

n!(k − n)!dn(1− d)(k−n)

therefore

d∗ =

K∑k=T+1

p(k)c d(k)

Figure 7 data downtime for different values ofthe recovery ratio h, as a function of the villagesize K. As expected, downtime is a decreas-ing function of h (as the recovery procedure isperformed before the redundancy level allows asignificant probability of enough nodes being si-multaneously offline).

B.4.4 ConclusionsParameters and performance

As a result of the tradeoffs defined by reliabil-ity, polling overhead and recovery bandwidth wechose the following parameters:

Block size (N) 24Village size (K) 36Recovery threshold (T ) 28Storage size (Z) 100 GB

10 20 30 40 50 6010−25

10−20

10−15

10−10

10−5

Village size (K)

Datado

wntim

e

Data downtime

h = 1.16h = 1.2h = 1.25h = 1.33

Figure 7: Data downtime for different values ofthe recovery ratio h, as a function of the villagesize K. Here we have r = 1.5 and Z = 100 GB

which result in the following performance:

Data lifetime (L∗) 7.33 · 108 yearsRecovery bandwidth (B∗) 11.5 KB/sData downtime (d∗) 9.51 · 10−10

Conclusions

In the previous sections, we described our re-dundancy strategy, designated four redundancyparameters (block size, village size, recoverythreshold and storage size), and discussed amodel that allowed us to derive analytical ex-pressions for three key performance indexes(data lifetime, recovery bandwidth and datadowntime).

Solomon-Reed based redundancy strategywith distributed, real-time data monitoring andrecovery procedures proves to be reliable and ef-ficient, and a viable strategy to store data usingsingle board computers, consumer-grade storagedevices, and home-grade Internet connections.

Part C

ArchitectureChallengesShowing that an efficient and reliable redun-dancy strategy can be designed to store data ona distributed network of nodes is indispensable,but not sufficient to build a network that can ac-tually be used to securely store and share files.In this section, we will provide an overview onsome of the foreseeable challenges for the fulldevelopment of RAIN.

14

C.2 NETWORK ORGANIZATION

C.1 Security challenges

As we discussed in the Introduction, the RAINnetwork should guarantee confidentiality (anyunauthorized access to the content stored onthe network should be impossible), integrity(it should be impossible to alter content storedon the network without the tampering being de-tected) and availability (it should be difficultfor attackers with limited hardware resources tomake some content permanently unavailable ortemporarily unaccessible).

The security protocols of traditional client-server paradigms are designed under the as-sumption that servers will be in physically se-cure environments beyond the reach of attack-ers. Discretion over who is granted access tothe data, however, is left to the provider of theservice and not to the user: in the recent past,many of the largest Internet companies in theworld, handling personal data of billions of cit-izens, have guaranteed access to government-grade attackers [24] [25].

Since each node of the network will be beyondour physical reach, and in the hands of poten-tially malicious users, RAIN will need radicallydifferent security protocols.

Attack model

RAIN’s security protocols will be designed underthe following attack model (namely, a set ofaxiomatic assumptions about the extent of anattacker’s access to our network and its intents,see, e.g., [26]):

Access

1. The attacker will have read-only access toall the raw data persistently stored in ournetwork.

2. The attacker will have physical access to upto 1% of the nodes in our network.

3. The attacker will be able to gain physicalaccess to any node in the network in oneday.

4. The attacker will have at its disposal a com-putational power comparable to that of thewhole world in the foreseeable future.

5. The attacker will not have access to un-known computational technology, or knowcryptanalytic exploits to NSA Suite BCryptography algorithms (namely, AES,ECDSA, ECDH and SHA-256/384).

Intents

1. The primary objective of the attackerwill be to gain access to plaintext datastored by the network’s users.

2. The secondary objective of the attackerwill be to alter the data stored by the net-work’s user.

3. The tertiary objective of the attacker willbe to severely disrupt the network’s service(e.g., by making large amounts of data un-available) in order to undermine users’ trustin the network and convince them to revertto a classical client-server paradigm.

C.2 Network organization

C.2.1 Honest Geppetto attacksAs we have seen, correlations among failuresdisrupt data lifetime. For example, wheneverT−N+1 nodes in the same village will under thecontrol of the attacker, an undetectable Hon-est Geppetto attack (here we use the samenomenclature as in [27]) can be delivered: ma-licious nodes can behave identically to honestnodes until some recovery threshold is reached.When the redundancy of some files they arehosting reaches T , malicious nodes can simul-taneously wipe their content, thus making datapermanently unavailable.

C.2.2 Proofs of Persistent StorageIt is easy to see that no distributed storage net-work can guarantee availability under a Googleattack (namely, a variant a Sybil attack wherethe attacker has under its control an amount ofhardware resources far larger than the size ofthe whole network). Especially in the case ofan Honest Geppetto attack, where it is impos-sible to distinguish malicious from honest nodesbefore they perform the attack, there is no ef-fective way of selectively storing data on honestnodes, and the probability of that happening bychance is vanishingly low. The only feasible de-fense against a Google attack is to have a largeenough network [27] (or to make its size growfaster than the reward for an attacker to investits resources in a Google attack).

As per Assumption 2, the attacker might haveunder its control an extensive amount of physi-cal nodes, that we still assume to be a minority.A malicious node, however, might spawn moredaemons than what allowed by its storage space,and go undetected until the actual occupancy offiles overflows its capacity. By carefully over-booking its storage, without any hardware cost,

15

C.2.3 Authenticated Distributed Hash Tables C.2 NETWORK ORGANIZATION

the attacker might multiply the number of dae-mons under its control.

In order to prevent this from happening, wewill make use of Proofs of Persistent Space, i.e.,protocols that allow one party (verifier) to checkif the other party (prover) is persistently occu-pying some committed space [14].

PoPS protocols exist with polylog(N) com-munication complexity [15] that, however, arevulnerable to tradeoffs that, at a cost of CPU,could allow the prover to maliciously save partof its committed space. In order to addressthis issue, we developed a provably secure, tight-bound PoPS that uses inherently sequential in-tertwined functions and timeouts to prevent theon-the-go recomputation of the storage commit-ment.

By forcing each daemon to commit the storagespace it is making available, we will prevent theattacker from controlling a larger percentage ofdaemons than physical nodes.

C.2.3 Authenticated DistributedHash Tables

Publicly verifiable systems exist (namely,blockchains [13]) that allow distributed book-keeping on a public ledger of transactions with-out need for the nodes involved to trust eachother, or a third party. Blockchains, however,require each node to keep a full copy of the wholetransaction ledger, whose size grows in time.

As we will later see, in order to guaranteeavailability, we will need to perform some rareoperations on the network, like the addition ofa new node or the creation of a village, whoseresult needs to be verifiable by any other node.

In order to do this in a way that is scalable anddoes not require each node to keep a list of everyglobal transaction ever occurred on the network(as in a blockchain paradigm), a protocol basedon authenticated hash tables (for related work,see [12] and [28]) is under development.

Our strategy will be to organize the entries ofa hash table in a Merkle tree where each valuecan be retrieved by navigating the tree along apath uniquely determined by the key.

This data structure can also be distributed(authenticated distributed hash table) amongmultiple nodes, each holding one or more entries,and their corresponding Merkle proofs. It is pos-sible to show that non-conflicting Merkle proofscan be merged with each other, which allows forlocal edits that can be propagated without cor-rupting pre-existing entry proofs.

A consensus mechanism can then be devisedto sequentialize updates based not (as in, e.g.,[13]) on CPU-intensive proofs of work, but on

the same PoPS in place to prevent Sybil attacks.Our preliminary results suggest that it should

also be possible to partition the set of nodesin interconnected communities bound to verifyseparate segments of the ADHT. Allowing forparallel validation of updates would be a signif-icant breakthrough in the field of decentralizedledgers, as it would remove the limit to the num-ber of operations per second that can be carriedout on the ledger.

A paper on ADHT is under development.

C.2.4 Random beacon

As we will see, many of our protocols will relyon the possibility to generate globally verifiablerandom numbers. While in principle distributedrandom number generation could be performedby a Byzantine consensus protocol among all thenodes, its communication complexity would beas high as O(N2) [29], which a planetary net-work, potentially with billions of nodes, cannotafford.

The development of a reliable random beacon(as in [30], but free from unilateral management)is therefore of paramount importance in the de-velopment of our network.

Using time-hard functions

It has been shown [31] that inherently sequen-tial, time hard functions exist (e.g., square rootover GF (p) with p mod 4 = 3, that requires atleast log2(p) − 2 inherently sequential multipli-cations on the field) that, while being wallclock-time hard to compute, can be verified with verylittle computational effort (one multiplication inthe GF (p) square root case).

This allows us to implement a timed protocolwhere multiple nodes can seed the generationof the random number, but a malicious node isunable to adjust its seed to affect the final gen-erated value, because doing so would require awallclock time effort longer than the window ithas to provide its seed.

Using again PoPS as a commitment mech-anism to prevent a single node with limitedhardware resources to spawn an arbitrarily largenumber of daemons [15], we can limit the frac-tion of the seeders pool (namely, the daemonsthat can contribute to seed the random numbergeneration) that even a very large attacker cancontrol.

Using ADHT, we can now publicly keep trackof the number of nodes in the seeders pool,and monitor how many times each daemon con-tributed to the seed.

Now, if we program each honest daemon tocontribute to one seed every, e.g., r/S, S being

16

C.3 DATA C.2.5 Village formation

the number of seeders and r a limited number(e.g., 100), each number will be seeded on aver-age by r independent seeds, and the probabilityof all of them being malicious (the outcome willbe random if at least one of the seeds is random)will decay exponentially in r.

Daemons will be free to randomly choose whatrandom numbers to contribute to, but if a dae-mon contributes too often (note how this can bemonitored with a authenticated hash table), itscontributions will be considered invalid and befiltered (this prevents flooding).

This mechanism (which we will describe inmore detail in an upcoming paper) is similar tothe process of mining [13], but with two ma-jor differences: the hardware commitment is interms of storage space, and not computing powerand multiple daemons will seed the same randomnumber, thus improving the overall security ofthe mechanism.Remark: even if this mechanism can sustain-

ably generate a relatively small amount of ran-dom numbers per time unit (e.g., one randomnumber every 30 minutes), a larger amount ofrandom numbers can be generated by using therandom beacon to select a limited random sub-set of nodes from the pool (the set will changeevery, e.g., 30 minutes), and use a traditionalByzantine consensus algorithm on that subset togenerate random numbers at a higher frequency.

Assumptions 2 and 3 can be used to guaranteethe security of random numbers generated usingthis protocol.

C.2.5 Village formationOnce the publicly verifiable generation of ran-dom numbers is guaranteed, a protocol to or-ganize daemons into villages will be designed inorder to verify the trustworthiness of daemonsand to minimize time correlations induced bythe attacker without compromising the perfor-mance of the network. Trustworthiness will beverified via distributed uptime and IP consis-tency tests. Memory-hard functions on com-mitted storage can be also used to detect bot-nets. New daemons will be assigned as freeload-ers to pre-existing villages. New villages will beformed by randomly selecting daemons from thefreeloaders pool only when storage is saturated.This allows us to minimize the possibility of theattacker to introduce correlations in the forma-tion of new villages.

C.2.6 Bootstrapping connectivityIn order for data to be stored and retrieved byusers on different drives, each node of our net-work needs to be reachable by any other. This

poses a challenge both in terms of networkingand security, as on the one hand NAT-traversaltechniques and bootstrapping techniques needto be reliable and scalable, and on the otherthe attacker must be prevented from undermin-ing the connectedness of the network. Non-secure, worldwide network bootstrapping solu-tions, however, are already in place, developed,e.g., in the scope of trackerless torrent protocol[9], and IPFS protocol [32].

An exponential network topology can be de-signed to guarantee O(log(N)) connectivity andO(log(N)) crossing time, and ADHT can beused to perform timely global bans on the boot-strap network.

C.3 Data

C.3.1 Storage table synchroniza-tion

Once daemons are successfully organized intovillages, each daemon will have some amountof storage allocated by its fellow villagers. Avillage-wide ledger will be designed to keep trackof new files uploaded and their hashes. Sinceeach villager will be in charge of updating onlyits part of the ledger, no concurrency proto-col needs to be implemented. Incremental syn-chronization protocol (e.g., GIT) already existto minimize communication overhead and op-timize reliability of updates distribution whennodes have a non-negligible downtime.

Villagers will be in charge of monitoring thelegality of each other’s operations, and a fair bid-ding protocol will be devised to ban villagersthat misbehave or deny access to their data.

C.3.2 Distributed data monitor-ing

As we discussed, villagers will monitor eachother’s data availability. This will be done bymeans of Merkle trees [11] and signatures. Thiswill be non-trivial, as we want to make sure thatspace involved in a ADHT commitment can stillbe used to store actual data: PoPSs should servethe purpose to prove that a daemon has somehardware available and uniquely committed tothe network, but they should not interfere withhow that space is used.

A bidding protocol will be designed to triggera recovery procedure, along with secure, globalways to find substitute nodes from the freeload-ers pool whenever nodes become permanentlyunavailable.

17

C.5 CREDENTIALS

C.4 Metadata

C.4.1 Synchronization

Data is uploaded by users on their nodes’ vil-lages, using the storage separately allocated byeach village to their account. When data is or-ganized in file trees and shared with other users,however, synchronization becomes an issue ofrelevance. As any other piece of data, drivemetadata will be encrypted but, unlike data,nodes will be able to edit it so as to make changesto the file tree. In the context of a shared drive,multiple users can simultaneously try to pushtheir edits to the drive’s metadata, and atomic-ity needs to be guaranteed.

The problem is widely discussed in litera-ture, and addressed by pessimistic algorithms[33] (that involve acquiring multiple locks be-fore pushing an update to multiple nodes, usefulwhen inconsistencies are either very dangerousor very frequent) and optimistic algorithms [34](that allow for temporary inconsistencies amongthe copies and focus on quickly resolving them).

C.4.2 Efficient ban

Metadata in a drive is encrypted with a symmet-ric key, which, when a new user is invited to jointhe drive, is encrypted with his/her public asym-metric key and shared. When a user is banned,however, a new symmetric key needs to be gen-erated in order to exclude the banned user fromaccess to the data stored in the drive. When theamount of users in the drive is large, however,sharing the new key to all of them is a costlyprocedure. Similar problems are found, e.g., inthe field of premium television channels, wherenew keys need to be distributed appropriatelyonly to subscribers.

Algorithms exist in literature efficiently ad-dressing the problem [35]. On our side, anO(log(N)) protocol is under development to ef-ficiently share new keys to all the users exceptone.

C.4.3 Data-metadata synchro-nization

While atomicity will be guaranteed in the con-text of metadata synchronization, data andmetadata are in principle independent. A proce-dure needs to be devised to prevent orphan data,i.e., data that is not referenced by any meta-data. This can be done by allowing villages tocommunicate. Whenever data will be stored ona village, it will be marked as temporary untila metadata village confirms to be referencing it.When a reference is removed from a drive, the

village storing its metadata will notify the vil-lage storing the data to delete the data to savespace.

C.5 Credentials

C.5.1 High-entropy authentica-tion mechanisms

In order to prevent the attacker from pursuingits primary and secondary objective (see As-sumptions 1 and 2), end-to-end cryptographywill be implemented using asymmetric keys gen-erated at the moment of signup. Since, as wesaid in Section: User Experience, users shouldbe able to access their data from anywhere, auser’s private key will need to be persistentlystored on the network, encrypted with a key thatcan be generated by an authentication mecha-nism. In a client-server context, the encryptedkey would shared by the server only when theuser proves to be able to decrypt it, thus actingas a bottleneck for any brute-force or hash tableattack (for example, a server can lock an accountfor a few minutes if too many incorrect trialsare made). This protects low-entropy passwordsfrom attacks (or, in the case of an attacker incontrol of the server, completely undermines thesecurity of an end-to-end encryption based onweak passwords). Due to Assumption 1, how-ever, there is no place in the network where wecan store an encrypted private key without theattacker being able to retrieve it. Due to As-sumption 4, the attacker can perform a bruteforce or a hash table attack on the key, of whichit has the ciphertext.

This issue can be addressed in two parallelways. On the one hand, the decryption func-tion can be made hard to compute [36]. Evenwithout acting as effectively as a server bottle-neck, this will slow down brute force attackers,but it comes at a cost in terms of user experi-ence: the login process to the service should nottake more than a few seconds on a standard ma-chine. On the other hand, we will try to increasethe entropy of our authentication mechanism. Inorder to do so without making its solution dif-ficult to memorize, we plan to enforce the useof passphrases, rather than passwords, by de-sign. By leveraging on associative memory, thiscan be done in two ways.

Image-word associations Research hasbeen already carried out on the possibilityto use images as a part of an authenticationmechanism [37]. Here we propose an exampleof an image-based authentication mechanism.Upon signup, the user is prompted with a

18

C.6 ENVIRONMENTAL IMPACT C.5.2 Distributed hash tables on GBN

collection of images. Images are represen-tational, but without a uniquely identifiablesemantic content. The user is asked to pick acertain number of images (e.g., 10) that areeasy to associate to a personal memory, and toassociate one or more words to each. Duringsigning, the user is prompted with the imagesthat he/she selected, and asked to enter thecorresponding words. If words were uniformlypicked among the 2000 most common words inthe user’s native language, a 10-word passphrasewould be approximately as secure as a 110-bitrandom key. In order verify the hypothesis thatword-image associations are easier to memorizeand have a higher entropy than passwords, wedeveloped an online experiment where passwordentropy and memory persistence are measured.Preliminary results are encouraging as far asmemorization is concerned, with some usersremembering with 100% efficiency up to 286-bitequivalent passphrases. However, some userstend to choose words that describe the image,rather than a memory associated with it. Thisimmensely reduces the final entropy of thepassphrase.

Word-word association Word-word associ-ations can be used instead of images [38]. Uponsignup, the user is asked to enter a certain num-ber of words (e.g., 10), and associate to eachanother word with a personal, easily recallablemeaning. During signing, the user is promptedagain with the words he/she chose, and asked toenter again the corresponding words. In order tofilter out trivial associations, we plan to run acrowdsourced free word association experiment,where a network of more common associationsis built depending of nationality. This would al-low us to provide the user with an estimate ofhis passphrase entropy, and filter out then lessrobust ones.

C.5.2 Distributed hash tables onGBN

When signing in from a new client, a user willbe only asked to enter his username and hispassphrase. It will be necessary, therefore, tostore distributed login information at a globallevel on the network, in a way that allows in-formation to be retrieved from anywhere in alimited amount of steps. This can be doneby organizing nodes in large redundancy groups(namely, Distributed Hash Tables like [9]), andstoring login information for each user in a groupthat can be algorithmically determined fromhis/her username.

C.6 Environmental impactRAIN’s architecture is distributed and relies onnodes whose hardware is limited both in num-ber of transistors and power requirements. Akey milestone will therefore be to study how alarge scale transition from a centralized to a dis-tributed paradigm could affect the environmen-tal impact of the Internet.

Here we report preliminary estimates andqualitative arguments for this infrastructure’senergy efficiency.

C.6.1 Efficiency argumentsLow-energy nodes The power dissipation fora capacitor C over a voltage V that oscillateswith a frequency of ν is given by

P =1

2CV 2ν

which defines an essential part of the dissipa-tion for a digital computer: the dissipation ofthe creation and removal of a bit of information.Single-board computers (like Raspberry Pi), onwhich we plan to develop our infrastructure, arebuilt from components optimized for low-energyperformance (e.g., ARM processors are often usedin cellular phone technologies (see e.g. [39] [40]),where battery life is of paramount importanceand cooling is passive). The reduced clock speedand size (RISC processors require a smaller num-ber of transistors) of the nodes on which we planto develop our architecture could therefore makethem more energy efficient than faster servers,the overall complexity of each task being equal.

Embedded energy Hardware manufacturingprocesses significantly contribute to the energybudget of Internet-mediated transactions. Forexample, [41] shows that the production processof a server represents nearly the 20% of its life-time overall energy cost. Moreover, as suggestedin [42], servers are replaced at an increasing paceto supply the performance needed to offer up-to-pace services, with an expected lifetime cur-rently around 3 years.

Low-energy nodes have a smaller number oftransistors and often rely on larger-scale tech-nology (e.g., 14 nm Skylake microarchitecturecurrently used to produce Intel Xeon series vs40 nm technology for Broadcom BCM2837, cur-rently in use in Raspberry Pi 3). These twoeffects, along with the reduced size of their com-ponents, could have a significant energy savingimpact on the manufacturing process.

Datacenter infrastructure It is estimated[43] that approximately 50% of the total power

19

C.6.2 Node efficiency comparison C.6 ENVIRONMENTAL IMPACT

consumption of a datacenter is due to serversand storage, where the rest of the balance mainlyincludes cooling 2̃5% and powering of the in-volved distribution network hardware 2̃0%.

Due to its distributed architecture, and to thelow energy nature of its nodes, our infrastructurewould be passively cooled, and use an infrastruc-ture that is already in place to service citizensand organizations.

Long-range routing infrastructure Thegeographical distribution of the nodes in our net-work would be significantly more fine grainedthan that of a traditional, centralized paradigm.By storing data on nodes that are geographicallyclose to its users, it could be possible to reduceby orders of magnitude the average routing dis-tance to provide access to the data.

For example, accessing Facebook fromBologna, Italy probably involves a communi-cation with one of its datacenters in Europe,the closest being in Ireland [44], at a lineardistance of more than 1500 km. Conversely,for a sufficiently deep service penetration, it isnot unreasonable to assume that most of thepersonal data for a user in Bologna could bestored by a node in Bologna, thus reducing therouting distance to a few kilometers.

It is not easy, however, to estimate the en-ergetic impact that this change of paradigmwould have on the current routing infrastruc-ture, which is optimized to service a centralizedarchitecture. For example, it has been shown[45] that a significantly larger expenditure of en-ergy in communication networks is due to accessnetworks (i.e., the part of the routing infrastruc-ture that directly provides access to the users)than to core networks (i.e., the faster routing in-frastructure that carries data in long-range com-munications). Shorter range communicationscould therefore be less optimized, not due to in-herent properties in the routing technology, butto optimization for a centralized paradigm.

Intensive data processing A centralized ar-chitecture needs a constant stream of revenue inorder to be sustainable. As we have seen in thePart A, some of the largest owners of datacentersmake intensive use of the data collected throughtheir services to generate profit. Independentlyof the privacy issues that this practice raises, itis easy to argue that it comes at a significantcost in terms of computing power.

A distributed architecture could be sustainedlocally by the same community that uses it.This would make it unnecessary to find differ-ent, privacy-invasive and CPU-consuming ways

of processing data to, e.g., provide better tar-geted advertisement services.

C.6.2 Node efficiency comparison

In order to determine an order-of-magnitudeenergy efficiency ratio between a single-boardcomputer and a server, we performed a set ofcryptographic benchmarks on a Raspberry Pi3 single-board computer and a Dell PowerEdgeR815 server, mounting 4 AMD Opteron 63288-Core 3.2 GHz.Remark: we chose to benchmark crypto-

graphic operations because of the static natureof the content stored and distributed by our net-work: in a real case scenario where the CPU costof secure delivery of content over the Web is re-duced to the cost of an HTTPS transaction, it wasshown that 70% of the CPU cost was due to SSLprocessing [46].

Experimental procedure The benchmarkswere run on the Raspberry PI on a cleaninstallation of Raspbian Jessie, and on thePowerEdge on a clean installation of SpringdaleLinux 6.

We used the benchmark software integratedwith the OpenSSL library. Both platforms usedOpenSSL 1.0.1e-fips. AES encryption wasbenchmarked with 256 bit keys in CBC mode.

The benchmark was ran both in single-threadmode and in multi-thread mode, using all theavailable cores (4 and 32, respectively). 15single-thread and 20multi-threaded benchmarkswere ran on each platform.Remark: during the test we observed a

significant dependency of the speed of theRaspberry Pi on its temperature. No coolingmeasure was taken, and the multi-threaded testswere looped repeatedly and the values takenwhen the speed stopped showing a decreasingtrend.

Our results are as follows (all the values arein MB/s):

1 thread > 1 thread

RPI 36.90± 0.02 96.9± 0.2

PEdge 175.26± 0.08 3981± 5

Values for the maximum loads power con-sumption of Raspberry Pi and PowerEdge wereobtained from [47] and [48] respectively:

20

D.1 SERVICES

Maximum load power

RPI 3.7 W

PEdge 563 W

Which result in an encryption efficiencyof 26.2MB/J for the Raspberry Pi, against7.1MB/J for the PowerEdge.

Obviously, more detailed investigations areneeded to determine the energy and environmen-tal impact of the proposed distributed datastor-age architecture as well as how it compares tothe current centralized datastorage paradigm.

Part D

Potential ImpactA distributed, open-source, community-ownedinfrastructure for the storage, distribution andmanipulation of information has multiple appli-cations beyond the storage of personal and othersensitive user data. Here we propose some in thefields of Services, Finance and State Administra-tion.

D.1 ServicesRAIN’s first goal is to reliably and securely storeprivate data. This service can be offered in ascalable way, as its design involves connectingmore nodes to the network as new users startusing the service. A wide variety of critical ser-vices, however, could be hosted on RAIN’s in-frastructure once its acquisition is deep enough.Here we propose some examples.

D.1.1 Messaging platformDue to the simultaneous scaling of demand andoffer (the more users use the service, the morenodes are connected to it) and to the relativelylow bandwidth requirements (whose bottleneckis often determined by mobile phone connec-tion speeds), messaging services are among theeasiest additional services that could be imple-mented on top of a pre-existing storage and dis-tribution infrastructure for personal data.

Indeed, classical storage villages could be usedas a means to store outgoing messages, availablefor any recipient to be retrieved. Messages couldbe end-to-end encrypted using the same strat-egy that guarantees the security of shared files.In principle, this could guarantee an uncircum-ventable secrecy of communications, while allow-ing our users to exchange messages and media

without having to alter their usual user experi-ence.

Real-world scenario: WhatsApp

Throughout 2016, one billion of WhatsApp usershas sent an average of 64 · 109 messages per day,and 1.6 · 109 images [49].

Using the online interface WhatsApp Weband a network inspector, we selected 50 randomimages from a WhatsApp conversation to de-termine how images are compressed when dis-patched, and we determined an average imagesize of (111± 8) KB.

Overestimating a text message size to 1 KB,we therefore obtain a total service bandwidth of241.6 TB/day, or 2.79 GB/s.

As a rough estimate, a secure messaging ser-vice equivalent in size to WhatsApp could there-fore be hosted on RAIN’s infrastructure by ded-icating 1% of the upload bandwidth of 1.40 · 106of its nodes (here we used Su = 0.2MB/s, recallSection B.1).

D.1.2 Content Delivery Network

A personal data storage service involves storinga large amount of data with a rapidly decay-ing distribution of popularity: files are accessedonly to those users to which they were explic-itly shared. Publicly available content followsa more heavy-tailed distribution of popularity[50].

In order to provide access to very populardata a one-to-one redundancy strategy, wherethe number of copies for each file is tuned onlyto prevent data loss, is not sufficient.

A large enough network already deployed tostore personal data could make use of the sparestorage resources of its users and aid e.g. thedistribution of web pages with an unprecedentedgeographical extensiveness. Locally cached datawould be faster and less expensive to distribute,and long-range routing infrastructure would seeits load reduced.

A peer to peer Web content delivery net-work is already under investigation by the IPFSproject [51], relying for the distribution of con-tent on the personal computers of users access-ing the content. A similar protocol could be im-plemented on top of RAIN’s infrastructure withthe following advantages:

• Uptime: dedicated, permanently onlineembedded computers would have a higheruptime than personal computers. Thiswould significantly reduce the redundancyneeded to make a file available to large audi-ences, as each node would be providing the

21

D.1.3 Social network D.1 SERVICES

same content for a more extended amountof time.

• Storage space: the protocol could makeuse of a node’s spare storage space to cacheWeb content. That space would already beallocated by the user to mass storage pur-poses, therefore the protocol would not bein competition for resources with the rest ofthe user’s environment.

• Computational resources: the protocolwould run on dedicated machines insteadof personal computers. The user experiencewould therefore be unaffected by the proto-col.

Real-world scenario: Wikipedia

Throughout February 2017, the online encyclo-pedia Wikipedia had 585 · 106 pageviews [52],for a total size (late 2014) of 23 TB [53]. By us-ing the Random article feature, and a networkinspector, we sampled the data transferred toload 300 random Wikipedia articles. The aver-age data transfer per article is 243 KB, with astandard deviation of 195 KB. Under the ap-proximation of Gaussian distribution of sizes,the error on the average is 11 KB.

The above sums to 142 TB/month, or54 MB/s. As a rough estimate, Wikipedia couldtherefore be hosted on RAIN’s infrastructure bydedicating 1% of the upload bandwidth of 54·103of its nodes (here we used Su as per Section B.1).

D.1.3 Social network

RAIN’s infrastructure is designed to guaran-tee privacy-preserving sharing of files among itsusers while organizing data in a way consistentto that of a social network. Preserving the net-work’s cryptographic architecture could be seenjust as an extension to the software already inplace.

Having to manage only a limited number ofaccounts, each node in the network could storeand process information dispatched by others,related to the users to which its owner is con-nected. Moreover, while a universal protocolcould be established to distribute data acrossthe network, the way contents are organized andoffered to the user would depend on softwarerunning on each user’s node. This would allowon the one hand the transparency, and on theother the possibility to personalize the organi-zation of content, as each node could organizeits user’s content in a different way dependingon the software it is running.

Real world scenario: Facebook

As of 2014, the total storage capacity of Face-book, the largest social network in the world,was 300PB [54]. User base growth data is avail-able [55] and shows an approximately lineargrowth from 1.276 · 109 users during the firstquarter of 2014 to 1.860 · 109 at the end of 2016.Under the assumption of a steady production ofdata by its users, we can extrapolate Facebook’scurrent storage capacity by multiplying the oldfigure by the square of the users’ ratio, obtainingapproximately 650 PB.

Under the assumption of each node having astorage capacity of 1 TB, and dedicating 1%of the storage capacity of each node to hostingan equally large social network, approximately65 · 106 nodes could host the same 1.86 · 109 so-cial network accounts, while providing privacy,transparency and customizability to its service.

D.1.4 Search engineMultiple projects (e.g., [56] [57]) exist to imple-ment distributed crawling, indexing and datamining to provide independent, open-sourcesearch engine services. Distributed crawlingwould be a task especially suited for a networkalready providing CDN services, as the crawl-ing and indexing could be performed in-place bythe same nodes hosting the content, saving thebandwidth needed to download the content andindex it on a separate server. Multiple comput-ing infrastructures 3 already implement similarmechanisms, sending code to be locally executedon the machine hosting the data rather thanmoving the data to enable its processing on aseparate service.

Real-world scenario: Google

As of 2016, Google’s index comprises 130 · 1012distinct pages, for a total size of 100 PB [58].Under the overestimate of a 10% daily updaterate, a distributed network of 100 · 106 non-overlapping crawlers could keep the databaseupdated with as few as 1.5 page requests persecond per node. Storing the index in a Dis-tributed Hash Table with 10× redundancy, thesame amount of nodes could store the whole in-dex with an occupancy of 10 GB per node.

Such a large-scale distributed database, how-ever, would likely require the design of newindexing and routing algorithms optimized fordistributed systems to make the performance

3For example, the WLCG (Worldwide LHC Comput-ing Grid), which stores and processes the data producedby CERN’s Large Hadron Collider, runs data analysisprograms directly on the nodes storing the data, as it iseasier to move programs than the data they need.

22

D.2 CRYPTOCURRENCY

of a query comparable to that of a centralizedparadigm. As of 2009, a query to the Googleindex involved an order of magnitude of 103 dis-tinct servers [59], resulting in an extensive useof an internal datacenter bandwidth whose per-formance cannot be matched by that of a dis-tributed system.

D.2 CryptocurrencyAs we said in Section C.2.3, a large enough set ofnodes with a relatively high uptime, organizedin a network where the number of daemons pernode is limited by a hardware commitment, andmanaged for some critical aspects by a trustwor-thy or verifiable publicly available random bea-con, can be used to implement verifiable book-keeping of a distributed ledger by means of au-thenticated hash tables.

By means of verifiable hash tables, a setof account balances can be efficiently and se-curely stored, thus implementing a cryptocur-rency with the following advantages:

• Due to the low communication and CPUcomplexity of the process, it could addressthe issue of the environmental impact ofcryptocurrencies (like Bitcoin [60]), whosesecurity relies on the constant solution ofCPU-intensive cryptographic problems [13].

• Its constant persistent space complexitywould guarantee scalability: while on theone hand the most commonly used cryp-tocurrencies require each user to store thelist of all the transactions ever occurred onthe network, the memory requirement forthis one would not increase with time. In-deed, authenticated hash tables would storethe state of the ledger, rather than the listof all its updates: while T transactions in-volving, e.g., the same two accounts in ablockchain-based cryptocurrency would re-sult in a permanent storage requirement ofT new entries in the blockchain for all thenodes in the network, in an AHT-basedcryptocurrency those updates would justsuccessively change the value of the samefield.

• Cryptocurrencies based on mining encour-age the formation of mining pools (namely,groups of nodes that join their computa-tional power in order to solve cryptographicproblems more efficiently). However, ithas been shown [61] [62] that a coordi-nated group of users controlling the abso-lute majority, e.g., of the Bitcoin network’scomputing power could disrupt its security.

This is far from being a theoretical weak-ness, as, for example, a single mining poolhas reached 50% of the total Bitcoin’s min-ing power in June 2014 [63]. A cryptocur-rency where mining pools are not rewardedwould help avoid this critical scenario, whileaddressing the democracy issue inherent inthe high entry point of cost-efficient mininghardware [64].

D.2.1 Transaction syntaxAuthenticated hash tables allow any node withO(1) previous knowledge on the status of thehash table to verify that an edit has been doneto it. By grouping one or more edits together,it is possible to define a set of legal transactionsand, by recursion, a legal hash table as an emptyhash table or any hash table obtained from alegal transaction on a legal hash table.

For example, a basic cryptocurrency protocolcould involve the following transactions:

Signup: a new entry is added to the hashtable. The key is a user name, the contentincludes an initial balance of zero and theuser’s public key. The transaction includesthe proof of the addition of the entry to thehash table, and a signature with the user’spublic key.

Payment: a user A is paying another userB some amount. The transaction involvesa proof of the existence of A, a proof of theexistence of B, the balance of A, the bal-ance of B, the public key of A and a sig-nature with the public key of A describingthe transaction. The proof also includes anedit of A’s entry (reducing its balance bythe amount of the transaction) and an editof B’s entry (increasing its balance by theamount of the transaction). The transac-tion is accepted if both the users exist, if Ahas enough money in its balance, and if avalid signature is provided.

In this paradigm, the initial currency distribu-tion process could be uniform in time (e.g., everyaccount receives some amount per time unit, thetime being measured e.g. in number of transac-tions) or exponential (i.e., each account receivesproportionally to its balance). This eliminatesthe need for energetically expensive mining byreplacing it with minting : currency is initiallydistributed to all the nodes that take part inensuring the security of the network, e.g. bystoring proofs of persistent storage and double-checking the stream of updates received by thenetwork. Note how both activities are not CPUand energy intensive.

23

D.2.2 Universal transaction syntax D.2 CRYPTOCURRENCY

Since authenticated hash tables can be used toprovably add, retrieve and remove entries froma hash table, depending on the set of legal trans-actions one defines, it is possible to implement awide variety of financial constructs in the cryp-tocurrency.

D.2.2 Universal transaction syn-tax

As we have seen, a globally accepted transactionsyntax could allow a set of peers offering a hard-ware commitment to check a stream of updatesto a distributed ledger. This allows us to easilydefine universal sytanxes. Consider the case, forexample, of a syntax where:

• A set of fields in the distributed ledger de-fine a finite set of rules, e.g. in the form ofregular expressions.

• All updates are sent along with the proofof the rule they are applying. Every nodecan verify that the rule provided actuallylies in the ledger, and check that the rule isrespected by the update.

• Rules can exist to update every field, alsothose that store rules. For example, one ormore rules can define a distributed votingmechanism to update the fields storing therules themselves. This would allow users inthe network to define, and arbitrarily up-date the transaction syntax without havingto alter the software that checks that theyare respected.

D.2.3 Taxation system

Some cryptocurrencies already implement, orplan to implement, transaction fees as a formof redistribution of wealth after the mining re-sources have been exhausted.

Instead of implementing a fee for accepting ablock, thus splitting the value of the fee amongmultiple accounts, one or more community ac-counts could be included in the cryptocurrency.The syntax of a transaction could involve, forexample, paying a small fraction of the amountto a community account.

Since each transaction would be verified bythe whole community, tax evasion would be un-feasible: our cryptocurrency would implementtaxes in the very language used for transactions.Instead of putting the community accounts un-der the control of any trusted third party, theycould rather be managed by distributed, verifi-able algorithms.

Sponsoring FOSS, Creative Commons &free knowledge

In Section D.1.2, we described how a ContentDelivery Network could be hosted on top ofa large enough personal data storage networklike RAIN. Integrating that possibility with ataxable cryptocurrency integrated in the samenetwork could have remarkable implications interms of how knowledge is produced and madeavailable.

For example, authors of Free and Open SourceSoftware, Creative Commons Media and freeknowledge in general could make their contentavailable via RAIN’s distribution infrastructure.A distributed, verifiable algorithm could thenmonitor how much each content is accessed andused, and use funds from the community accountto appropriately reward its authors.

The possibility for authors to be sponsoredby the community in the production of freelyaccessible content would significantly contributein moving past a paradigm where content is ei-ther sold (and mostly brokered by progressivelyuseless third parties that make profit on its dis-tribution) or made available for free on a volun-teer basis.

D.2.4 Privacy vs taxability

Among the arguably most significant factorsthat limited the large-scale acquisition of cryp-tocurrencies as a viable financial instrument isthe concern that the anonymity they providewould make it simpler to evade taxes [65] [66]and to fund illegal [67] [68] and terrorist [69] ac-tivities.

In particular, in the case of cryptocurrencieswhere all accounts are anonymous and new ac-counts can be easily opened, the issue of fairtaxation is highly significant. Most taxation sys-tems, in fact, implement a progressive mecha-nism where those who earn more have to paya larger percentage of their income in tax. Ina context where there is no way to trace ac-counts back to their users, there is no way toimplement this, as any user can easily split itsresources across a large amount of anonymousalias accounts under his/her control.

On the other hand, we want to make sure toprotect the privacy of the transactions for eachuser. In order to address the problem posed bythe trade-off between anonymity and taxability/ traceability, we have developed a toy modelcryptocurrency based on ADHT (AuthenticatedDistributed Hash Tables) that, if proven to besecure, could offer a good compromise betweenthe two.

24

D.2 CRYPTOCURRENCY D.2.5 Peer to peer lending & microcredit

Goal of the model In this model, we havea set of citizens that make use of an ADHT toimplement a cryptocurrency, and an authority(their government). Citizens want their privacysafeguarded from mass surveillance programs,but they also want to make sure that everyonepays a fair tax, and that the under certain con-ditions (e.g., provided with a warrant) the au-thority will be able to uncover the transactionsof any specific user.

The transaction syntax Our model will besimplified in that the amount of tax paid by eachaccount will be a function of the amount of cur-rency received by that account during, e.g., theprevious month.

We will use the following transaction syntaxfor our cryptocurrency (see D.2.1) in order toprovide all the properties described in the pre-vious paragraph:

Signup: in order to open an account, a userwill need to first receive an identifier fromthe authority. The authority will requirethe user to provide his/her personal detailsin order to receive an identifier. The iden-tifier will be signed by the authority andwill not publicly disclose the identity of theuser. The authority, however, will knowwhich identifier belongs to which user.

Splicing: two users can create twoephemeral accounts and transfer to them amatching amount of currency. The transac-tion will only be accepted if signed by boththe users and a set of witnesses (namely,owners of non-emphemeral accounts), ran-domly selected by a global random beacon.

A splicing transaction will occur in thisform:

(a, b)n−−−−−−−−−−−→

(a,b,w0,...,wK−1)(c, d)

where a and b are accounts, n identifiesan amount of currency and those indicatedunder the arrow are the signatories of thetransaction. Both a and b will spend n unitsof currency and two new accounts, c and d,will be opened with n units of currency inthem. Only the signatories will know whichephemeral account belongs to which user(namely, if a’s owner is also c’s owner orif a’s owner is also d’s owner).

An ephemeral account cannot receivemoney, and it can only make one opera-tion involving its whole balance before be-ing closed, i.e., it can be used to create anew ephemeral account or to proceed to a

payment to a non-ephemeral account (seenext transaction). Ephemeral accounts alsoexpire (if they are not used for longer thana certain time limit, they are locked andabsorbed in a community account).

Payment: as in our first cryptocurrencyexample, any account can make a payment,but only non-ephemeral accounts can re-ceive a payment. When a non-ephemeralaccount receives a payment, it will publiclyadd the corresponding amount to a log thatwill be then used to pay the necessary tax.This allows progressive taxations: accountsare nominal, but transactions are obfus-cated: an account that receives more cannow be subject to paying more.

Ratting out: when provided with a war-rant signed by the authority, a witness toa splicing transaction will publicly log thecorrespondence between input and outputaccounts involved in the transaction, en-crypted with the authority’s public key.

In this model, while indeed the authority has acomplete correspondence table between citizensand accounts, each account can make anony-mous transaction by mean of a sequence of splic-ing transactions.

Let N be the total number of accounts in thecryptocurrency. At each splicing transaction, anew ephemeral account is opened and the in-determination on its owner doubles, as, in thelimit 2n � N , each ephemeral account can betracked back to O(2n) non-ephemeral accounts,n being the number of splicing transactions havecontributed to the generation of the ephemeralaccount.

The existence of witnesses, however, guaran-tees that under certain circumstances (e.g., pro-vided with a warrant) the authority can forcea witness into uncovering which account gener-ated which during a splicing transaction. Thisallows it, if needed, to successively track a trans-action back to the non-ephemeral account thatgenerated it. However, due to the fact that theresult of the operation is publicly logged by thewitness, the authority will have no way of im-plementing an undercover mass surveillance pro-gram, as each ratting out transaction is visibleto all the citizens, that can then decide to limitthe power of the authority, or select a new one.

D.2.5 Peer to peer lending & mi-crocredit

Lending has traditionally been mediated by fi-nancial institutions that would lend private citi-zen’s financial resources stored in bank accounts.

25

REFERENCES

It is easy to see, however, that loans canbe implemented in the syntax of a cryptocur-rency based on authenticated hash tables. Apeer to peer lending and microcredit platformcould therefore be integrated in the cryptocur-rency without need for third party mediation. Asimilar paradigm is implemented by crowdfund-ing platforms, that still act as lending mediators,but grant the user control over what one’s moneyis used for.

In order to address the limited economic dy-namicity that could stem from relying on privatecitizens to actively loan their resources, pub-licly verifiable, democratic lending pools couldbe formed to cooperatively select whose loans togrant.

Automatic relending options could also beprovided by the cryptocurrency, since not onlydata, but also scripts can be stored in a authen-ticated hash tables and executed in a verifiableway at appropriate times.

D.3 Discussion

We have introduced a distributed communica-tion and data storage architecture that providesa viable alternative to the growing centralizeddata storage solutions.

Initially (Part B) we demonstrate the feasi-bility of the proposed architecture. (a) We showthat expected data lifetimes of the same order ofmagnitude as the age of the Earth can easily canbe obtained using Solomon Reed redundancystrategies (∼ 7 · 108 years) with a redundancyfactor of 1.5 using 35 data storage nodes. (b)Based on measurement of end-user grade Inter-net connectivity and Raspberry Pi data storagenodes, we demonstrate that the RAIN architec-ture performance easily match or exceed thatof the central data storage paradigm. We mea-sured average data upload speed (∼17Mbit/s)and download speed (∼34Mbit/s) as well as es-timated data recovery bandwidth (∼12 KB/s)and downtime probability (∼ 10−9) exceedingthe existing centralized data storage paradigm.(c) Finally, the end-user costs for getting accessto the proposed RAIN data storage architectureare significantly lower than the costs based onthe centralized data storage paradigm.

Secondly (Part C) we define the scientific andtechnical challenges for designing data privacyand security, network organization as well asdata, metadata and credential handling. In thispart we also sketch the energy and environmen-tal issues for the proposed distributed architec-ture and how it differs from large scale servercenters. This part outlined the necessary stepsinvolved in developing the RAIN data storage

architecture.Finally (Part D) we provided an overview and

discuss the potential disruptive impact of a dis-tributed, privacy and security by design, com-munity owned digital infrastructure. We believea discussion of potential impact is also an impor-tant component of the architecture design. Inreflecting on the impact we in particular need toevaluate the critical connection between, on theone hand, cyberprivacy and security, and on theother hand, the potential political power struc-ture within a modern society.

RAIN could support the development of com-munitarian services including telecommunica-tion, content delivery, cryptocurrency, and dis-tributed administration (nation state and re-gional governmental), which currently are ser-vices managed in a centralized manner throughtrusted third parties. Implementation of aRAIN style architecture could thus distributethe power from global centralized trusted thirdparties to local citizens and business, while atthe same time presumably reduce the significantenergy requirement and resulting CO2 impact ofcentralized data storage.

AcknowledgementsWe are grateful for partial financial support fromthe European Commission sponsored SYNEN-ERGENE project as well as for constructive dis-cussions with and suggestions from Sofia Farina,Per Odling, Leif Rasmussen and Piper Stover.Lucinda Voldsgaard is acknowledged for proofreading of the manuscript.

References[1] A. Vaughan, “How viral cat videos

are warming the planet,” Indepen-dent. https://www.theguardian.com/environment/2015/sep/25/server-data-centre-emissions-air-travel-web-google-facebook-greenhouse-gas.

[2] T. Bawden, “Global warming: Data centresto consume three times as much energyin next decade, experts warn,” Indepen-dent. http://www.independent.co.uk/environment/global-warming-data-centres-to-consume-three-times-as-much-energy-in-next-decade-experts-warn-a6830086.html.

[3] I. Foster and C. Kesselman, The Grid:Blueprint for a New Computing Infrastruc-ture. Morgan Kaufmann Publishers Inc.,1999.

26

https://www.theguardian.com/environment/2015/sep/25/server-data-centre-emissions-air-travel-web-google-facebook-greenhouse-gas




http://www.independent.co.uk/environment/global-warming-data-centres-to-consume-three-times-as-much-energy-in-next-decade-experts-warn-a6830086.html





REFERENCES REFERENCES

[4] “Boinc - open-source software for volunteercomputing.” http://boinc.berkeley.edu/.

[5] M. M. Gaber, F. Stahl, and J. B. Gomes,Pocket Data Mining - Big Data on SmallDevices, vol. 2 of Studies in Big Data.Springer International Publishing.

[6] S. A. et al., “Mobile-edge computing– introductory technical white paper.”https://portal.etsi.org/Portals/0/TBpages/MEC/Docs/Mobile-edge_Computing_-_Introductory_Technical_White_Paper_V1%2018-09-14.pdf.

[7] I. S. Reed and G. Solomon, “Polynomialcodes over certain finite fields,” Journalof the Society for Industrial and AppliedMathematics, vol. 8, pp. 300–304, 1960.

[8] R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau, Operating Systems: Three EasyPieces. Arpaci-Dusseau Books.

[9] A. Loewenstern and A. Norberg, “Dhtprotocol.” http://www.bittorrent.org/beps/bep_0005.html.

[10] H. Balakrishnan, M. F. Kaashoek,D. Karger, R. Morris, and I. Stoica,“Looking up data in p2p systems,” Com-mun. ACM.

[11] R. C. Merkle, “A digital signature basedon a conventional encryption function,” Ad-vances in Cryptology - CRYPTO 1987,p. 369, 1988.

[12] A. Miller, M. Hicks, J. Katz, and E. Shi,“Authenticated data structures, generi-cally,” SIGPLAN Not., vol. 49, no. 1,pp. 411–423.

[13] S. Nakamoto, “Bitcoin: A peer-to-peer elec-tronic cash system.” https://bitcoin.org/bitcoin.pdf.

[14] L. Ren and S. Devadas, “Proof of spacefrom stacked expanders,” Cryptology ePrintArchive, 2016.

[15] S. Dziembowski, S. Faust, V. Kolmogorov,and K. Pietrzak, “Proofs of space,” Ad-vances in Cryptology - CRYPTO 2015,pp. 585–605, 2015.

[16] B. Beach, “How long do disk driveslast?.” https://www.backblaze.com/blog/how-long-do-disk-drives-last/.

[17] G. Gasior, “The ssd endurance experiment:They’re all dead.” http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead.

[18] M. M. et al., “State of the in-ternet report / q3 2016 report.”https://www.akamai.com/us/en/our-thinking/state-of-the-internet-report/.

[19] “Speedtest market reports.” http://www.speedtest.net/reports/.

[20] V. Strassen, “Gaussian elimination is notoptimal.,” Numerische Mathematik, vol. 13,pp. 354–356, 1969.

[21] R. Williams, “Matrix-vector multiplicationin sub-quadratic time: (some preprocess-ing required),” in Proceedings of the Eigh-teenth Annual ACM-SIAM Symposium onDiscrete Algorithms, SODA ’07, pp. 995–1001, Society for Industrial and AppliedMathematics, 2007.

[22] E. Pinheiro, W.-D. Weber, and L. A. Bar-roso, “Failure trends in a large disk drivepopulation,” in 5th USENIX Conference onFile and Storage Technologies, FAST 2007,2007.

[23] B. Schroeder and G. A. Gibson, “Disk fail-ures in the real world: What does an mttfof 1,000,000 hours mean to you?,” in Pro-ceedings of the 5th USENIX Conference onFile and Storage Technologies, FAST 2007,USENIX Association, 2007.

[24] G. Greenwald and E. MacAskill, “Nsaprism program taps in to user dataof apple, google and others.” https://www.theguardian.com/world/2013/jun/06/us-tech-giants-nsa-data.

[25] G. Greenwald, E. MacAskill, L. Poitras,S. Ackerman, and D. Rushe, “Microsofthanded the nsa access to encrypted mes-sages.” https://www.theguardian.com/world/2013/jul/11/microsoft-nsa-collaboration-user-data.

[26] “Attack model.” https://en.wikipedia.org/wiki/Attack_model.

[27] S. Wilkinson, T. Boshevski, J. Brandof,J. Prestwich, G. Hall, P. Gerbes,P. Hutchins, and C. Pollard, “Storj -a peer-to-peer cloud storage network.”https://storj.io/storj.pdf.

27

http://boinc.berkeley.edu/

http://boinc.berkeley.edu/

https://portal.etsi.org/Portals/0/TBpages/MEC/Docs/Mobile-edge_Computing_-_Introductory_Technical_White_Paper_V1%2018-09-14.pdf




http://www.bittorrent.org/beps/bep_0005.html

http://www.bittorrent.org/beps/bep_0005.html

https://bitcoin.org/bitcoin.pdf

https://bitcoin.org/bitcoin.pdf

https://www.backblaze.com/blog/how-long-do-disk-drives-last/

https://www.backblaze.com/blog/how-long-do-disk-drives-last/

http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead



https://www.akamai.com/us/en/our-thinking/state-of-the-internet-report/



http://www.speedtest.net/reports/

http://www.speedtest.net/reports/

https://www.theguardian.com/world/2013/jun/06/us-tech-giants-nsa-data



https://www.theguardian.com/world/2013/jul/11/microsoft-nsa-collaboration-user-data



https://en.wikipedia.org/wiki/Attack_model

https://en.wikipedia.org/wiki/Attack_model

https://storj.io/storj.pdf


[28] S. A. Crosby and D. S. Wallach, “Authen-ticated dictionaries: Real-world costs andtrade-offs,” ACM Trans. Inf. Syst. Secur.,vol. 14, no. 2, pp. 17:1–17:30.

[29] J.-P. Martin and L. Alvisi, “Fast byzantineconsensus,” IEEE Trans. Dependable Secur.Comput., vol. 3, no. 3, pp. 202–215.

[30] “Nist randomness beacon.” https://www.nist.gov/programs-projects/nist-randomness-beacon.

[31] A. K. Lenstra and B. Wesolowski, “A ran-dom zoo: sloth, unicorn, and trx,” IACReprint archive.

[32] “libp2p - modular peer-to-peer networkingstack.” https://github.com/libp2p.

[33] P. A. Bernstein and N. Goodman, “Thefailure and recovery problem for repli-cated databases,” in Proceedings of the Sec-ond Annual ACM Symposium on Princi-ples of Distributed Computing, PODC ’83,pp. 114–122, 1983.

[34] Y. Saito and M. Shapiro, “Optimistic repli-cation,” ACM Comput. Surv., vol. 37, no. 1,pp. 42–81.

[35] Y. Dodis and N. Fazio, Public Key Broad-cast Encryption for Stateless Receivers,pp. 61–80. Springer Berlin Heidelberg.

[36] C. Percival, “Stronger key derivation via se-quential memory-hard functions.”

[37] M. Gibson, M. Conrad, and C. Maple,“Evaluating the effectiveness of image-based password design paradigmsusing a newly developed metric.”http://perisic.com/preprints/EvaluateImagePasswordMetrics.pdf.

[38] R. Pond, J. Podd, J. Bunnell, and R. Hen-derson, “Word association computer pass-words: The effect of formulation techniqueson recall and guessing rates,” Computers &Security, vol. 19, no. 7, pp. 645–656, 2000.

[39] “iphone 7 - technical specifications.” http://www.apple.com/iphone-7/specs/.

[40] Samsung Galaxy S7, “Samsung galaxy s7— Wikipedia, the free encyclopedia,” 2017.[Online; accessed Apr. 7, 2017].

[41] K. Craig-Wood and P. Krause, “Towardsthe estimation of the energy cost of in-ternet mediated transactions,” Preliminaryversion of a technical report of the EEC(Energy Efficient Computing) SIG of the

ICT KTN (Knowledge Transfer Network),2013.

[42] J. Edwards, “New technologies meanshorter server life cycles.”

[43] M. Dayarathna, Y. Wen, and R. Fan,“Data center energy consumption modeling:A survey,” IEEE Communications SurveysTutorials, vol. 18, no. 1, pp. 732–794, 2016.

[44] “Clonee data center - facebook.” https://www.facebook.com/CloneeDataCenter/.

[45] S. Lambert, W. V. Heddeghem,W. Vereecken, B. Lannoo, D. Colle,and M. Pickavet, “Worldwide electricityconsumption of communication networks,”Opt. Express, vol. 20, no. 26, pp. B513–B524.

[46] L. Zhao, R. Iyer, S. Makineni, andL. Bhuyan, “Anatomy and performance ofssl processing,” in Proceedings of the IEEEInternational Symposium on PerformanceAnalysis of Systems and Software, 2005, IS-PASS ’05, pp. 197–206, IEEE Computer So-ciety, 2005.

[47] “Power consumption.” https://www.pidramble.com/wiki/benchmarks/power-consumption.

[48] “Power efficiency comparison of the dellpoweredge r815 and hp proliant dl585 g7rack servers.” http://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/Dell-2u-R815-versus-HP-4u-DL585.pdf.

[49] “How many text and photo messagesare sent using the top messagingapps globally each day?.” https://askwonder.com/q/how-many-text-and-photo-messages-are-sent-using-the-top-messaging-apps-globally-each-day-57738fa9b3159159003df760.

[50] L. A. Adamic and B. A. Huberman, “Zipf’slaw and the internet,” Glottometrics, vol. 3,pp. 143–150, 2002.

[51] J. Benet, “Ipfs - content ad-dressed, versioned, p2p file sys-tem.” https://ipfs.io/ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf.

[52] “Page views for wikipedia, bothsites, normalized.” https://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm.

28

https://www.nist.gov/programs-projects/nist-randomness-beacon



https://github.com/libp2p

http://perisic.com/preprints/EvaluateImagePasswordMetrics.pdf

http://perisic.com/preprints/EvaluateImagePasswordMetrics.pdf

http://www.apple.com/iphone-7/specs/

http://www.apple.com/iphone-7/specs/

https://www.facebook.com/CloneeDataCenter/

https://www.facebook.com/CloneeDataCenter/

https://www.pidramble.com/wiki/benchmarks/power-consumption



http://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/Dell-2u-R815-versus-HP-4u-DL585.pdf




https://askwonder.com/q/how-many-text-and-photo-messages-are-sent-using-the-top-messaging-apps-globally-each-day-57738fa9b3159159003df760





https://ipfs.io/ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf



https://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm




[53] Wikipedia, “Wikipedia:Size of Wikipedia— Wikipedia, the free encyclopedia.”http://en.wikipedia.org/w/index.php?title=Wikipedia%3ASize%20of%20Wikipedia&oldid=775138473, 2017.[Online; accessed 13-April-2017].

[54] P. Vagata and K. Wilfong, “Scaling the face-book data warehouse to 300 pb.”

[55] “Number of monthly active facebookusers worldwide as of 4th quarter 2016(in millions).” https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/.

[56] “Majestic-12: distributed search engine.”https://www.majestic12.co.uk/.

[57] “Yacy - the peer to peer search engine.”http://yacy.net/en/index.html.

[58] “How search works - the story.”

[59] J. Dean, “Challenges in building large-scaleinformation retrieval systems.”

[60] K. J. O’Dwyer and D. Malone, “Bit-coin mining and its energy footprint,” in25th IET Irish Signals Systems Conference2014 and 2014 China-Ireland InternationalConference on Information and Commu-nications Technologies (ISSC 2014/CIICT2014), pp. 280–285.

[61] Andes, “Bitcoin’s kryptonite: The 51%attack..” https://bitcointalk.org/index.php?topic=12435.

[62] I. Eyal and E. G. Sirer, “Majority isnot enough: Bitcoin mining is vulner-able,” 2013. https://arxiv.org/abs/1311.0243.

[63] R. Gill, “Cex.io slow to respond as fearsof 51 http://www.coindesk.com/cex-io-response-fears-of-51-attack-spread/.

[64] S. Valfells and J. H. Egilsson, “Mintingmoney with megawatts,” in Proceedings ofthe IEEE, vol. 104, pp. 1674–1678, 2016.

[65] O. L. Bateman, “Bitcoin might maketax havens obsolete,” Motherboard.https://motherboard.vice.com/en_us/article/bitcoin-might-make-tax-havens-obsolete.

[66] E. Zwirn, “No, you can’t avoid taxes byinvesting in bitcoin,” New York Post.http://nypost.com/2017/04/08/no-

you-cant-avoid-taxes-by-investing-in-bitcoin/.

[67] J. Bearman, “The untold story of silk road,part 1,” Wired. https://www.wired.com/2015/04/silk-road-1/.

[68] J. Bearman, “The untold story of silk road,part 2: The fall,” Wired. https://www.wired.com/2015/05/silk-road-2/.

[69] S. Higgins, “Isis-linked blog: Bitcoincan fund terrorist movements world-wide,” Coindesk. http://www.coindesk.com/isis-bitcoin-donations-fund-jihadist-movements/.

29

http://en.wikipedia.org/w/index.php?title=Wikipedia%3ASize%20of%20Wikipedia&oldid=775138473



https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/




https://www.majestic12.co.uk/

http://yacy.net/en/index.html

https://bitcointalk.org/index.php?topic=12435

https://bitcointalk.org/index.php?topic=12435

https://arxiv.org/abs/1311.0243

https://arxiv.org/abs/1311.0243

http://www.coindesk.com/cex-io-response-fears-of-51-attack-spread/



https://motherboard.vice.com/en_us/article/bitcoin-might-make-tax-havens-obsolete



http://nypost.com/2017/04/08/no-you-cant-avoid-taxes-by-investing-in-bitcoin/



https://www.wired.com/2015/04/silk-road-1/




http://www.coindesk.com/isis-bitcoin-donations-fund-jihadist-movements/



Date post:	15-Mar-2018
Category:	Documents
Upload:	truongbao
View:	217 times
Download:	1 times