+ All Categories
Home > Documents > Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand,...

Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand,...

Date post: 24-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
65
Hive2Hive — An Open-Source Library for Distributed File Synchronization and Sharing Sebastian Golaszewski, Christian Lüthold, Nico Rutishauser Zurich, Switzerland Student ID: 09-911-983, 09-714-981, 09-706-821 Supervisor: Dr. Thomas Bocek, Andri Lareida, Prof. Dr. Burkhard Stiller Date of Submission: March 7, 2014 University of Zurich Department of Informatics (IFI) Binzmühlestrasse 14, CH-8050 Zürich, Switzerland ifi MASTER P ROJECT REPORT Communication Systems Group, Prof. Dr. Burkhard Stiller
Transcript
Page 1: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

Hive2Hive — An Open-SourceLibrary for Distributed File

Synchronization and Sharing

Sebastian Golaszewski, Christian Lüthold, Nico RutishauserZurich, Switzerland

Student ID: 09-911-983, 09-714-981, 09-706-821

Supervisor: Dr. Thomas Bocek, Andri Lareida, Prof. Dr. BurkhardStiller

Date of Submission: March 7, 2014

University of ZurichDepartment of Informatics (IFI)Binzmühlestrasse 14, CH-8050 Zürich, Switzerland ifi

MA

ST

ER

PR

OJE

CT

RE

PO

RT

–C

omm

unic

atio

nS

yste

ms

Gro

up,P

rof.

Dr.

Bur

khar

dS

tille

r

Page 2: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

Master Project ReportCommunication Systems Group (CSG)Department of Informatics (IFI)University of ZurichBinzmühlestrasse 14, CH-8050 Zürich, SwitzerlandURL: http://www.csg.uzh.ch/

Page 3: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

Zusammenfassung

Obwohl Peer-to-Peer Systeme in den letzten Jahren an Bedeutung verloren haben, machensie einen Grossteil des gesamten Datenverkehrs im Internet aus. Den grossten Anteil amDatenverkehr stellen Aktivitaten aus dem Bereich Filesharing, der durch zahlreiche de-zentralisierte Dienste ermoglicht wird, dar. Ein anderer, beobachtbarer Trend beschreibtdie wachsende Anzahl an internetfahigen Geraten. Da die Benutzer meistens mehrere Ge-rate besitzen, zeichnet sich vermehrt eine Nachfrage nach Synchronisierungsmechanismenab. Um diesem wachsenden und allgegenwartigen Bedarf gerecht zu werden, haben sichin den letzten Jahren verschiedenste Dienste durchgesetzt, welche meistens auf dem zen-tralisierten Client-Server Ansatz basieren und Daten in externen Datenzentern speichern.Hinzu kommt, dass die meisten Dienste die personlichen und privaten Daten unverschlus-selt ablegen und somit Nutzer nicht wissen, wer sonst noch Zugang zu ihren Daten hat.Neben diesem Kontrollverlust seitens der Benutzer leiden zentralisierte Systeme an dersingle-point-of-failure Eigenschaft und sind somit anfallig gegen gezielte Attacken.Das Ziel dieses Projekts ist die Entwicklung einer open-source Java Bibliothek, welchedie verschiedenen Aufgaben der Datensynchronisation und die Moglichkeit zum Teilendezentralisiert lost. Weiter soll die Sicherheit fur Benutzer und Daten so hoch wie moglichgehalten werden. Ein weiteres, grundlegendes Ziel ist es, den Nutzern die gleiche Benut-zerfreundlichkeit zu bieten, die ihnen von der Verwendung von zentralisierten Dienstenher bekannt ist, wobei die Vorteile eines verteilten Peer-to-Peer Systems genutzt werdenkonnen.Das Projekt Hive2Hive ist die entsprechende Umsetzung dieser Anforderungen. Hive2Hiveist eine kostenlose und einfach zu verwendende Bibliothek, die alle benotigten Nutzer- undDateioperationen unterstutzt. Durch den Einsatz einer verteilten Hashtabelle wird einevollstandige Dezentralisierung erreicht. Die Bibliothek ubernimmt die volle Verantwor-tung fur Interaktionen mit dem Netzwerk und liefert somit den notigen Abstraktionsgrad.Daruber hinaus erlaubt ihr Aufbau die einfache Entwicklung von Erweiterungen fur Dien-ste, die einerseits von den verteilten Eigenschaften profitieren wollen und andererseitsdie Basisfunktionalitat erganzen. Um Hive2Hive ausgiebig zu dokumentieren wurde eineanschauliche Webseite erstellt und veroffentlicht.

i

Page 4: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

ii

Page 5: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

Abstract

Although peer-to-peer systems have found less attraction over the last years, they stillaccount for a large part of the overall Internet traffic. Most of this traffic is related tofile sharing activity that is made possible by numerous decentralized services. Anotherobservable trend is the increasing number of devices connected to the Internet. Since usersusually tend to possess multiple devices, the need for synchronization mechanisms is ris-ing. Various services emerged in the last couple of years in order to dissolve this pervasiveuser requirement. However, these solutions are mostly based on centralized client-serverapproaches and thus store all user data in external data centers. There, personal and pri-vate data is often adversely deposited in an unencrypted manner so that owners cannotknow who else might have access to it and thus causing a loss of user control. Further-more, such centralized systems suffer from the single-point-of-failure property and henceare vulnerable to targeted attacks.The mission of this project is to develop an open-source Java library that aims at dis-tributed file synchronization and sharing tasks, while ensuring maximum user and datasecurity. The fundamental ambition is to offer the same user experience that is alreadyknown from popular, centralized services, while profiting from the advantages of a decen-tralized peer-to-peer system.The resulting Hive2Hive project represents the corresponding implementation. It is a freeand easy-to-use library that supports all required user and file operations. Total decen-tralization is achieved by building functionality on top of a distributed hash table. Thelibrary takes full responsibility of all network interaction and thus provides the necessarylevel of abstraction. Moreover, the Hive2Hive library is designed to be easily extendablefor other services that intend to profit from decentralized properties and, at the same time,enrich its set of supported operations. So as to provide fundamental documentation andguidance for the library, an appealing project website has been created and published.

iii

Page 6: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

iv

Page 7: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

Acknowledgments

We would like to thank several people for their support in the realization of this masterproject. First of all, we express our deep gratitude to our supervisor, Dr. Thomas Bocek,for his competent assistance, enthusiasm and ever cooperative interaction. We also like tothank our assistant supervisor, Andri Lareida, for his feedback and advice. Furthermore,we sincerely thank Prof. Dr. Burkhard Stiller for the enablement of this project in contextof the Communication Systems Group. Last but not least, we thank our significant othersand our families for their ongoing encouragement throughout the project.

v

Page 8: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

vi

Page 9: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

Contents

Zusammenfassung i

Abstract iii

Acknowledgments v

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Description of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Related Work 5

3 Proposed Solution 7

3.1 Layer Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Underlying Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 Location Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.2 UserProfile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.3 FileIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.4 MetaFile and Chunk . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.5 FolderIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.6 UserPublicKey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.7 Locations and Notifications . . . . . . . . . . . . . . . . . . . . . . 12

3.2.8 UserProfileTask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

vii

Page 10: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

viii CONTENTS

3.2.9 Content Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Security Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3.1 User Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3.2 Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.3 Authenticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 P2P Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4.1 Put Version Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4.2 Concurrent Modification . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4.3 Garbage Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.6 Demonstration Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.7 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Conclusions 25

4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Bibliography 29

List of Figures 29

List of Tables 31

A Additional Documentation 35

A.1 The Name Hive2Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A.2 Library Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A.3 Developer API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

A.3.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

A.3.2 Peer Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

A.3.3 User Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Page 11: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

CONTENTS ix

A.3.4 File Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

A.3.5 Process Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

A.4 Implemented Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.4.1 Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.4.2 Login . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.4.3 Logout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

A.4.4 Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

A.4.5 Modify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

A.4.6 Delete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

A.4.7 Move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

A.4.8 Recover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

A.4.9 Share . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

A.5 Process Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

A.5.1 Requirements and Overview . . . . . . . . . . . . . . . . . . . . . . 46

A.5.2 Process States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

A.5.3 Process Listeners . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

A.5.4 Rollback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

A.5.5 Asynchronous Process Components . . . . . . . . . . . . . . . . . . 49

A.5.6 Computing Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A.5.7 Process Factory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A.6 Class Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

B Contents of the CD 53

Page 12: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

x CONTENTS

Page 13: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

Chapter 1

Introduction

This work takes up with the motivation for the realization of this project, followed by adescription of the objectives that are pursued. Before going into the proposed solution, abrief outline of this document is given.

1.1 Motivation

Although the term peer-to-peer (P2P) has found less attraction over the past last years,P2P systems are still popular and account for a large portion of Internet traffic [4][8].Most of this P2P traffic is related to file sharing, which is made feasible by applicationslike BitTorrent (BT) [1], but also new types of P2P services are emerging. Another trendthat can be observed is the increasing number of devices connected to the Internet [7],where the amount of mobile devices accounts for a huge part [6]. Therefore, synchroniza-tion between such devices becomes more important since users usually tend to possessmore than one device on which data is created, accessed and modified. Many centralizedsystems, such as Dropbox1 or Google Drive2, offer data transfer solutions which enablemultiple devices to synchronize their data. However, users are bound to their respectivepricing and terms of service. Another downside for the users is the loss of control overtheir personal and private data when uploading to one of these centralized solutions.Furthermore, recent events, such as the forced shutdown of the MegaUpload3 platform,have shown that the single-point-of-failure property of centralized systems constitutes amajor problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) offer a de-centralized solution for synchronizing files of any size among several devices. However,BT-Sync does not currently offer any versioning features that would allow a reproductionof falsely deleted or modified content. Also, synchronization is only possible if the corre-sponding devices are online. In order to ensure privacy, BT-Sync uses folder-based secretswhich can be used to share directories among several devices or users. These secrets have

1https://www.dropbox.com/2https://drive.google.com/3https://mega.co.nz/4http://www.bittorrent.com/sync

1

Page 14: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

2 CHAPTER 1. INTRODUCTION

to be exchanged out-of-band. Hence, at this moment in time, there is no generalized andfree solution for the functionality described above.

1.2 Description of Work

This work describes the development project of a Java open-source implementation of adistributed hash table (DHT) supporting file synchronization and sharing, on the basisof the TomP2P5 framework. Furthermore, corresponding security mechanisms have to beconceptually compiled and applied so as to support user and data privacy. Since a basicconcept and prototype implementation has already been worked out and presented [6] aspart of another P2P project at the University of Zurich, it was the mission of this projectto abstract and improve the overall concept of the preliminary work. In particular, theexisting parts needed to be modified, adapted and revised in terms of security, stabilityand efficiency, such that the final product represents a self-containing and fully viablelibrary.The interface of the library supports several user management tasks that allow a potentialapplication to register and login users of the service. Also, the respective reverse operationsare provided. In order to cope with the file management, the library provides interfacesto add, update and delete files. In addition, operations for the sharing of files and theparticular assignment of read-/write permissions are supported. Since the library alsoassists in recovering older files by maintaining a file version history, appropriate interfacesare defined and implemented. Because the existing prototype did not provide its servicesfor large files, this missing functionality had to be made available. Concretely, the librarywas required to handle both small and large files while discriminating them internally toensure an equivalent usage of the operation interface. Another requirement was to rethink,improve, encapsulate and model all supported functions as separate processes such thatthey can be executed in parallel. Those requirements have been satisfied by the finalizedversion of the open-source library.So in order to demonstrate that the library can actually be considered to develop any file-based synchronization tool, an appropriate application has been created. Also, in orderto provide fundamental documentation and guidance for the library, an appealing projectwebpage has been designed, created and hosted6.

1.3 Outline

After this chapter has given an insight on the motivation and the target of this project,Chapter 2 sheds some light on related approaches. There, similar approaches are discussedand it is shown how this project is related to them. After this, the proposed solution ispresented in Chapter 3. Beside the demonstration of the underlying model, the majorsecurity and peer-to-peer aspects are disclosed. In addition, some optimization techniques

5http://tomp2p.net/6http://www.hive2hive.com/

Page 15: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

1.3. OUTLINE 3

are pointed out. Finally, Chapter 4 draws the conclusion of this project, enlists somelessons learned and proposes some valuable approaches for future work.

Page 16: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

4 CHAPTER 1. INTRODUCTION

Page 17: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

Chapter 2

Related Work

The subject matter of synchronizing and sharing files over a global network is not com-pletely new. Various services emerged in the last couple of years to dissolve this pervasiveuser requirement. With the availability of multiple providers comes the inevitable compe-tition that again leads to the appearance of innovative technology and features. Amongthe most popular providers are corporations like Dropbox, Google and Microsoft, whichall provide synchronization and sharing solutions for both consumer and business markets.However, their implementations, concretely Dropbox, Google Drive and OneDrive1, arebased on a centralized client-server approach and thus store all user data in large externaldata centers. Regrettably, such private data is often stored as clear text and without anysort of encryption. Therefore, users have no control over their private data and cannotknow who else might have access to it. In addition, such centralized systems suffer fromthe single-point-of-failure property and hence are vulnerable to targeted attacks.More recent approaches try to avoid such liabilities of central instances and rather gofor a decentralized system structure. The BitTorrent Sync application, as an example,unleashes the possibilities of peer-to-peer network overlays in order to profit from itsadvantages, like scalability, heterogeneity, reliability and fault-tolerance. Hereby, the de-ployment of a distributed hash table (DHT) allows to organize and store files in a decen-tralized manner where no central elements are required at all. Unfortunately, BitTorrentSync does currently only provide synchronization and sharing of files among devices thatare online at the same time. This represents a major drawback compared to centralizedsolutions, where files can be buffered on a server for the sake of asynchronous communi-cation. Another application that is based on a peer-to-peer network and addresses thisissue is Wuala2, a distributed file sharing system where files can temporarily be stored ondedicated peers. Wuala’s mission is not only to store, share and publish files on the In-ternet, it also consequently aims at making such operations private and secure. Whereasolder peer-to-peer sharing clients allow to send files over the network as-is, Wuala distin-guishes itself from other services by supporting client-side encryption of files that enterthe network. Due to its original architecture, which included a central bootstrap entity,and its acquisition through LaCie3, the distinguishing features between peer-to-peer and

1https://onedrive.live.com/2http://www.wuala.com/3http://www.lacie.com/

5

Page 18: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

6 CHAPTER 2. RELATED WORK

client-server parts started to fade [2]. Since the Wuala project is closed-source, a furtherdownside consists of the non transparent application of cryptographic methodologies thatare used to encrypt the data in the network. With Box2Box [6], an initial prototype fora file synchronization and sharing application with focus on security and privacy aspects,has been realised. However, in contrast to the Wuala system, Box2Box is fully distributedand has no central entity at all. Nevertheless, Box2Box was not built to be of public use,but rather to establish and test a proof of concept.Hence, the mission of the Hive2Hive project was to address all these desired properties.Along with the very fundamental support of file synchronization and sharing operations,the full set of advantages of decentralized - and in particular the ones of peer-to-peer -are exploited. Another distinction to services, like BitTorrent Sync, is the ambition toprovide users with the same user experience that they already know from popular, cen-tralized services, including asynchronous communication between devices. Furthermore,the Hive2Hive library is not only a more abstracted, improved and generalized versionof Box2Box, but also developed as an open-source project that intentionally discloses itsinternal implementation. Doing so, the best possible realization of any ingredient, suchas cryptographic parts, are exposed to external expertise and hence creates confidence,trust and transparency.

Page 19: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

Chapter 3

Proposed Solution

First of all, this section shows the role of Hive2Hive and explains on which layer it islocated. After that, the model with all its components is introduced, explained andreasoned. The Hive2Hive library attaches much importance to security aspects that arediscussed thereafter. Since distributed systems, compared to centralized approaches, needadditional consideration, the subsequent section is devoted to specific implementationsthat find their cause in peer-to-peer characteristics. Furthermore, some optimizationwork that has been done is stated. To complete this chapter, a brief description of animplemented demonstration client and the used testing methods is provided.

3.1 Layer Roles

The Hive2Hive project extends the functionality of common peer-to-peer libraries by sup-porting user and file management operations. The setup of the underlying DHT, includ-ing the routing and storing mechanisms, however, is not part of this project. Instead, theHive2Hive library relies on one of the most advanced open-source peer-to-peer projects,the TomP2P framework, which takes full responsibility of the interaction on the DHT andthus offers the necessary level of abstraction. The following basic operations provided bythe TomP2P framework as used by this project:

• bootstrap: connecting peers to each other to form an overlay network

• put: storing data in the network

• get: fetching data from the network

• remove: deleting data from the network

• protect: protecting data against unauthorized modification by means of a digitalsignature

• messaging: sending direct and routed messages over the network

7

Page 20: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

8 CHAPTER 3. PROPOSED SOLUTION

• replication: automatic duplication of stored data to deal with churn

Many applications would use and interact with the TomP2P framework directly. How-ever, as depicted in Figure 3.1, Hive2Hive positions itself between such applications andthe framework in order to serve with an extended functionality and a simplified usage.This set of extensions can be accessed on top of the underlying peer-to-peer frameworkand includes user management operations like registration, login and logout. File man-agement operations allow users that are logged in to add, delete and share files and folderswith other users. Another benefit that comes from the Hive2Hive library is its effort onsecurity so as to protect not only the data in the network, but also the privacy of eachand every user. One advantage of a generic file synchronization library is that it canbe used for almost any application that requires data to be exchanged among differentclients. Arbitrary data can be written to files and synchronized across multiple peers inthe network.

Figure 3.1: Hive2Hive is located between TomP2P and an application.

3.2 Underlying Concept

This section introduces the model of Hive2Hive and exhibits how the components inter-act. First, the namespace of the DHT is explained and it is shown where which content

Page 21: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

3.2. UNDERLYING CONCEPT 9

Object Location Key Content Key EncryptionUserProfile hash(userID,

password, pin)”USER PROFILE” symmetric with password

and pin (256 bit)MetaFile hash(public key) ”META FILE” hybrid with RSA (2048 bit)

and AES (256 bit)Chunk hash(random

String)”FILE CHUNK” hybrid with RSA (2048 bit)

and AES (256 bit)UserPublicKey hash(userID) ”USER PUBLIC KEY” public, not encryptedLocations hash(userID) ”USER LOCATIONS” public, not encryptedUserProfileTask hash(userID) timestamp hybrid with RSA (2048 bit)

and AES (256 bit)

Table 3.1: Overview about where and how the objects are stored.

is stored. Then, particular objects, their associations and interactions, are focused.Hive2Hive stores several data types in the network at various locations. Table 3.1 showsan overview of the stored objects, their according location, content key and the usedencryption mechanism. An according visualization is represented by Figure 3.2. Further-more, a class diagram of the model, showing the corresponding associations, is attachedin Appendix A.6.

Figure 3.2: Visualization of the model in the DHT.

Page 22: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

10 CHAPTER 3. PROPOSED SOLUTION

3.2.1 Location Keys

The location key indicates the location where the data is stored in the namespace of theDHT. In case of TomP2P, a location key is 160 bit long, allowing 2160 different locationkeys. Each peer in the DHT has a unique peer ID that is in the same range as the possiblelocation keys. For any given location key, the peer with the closest peer ID is responsibleto store and replicate this object.For each object stored in the DHT, the location key must always be known; else the datais not retrievable because one does not know which peer to ask.In Hive2Hive, the location key is derived from either a string or a public key. Since thepublic key cannot be chosen manually, a random distribution of the data is achieved. Thelocation can be chosen deliberately when using a string, allowing to explicitly select apeer for the storage.Figure 3.3 shows the concept of a location key graphically on an example. If the locationkey is Alice’s user id Alice (5147), the content is stored at node 4607 because it is theclosest one. In contrast, the user id Bob (1375) would lead to peer 1032. If a public key ofan asymmetric key pair is used as the location key, the content gets statistically randomlydistributed over the whole DHT. In the example, the target node of the public key’s hashis 243. A good balance of the data is required for a stable network.

Figure 3.3: Example for the location keys in the overlay network.

3.2.2 UserProfile

The central element of the user management is the UserProfile. The UserProfile, in thefollowing also called profile, has not much in common with an account known from social

Page 23: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

3.2. UNDERLYING CONCEPT 11

websites like Google+1 or Facebook2. The UserProfile must be kept private because itcontains all relevant data to read and manipulate the user’s files. Details about the profilesencryption and further precautions are provided in Section 3.3.1. The UserProfile holdsa tree of all files of the user. For every file or folder that is stored in Hive2Hive, the usercreates an index. All indices form a tree equitable with the file tree on the user’s disk.An index can be either of type file or folder (see Section 3.2.3 and 3.2.5).Furthermore, the UserProfile also maintains the user’s default protection and encryptionkey. The protection key is used for protecting private content in the DHT. How and whythe protection keys are used is explained in details in Section 3.3.3. The purpose of theencryption key can bee seen in Section 3.3.2 below.

3.2.3 FileIndex

A FileIndex holds the per-file encryption RSA key pair. The public and the private keyhave two functions: The public key is used as the location key to store and retrieve theMetaFile; the private key is used to decrypt it (the purpose of a MetaFile is explainedbelow). Next to the key pair, the FileIndex holds a MD5 hash of the newest file versionsuch that the synchronization process is easier. Changes on disk and in the network canbe compared much faster using this hash, than re-hashing all files when the comparisontakes place. As a drawback, the hash needs to be updated when the content of the fileis updated. Despite, to enable a fast file synchronization during the login it has beenimplemented this way.

3.2.4 MetaFile and Chunk

The MetaFile is a separate object in the DHT, located at a random peer in the DHT. AMetaFile has a list of FileVersions, which again have a list of file chunks. For a betterdistribution of the heavy data in the network, all files are chunked with a user-configurablechunk size. Each chunk is encrypted with the same private key located in the MetaFile.The appropriate public key is also stored in the MetaFile. The location key of each chunkis randomly generated, ensuring that the chunks are evenly distributed over all peers inthe network.The chunk are the crucial data in the DHT; for an efficient replication, the chunk size isintended to be rather small and proportional to the assumed bandwidth. This may differfor each application; some need to synchronize chunks over the Internet while others onlyact in a local area network which provides higher bandwidths, lower error rates and lowerlatencies.

3.2.5 FolderIndex

In contrast to the FileIndex, a FolderIndex does not have a MD5 hash or an encryptionkey pair. A folder has additional sharing functionality and needs to hold a protection key

1https://plus.google.com/2https://www.facebook.com

Page 24: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

12 CHAPTER 3. PROPOSED SOLUTION

in case it is shared with other users. The reason and mechanics of this key is describedin Section 3.3.3.To be able to keep track over all sharers, a list of users having access to the folder is keptin the FolderIndex in the UserProfile. Knowing which users have access to which files arerequired for the push notifications: clients currently online receive messages when a filehas been added, updated, moved or deleted.

3.2.6 UserPublicKey

The encryption key pair, which is stored in the UserProfile, is used to secure all commu-nication among users and among clients of the same user. Securing the communicationin a P2P network is necessary when investigating the layers below. When TCP is used,the session is not necessarily secured. In case of UDP, a connection-less protocol, packetarrival is not verified and encryption at the application layer must be applied. Intermedi-ate routers and Internet service providers (ISP) could read and manipulate the message(Man-In-The-Middle attacks). Encrypting the message requires additional computationpower but provides better security and privacy.The public key of each user is published in the DHT, such that everyone can find it. Thelocation key used is the hash of the user id such that everyone knowing the user id canretrieve the public key. Messages can be encrypted with this public key. Since the privatekey is kept in the UserProfile, only the receiver can decrypt and read the message. Addi-tionally, all messages that traverse the network are signed beforehand. The signature isdone with the private key of the sending user. The receiver verifies the message’s senderwith the aid of the senders public key, which can easily be fetched from the network.Thus, the origin of the message can be verified. This mechanism does not only apply forcommunication between different users, but also for information flow between differentclients of the same user.

3.2.7 Locations and Notifications

For the fast and efficient synchronization of currently online clients, the clients pushupdates using direct messages. These messages are called notifications. To know whichclients of a user are online, a mechanism to lookup their locations is required. For thispurpose, a list of locations per user is published in the DHT. It holds a list of peeraddresses (public IP and port) of clients that are online. The Locations should always beup-to-date: when a client logs in, it should add himself to this list; when he logs out, heshould remove himself (friendly logout). Since unfriendly logouts need to be consideredas well, the locations map is cleaned up every time a client detects an inconsistent state.The Locations are stored under the hash of the user id, allowing every user to find andread it when the user’s id is known.The procedure is shown on an example in Figure 3.4, where Alice wants to send a messageto Bob. First, she gets Bob’s public key from the DHT (1.). Then, the Locations of Bobare read (2.). In this case, Bob currently has only one client online. Alice encrypts themessage with Bob’s public key, signs it with her own private key and sends it to Bob’s

Page 25: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

3.2. UNDERLYING CONCEPT 13

client (3.). Before decrypting and executing the message with his own, well-known privatekey, Bobs checks Alice’s signature with the help of her UserPublicKey (4.).

Figure 3.4: Example showing how Alice sends a message to Bob.

3.2.8 UserProfileTask

Depending on the application, it can be the case that no client of a given user is on-line. The Locations are empty and cannot be used to notify the user. Here comes theUserProfileTask into play. The name may sound confusing, but has its justification: Auser only needs to notify another user if the latter needs to do some modifications in hisown UserProfile. Remember, the UserProfile is private and can only be modified if thecredentials are known. Any user can assign a UserProfileTask to any other user. Similarto the messages, the UserProfileTask is encrypted with the receiver’s public key.A UserProfileTask is located again under the well-known hash of the receiver’s user id.The attentive reader may have noticed that the UserProfileTasks, the Locations and theUserPublicKey are all stored at the same peer.When a user is logged out, multiple UserProfileTasks can be queued in the DHT, allow-ing the user to fetch them in the correct order. This is implemented using the contentkey, which is explained in the next section. The order of the UserProfileTasks needs tobe conserved because some tasks are dependent on other tasks. Consider the followingexample: Bob receives a UserProfileTask from Alice that invites him to a shared folder.Some moments after, Alice adds one or multiple files to the shared folder. Since Bob isnot online, the UserProfileTasks are queued in the network until he logs in the next time.

Page 26: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

14 CHAPTER 3. PROPOSED SOLUTION

The sharing task needs to be processed first, else Bob does not know what to do withAlice’s files if he does not know that she shared the folder with him.

3.2.9 Content Key

The content key is the second dimension after the location key that TomP2P offers. Fora given location key, one of 2160 different buckets (indicated by the content key) can bechosen. This allows storing various data under a single location key3.In case of Hive2Hive, the content key are mostly hashes of constants depending on the typeof the data. This design decision reduces the probability of hash collisions. Concretely,a UserProfile has a content key hash(”USER PROFILE”), a MetaFile has a content keyhash(”META FILE”) and so on. In case of the UserProfileTask, the current timestampinstead of a constant is used the content key. Instead of using a one-way hash to calculatethe content key, the timestamp is encoded to the content key, such that it can be recon-structed. This allows a chronological ordering of the UserProfileTasks with best-effortat the peer storing it. When the receiver logs in, the tasks can be fetched in order andprocessed in order.

3.3 Security Aspects

In this section, the most fundamental security concerns and aspects, which are consideredin the Hive2Hive project, are presented.

3.3.1 User Protection

UserCredentials

So as to protect and distinguish users in the peer-to-peer network, a set of parametershas to be provided by every user:

• User ID: Unique identifier of the user in the network. This parameter is public.

• User Password: The password of the user to login to the network. This parametermust be kept secret.

• User Pin: A per-password pin of the user. In case the user password changes, it isrecommended to change as well. This parameter must be kept secret.

The different parameters are required in different applications of the library in order toensure a user’s privacy.

3With the domain key, TomP2P offers after the location key and the content key a third dimension.As the domain key is hardly used in Hive2Hive, it is omitted in this report.

Page 27: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

3.3. SECURITY ASPECTS 15

User Profile Protection

To keep the content of the UserProfile private, it gets symmetrically encrypted as soonas it is put into the DHT. The according AES encryption key is derived from the user’spassword and pin by means of the PBKDF2 key derivation function. For that matter,the pin is used to generate a fixed local salt (as opposed to client-server systems wherethe salt resides beneath the hashed password on the server side). The salting is requiredto counter attackers that fetch a UserProfile and try to crack the encryption passwordby brute-force dictionary attacks. (Since this library is open-sourced, the key derivationfunction is known to everybody.)

User Profile Location

Another improvement in protecting the profile from unauthorized access is to hide it inthe DHT. It should not be possible for an attacker to download and crack a very specificUserProfile. Therefore, the profile’s location is derived from a hash out of a combinationof the credentials, including the password and pin. Doing it this way prevents a potentialattacker to attack a specific user and instead can only fetch profiles of random users.However, the cracking of an encrypted UserProfile of an unknown user would be veryimprovident.

3.3.2 Encryption

Hive2Hive generally encrypts all data that is traversing the network or stored in the DHT.The same applies for messages. This holds the content private and protects it from beingread by clients that do not have the correct decryption key.The encryption of data depends on what the data represents. This is due to the fact, thatdifferent data needs to be found and accessed in different ways or might even be decryptedby more than one user. An overview of how different data is encrypted can be found inTable 3.1.Currently, the Hive2Hive library discriminates between symmetric AES encryption (128bit,192bit or 256bit), asymmetric RSA encryption (512bit, 1024bit, 2048bit or 4096bit) andthe hybrid encryption being a mixture of both. Hybrid encryption is recommended forlarge data as it improves the encryption/decryption time.

3.3.3 Authenticity

In order to ensure authenticity, data that is stored in the network, as well as messagestraversing it, are signed by owner or sender, respectively.

Page 28: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

16 CHAPTER 3. PROPOSED SOLUTION

Signing Data (Content Protection)

Since put and get operations in a DHT are public, a major problem in peer-to-peersystems is the verification of authorized content access. In order to avoid unconscious(e.g., selection of an already used key) and conscious (e.g., manipulation attacks) networkcontent modification, such as overwrites or deletes, any accessors modification permissionhas to be verified.Thus, Hive2Hive uses a mechanism that is referred to as Content Protection and providedby the underlying TomP2P framework. This mechanism allows to sign content with aProtection Key when it is put the first time into the DHT. As long as this DHT location(composed out of location key and content key) has not been allocated yet, it can be used.Due to the signature, only users owning the correct key are able to modify (overwrite,delete) the corresponding content.

Default Protection Keys

During the register process, a user is assigned a generated public/private key pair which isreferred to as default protection keys and stored in the UserProfile. This created per-userkey pair has an essential role because it is not only used to sign and protect the user’sprofile, Locations and public key that are all stored in the DHT, but also for private (i.e.,not shared) content, such as MetaFiles and data chunks. This protection key pair is usedby default for this user so as to avoid expensive key pair creation time.

Generated Protection Keys

However, as soon as content in the DHT is about to be shared with other users, thingsbecome slightly more complicated. In case a user provides full write access for some datato other users, he cannot just share his own, secret default protection key. In fact, a newprotection key pair must be generated and the content needs to be re-protected (signed)with it. After that, the newly created key pair has to be shared between the participants.Considering the provision of read-only access, it is not necessary to generate such a newkey pair. Despite, the content is re-protected because the user could upgrade the rightsand / or invite further friends.Another case where new protection keys need to be generated is the protection of User-ProfileTask objects. If a user puts such a UserProfileTask into the queue of another user,it is signed with the new protection key. Then the protection key is added to the User-ProfileTask, and the UserProfileTask is encrypted. The receiver can decrypt the task andremove it from the DHT with the help of the enclosed protection keys.

3.4 P2P Aspects

Although a peer-to-peer network overlay structure comes with many advantages, thereare several aspects that are more complicated than in centralized systems. This section

Page 29: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

3.4. P2P ASPECTS 17

addresses these problems and presents the most important mechanisms to cope with them.

3.4.1 Put Version Conflicts

If two nodes modify the same data simultaneously and try to put the modified data intothe network a version conflict may appear. In order to avoid version conflicts and todetect them, Hive2Hive uses a versioning approach, which bases on the fourth key (afterthe location, content and domain key), the version key, provided by TomP2P.The idea is that every node, which likes to store the modified data, has to verify whetherthe new data is based on the old data. To do so, every modified data generates a versionkey before it is put to the DHT. A version key consists of a timestamp and a hash valueof the modified data. Each put requires this generated version key of the data it is basedon. The two following examples illustrate this concept:

Example 1

Figure 3.5 shows an example of a version conflict. The version X of some data is stored onmultiple nodes due to replication. A client node U1 of user U modifies the version X (1.)and stores the new version as version Y1 on all nodes which are holding X (2.). A secondclient node U2 of user U for example did not get a notification about the modificationof X. U2 also modifies it (3.) and tries to store version Y2 on all nodes (4.), which areholding now version Y1 (because the node U1 was faster). In order to avoid and detectthis version conflict, each node holding Y1 checks if data Y2 is based on Y1. If not,the nodes deny the put request. This is exactly the case when U2 tries to put Y2. Theputting client node U2 gets aware of the conflict through the returning so called futureobject provided by TomP2P. The future object contains a map, which contains the peeraddress of all peers which tried to store the version Y2 (successful or not) as well asinformation about the put result of every contacted peer.

Example 2

The second example is visualized in Figure 3.6. Assume a replication rate of 5, thus foreach object, five nodes are responsible to store it. The version X of some data is storedon the nodes a, b, c, d and e. A client node U1 of user U modifies the version X (1a.) andstarts to store version Y1, which gets successfully stored on the nodes a, b and c (2a.).At the same time client node U2 modifies version X too (1b.) and also starts to storeversion Y2, which is successfully stored at node d and e (2b.). The version Y1 is deniedat node d and e (3a.). The same happens for version Y2, which is denied at node a, band c (3b.). Version Y1 (based on data X) is denied at node d and e because the versionwhich it is based is not Y2, but X. Version Y2 (based on data X) is denied at node a,b and c because the based on version key is not Y1. Both client nodes get aware of theconflict through the returned future objects.

Page 30: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

18 CHAPTER 3. PROPOSED SOLUTION

Figure 3.5: Example I of a version conflict.

If a version conflict has been detected, the processes are supposed to rollback the currentmodifications and to download/fetch again the newest version, which is used as the baseversion for the desired modifications. The main drawback of this strategy is that raceconditions may appear, until one participant wins.

3.4.2 Concurrent Modification

A further problem, which is overlapping with the previous one and also affects the storingof data into the network, is the concurrent modification of a particular object in the DHT.A high churn rate, caused by a large amount of joining and/or leaving nodes, can distractthe nodes from their responsibilities. Data, which is concurrently modified and stored,may be stored on two sets of replication nodes, because two peers think that they areresponsible for the put and replication procedure.Unfortunately there is no suitable strategy to avoid this phenomenon. Therefore the onlyapproach is to accept this problem, but to do some extra-work to verify it. In order todetect concurrent modifications after each put, the node, which has put some new datainto the network, validates the put through a successive get. This is performed by aso called digest get. The digest result is a list, which contains the version keys of allcurrently stored versions on a node. With this kind of history it is possible to check, ifthe new version has been stored on the nodes. As long the newest version key appearsin this list, everything went right. The version key does not necessarily have to be thefirst entry because during the time between the put and get another node could already

Page 31: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

3.4. P2P ASPECTS 19

Figure 3.6: Example II of a version conflict.

put a newer version. If the entry is missing on a node it is a good sign that concurrentmodification happened. In this case the validating node needs to check which versiongets precedence. This is evaluated by comparing the version keys: the older version wins.The process detecting a version conflict which did not win has to rollback. There are twoproblems with this strategy. First, a node may not get notified that its version has lost.At the latest, this node realizes it at the next put that something went wrong. Second,the validation of a get is problematic because it may happen that after several gets, theget request does not see the other modified versions, which may lead to inconsistencies.This extremely rare case still requires some improvements in Hive2Hive. The followingexample illustrates the problem and the chosen approach:

Example

Client node U1 of user U modifies the version X to version Y1 (see point 1a in Figure3.7) and stores it on node a, b, c, d and e (2a.). Client node U2 of user U modifies alsoversion X to version Y2 (1b.) and stores it on node f, g, and h (2b.). The reason is a highchurn rate, so that other peers think that they are responsible for data Y2.Meanwhile node c, d and h went offline (see Figure 3.8). For the verification, client nodeU1 performs a digest get operation. The digest get operation finds nodes a, b, e, f andg. Also client node U2 performs a digest get operation and finds nodes e, f and g. Clientnode U1 checks the history from node a, b and e (3a.) and sees that its version Y1appears in the history. The same happens at client node U2 (3b.). The node sees that its

Page 32: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

20 CHAPTER 3. PROPOSED SOLUTION

version Y2 (3b.) is appearing at node f and g. But client node U1 sees also that the twoanswering nodes f and g do not have the version Y1. Instead a concurrent modificationis detected (4a.). Also client node U2 detects a concurrent modification thanks to theanswer of node f (4b.). The nodes compare the keys and detect that version Y2 is newerthan version Y1, thus client node U2 has to roll back, respectively remove his version Y2.

Figure 3.7: Example for a concurrent modification (Part I).

3.4.3 Garbage Collection

A characteristic of a DHT is that all data is spread over the network, where the bulkof storage space at each peer is used for foreign data, respectively data of other users.The main reason is the replication which causes a multiplication of the required storagespace (factors between 3 and 6 are default). This circumstances require a careful storagemanagement. However, inactive users or data residues, which appear despite of a carefuldesign, cause dead data using superfluous storage space and network traffic (at replica-tion).In Hive2Hive, all data, which is stored in the network, has a time-to-live (TTL) value.The underlying TomP2P framework automatically removes content from the network ifits TTL has expired or has not been refreshed. The default settings in Hive2Hive is aTTL of at least one year, but it can be configured individually by the developer.

Page 33: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

3.5. OPTIMIZATIONS 21

Figure 3.8: Example for a concurrent modification (Part II).

3.5 Optimizations

The described concept supports all high-level operation defined in Section 1.2. Duringthe implementation phase, some optimizations that have not been considered during thedesign phase were made. The most important ones are described in the following.

Concurrent Operations

The for one high-level operation (as for example login or add), multiple long-lasting net-work calls are often required. The parallelization of such calls (where possible) improvethe performance. Each operation is atomic and consists of multiple process steps andsub-processes. For example, a process step makes a network call and starts the next stepwhen it is done. In addition, processes can be rolled back in case of an error.The process as a whole is again asynchronous and can be controlled by the developer. It isallowed to run multiple processes concurrently. Further details of the process frameworkare provided in the Appendix A.5.

UserProfile Manager

As mentioned above, all operations can run in parallel. Race-conditions of one processoverwriting the other processes changes were detected soon. Although the concurrency

Page 34: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

22 CHAPTER 3. PROPOSED SOLUTION

handling described in Section 3.4.2 is elaborate and trustworthy, such errors slow downthe execution of processes dramatically because processes need to rollback and restart.Most of the processes require to read and / or modify the UserProfile. As a solution, acentral and process-overreaching mechanism to fetch and modify the UserProfile by multi-ple processes was built. Requests for read-only or read-and-writes can be made, which arequeued in two separate queues: Qw for write-requests and Qr for read-only requests. Itis allowed that multiple processes can read the profile in parallel, thus they can be servedat the same time. Allowing multiple processes to modify the UserProfile simultaneouslyleads to race-conditions and inconsistencies. Thus, a request for a change blocks until thechange has been made (or the maximum allowed modification time expired). As soon asthe process releases the lock, the changes are applied and uploaded and the processing ofthe queues can be continued and waiting processes are served with the updated profile.Qw has a higher priority than Qr because every write-request includes a fetch of the newestprofile from the DHT, which is then modified. After this fetch, not only the foremost pro-cess in Qw, but all processes in Qr can be served with the fetched and yet unmodifiedprofile.Each node in the network has an own local UserProfile manager and concurrency issuesamong multiple peers is not reduced. However concurrency within a peer is eliminated.This UserProfile manager not only speeds up the processes because version conflicts areomitted, but also improves the network efficiency. The UserProfile does not have to befetched by every process that tries to read it, but only once by the UserProfile manager,which then distributes it among the waiting processes.

Caching

The policy of having all messages encrypted evokes the necessity to know the receiver’spublic key for the encryption. As all messages are signed with the sender’s public key,thus the receiver needs to know the sender’s public key, too. The public key of each useris stored in the DHT (see Section 3.2.6) during the registration. These keys never changebut need to be fetched every time a message is sent or received. Caching the keys ateach peer once it has got the key once from the DHT is therefore justified. The cache isextended over time when the user interacts with another user for the first time. Whenthe user logs out, the cache is written to a (hidden) meta file on disk. At the next login,the cache is again loaded into memory. This optimization is advantageous when a userhas shared folders with one or multiple other users.

3.6 Demonstration Client

For testing and demonstration purposes of the developer interface and the proper oper-ation of the functionality, the creation of a minimalistic client was required. When theconsole-based application starts, a peer is created. By means simple text inputs, the usercan configure the node, connect to the peer-to-peer network and use all high-level func-tions. Screenshots 3.9 and 3.10 show sample screens of the console.

Page 35: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

3.7. TESTING 23

The demonstration client helped to conduct integration tests and to improve the developerAPI so as to guarantee a high usability.

Figure 3.9: Demonstration client (Screenshot I).

3.7 Testing

A test-driven development approach has been pursued from the beginning. The tests areorganized in an independent project and cover all important components of the Hive2Hivelibrary. With the help of the JUnit framework4 several unit and integration tests areimplemented. Moreover, the already mentioned test client served for manual testingpurposes.

4http://junit.org/

Page 36: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

24 CHAPTER 3. PROPOSED SOLUTION

Figure 3.10: Demonstration client (Screenshot II).

Page 37: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

Chapter 4

Conclusions

This chapter contains the conclusion of this project, a recapitulation of the achievedgoals and some lessons learned. Furthermore, it mentions possible directions for futureimprovement and extension of the Hive2Hive library.

4.1 Conclusion

The task of this project was to develop an open-source Java library that aims at distributedfile synchronization and sharing tasks. Moreover, this library should not only base on thedistributed hash table provided by the underlying TomP2P framework, but also focus onoffering corresponding extensions that provide a high user and data security. Along withthe very fundamental support of file synchronization and sharing operations, the full setof advantages of decentralized – and in particular the ones of peer-to-peer – are exploited.Another fundamental ambition was to provide users with the same user experience that isalready known from popular, centralized services. This includes, among others, handlingof rather large files and the asynchronous communication and transmission of data betweenpeers.The resulting Hive2Hive project represents a free and easy-to-use library that supportsoperations or tasks which shall be executed in a decentralized manner. With Hive2Hive,decentralization is achieved by building functionality on top of an underlying peer-to-peernetwork structure. The library takes full responsibility of the network interaction andthus provides the necessary level of abstraction.The main focus of the library’s current state lies on user and file management. Concretely,an extensive amount of fundamental operations is provided for applications that desire tostore, backup, synchronize or share files in and over the network. However, the Hive2Hivelibrary is designed to be easily extendable for other services that intend to profit fromdecentralized properties and, at the same time, enrich its set of supported operations. Inorder to provide fundamental documentation and guidance for the library, an appealingproject website has been designed, created and hosted1.

1http://www.hive2hive.com

25

Page 38: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

26 CHAPTER 4. CONCLUSIONS

4.2 Lessons Learned

This section lists the lessons learned during the life cycle of this project. The most im-portant lesson learned is to always have a profound understanding of the problems. Mainproblems of a distributed file storage service have already been detected in the proto-type project Box2Box [6]. Even though the largest part of the Hive2Hive project wasthe implementation phase, a detailed planning and conception phase has taken place atthe start of the project, where the Box2Box approach has been reviewed and revised.Each planned use-case was discussed over multiple meetings, until the appropriate modelwas found which provides the required level of security but keeps the use-cases simple.During the implementation phase, the necessity of such a precise planning turned out tobe extremely helpful. Even though some approaches have been updated and restructuredlater, the overall conception remained likewise.Since the supervisor of this project is the core developer of TomP2P, the choice of thissoftware as the underlying peer-to-peer framework was not arbitrarily. This enabled atight cooperation between the two projects. If needed, new functionality has been addedto TomP2P, as for example the version key. These new features were delivered very fast,mostly in a few days, such that the development of Hive2Hive never faltered because ofwaiting times. In the course of the close interaction Hive2Hive was helping to detect andremove several bugs in TomP2P.Furthermore, allocating expert fields to the developers turned out to be a successful prac-tice. One member was a specialist for the interaction with TomP2P and managed thelow-level functions (from the Hive2Hive perspective). Another developer implementedhigh-level operations in the file and user management. The third team member super-vised the overall software architecture and leaded the required code refactoring processes.But also, everyone knew details of the others fields and the boundaries were not tight.Interestingly, such expert fields have never been assigned explicitly, but emerged duringthe daily work. Looking back, the expert fields turned out to be useful because whenproblems arose, it was clear who had to be asked for help. At the same time, such apractice minimizes conflicts and interference in the code.

4.3 Future Work

So as to keep up with the development of the Hive2Hive library, many ideas about im-provements, alternative solutions or extensions can be named.A first proposed improvement is to secure the bootstrap mechanism. Concretely, thepeer-to-peer system could support a sand-boxed environment, wherein peers can trusteach other. For example, this is could be useful in cases where Hive2Hive is deployed asa small company-internal network. So if a peer wants to bootstrap, it has to know thenetwork password first.A second improvement for Hive2Hive would be to share the file configuration that is ac-tually required by all nodes in the network. Currently, each node locally configures suchconstants, like the allowed maximal file size or the maximal number of versions. Anotherapproach could be that the initial node (the master peer), which creates the network,

Page 39: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

4.3. FUTURE WORK 27

defines and uploads a general configuration to a constant location. Other joining clientscan then automatically fetch it. Such a solution would ensure that the file configurationhas to be configured only once and that it is applied globally. However, the challengewould be to verify that all joining nodes are using this configuration.One of the major challenges during the Hive2Hive project was the design and implemen-tation of a folder tree structure for the file system that copes with all of the securitymechanisms without introducing unnecessary network load. The current solution worksso far, but has some minor drawbacks. One shortcoming is the time consuming encryptionkey generation, another the strong coupling with the UserProfile, which often has to beloaded from the network. Therefore, the implementation of the Cryptree [5] propositionsounds very promising. Cryptree would allow for a smarter key and access control man-agement that has been optimized for peer-to-peer networks.By now, the library provides chunking of files, which certainly helps to manage larger filesin a peer-to-peer network. However, the handling of large files still remains as one of thebiggest challenges in peer-to-peer networks. An approach that might help to reduce thenetwork load due to large files is data deduplication. With this, a check for the existenceof the exact data in the network could be made preliminary to each upload. If it is thecase, only a reference is stored. Unfortunately, this approach triggers several questionsand problems that have to be solved. One major issue is the conflict of data encryptionand deduplication, where the latter tries to find binary data patterns that are optimallyremoved by the former.Another, but well-known, approach for large files in peer-to-peer networks, is to use era-sure codes instead of the usual replication mechanisms. Erasure coding allows to reducethe replication factor from traditional values about 3 - 6 to approximately 1.3. But thereare several questions to answer, like who is responsible for the data chunks and wheredoes the calculation take place. Another point is that TomP2P currently does not sup-port erasure coding. Thus, it would require a major rebuild of the underlying framework.Hive2Hive is mainly designed as a library for synchronizing and sharing files in a decen-tralized manner, which may be used by applications of third parties. Beside the classicaldomain of syncing and sharing files, such as images or documents, the library could serveas the underlying platform for applications coming from other domains. The existingmessaging and user management operations are perfectly suited for a chat application,for example. The library is using state-of-the-art encryption and there are no centralcontrol instances. Another idea could be to realize a key manager. Often, having severaldozens of accounts with user names and passwords leads people (with a flair to privacyand safety) to manage their passwords with key manager applications. So far, there arejust a few applications that allow to synchronize their key databases over several devices.Current solutions mostly use central services, like servers, which are subject to trust andconfidence. A key manager, basing on an overlay network and providing full encryption,could be an interesting alternative.The library provides general interfaces for user management, network configuration, filesharing and synchronization. However, the current design does not yet offer the access tothe underlying DHT functionality, such as get or put, in a more generic way. The same ap-plies for encryption and write protection. A controlled and reasonable exposure of ratherinternal library services could support a developer even more when creating applications.One possible future direction for the Hive2Hive library could be an optimization for usercontrolled nano data centers (UNaDa). UNaDa’s use home routers that are online most

Page 40: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

28 CHAPTER 4. CONCLUSIONS

of the time so as to offer services to its users. Routers usually provide high availabilityand attached storage capacity.Another possible direction could be an optimization for mobile devices and thus opennew application domains. However, such a direction would require much effort, includingcareful redesign and conceptualization, to let the library run in environments of limitedresources.

Page 41: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

Bibliography

[1] B. Cohen: Incentives Build Robustness in BitTorrent, in 1st Workshop on Economicsof Peer-to-Peer Systems (P2PECON), Berkeley, CA, USA, June 2003.

[2] B. Stiller et al.: Internet Economics VIII, [Online], http://www.csg.uzh.ch/

teaching/hs13/inteco/extern/IFI-2014.01.pdf, Universitat Zurich, pp. 103-133

[3] BBC: Megaupload file-sharing site shut down, [Online], http://www.bbc.co.uk/

news/technology-16642369, May 2013

[4] CISCO: Cisco Visual Networking Index: Forecast and Methodology, 2012-2017. [Online], http://www.cisco.com/c/en/us/solutions/collateral/

service-provider/ip-ngn-ip-next-generation-network/white_paper_

c11-481360.pdf, March 2014

[5] D. Grolimund, L. Meisser, S. Schmid, R. Wattenhofer (2006): Cryptree: A foldertree structure for cryptographic file systems, in SRDS, IEEE Computer Society, pp.189-198

[6] Lareida et al.: Box2Box - A P2P-based File-Sharing and Synchronization Applica-tion, 2013

[7] M. Meeker, L. Wu: Internet trends, [Online], http://allthingsd.com/tag/

mary-meeker/, May 2013

[8] Miniwatts Marketing Group: Internet Growth Statistics, [Online], http://www.

internetworldstats.com/emarketing.htm, January 2008

[9] Sandvine Inc. UCL: Global Internet Phenomena Report1H, [Online],https://www.sandvine.com/downloads/general/global-internet-phenomena/

2013/sandvine-global-internet-phenomena-report-1h-2013.pdf, May 2013

[10] V. Valancius, N. Laoutaris, L. Massoulie, C. Diot, P. Rodriguez: Greening the In-ternet with nano data centers, in Proceedings of the 5th international conference onEmerging networking experiments and technologies, ACM, 2009, pp. 37-48.

29

Page 42: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

30 BIBLIOGRAPHY

Page 43: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

List of Figures

3.1 Hive2Hive is located between TomP2P and an application. . . . . . . . . . 8

3.2 Visualization of the model in the DHT. . . . . . . . . . . . . . . . . . . . . 9

3.3 Example for the location keys in the overlay network. . . . . . . . . . . . . 10

3.4 Example showing how Alice sends a message to Bob. . . . . . . . . . . . . 13

3.5 Example I of a version conflict. . . . . . . . . . . . . . . . . . . . . . . . . 18

3.6 Example II of a version conflict. . . . . . . . . . . . . . . . . . . . . . . . . 19

3.7 Example for a concurrent modification (Part I). . . . . . . . . . . . . . . . 20

3.8 Example for a concurrent modification (Part II). . . . . . . . . . . . . . . . 21

3.9 Demonstration client (Screenshot I). . . . . . . . . . . . . . . . . . . . . . 23

3.10 Demonstration client (Screenshot II). . . . . . . . . . . . . . . . . . . . . . 24

A.1 The logo of the Hive2Hive project . . . . . . . . . . . . . . . . . . . . . . . 36

A.2 Process states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

A.3 UML class diagram of the underlying model. . . . . . . . . . . . . . . . . . 50

A.4 UML class diagram of the process framework. . . . . . . . . . . . . . . . . 51

31

Page 44: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

32 LIST OF FIGURES

Page 45: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

List of Tables

3.1 Overview about where and how the objects are stored. . . . . . . . . . . . 9

A.1 Configuration interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

A.2 Components of the process framework. . . . . . . . . . . . . . . . . . . . . 47

A.3 Process states explained in detail. . . . . . . . . . . . . . . . . . . . . . . . 48

33

Page 46: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

34 LIST OF TABLES

Page 47: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

Appendix A

Additional Documentation

This chapter provides a brief documentation of the written code and how it can be used.A more detailed guide can be found on the project website1. First, the name of theproject is explained. Then, the proper setup of the library and with the aid of some codesnippets, its basic usage is shown. Next, the focus lies on how the high-level operations areimplemented using the model explained in section 3.2. Finally, the implemented processframework is documented.

A.1 The Name Hive2Hive

The former name Box2Box was discarded to indicate that this project starts again fromscratch, including new concepts and ideas. Furthermore, the old name is a derivationof one of the big centralized file synchronization services. When thinking of distributedsystems, swarm behavior is not far-fetched. Busy bees fulfill their tasks in an altruisticmanner but still protect their hive from attackers. This led to the beehive analogy whichis represented in the project’s name. Figure A.1 shows the logo of the project.

A.2 Library Setup

There are two ways to download and import the Hive2Hive library in order to use it inan application. The recommended way is to download a stable official release that comesdirectly with all necessary and stable dependencies, such as the TomP2P framework.All required JAR-files are packed and delivered as a ZIP. If the most recent state ofdevelopment is required, Hive2Hive can be cloned from the version control system (VCS)repository2 and bound to the application. This requires to also clone TomP2P from theVCS repository3.

1http://hive2hive.com2https://github.com/Hive2Hive/Hive2Hive3https://github.com/tomp2p/TomP2P/

35

Page 48: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

36 APPENDIX A. ADDITIONAL DOCUMENTATION

Figure A.1: The logo of the Hive2Hive project

INetworkConfiguration Provides the necessary network settings. The utility classNetworkConfiguration provides some factory methods to eas-ily create default and custom configurations (e.g., node ID,bootstrap address/port).

IFileConfiguration Provides the necessary file settings. The utility class File-Configuration provides some factory methods to easily createdefault and custom configurations (e.g., max. file size, numberof versions, chunk size).

Table A.1: Configuration interfaces.

A.3 Developer API

The Hive2Hive developer API was designed to be comprehensible and used while remain-ing open to extensions and improvements. In this section, a simple introduction abouthow to use the basic API is given. Note that this introduction does not cover how theAPI can be extended or how system internal components can be changed and improved.Further, at this point it is assumed that the Hive2Hive library has successfully been setup and imported to the application.

A.3.1 Configuration

All configuration of the Hive2Hive framework is done by means of the two interfaces statedin Table A.1.Listing A.1 shows the creation for a master and a client peer. The difference between

Page 49: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

A.3. DEVELOPER API 37

the two is that a master peer is the first node in the network, whereas the other peerbootstraps to the master node by the bootstrap address (and an optional port) providedby the configuration. Further, it demonstrates how the configuration for the files can beset in a similar way.

// master peer configuration

INetworkConfiguration masterConfig = NetworkConfiguration.create("

masterID");

// node peer configuration with bootstrap address

INetworkConfiguration nodeConfig = NetworkConfiguration.create("nodeID",

InetAddress.getByName("192.168.1.100"));

// default file configuration

IFileConfiguration defaultFileConf = FileConfiguration.createDefault ();

// custom file configuration:

// max. file size: 10 MB

// nr. of versions: 20

// max. size of all versions: 200 MB

// chunk size: 1 MB

IFileConfiguration customFileConf = FileConfiguration.createCustom (10 *

1024 * 1024, 20, 20 * 10 * 1024 * 1024, 1024 * 1024);

Listing A.1: Configuration demo code.

A.3.2 Peer Creation

As already stated in the previous section, the P2P network consists of two types of peers.One master peer, which represents the very first peer in the network, and several normalpeers that either bootstrap to the master peer or any other connected peer. Whether apeer is a master or not is implicitly defined in its instance of INetworkConfiguration: incase no bootstrap parameters are provided, a master peer will be created. When creatinga peer, the developer API returns a IH2HNode interface that contains the necessaryfunctionalities to interact with the library. So in order to create a peer, the H2HNodeclass must be made use of, whose node creation factory method expects both the networkand file configuration. Once created, the node can be connected to the network (seeListing A.2).

IH2HNode node = H2HNode.createNode(masterConfig , defaultFileConfig);

node.connect ();

//...

node.disconnect ();

Listing A.2: Peer creation demo code.

Page 50: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

38 APPENDIX A. ADDITIONAL DOCUMENTATION

A.3.3 User Management

The Hive2Hive user management provides access to the registration and login states ofusers. The necessary operations can be found in the IUserManagerinterface of whichinstances of the IH2HNode return an implementation. Most operations require someparameters, like for example a user’s specific credentials that are defined in an object ofthe UserCredentials class. An exemplary usage is shown in Listing A.3

IUserManager userManager = node.getUserManager ();

UserCredentials credentials = new UserCredentials("userID", "password",

"1234");

// register the user

userManager.register(credentials);

// login the user and provide the local root directory path

userManager.login(credentials , Paths.get(System.getProperty("user.home")

));

// logout the currently logged in user

userManager.logout ();

Listing A.3: User management demo code.

A.3.4 File Management

The file management, on the other hand, provides operations to manage the file in thenetwork, like adding, modifying or sharing. These operations can be found in the IFile-Manager interface of which instances of the IH2HNode return an implementation. Hereagain, the required parameters have to be passed. In Listing A.4 a new file is created, up-loaded and then shared. Furthermore, the interaction when recovering an old file versionis shown. In the end, the file is moved and finally deleted.

IFileManager fileManager = node.getFileManager ();

File folder = new File(rootFolder , "demo -folder");

File file = new File(folder , "demo -file");

// add a file

fileManager.add(file);

// share a folder with another user (write permission)

fileManager.share(folder , "otherUser", PermissionType.WRITE);

// update a file

fileManager.update(file);

// recover a file’s other version

IVersionSelector versionSelector = new IVersionSelector () {

@Override

Page 51: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

A.3. DEVELOPER API 39

public IFileVersion selectVersion(List <IFileVersion >

availableVersions) {

return availableVersions.get(0);

}

};

fileManager.recover(file , versionSelector);

// move a file in the file hierarchy

File otherFolder = new File("other -demo -folder");

fileManager.move(folder , otherFolder);

// delete a file

fileManager.delete(file);

Listing A.4: File management demo code showing some supported file operations.

A.3.5 Process Interaction

As shown in the examples above, user and file manager are considered to execute certainprocesses that interact with the network. By default, such processes are executed instantlyand automatically. However, sometimes, it might be necessary or wished that the processcan be manually started. In order to do so, each manager can be configured whether itsoperations shall be executed automatically or manually. How to enable or disable theautostart is stated in Listing A.5

IUserManager userManager = node.getUserManager ();

userManager.configureAutostart(true);

IFileManager fileManager = node.getFileManager ();

fileManager.configureAutostart(false);

Listing A.5: Autostart enabling/disabling demo code.

Furthermore, it is important to notice that the background processes, created by oper-ations of the user or file manager, are running asynchronous (i.e., in a separate thread)and thus these manager operations are non-blocking. As such operations are executedin the background, it might be a good idea to keep track of it. Such tracking can beachieved by retaining an instance of the IProcessComponent interface that is returned byeach manager operation. This interface allows to query different information about theprocess, like its current state or progress, but also offers controlling options like pausingand resumption or cancellation. Furthermore, it allows to attach listeners of the interfaceIProcessComponentListener that basically listen for a process’ success or failure but canbe extended to do something else, too. In any case, processes run in an asynchronousmanner and controlling a process might be tricky because it might already completed atthe moment a pause is called, for instance. Therefore, as soon as the ability to control a(potentially very short/fast) process instance is required, the autostart should be disabled(see above). Concerning the listeners, any listener of type IProcessComponentListenerthat is attached to an already finished (success/fail) process component will be instantlytriggered. Listing A.6 should be considered to clarify the above described.

Page 52: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

40 APPENDIX A. ADDITIONAL DOCUMENTATION

IProcessComponentListener registerListener = new

IProcessComponentListener () {

@Override

public void onSucceeded () {

System.out.println("Registering succeeded.");

}

@Override

public void onFailed(RollbackReason reason) {

System.out.println("Registering failed. Reason: " + reason.

getHint ());

}

@Override

public void onFinished () {

// ...

}

};

// automatic process starting

IUserManager userManager = node.getUserManager ();

userManager.configureAutostart(true); // default

IProcessComponent registerProcess = userManager.register(credentials);

// in this example , the process might already have finsihed because it

was autostarted and runs in a separate thread

String id = registerProcess.getID();

ProcessState state = registerProcess.getState ();

double progress = registerProcess.getProgress ();

// controllings might not work , if process finished already

registerProcess.pause();

registerProcess.resume ();

registerProcess.cancel(new RollbackReason("cancellation reason"));

registerProcess.attachListener(registerListener); // triggers even if

process already finished

Listing A.6: Process interaction demo code.

In case the processes need to be started manually, inspiration should be gained fromListing A.7. More fundamental information about processes can be found in AppendixA.5

// manual process starting

IUserManager userManager = node.getUserManager ();

userManager.configureAutostart(false);

// start a registration

IProcessComponent registerProcess = userManager.register(credentials);

// here , the process has not yet started

Page 53: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

A.4. IMPLEMENTED FUNCTIONALITY 41

registerProcess.attachListener(registerListener);

registerProcess.start();

Listing A.7: Manual process start demo code.

A.4 Implemented Functionality

This section shows and explains how the Hive2Hive functionality is implemented and howit works internally. For the sake of a better understanding, some simplifications have beenmade.

A.4.1 Register

In order to interact with the network, a user needs to announce himself. This is done bymeans of a one-time registration. Each user has some unique credentials, consisting ofuser ID, password and pin, which are the required registering parameters.

1. Check if user already exists.

2. If not, the UserProfile for this user is created.

3. The created UserProfile is put to the DHT.

4. The new user Locations (empty) are put to the DHT.

5. The user’s public key is put to the DHT.

A.4.2 Login

As soon as a user is registered to the network, he can log in to the network to representhimself online. This triggers the network to update the logged in client’s state (file syn-chronization, messages, etc.). In order to do so, the correct credentials have again to beprovided. As soon as the client is logged in, he starts to receive push notification whenanother client has changed something.

1. The user’s UserProfile is fetched from the DHT.

2. A client session is created.

3. The user Locations are fetched from the DHT.

4. The user Locations in the DHT are updated (addition of own location and detec-tion of possible unfriendly leaves of other clients) and put to the DHT. Also, it isevaluated whether this client is the master client.

5. Files are synchronized (inconsistencies due to offline phase).

6. If this client is master, the UserProfileTask queue is processed.

Page 54: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

42 APPENDIX A. ADDITIONAL DOCUMENTATION

A.4.3 Logout

Logout is the reverse operation of login and can be regarded as a friendly leave in thenetwork. In order to detect file changes during the offline phase of a client, the lastsynchronous state is persisted on the client’s disk. This information is considered on thenext login.

1. The user Locations are fetched from the DHT.

2. The user Locations in the DHT are updated (removal of own location) and put tothe DHT.

3. Persist current disk state on the client (including hashes of the files, cached publickeys).

A.4.4 Add

Once logged in, the user can add files and folders to the network. Adding a folder is easierthan adding a file because a folder does not have a corresponding MetaFolder or chunks.

Add a File

1. First, the file size is validated (configurable at the node setup)

2. Get the UserProfile

3. Add the FileIndex to the UserProfile and put the UserProfile back to the DHT.

4. Generate a RSA key pair which is used for the encryption of all chunks

5. Split the file in multiple chunks, encrypt them with the previously generated publickey. Upload them to the DHT at a random location, such that they are statisticallyspread over all nodes.

6. Create a new MetaFile, add the generated key pair and all chunk location keys. Addit to the DHT.

7. All clients having access to the newly added file are notified, such that they candownload it.

When someone needs to download the file, he needs to search for the MetaFile key pair inthe UserProfile. With these meta file key pair, he can find and decrypt the correspondingMetaFile. In the MetaFile, he finds the location of all chunks and the key pair to decryptthem.

Page 55: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

A.4. IMPLEMENTED FUNCTIONALITY 43

Add a Folder

1. Get the UserProfile

2. Add the FolderIndex to the UserProfile and put the UserProfile back to the DHT.

3. Notify all clients / users having access to the new folder, such that they can createone on their disk, too.

A.4.5 Modify

Once a file is added, users might want to modify / update it. Hive2Hive supports storingmultiple versions of the file. This scenario describes, how to add a new version. Modifi-cations can only be made to files because folders do not have a version history.

1. Similar to adding a file, its size is validated

2. Fetch the MetaFile from the DHT

3. Split the updated version into multiple chunks, encrypt them with the RSA key pairlocated in the MetaFile.

4. The previously fetched MetaFile is supplemented with the new chunks’ location keysand put in the DHT

5. Since the file’s MD5 hash has changed, the hash is updated in the FileIndex in theUserProfile

6. All clients having access to the updated file are notified, such that they can downloadthe new version

The node configuration allows to configure the maximum number of versions and themaximum number of bytes allowed for each file (summing up all versions together). Itcan happen that the actual values exceed these limits and old file versions need to becleaned up. If a cleanup is necessary, old versions are removed from the MetaFile and theold chunks are deleted from the DHT.

A.4.6 Delete

Deleting a file implies deleting all versions of this file in the DHT. Note that deleting a fileis irreversible because it removes all previous versions as well. Deleting a folder is mucheasier than deleting a file because no chunks and no MetaFile needs to be removed fromthe DHT.

Page 56: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

44 APPENDIX A. ADDITIONAL DOCUMENTATION

Delete a File

1. Get the UserProfile

2. Remove the FileIndex from the index tree in the UserProfile and put it back to theDHT

3. Delete the file on disk (or moving it to the trash folder)

4. Fetch the MetaFile from the DHT

5. Iteratively delete all chunks

6. Delete the MetaFile

7. Notify all clients that they should delete the file as well

Delete a Folder

1. Delete the folder on disk (or moving it to the trash folder)

2. Remove the FolderIndex from the index tree in the UserProfile

3. Notify all clients that they should delete the folder as well

A.4.7 Move

Moving a file or a folder does not differ. The model allows to only relink the index in theUserProfile. No MetaFile or file data needs to be moved in the network.

1. Get the UserProfile

2. Move the index from the old parent (FolderIndex) to the new parent (FolderIndex).

3. Put the UserProfile back to the DHT.

4. Notify the own clients about the movement of the file, such that they can move iton their disk too.

The notification when a file has moved is slightly more complicated when consideringmovements of shared files / folders. The client that has moved the file needs to distinguishbetween three different kinds of users:

• Users that had access to the old location and to the new location

• Users that had access to the old location but do not have access to the new location.They should be notified to delete the file at the old location.

• Users that did not have access to the old location but have access to the new location.From their perspective, the movement is a simple add of the file.

These three classes of users need to be notified with each another kind of message.

Page 57: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

A.4. IMPLEMENTED FUNCTIONALITY 45

A.4.8 Recover

Hive2Hive supports file versioning (see section A.4.5) and stores, depending on the de-veloper’s configuration, multiple versions of a given file. Recovering an old version doesnot overwrite the newest version on disk but creates a new file with a predefined suffix toindicate that it is a recovered file.Note that recovering a previously deleted file is not possible because deleting a file denotesdeleting all file versions.

1. Get the UserProfile, find the FileIndex of the file to recover

2. With the FileIndex, the MetaFile can be fetched from the DHT. All versions arereferenced there

3. Ask the user which version he likes to recover

4. Download the selected version under a new name

5. Add the recovered file to the DHT using the Add File functionality

A.4.9 Share

When a user shares a folder with one or multiple other users, they can read and/or modifythe content of this folder. By definition, users can share whole folders but not single files.Sharing can either be read-only or giving the full write access. Users with write accesscan invite more users or kick unwelcome sharers.Since all data in the DHT is write protected, the protection key need to be handed overto the other users in case they have write access. If he does have read-only, the protectionkey is kept private, such that the user has no chance to update a file. Let’s look at thefunctionality step-by-step:

1. Get the UserProfile, find the FolderIndex to share

2. Check recursively whether a sub-folder or parent folder is already shared. If yes,the sharing is not possible because it is not allowed to share folders within sharedfolders.

3. Generate a new protection key for this share.

4. Recursively change all protection keys of the enclosed files (MetaFiles, Chunks) fromthe default protection key to the newly generated protection key.

5. Add the new sharer to the sharer’s list in the FolderIndex and put the UserProfileback to the DHT.

6. Notify the new sharer about the new folder he has now access to.

Page 58: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

46 APPENDIX A. ADDITIONAL DOCUMENTATION

7. Notify already existing sharers about the new sharer such that they can update theirsharer’s list in the FolderIndex, too.

Sharing is a cost-intensive call, especially if a user wants to share a folder that alreadyhas a lot of content. However, TomP2P allows to update the protection key withoutuploading the data again. To speed up the sharing process, updating the protection keyscan be done in parallel.

A.5 Process Framework

Internally, all Hive2Hive operations are executed by the means of processes and processsteps. For this reason, the Hive2Hive project also features an own process framework thatis explained in this section. This framework is designed to primarily achieve to goals: itmust facilitate the project’s core operations and be as extendable and reusable as possibleto allow future additions, improvements and extensions.

A.5.1 Requirements and Overview

In order to successfully process and execute chains of operations, some basic requirementsneed to be satisfied. The following features are supported:

• Any operation or functionality is representable by (nested) processes, subprocessesand process steps, based on its specific divisibility.

• Any operation or functionality, or rather its underlying process structure, is acces-sible and controllable via a single interface.

• Any operation executed by a process component is able to be rolled back in case ofa cancellation (due to manual interaction or execution failure).

• Any process component can be defined to run in an asynchronous manner.

The information about how these features are implemented can be withdrawn from theUML diagram in Figure A.4. Table A.2 shows an overview of the components of theprocess framework.

Page 59: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

A.5. PROCESS FRAMEWORK 47

IProcessComponent This is the basic interface defining all common functionalitiesof process components and provides the ”public” API that istypically used by the client. For example, it offers the uniqueID of the component, its current state and progress as wellas all attached IProcessComponentListeners. This interfaceextends the IControllable interface which defines operations tostart, pause, resume, stop and cancel any process component.

ProcessComponent This is the abstract base class for all process components anddirectly implements IProcessComponent. It keeps track ofa component’s most essential properties and functionalities.Some Template Methods are defined for component specificroutine implementations.

Process This is the abstract base class for all composite (container)process components that may contain other ProcessCompo-nents.

ProcessStep This is the abstract base class for all (leaf) process componentsthat represent a specific operation and do not contain furthercomponents.

ProcessDecorator This is the abstract base class for all process component Dec-orators that provide additional behavior or state to existingcomponents.

IProcessComponentListener This is the basic process component listener interface.

Table A.2: Components of the process framework.

Figure A.2: Process states.

Page 60: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

48 APPENDIX A. ADDITIONAL DOCUMENTATION

READY Represents a component that is ready to be executed.RUNNING Represents a component that is currently being executed.ROLLBACKING Represents a component that is currently being rolled back.PAUSED Represents a component that is currently paused, whether it

is running or rollbacking.SUCCEEDED Represents a component that has finished successfully.FAILED Represents a component that has finished unsuccessfully and

failed.

Table A.3: Process states explained in detail.

A.5.2 Process States

Every ProcessComponent is required to remain in a valid state represented by the enumProcessState. To ensure this, a minimal set of states and valid state transitions, of whichthe ProcessComponent keeps track, are defined. The possible states are described in TableA.3 and the valid transitions are shown in Figure A.2.

A.5.3 Process Listeners

In order to observe a process component it is possible to attach listeners to it. Theframework so far provides the IProcessComponentListener interface, which notifies abouta component’s success or failure, and the IProcessResultListener, which notifies about thecompletion of a result. Both interfaces may be extended among any other needs. It ishighly recommended that processes are observed as this is the only way to get notifiedabout its successful or unsuccessful execution.

A.5.4 Rollback

Every process component has the ability to roll itself back. This means that every op-eration that has been executed by a Process or ProcessStep needs to be reverted. ForProcess container classes this might, based on the implementation, just mean to roll backall its child components in the reverse order. For ProcessStep subclasses that need to rollsomething back, it is required to override the doRollback() method because it is assumedto do nothing by default.A rollback can be triggered in two ways. First, the cancel() method may get called exter-nally in which case a RollbackReason object has to be provided. And second, any processcomponent in the composite failed. In this case, a ProcessExecutionException must beraised by the component in the doExecute() method. This will automatically induce theframework to roll back, and therefore cancel, the whole process composite and notify alllisteners about the fail.

Page 61: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

A.6. CLASS DIAGRAMS 49

A.5.5 Asynchronous Process Components

By default, every ProcessComponent is intended to run in a synchronous manner. Inother words, it runs on the main thread and blocks until the execution finished. Hence, soas to be able to run whole process hierarchies or single nested process components in anasynchronous manner, the framework provides the AsyncComponent decorator class. Thisdecorator basically wraps the original component and executes it on a separate thread.Such wrapping allows different components or whole processes to execute simultaneously.Although this decorator is a very convenient tool, it has to be used with caution asasynchronicity is sometimes a bit hard to work with. For example, note that asynchronouscomponents should be independent of any other components as they are executed and, incase of a failure or manual interaction, rolled back in an own background thread.

A.5.6 Computing Processes

Sometimes a process is required to compute a value or get a result from the network.In such situations the IResultProcessComponent interface might come in handy as itpovides the notifyResultComputed() method that is executed as soon as the result isready. Process components that implement this interface should ensure to notify theirattached IProcessResultListeners so as to notify the client about the available result. Ifsuch a computing component needs to run in an asynchronous manner, wrapping it withthe AsyncResultComponent decorator is the best option.

A.5.7 Process Factory

Hive2Hive makes use of a simple process factory that composes all processes used by itscore functionality. It expects the process specific parameters, takes care of the processcomponent nesting and finally just returns the IProcessComponent interface to the client.

A.6 Class Diagrams

Page 62: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

50 APPENDIX A. ADDITIONAL DOCUMENTATION

Figure A.3: UML class diagram of the underlying model.

Page 63: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

A.6. CLASS DIAGRAMS 51

Figure A.4: UML class diagram of the process framework.

Page 64: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

52 APPENDIX A. ADDITIONAL DOCUMENTATION

Page 65: Hive2Hive — An Open-Source Library for Distributed File ... · major problem. On the other hand, systems like BitTorrent-Sync4 (BT-Sync) o er a de-centralized solution for synchronizing

Appendix B

Contents of the CD

• Report in PDF and PS format.

• Report as source files including all figures.

• Abstract and its german counterpart ”Zusammenfassung” as plain-text files.

• The Hive2Hive library source code as JAR file, including all dependencies.

• Related work papers in PDF format.

• Other material used and produced during the project. These include pictures ofwhiteboard discussions, diagrams, marketing material (logo and website), etc.

• Project website as offline version.

53


Recommended