Datafying BitcoinTariq B. Ahmad
https://github.com/tariq786/datafying_bitcoin
Motivation● Bitcoin is a virtual Peer-to-Peer crypto currency.
● All bitcoin transactions are publicly available (who sent, who received and how much?) but pseudo-anonymous
● This publicly available data is called “blockchain distributed ledger”. Current size is around 70 GB (binary data). Growing every day since 2009.
2
BlockChain Size
3
Bitcoin Transaction types
4
one to one transaction
Many to Many transaction
Block
5
Block contains bitcoin
transactions.
There are almost 400,000
blocks today.
Blockchain contain all
these blocks linked
together like a doubly linklist
Data● Historical Data
○ Almost 400,000 blocks (new bitcoins)○ More than 104 Million transactions so far
● Live Data○ 2 transaction per second○ Propagate through Peer to Peer
6
69 GB (2009-2016)
Query
The evolution of bitcoin transaction fee per block.
7
Working with Data● Run full node locally on AWS => Store the entire blockchain ledger on AWS.● Query blockchain via JSON RPC in Python● Two RPC calls per block (Number of relevant blocks ~ 200,000 and 6.5 GB
of text storage)○ Av time per RPC call = 1.45 sec (huge performance bottleneck. Work around is to reduce RPC
calls to one RPC call by storing all blocks in json format on disk/HDFS)
8
Bitcoin Node APP
get block RPC call
block json
get transaction RPC call
transaction json
1
2
Data Pipeline
9
Ingestion File SystemBatch
processingDatabase
Visualization
BitcoinNode
(Local Disk)
StreamprocessingNetcat
Relay
Accomplishments and Challenges● Complex query (bitcoin transaction fee evolution) working end to end
● Working with sea of jsons (2 jsons per block) in Apache Spark is complex. Takes time to scale the results
● Ideally comparing three modes (batch,streaming and API) for throughput, latency and cost
● Public APIs have rate limits. After lot of search, found Toshi API https://toshi.io that has no rate limits
10
11
Mode # of processed blocks
Time(minutes)
Storage
RPC Batch 186,846 162 Local File System
RPC Batch 186,846 69 HDFS
RPC Streaming 187,990 177 -
API Streaming 187,990 222 -
API Batch 187,990 3.1 HDFS
Comparison
Storing data on HDFS pays off with Spark processing taking only 3.1 minutes in API modeand 69 minutes in RPC mode (62 minutes account for RPC call overhead for get transaction)
Visualization
12
Zooming in to check discontinuity
13
About MePhD in Computer Engineering
Parallel Computing & Computer
Security.
In Love with Linux
Likes disruptive technology
14
Thank you + Q&A