1
Data Persistence in Large-scale Sensor Networks with Decentralized Fountain Codes
Yunfeng Lin, Ben Liang, Baochun Li
INFOCOM 2007
2
Outline
Introduction Preliminaries Persistent Data Access
Two-way random walks EDFC and ADFC
Discussion of Multiple Encoded Blocks Performance Evaluation Conclusion
3
Introduction (1/5)
It has been a conventional assumption that measured data in individual sensors are gathered and processed at powered sinks. Internet Connections via Data Aggregation
This assumption may not realistically hold. large-scale sensor networks inaccessible geographical regions
4
Introduction (2/5)
Our proposed vision is: Ask the sensors to collaboratively store
measured data over a historical period of time. After a later time of convenience, a collector
collects such measured data directly from the sensors.
PUSH Model PULL Model
Sensors send data periodically.
Sensors are passively polled by the collector.
5
Introduction (3/5)
We propose a novel decentralized implementation of fountain codes in sensor networks. Data can be encoded in a distributed fashion.
A sensor disseminates its data to a random subset of sensors in the network.
Each sensor only encodes data it has received. The collector is able to decode original data by
collecting a sufficient number of encoded data blocks.
6
Introduction (4/5)
Our decentralized implementation of fountain codes does not require the support of a generic layer of routing protocols. Do not need Routing Table or Geographical
Routing Protocols. Use random walks to disseminate data.
7
Introduction (5/5)
: sensing nodes
: caching nodes
Sensed Data
Caching
Caching
Caching
Source Blocks
Encoded Blocks
failure!
Caching CollectorDecoding!
8
PreliminariesWhy Fountain Codes?
Replication backup sensors But a large number of replicas are required.
Error-correcting Codes Implemented in a centralized fashion
Random Linear Codes decentralized But the decoding process is computationally expensive.
Fountain Codes Low decoding complexity: superior decoding performance
)( 3KO)ln( KKO
“Digital Fountain Codes V.S. Reed-Solomon Code For Streaming Applications” S. K. Chang
9
PreliminariesLT Codes
In LT codes, K source blocks can be decoded from any subset of encoded blocks. with probability
degree the number of source blocks used to generate an encoded
block The degree distribution of encoded blocks in LT cod
es follows the Robust Soliton distribution.
))(ln ( 2 K/δKOK -1
10
PreliminariesLT Codes
Ideal Soliton distribution
Let
Robust Soliton distribution
)(ρ
, ..., K, i)/i(i-
i/Kiρ
32for 111 if 1
)(
KK/δcR )(ln
, ..., K K/Ri
K/Ri / KR/δR -, ..., K/R i R/iK
iτ1for 0
for )(ln11for
)(
i
iτiρiτiρiμ
)( )()( )( )(
11
PreliminariesLT Codes
Example of Robust Soliton distribution
K=10000, c=0.2, and =0.05δ
The encoded blocks with a degree higher than K/R are not essential in decoding!
K/R = 41spike!
12
PreliminariesRandom Walks on Graphs
We describe random walks in the context of disseminating a source block. sensor: node in the graph The next hop is randomly chosen from the neighbors of the
source node. A random walk corresponds to a time-reversible
Markov chain. In this paper, we choose a variant of the Metropolis
algorithm. a generalization of the natural random walks for the
Markov chain non-uniform steady-state distribution
13
PreliminariesMetropolis Algorithm
The Metropolis algorithm computes the transition matrix. Steady-state distribution : neighbors of node i : maximal node degree in the graph
Pij P
)(iN
M
ij ij
ij
ij
ji P-iNj and ji
iNj and jiM/π, π
P if 1
)( if 0
)( if )/1min(
)( 21 , ..., πππ
14
Persistent Data AccessDecentralized Fountain Codes
: sensing nodes K
: caching nodes N
Sensed Data
Caching
Caching
Caching Source Blocks
Encoded Blocks
Caching
degree d
Sensed Data
requestsource blocks
based on two-way random walks
request
source blocks
15
Persistent Data AccessDecentralized Fountain Codes
We seek to construct decentralized fountain codes with only one traversal of random walks. from sensing nodes to the caching nodes Cache Nodes: Encode and store the source blocks. Collector: Decode the source blocks.
We propose two heuristic algorithms. EDFC and ADFC guarantee the Robust Soliton distribution of LT codes
16
Persistent Data Access Exact Decentralized Fountain Codes
The randomization introduced by random walks. Distinct source blocks received by a node is uncertain. We must disseminate more than source blocks on
each node.
Redundancy Coefficient: Assume each node receives blocks. , Pr (receive less than d nodes)
dx
d
dxd
dx
17
Persistent Data Access Exact Decentralized Fountain Codes
The number of random walks:
Probabilistic forwarding tables:
K
ddμxN b
K
d d 1)(
K
i i
dd
dddd
iiμxNdxπ
bKdxπdxπbK
1)(
18
Persistent Data Access Exact Decentralized Fountain Codes
: sensing nodes K
: caching nodes N
Sensed Data
Caching
Caching
Source Blocks
Encoded Blocks
Caching
degree d
Sensed Data
source blocks
, forwarding Table, and # of random walks.
source blocks
degree d
degree d
degree ddegree d
degree d Collector
Decoding!
dπ
19
Persistent Data Access Exact Decentralized Fountain Codes
The steps of EDFC are:Step 1. Degree generation.
Step 2. Compute steady-state distribution.
Step 3. Compute probabilistic forwarding table.
Step 4. Compute the number of random walks.
Step 5. Block dissemination.
Step 6. Encoding.
from the Robust Soliton distributiondπ
by the Metropolis algorithm
b: number of random walks
by bitwise XOR of a subset of d source blocks
based on the probabilistic forwarding table
The source node IDs are attached in the encoded block!
20
Persistent Data Access Exact Decentralized Fountain Codes
Overhead ratio
Violation Probability
Optimization Problem:
K
d
K
d d
ddμ
ddμxbbg
1
1
01
)(
)(
d-xd
dedK
dX|dY
) Pr(
) Pr( dX|dY
, ..., K/Rdx
δdX|dY
ddμx
d
d
K
dd
1for 1
) Pr( subject
)( minimize1
trade-off between coding performance and communication overhead
d-xd
K-dK
dxd
K-d
d-
j
K-jj
d/K-xE/Kd/E-x
bdi
d
d
dd
edK
edeK
-pdK
dX|dYdX|dY
-ppjK
dd|XY
-e-e
-π-dX|Y
)1(
) Pr( ) Pr(
)1( )Pr(
1 1
NEdx
-1
)1(1) 1Pr(
)(-
1
0
))((
NE/Kd
21
Persistent Data Access Exact Decentralized Fountain Codes
Solve the optimization problem by MATLAB Parameter Setting
(constraints of violation probabilities) = 0.05 N (the number of total nodes) = 2000 K (the number of sensing nodes) = 1000 c = 0.01, = 0.05
Further numerical computation overhead ratio = 1.4508
dδ
δ
22
Persistent Data Access Approximate Decentralized Fountain Codes
Design a new distribution to be a hypothetical chosen degree distribution. attempt to avoid its redundant random walks
Number of random walks
Steady-state distribution of the random walks
)(υ
K
ddυNb
K
d 1)(
K
i
diiυN
dπ1
)(
K
iiiυE
1)(
23
Persistent Data Access Approximate Decentralized Fountain Codes
Optimization Problem:
.1for 0 )(
1 )( subject to
))( - )(( minimize
1
1
2
, ..., Kiiυ
iυ
iμiυ'
K
j
K/R
j
N/Ki NE
d--dX|Y )1(1 ) 1Pr(
K
d
K-d'd'
K
d
-ppd'K
dυ
dd'|XYdXd'Y
1
1
)1()(
))Pr(Pr( )Pr(
)(υ'
minimize the mean-square error between and)(υ' )(
actual degree distribution of a node
p
24
Persistent Data Access Approximate Decentralized Fountain Codes
The steps of ADFC are:Step 1. Degree generation.Step 2. Compute steady-state distribution.Step 3. Compute probabilistic forwarding table.Step 4. Compute the number of random walks.Step 5. Block dissemination.Step 6. Encoding.
from the chosen degree distributiondπ
)(υ
by the Metropolis algorithm
b: number of random walks
based on the probabilistic forwarding table
by bitwise XOR of all received source blocks
The source node IDs are attached in the encoded block!
25
Persistent Data Access Approximate Decentralized Fountain Codes
Overhead ratio of ADFC b: the number of random walks in ADFC b0: the number of random walks in the ideal algorithm
By further numerical computation The overhead ratio is only 0.2326. Less transmission cost is required. But…
K
d
K
d
ddμ
ddυ
bb g
1
1
02
)(
)(
2g
26
Persistent Data Access Approximate Decentralized Fountain Codes
Parameter Setting N (number of total nodes) = 2000 K (number of sensing nodes) = 1000 c = 0.01, = 0.05 Robust Soliton distributionδ
chosen degree distribution actual degree distribution)(υ
inaccuracy!
27
Discussion of Multiple Encoded Blocks
Does it improve the coding performance if different encoded blocks are maintained?
Source Blocks
Source Blocks
Source Blocks……
Encoded Blocks
Cache Node
may lose some information…
Sensing Nodes
28
Discussion of Multiple Encoded Blocks
Theorem 2 When the code-degree distribution conforms to the Robus
t Soliton distribution, even if the source blocks on each node are not encoded, the collector must visit nodes in order to collect all source blocks with probability .
is a small positive number. is a random variable that assumes the value 1 if the s
ource block j is collected when visiting ith node.
)Ω(K-1
K
d
K
dii,jii,j
K(K/δ c
Kddμ
d| XYdXY
1
1
1
)ln )(
) 1)Pr(Pr( )1Pr(
i,jY
average degree of an encoded block [3]
29
Discussion of Multiple Encoded Blocks
has value 1 if source block j is collected after visiting M nodes.
E denote the event that all blocks are collected after visiting M nodes.
All blocks are collected with probability
jZ
M
M
i
M
i i,ji,jj
KK/δc
-
Y-YZ
))(ln
1(
))1(Pr1( )0Pr( )0Pr(
1
1 1
KMK
j j KK/δ c
--ZE )))(ln
1(1( )1Pr( )Pr( 11
-1
-δ K
K/δ c-- KM 1 ))
)(ln1(1( 1
30
Discussion of Multiple Encoded Blocks
Apply logarithm to both sides
By using similar approximation, we obtain
-δ-δ K
K/δ c--K M )1(ln ))
)(ln1(1( ln 1
1K/cM )(Ki.e., M
K -δK
K/δ c-- M / ))(ln1( 1
The collector needs to visit nodes to collect all K source blocks.
)(K
31
Performance Evaluation
We implement both the original centralized and the decentralized implementation of fountain codes. To evaluate the effectiveness and performance
Centralized implementation of fountain codes about 1000 lines of C++ code Optimized implementation of encoding and decoding
algorithms. Decentralized implementation of fountain codes
also simulated in C++
32
Performance Evaluation
Use two-dimensional Geometric Random Graph as the topological model. N sensors are uniformly distributed on a unit disk K sensing nodes are uniformly distributed among the N
sensors. Radio range: r
We set K=10000, N=20000, and r=0.033 in most experiments.
The average number of neighbors for each node is 21.
33
Performance EvaluationCommunication Cost and Decoding Ratio
Two main performance metrics Communication Cost and Decoding Ratio
Communication Cost the length of random walks the number of random walks
Decoding Ratio number of nodes need to be visited by a collector
for decoding Normalized by the number of sensing nodes.
fault tolerance!
34
Performance EvaluationCommunication Cost and Decoding Ratio
The impact of the length of random walks on decoding ratio.
Each Data Point: the average and the 95% confidence interval from 10 experiments
1.0550050
35
Performance EvaluationCommunication Cost and Decoding Ratio
The ratio of dissemination costs of EDFC and ADFC to that of the two-way algorithm.
0.2
0.8
36
Performance EvaluationMultiple Encoded Blocks Cannot Do Better
Theorem 2: Keeping multiple encoded blocks on each node does not
offer any asymptotic performance advantage over keeping a single encoded block.
The collector needs to visit close to K nodes even if the source blocks are not encoded.
The number of nodes to be visited before collecting all source blocks.
37
Performance EvaluationOverestimation of K and N
The failure of sensors are common events. in large-scale sensor networks
It is not feasible to update K and N to all nodes in the network whenever they change. Update K and N periodically. Each node may overestimate K and N.
38
Performance EvaluationOverestimation of K and N
The consequence of overestimating N: N: the number of total nodes
Actual N = 20000.
1.05
39
Performance EvaluationOverestimation of K and N
The impact of overestimating K: K: the number of sensing nodes
Estimated K = 10000.
EDFC is more robust!
40
Conclusion
In this paper, we seek to improve the fault tolerance and data persistence in sensor networks. decentralized implementation of fountain codes disseminate original data throughout the network with
random walks Superior decoding performance and low decoding
complexity of fountain codes. as the number of nodes scales up
The proposed algorithms are able to provide near-optimal fault tolerance. with minimal demand on local storage