Cooperative regenerating codes
for distributed storage systems
Kenneth Shum
(Joint work with Yuchong Hu)
22nd July 2011
Multiple node failures
• Large-scale storage system
– Google data center, example from Kannan’s talk.
– 800000 servers, fail rate = 4% per year
– Repair in 2 days
– Mean number of failed servers in 2 days = 175.
• The lazy-repair policy in TotalRecall
– A repair process is triggered only after the number
of failed nodes has reached a certain threshold.
Jul, 2011 kshum 2
Jointly repair multiple failures
Storage nodes Newcomers
Data exchange
Can we further reduce the
repair-bandwidth?
Hu et al. (JSAC, Feb 2010)
Jul, 2011 kshum 3
Distributed storage (erasure coding)
Wu, Dimakis ISIT09
A1
A2
A1, A2,
B1, B2
B1
B2
Data
Collector
A1+B1
2 A2+B2
2 A1+B1
A2+B2
Jul, 2011 kshum 4
Naive Repair
A1 A1
A2 A2
A1, A2,
B1, B2
B1
B2
A1+B1 4 packets required.
2 A2+B2
2 A1+B1
A2+B2
Jul, 2011 kshum 5
Repair with ``code alignment’’
A1 A1
A2 A2
A1, A2,
B1, B2
B1
B2
A1+B1
2 A2+B2
packets
3 Solve: required.
P1 = A1+2 A2
2 A1+B1 P2 = 2 A1+ A2
A2+B2
Jul, 2011 kshum 6
Multiple failures, separate repair
8 packets in total
A1 4 packets per newcomer
A2
A1, A2,
B1, B2 B1
B1 B2
B2
A1+B1
2 A2+B2
2 A1+B1
A2+B2
2 A1+B1
A2+B2
Jul, 2011 kshum 7
Multiple failures, cooperative repair (I)
6 packets in total
A1 3 packets per newcomer
A2
A1, A2,
B1, B2 B1
B1
B2 B2
A1+B1 B1,B2
2 A2+B2
2 A1+B1
2 A1+B1 A2+B2
A2+B2
Jul, 2011 kshum 8
Multiple failures, cooperative repair (II)
6 packets in total
A1 3 packets per newcomer
A2
A1, A2,
B1, B2 A1 B1
B1
A1+B1
B2 B2
A1+B1
2 A2+B2
A2 2A1+B1
2 A1+B1 2A2+B2
A2+B2
A2+B2
Jul, 2011 kshum 9
Outline of the talk
• Is it optimal in terms of repair-bandwidth?
• What is the tradeoff between storage and
repair-bandwidth for cooperative repair?
• Can we achieve the Pareto-optimal operating
points on the tradeoff curve by linear network
coding?
– Exact repair
– Functional repair
Jul, 2011 kshum 10
Information flow graph
In1 Out1 In6 Mid6 Out6
2
In2 Out2 1 2
1 In7 Mid7 Out7
1
S In3 Out3 1
1
In4 Out4 1
Data
Collector
In5 Out5
Jul, 2011 kshum 11
Is this regenerating code optimal ?
6 packets in total
A1 3 packets per newcomer
A2
A1, A2,
B1, B2 A1 A1
B1
A1+B1
B2 B2
A1+B1
2 A2+B2
A2 2A1+B1
2 A1+B1 2A2+B2
A2+B2
A2+B2
Jul, 2011 kshum 12
First cut
In1 Out1 In6 Mid6 Out6
2
1
In2 Out2 2
1 In7 Mid7 Out7
B In3 Out3 1
1
In4 Out4
B 4 1 Data
Collector
Jul, 2011 kshum 13
Second cut
2
Out1 In1 Mid1 Out1
2 Data
1
Collector
Out2 2 2
1 In
2
Mid2 Out2
Out3 1 1 1
1
Out4 In3 Mid3 Out3
2
2
In4 Mid4 Out4
B 2+1+ 2
Jul, 2011 kshum 14
A linear programming problem
• Minimize 21+ 2 (repair bandwidth)
• Subject to
4 41
2
4 2+1 + 2
1 , 2 0
1
1
1 1 2 1 1
At least 3 packets
Jul, 2011 kshum 15
Non-homogeneous download traffic
In1 Out1 In6 Mid6 Out6
2
a
In2 Out2 2
b In7 Mid7 Out7
B In3 Out3 c
d
In4 Out4
B a +b + c +d Data
Collector
Jul, 2011 kshum 16
Non-homogeneous traffic
2
Out1 In1 Mid1 Out1 Data
2
1 Collector
Out2 2 e
1 In 2
2
Mid2 Out2
Out3 1 f f
g
1 h
B 2+f +j In3 Mid3 Out3
i
Out4
j
In4 Mid4 Out4
Jul, 2011 kshum 17
Non-homogeneous traffic
2
Out1 In1 Mid1 Out1 Data
2
1 Collector
Out2 2 e
1 In 2
2
Mid2 Out2
Out3 1 f f
g
1 h
B 2+f +j In3 Mid3 Out3
i
Out4 B 2+h +i
j
In4 Mid4 Out4
Jul, 2011 kshum 18
Non-homogeneous traffic
2
Out1 In1 Mid1 Out1 Data
2
1 Collector
Out2 2 2 e
1 In
2
Mid2 Out2
Out3 1 f f
g
1 h
B 2+f +j In3 Mid3 Out3
i
Out4 B 2+h +i
B 2+e +j j
In4 Mid4 Out4
Jul, 2011 kshum 19
Non-homogeneous traffic
2
Out1 In1 Mid1 Out1 Data
2
1 e
Collector
Out2 2 2
1 In
2
Mid2 Out2
Out3 1 f f
g
1 h
B 2+f +j In3 Mid3 Out3
i
Out4 B 2+h +i
B 2+e +j j
In4 Mid4 Out4
B 2+g +i
Jul, 2011 kshum 20
The same LP problem
• Minimize
• Subject to
1
1
At least 3 packets
Jul, 2011 kshum 21
TRADEOFF BETWEEN
STORAGE AND REPAIR-BANDWIDTH
Jul, 2011 kshum 22
Storage vs Repair-bandwidth (S., ICC 2011, Kermarrec, Le Scouamec and Straub, Netcod 2011.)
140
135 File size = 420
One-by-one repair d=8
130
k=4
Storage per node
125
120
115
110
105
Repairing 3 newcomers jointly
100
120 130 140 150 160 170 180
d Repair bandwidth per failed node
k
DC
Jul, 2011 kshum 23
Fair comparison? repair degree = 8
One-by-one repair Cooperative repair
Surviving nodes
Surviving nodes
Number of connections
Number of connections
per each newcomer = 8
per each newcomer = 8+2
Jul, 2011 kshum 24
MBCR and MSCR
140
Minimum bandwidth 135
cooperative repair (MBCR)
130
Storage per node
125
120
115
One-by-one repair
110
Cooperative repair
105
100
120 130 140 150 160 170 180
Repair bandwidth per failed node
Minimum storage
cooperative repair (MSCR)
Jul, 2011 kshum 25
How much can we improve?
500
File size = 2275
490 One-by-one repair d = 30
Storage per node,
k=5
480
When d is large,
joint repair does not have
470
significant advantage over
one-by-one repair.
460
450
Repairing 10 newcomers jointly
480 490 500 510 520 530 540 550
Repair bandwidth per failed node
d
k
DC
Jul, 2011 kshum 26
How much can we improve?
200
190 One-by-one repair File size = 616
Storage per node,
180
d=8
k=4
170
160
150
180 200 220 240 260
Repair bandwidth per failed node
Repairing 10 newcomers jointly Repair-bandwidth reduction
is more prominent
when d is not so large.
d
k
DC
Jul, 2011 kshum 27
AN EXPLICIT CONSTRUCTION FOR
MINIMUM-BANDWIDTH
COOPERATIVE REPAIR
Jul, 2011 kshum 28
An explicit construction for MBCR
(S., Hu, ISIT 2011.)
Require d = k, r = n–d
• B = 8 information
packets
• Minimum repair-
• n = 4 nodes bandwidth
• Each node stores 5
packets.
• Repair r = 2 failures
simultaneously
• Storage per node
• No. of connections
for each DC = k=2
• No. of helpers for
each failed node =d=2
Jul, 2011 kshum 29
Min-Bandwidth point
6
5.5
Storage per node
5
4.5
4
Repairing 2 new nodes cooperatively
3.5
5 5.5 6 6.5 7 7.5 8 8.5 9
Repair bandwidth per failed node
Jul, 2011 kshum 30
Data Distribution
XOR
A, B, C, D, F+G
C, D, E, F, H+A
8 data packets:
A, B, C, D, E, F, G, H
E, F, G, H, B+C
G, H, A, B, D+E
5 packets: 4 systematic, 1 parity-check
Jul, 2011 kshum 31
Data collection
A, B, C, D, F+G
C, D, E, F, H+A
Data
collector
E, F, G, H, B+C
A,B,C,D,E,F,G,H
G, H, A, B, D+E
Jul, 2011 kshum 32
Data collection
A, B, C, D, F+G Data
collector
C, D, E, F, H+A AB C DE F GH
A
B
E, F, G, H, B+C C
D
E
F
G, H, A, B, D+E
F+G
H+A
Jul, 2011 kshum 33
Exact Repair How to
repair?
A, B, C, D, F+G A B C D F+G
C, D, E, F, H+A
B+C F+G
E, F, G, H, B+C E F G H B+C
G, H, A, B, D+E
Total repair-bandwidth=10
Jul, 2011 kshum 34
Exact Repair How to
repair?
A, B, C, D, F+G
C, D, E, F, H+A E F
C D D+EH+A
E F
E, F, G, H, B+C
F+G
E F G H B+C
F
G, H, A, B, D+E
Total repair-bandwidth=10
Jul, 2011 kshum 35
Min-Bandwidth point
6
5.5
Storage per node
5
4.5
4
Repairing 2 new nodes cooperatively
3.5
5 5.5 6 6.5 7 7.5 8 8.5 9
Repair bandwidth per failed node
Jul, 2011 kshum 36
AN EXPLICIT CONSTRUCTION FOR
MINIMUM-STORAGE COOPERATIVE
REPAIR
Jul, 2011 kshum 37
An explicit construction for MSCR
Require d = k (S. ICC 2011.)
• B = 6 information • Minimum repair-
packets
• n nodes
bandwidth
• Each node stores 2
packets.
• Repair r = 2 failures
simultaneously • Storage per node
• No. of connections
for each DC = k=3
• No. of helpers for
each failed node =d=3
Jul, 2011 kshum 38
The min-storage point
7 3
6
Storage per node,
3
5
DC
4 Non-cooperative
k=3,d=3,
3
r =2,B=6
2 storage cost
Cooperative per node = 2
1
1 2 3 4 5 6 7 repair bandwidth
Repair bandwidth per failed node, d per node = 4
Jul, 2011 kshum 39
Data retrieval
MDS code with dimension k=3
Source data
codeword
encode
codeword
=2
Storage nodes ……
Data collector
decode
Jul, 2011 kshum 40
Repair : phase 1
Source data
codeword
encode
codeword
lost
lost
Storage nodes
newcomers
decode decode
Jul, 2011 kshum 41
Repair: phase 2
codeword
encode
codeword
Storage nodes
lost
lost
Repair bandwidth per node
= 8/2 = 4
newcomers
Re-encode Re-encode
exchange
Jul, 2011 kshum 42
The construction is optimal
7 3
6
Storage per node,
3
5
DC
4 Non-cooperative
k=3,d=3,
3
r =2,B=6
2 storage cost
Cooperative per node = 2
1
1 2 3 4 5 6 7 repair bandwidth
Repair bandwidth per failed node, d per node = 4
Jul, 2011 kshum 43
EXISTENCE OF COOPERATIVE
REGENERATING CODES UNDER
FUNCTIONAL REPAIR
Jul, 2011 kshum 44
Existence of optimal linear
regenerating codes in general
(S., Hu, Netcod 2011.)
• Sustainable storage system
– Will it work after arbitrarily many repairs?
• Technical difficulty: The information flow
graph is unbounded.
• Can we work over a fixed finite field, for
unlimited number of regenerations?
– Yes if we can construct an exact regenerating code.
– The answer is also “yes” for cooperative functional
repair in general.
Jul, 2011 kshum 45
Trellis structure
…
…
…
…
Stage 0 Stage 1 Stage 2
m
Message vector
(row vector) mT0 mT0T1 mT0T1T2
T0 is the “transfer T1 is the “transfer T2 is the “transfer
matrix” in stage 0 matrix” in stage 1 matrix” in stage 2
Jul, 2011 kshum 46
Flow in information flow graph
5
4
In1 Mid1 Out1
Out1
0
1 DC
5 2
2
1
1 1
2
5 3
5 0
2
S Out2 In2 Mid2 Out2
4
4
5 2
2
2 2
2 3
4
5
Out3 Out3 In3 Mid3 Out3
2
1 1
5 0
4 2
The cut-set bound 1
says that the cut 5
Out4 Out4 In4 Mid4 Out4
capacity is at least 8.
Can we construct
a flow with value 8?
Jul, 2011 kshum 47
Cross-sectional flow pattern
5
4
Out1 0 In1
0
1
Mid1 Out1
5 4 DC
5 2
2
1
1 1
2
5
Out2 0 In2 Mid2
3
5
Out2
0
2
4
S 3 4
4
5 2
2
2 2
2 3
4
Out3 4 Out3
2
1 0 In1
1
0
Mid1 Out1 0
5
4 2
1
Out4 4 Out4 0 In2 Mid2
5
Out2 0
Jul, 2011 kshum 48
A recursive construction of flow
Stage s Stage s+1 1. Identify a set of cross-
section flow pattern, say H.
In1 Mid1 Out1 2. For any cross-section flow
g1 h1 pattern (h1, h2, h3, h4) in H
stage s+1, we can find a
flow in this segment of
graph, such that
g2 In2 Mid2 Out2
h2 (g1, g2, g3, g4) is also in H.
3. Each pattern corresponds
to a submatrix of the
g3 Out3 Out3 h3 transfer matrix.
4. By Schwartz-Zippel lemma,
we can find the local
encoding vectors so that all
g4 Out4 Out4 h4 such determinants are non-
zero, if the finite field is
sufficiently large.
Jul, 2011 kshum 49
Summary
• Multiple node failures in medium-scale to
large-scale storage system
• Formulation as a linear program
• Functional repair: Linear regenerating code
over fixed finite field which matches the cut-
set bound on repair-bandwidth exists.
• Exact repair: two families of explicit code
constructions
– Minimum-bandwidth point: d=k, r = n – d
– Minimum-storage point: d=k, r arbitrary
Jul, 2011 kshum 50
References
• Y. Wu and A. G. Dimakis, Reducing repair traffic for erasure coding-based storage
via interference alignment, ISIT, Jul, 2009.
• Y. Hu, Y. Xu, X. Wang, C. Zhan and P. Li, Cooperative recovery of distributed storage
systems from multiple losses with network coding, J. Sel. Area Comm., vol. 28, no.
2, pp.268-275, Feb, 2010.
• K. W. Shum, Cooperative Regenerating Codes for Distributed Storage Systems, ICC,
Jun, 2011.
• A.-M. Kermarrec and N. Le Scouarnec and G. Straub, Repairing Multiple Failures
with Coordinated and Adaptive Regenerating Codes, Netcod, Jul, 2011.
• K. W. Shum and Y. Hu, Existence of Minimum-Repair-Bandwidth Cooperative
Regenerating Codes, Netcod, Jul, 2011.
• K. W. Shum and Y. Hu, Exact Minimum-Repair-Bandwidth Cooperative
Regenerating Codes for Distributed Storage Systems, ISIT, Aug, 2011.
Jul, 2011 kshum 51