VIEWS: 15 PAGES: 51 POSTED ON: 12/2/2011 Public Domain
Cooperative regenerating codes for distributed storage systems Kenneth Shum (Joint work with Yuchong Hu) 22nd July 2011 Multiple node failures • Large-scale storage system – Google data center, example from Kannan’s talk. – 800000 servers, fail rate = 4% per year – Repair in 2 days – Mean number of failed servers in 2 days = 175. • The lazy-repair policy in TotalRecall – A repair process is triggered only after the number of failed nodes has reached a certain threshold. Jul, 2011 kshum 2 Jointly repair multiple failures Storage nodes Newcomers Data exchange Can we further reduce the repair-bandwidth? Hu et al. (JSAC, Feb 2010) Jul, 2011 kshum 3 Distributed storage (erasure coding) Wu, Dimakis ISIT09 A1 A2 A1, A2, B1, B2 B1 B2 Data Collector A1+B1 2 A2+B2 2 A1+B1 A2+B2 Jul, 2011 kshum 4 Naive Repair A1 A1 A2 A2 A1, A2, B1, B2 B1 B2 A1+B1 4 packets required. 2 A2+B2 2 A1+B1 A2+B2 Jul, 2011 kshum 5 Repair with ``code alignment’’ A1 A1 A2 A2 A1, A2, B1, B2 B1 B2 A1+B1 2 A2+B2 packets 3 Solve: required. P1 = A1+2 A2 2 A1+B1 P2 = 2 A1+ A2 A2+B2 Jul, 2011 kshum 6 Multiple failures, separate repair 8 packets in total A1 4 packets per newcomer A2 A1, A2, B1, B2 B1 B1 B2 B2 A1+B1 2 A2+B2 2 A1+B1 A2+B2 2 A1+B1 A2+B2 Jul, 2011 kshum 7 Multiple failures, cooperative repair (I) 6 packets in total A1 3 packets per newcomer A2 A1, A2, B1, B2 B1 B1 B2 B2 A1+B1 B1,B2 2 A2+B2 2 A1+B1 2 A1+B1 A2+B2 A2+B2 Jul, 2011 kshum 8 Multiple failures, cooperative repair (II) 6 packets in total A1 3 packets per newcomer A2 A1, A2, B1, B2 A1 B1 B1 A1+B1 B2 B2 A1+B1 2 A2+B2 A2 2A1+B1 2 A1+B1 2A2+B2 A2+B2 A2+B2 Jul, 2011 kshum 9 Outline of the talk • Is it optimal in terms of repair-bandwidth? • What is the tradeoff between storage and repair-bandwidth for cooperative repair? • Can we achieve the Pareto-optimal operating points on the tradeoff curve by linear network coding? – Exact repair – Functional repair Jul, 2011 kshum 10 Information flow graph In1 Out1 In6 Mid6 Out6 2 In2 Out2 1 2 1 In7 Mid7 Out7 1 S In3 Out3 1 1 In4 Out4 1 Data Collector In5 Out5 Jul, 2011 kshum 11 Is this regenerating code optimal ? 6 packets in total A1 3 packets per newcomer A2 A1, A2, B1, B2 A1 A1 B1 A1+B1 B2 B2 A1+B1 2 A2+B2 A2 2A1+B1 2 A1+B1 2A2+B2 A2+B2 A2+B2 Jul, 2011 kshum 12 First cut In1 Out1 In6 Mid6 Out6 2 1 In2 Out2 2 1 In7 Mid7 Out7 B In3 Out3 1 1 In4 Out4 B 4 1 Data Collector Jul, 2011 kshum 13 Second cut 2 Out1 In1 Mid1 Out1 2 Data 1 Collector Out2 2 2 1 In 2 Mid2 Out2 Out3 1 1 1 1 Out4 In3 Mid3 Out3 2 2 In4 Mid4 Out4 B 2+1+ 2 Jul, 2011 kshum 14 A linear programming problem • Minimize 21+ 2 (repair bandwidth) • Subject to 4 41 2 4 2+1 + 2 1 , 2 0 1 1 1 1 2 1 1 At least 3 packets Jul, 2011 kshum 15 Non-homogeneous download traffic In1 Out1 In6 Mid6 Out6 2 a In2 Out2 2 b In7 Mid7 Out7 B In3 Out3 c d In4 Out4 B a +b + c +d Data Collector Jul, 2011 kshum 16 Non-homogeneous traffic 2 Out1 In1 Mid1 Out1 Data 2 1 Collector Out2 2 e 1 In 2 2 Mid2 Out2 Out3 1 f f g 1 h B 2+f +j In3 Mid3 Out3 i Out4 j In4 Mid4 Out4 Jul, 2011 kshum 17 Non-homogeneous traffic 2 Out1 In1 Mid1 Out1 Data 2 1 Collector Out2 2 e 1 In 2 2 Mid2 Out2 Out3 1 f f g 1 h B 2+f +j In3 Mid3 Out3 i Out4 B 2+h +i j In4 Mid4 Out4 Jul, 2011 kshum 18 Non-homogeneous traffic 2 Out1 In1 Mid1 Out1 Data 2 1 Collector Out2 2 2 e 1 In 2 Mid2 Out2 Out3 1 f f g 1 h B 2+f +j In3 Mid3 Out3 i Out4 B 2+h +i B 2+e +j j In4 Mid4 Out4 Jul, 2011 kshum 19 Non-homogeneous traffic 2 Out1 In1 Mid1 Out1 Data 2 1 e Collector Out2 2 2 1 In 2 Mid2 Out2 Out3 1 f f g 1 h B 2+f +j In3 Mid3 Out3 i Out4 B 2+h +i B 2+e +j j In4 Mid4 Out4 B 2+g +i Jul, 2011 kshum 20 The same LP problem • Minimize • Subject to 1 1 At least 3 packets Jul, 2011 kshum 21 TRADEOFF BETWEEN STORAGE AND REPAIR-BANDWIDTH Jul, 2011 kshum 22 Storage vs Repair-bandwidth (S., ICC 2011, Kermarrec, Le Scouamec and Straub, Netcod 2011.) 140 135 File size = 420 One-by-one repair d=8 130 k=4 Storage per node 125 120 115 110 105 Repairing 3 newcomers jointly 100 120 130 140 150 160 170 180 d Repair bandwidth per failed node k DC Jul, 2011 kshum 23 Fair comparison? repair degree = 8 One-by-one repair Cooperative repair Surviving nodes Surviving nodes Number of connections Number of connections per each newcomer = 8 per each newcomer = 8+2 Jul, 2011 kshum 24 MBCR and MSCR 140 Minimum bandwidth 135 cooperative repair (MBCR) 130 Storage per node 125 120 115 One-by-one repair 110 Cooperative repair 105 100 120 130 140 150 160 170 180 Repair bandwidth per failed node Minimum storage cooperative repair (MSCR) Jul, 2011 kshum 25 How much can we improve? 500 File size = 2275 490 One-by-one repair d = 30 Storage per node, k=5 480 When d is large, joint repair does not have 470 significant advantage over one-by-one repair. 460 450 Repairing 10 newcomers jointly 480 490 500 510 520 530 540 550 Repair bandwidth per failed node d k DC Jul, 2011 kshum 26 How much can we improve? 200 190 One-by-one repair File size = 616 Storage per node, 180 d=8 k=4 170 160 150 180 200 220 240 260 Repair bandwidth per failed node Repairing 10 newcomers jointly Repair-bandwidth reduction is more prominent when d is not so large. d k DC Jul, 2011 kshum 27 AN EXPLICIT CONSTRUCTION FOR MINIMUM-BANDWIDTH COOPERATIVE REPAIR Jul, 2011 kshum 28 An explicit construction for MBCR (S., Hu, ISIT 2011.) Require d = k, r = n–d • B = 8 information packets • Minimum repair- • n = 4 nodes bandwidth • Each node stores 5 packets. • Repair r = 2 failures simultaneously • Storage per node • No. of connections for each DC = k=2 • No. of helpers for each failed node =d=2 Jul, 2011 kshum 29 Min-Bandwidth point 6 5.5 Storage per node 5 4.5 4 Repairing 2 new nodes cooperatively 3.5 5 5.5 6 6.5 7 7.5 8 8.5 9 Repair bandwidth per failed node Jul, 2011 kshum 30 Data Distribution XOR A, B, C, D, F+G C, D, E, F, H+A 8 data packets: A, B, C, D, E, F, G, H E, F, G, H, B+C G, H, A, B, D+E 5 packets: 4 systematic, 1 parity-check Jul, 2011 kshum 31 Data collection A, B, C, D, F+G C, D, E, F, H+A Data collector E, F, G, H, B+C A,B,C,D,E,F,G,H G, H, A, B, D+E Jul, 2011 kshum 32 Data collection A, B, C, D, F+G Data collector C, D, E, F, H+A AB C DE F GH A B E, F, G, H, B+C C D E F G, H, A, B, D+E F+G H+A Jul, 2011 kshum 33 Exact Repair How to repair? A, B, C, D, F+G A B C D F+G C, D, E, F, H+A B+C F+G E, F, G, H, B+C E F G H B+C G, H, A, B, D+E Total repair-bandwidth=10 Jul, 2011 kshum 34 Exact Repair How to repair? A, B, C, D, F+G C, D, E, F, H+A E F C D D+EH+A E F E, F, G, H, B+C F+G E F G H B+C F G, H, A, B, D+E Total repair-bandwidth=10 Jul, 2011 kshum 35 Min-Bandwidth point 6 5.5 Storage per node 5 4.5 4 Repairing 2 new nodes cooperatively 3.5 5 5.5 6 6.5 7 7.5 8 8.5 9 Repair bandwidth per failed node Jul, 2011 kshum 36 AN EXPLICIT CONSTRUCTION FOR MINIMUM-STORAGE COOPERATIVE REPAIR Jul, 2011 kshum 37 An explicit construction for MSCR Require d = k (S. ICC 2011.) • B = 6 information • Minimum repair- packets • n nodes bandwidth • Each node stores 2 packets. • Repair r = 2 failures simultaneously • Storage per node • No. of connections for each DC = k=3 • No. of helpers for each failed node =d=3 Jul, 2011 kshum 38 The min-storage point 7 3 6 Storage per node, 3 5 DC 4 Non-cooperative k=3,d=3, 3 r =2,B=6 2 storage cost Cooperative per node = 2 1 1 2 3 4 5 6 7 repair bandwidth Repair bandwidth per failed node, d per node = 4 Jul, 2011 kshum 39 Data retrieval MDS code with dimension k=3 Source data codeword encode codeword =2 Storage nodes …… Data collector decode Jul, 2011 kshum 40 Repair : phase 1 Source data codeword encode codeword lost lost Storage nodes newcomers decode decode Jul, 2011 kshum 41 Repair: phase 2 codeword encode codeword Storage nodes lost lost Repair bandwidth per node = 8/2 = 4 newcomers Re-encode Re-encode exchange Jul, 2011 kshum 42 The construction is optimal 7 3 6 Storage per node, 3 5 DC 4 Non-cooperative k=3,d=3, 3 r =2,B=6 2 storage cost Cooperative per node = 2 1 1 2 3 4 5 6 7 repair bandwidth Repair bandwidth per failed node, d per node = 4 Jul, 2011 kshum 43 EXISTENCE OF COOPERATIVE REGENERATING CODES UNDER FUNCTIONAL REPAIR Jul, 2011 kshum 44 Existence of optimal linear regenerating codes in general (S., Hu, Netcod 2011.) • Sustainable storage system – Will it work after arbitrarily many repairs? • Technical difficulty: The information flow graph is unbounded. • Can we work over a fixed finite field, for unlimited number of regenerations? – Yes if we can construct an exact regenerating code. – The answer is also “yes” for cooperative functional repair in general. Jul, 2011 kshum 45 Trellis structure … … … … Stage 0 Stage 1 Stage 2 m Message vector (row vector) mT0 mT0T1 mT0T1T2 T0 is the “transfer T1 is the “transfer T2 is the “transfer matrix” in stage 0 matrix” in stage 1 matrix” in stage 2 Jul, 2011 kshum 46 Flow in information flow graph 5 4 In1 Mid1 Out1 Out1 0 1 DC 5 2 2 1 1 1 2 5 3 5 0 2 S Out2 In2 Mid2 Out2 4 4 5 2 2 2 2 2 3 4 5 Out3 Out3 In3 Mid3 Out3 2 1 1 5 0 4 2 The cut-set bound 1 says that the cut 5 Out4 Out4 In4 Mid4 Out4 capacity is at least 8. Can we construct a flow with value 8? Jul, 2011 kshum 47 Cross-sectional flow pattern 5 4 Out1 0 In1 0 1 Mid1 Out1 5 4 DC 5 2 2 1 1 1 2 5 Out2 0 In2 Mid2 3 5 Out2 0 2 4 S 3 4 4 5 2 2 2 2 2 3 4 Out3 4 Out3 2 1 0 In1 1 0 Mid1 Out1 0 5 4 2 1 Out4 4 Out4 0 In2 Mid2 5 Out2 0 Jul, 2011 kshum 48 A recursive construction of flow Stage s Stage s+1 1. Identify a set of cross- section flow pattern, say H. In1 Mid1 Out1 2. For any cross-section flow g1 h1 pattern (h1, h2, h3, h4) in H stage s+1, we can find a flow in this segment of graph, such that g2 In2 Mid2 Out2 h2 (g1, g2, g3, g4) is also in H. 3. Each pattern corresponds to a submatrix of the g3 Out3 Out3 h3 transfer matrix. 4. By Schwartz-Zippel lemma, we can find the local encoding vectors so that all g4 Out4 Out4 h4 such determinants are non- zero, if the finite field is sufficiently large. Jul, 2011 kshum 49 Summary • Multiple node failures in medium-scale to large-scale storage system • Formulation as a linear program • Functional repair: Linear regenerating code over fixed finite field which matches the cut- set bound on repair-bandwidth exists. • Exact repair: two families of explicit code constructions – Minimum-bandwidth point: d=k, r = n – d – Minimum-storage point: d=k, r arbitrary Jul, 2011 kshum 50 References • Y. Wu and A. G. Dimakis, Reducing repair traffic for erasure coding-based storage via interference alignment, ISIT, Jul, 2009. • Y. Hu, Y. Xu, X. Wang, C. Zhan and P. Li, Cooperative recovery of distributed storage systems from multiple losses with network coding, J. Sel. Area Comm., vol. 28, no. 2, pp.268-275, Feb, 2010. • K. W. Shum, Cooperative Regenerating Codes for Distributed Storage Systems, ICC, Jun, 2011. • A.-M. Kermarrec and N. Le Scouarnec and G. Straub, Repairing Multiple Failures with Coordinated and Adaptive Regenerating Codes, Netcod, Jul, 2011. • K. W. Shum and Y. Hu, Existence of Minimum-Repair-Bandwidth Cooperative Regenerating Codes, Netcod, Jul, 2011. • K. W. Shum and Y. Hu, Exact Minimum-Repair-Bandwidth Cooperative Regenerating Codes for Distributed Storage Systems, ISIT, Aug, 2011. Jul, 2011 kshum 51