I. INTRODUCTIONFailures are prevalent in large-scale storage clusters andmanifest at disks or various storage components [10], [15], [30],[36]. Field studies report that disk replacements in productionare more frequent than estimated by vendors [30], [36], andlatent sector errors are commonly found in modern disks [4].To maintain data availability guarantees in the face of failures,practical storage clusters often stripe data with redundancyacross multiple nodes via either replication or erasure coding.Replication creates identical data copies and is adopted byearlier generations of storage clusters, yet it incurs substantialstorage overhead, especially with today’s tremendous growthof data storage. On the other hand, erasure coding creates alimited amount of redundant data through coding computations,and provably maintains the same level of fault tolerance withmuch less storage redundancy than replication [41]. Today’slarge-scale storage clusters increasingly adopt erasure codingto provide low-cost fault-tolerant storage (e.g., [2], [10], [13],[26], [28]), and reportedly save petabytes of storage comparedto replication [13], [26].While being storage-efficient, erasure coding incurs highrepair penalty. As an example, we consider Reed-Solomon (RS)codes [34], which are a popular erasure coding constructionused in production [2], [10], [26], [28]. At a high level, RScodes encode k data chunks into n coded chunks for someparameters k and n > k, such that any k out of n codedchunks can reconstruct (or decode) all original k data chunks.However, repairing a lost chunk of RS codes needs to retrievek available chunks for decoding, implying that both bandwidthand I/O costs for a single-chunk repair are amplified k times;in contrast, in replication, repairing a lost chunk can be simplydone by retrieving another available chunk copy.The high repair penalty is a fundamental issue in all erasurecoding constructions: the repair traffic increases as the storageredundancy decreases [8]. Thus, there have been extensivestudies on improving the repair performance of erasure coding,such as proposing theoretically proven erasure codes thatminimize the repair traffic or I/Os during repair (e.g., [8],[13], [35]), or designing repair-efficient techniques that applyto all practical erasure codes including RS codes (e.g., [5],[20], [21], [24], [37], [38]). Conventional repair approachesare reactive, meaning that a repair operation is triggered onlyafter a node failure is detected. Nevertheless, if we can predictimpending failures in advance, we may proactively repair thelost data of any impending failed node to mitigate the repairpenalty before any actual failure occurs.Recent studies show that machine learning can achieveaccurate prediction of disk failures in production environmentswith thousands of disks [6], [18], [23], [42], [43], [45]; in somecases, the prediction accuracy can even reach at least 95% [6],[18], [23], [45] with a very small false alarm rate. Motivated bythe potential of highly accurate disk failure prediction, we canaccurately pinpoint a soon-to-fail (STF) node and accelerate arepair operation by coupling two repair methods: (i) migration,in which we relocate the currently stored chunks of the STFnode to other healthy nodes, and (ii) reconstruction, in whichwe reconstruct (or decode) the chunks of the STF node byretrieving the chunks of all healthy nodes in a storage cluster asin conventional reactive repair approaches. Migration addressesthe bandwidth and I/O amplification issues that are inherentin erasure coding, while reconstruction exploits the aggregatebandwidth resources of all healthy nodes. An open question ishow to carefully couple both migration and reconstruction soas to maximize the repair performance.We present FastPR, a Fast Predictive Repair approach thatcarefully couples the migration and reconstruction of the chunksof the STF node, with the primary objective of minimizingthe total repair time. FastPR schedules both migration andreconstruction of the chunks of the STF node in a parallelfashion, so as to exploit the available bandwidth resources ofthe underlying storage cluster. We address two repair scenarios:scattered repair, which stores the repaired chunks of the STFnode across all other existing nodes in the storage cluster,and hot-standby repair, which stores the repaired chunks ofthe STF node in dedicated hot-standby nodes. We present anin-depth study of FastPR through mathematical analysis, largescale simulation, and Amazon EC2 experiments, and make thefollowing contributions:• We first present mathematical analysis on the optimalpredictive repair in minimizing the total repair time. We show