Mitigating Metastable Failures in Distributed Systems with Offline Reinforcement Learning


This paper introduces a load-shedding mechanism that mitigates metastable failures through offline reinforcement learning (RL). Previous studies have heavily focused on heuristics that are reactive and limited in generalization, while online RL algorithms face challenges in accurately simulating system dynamics and acquiring data with sufficient coverage. In contrast, our algorithm leverages offline RL to learn directly from existing log data. Through extensive empirical experiments, we demonstrate that our algorithm outperforms rule-based methods and supervised learning algorithms in a proactive, adaptive, generalizable, and safe manner. Deployed in a Java compute service with diverse execution times and configurations, our algorithm exhibits faster reaction time and attains the Pareto frontier between throughput and tail latency.

ICLR'2023 (Workshop)