ARIMA, ETS, and Prophet shine with clear seasonality and limited features. Tree ensembles handle heterogeneous signals and nonlinearities with strong baselines. LSTMs and Transformers capture long-range dependencies when sequences dominate. We outline training cadence, hyperparameter routines, and debuggability, prioritizing stability, transparent failure modes, and predictable performance rather than chasing leaderboard victories that crumble under drift.
Point forecasts ignore risk. Quantile regression, conformal prediction, and Bayesian approaches yield uncertainty bands. Translate the P90 or P95 into preemptive headroom for spiky services, while letting smoother workloads ride closer to median. This lets operations encode risk appetite explicitly, aligning capacity with error bars rather than wishful thinking or folklore-driven safety margins.
New services lack history, while mature ones drift with product changes. Bootstrapping from analog services, hierarchical pooling, and transfer learning helps when data are scarce. Online learning, sliding windows, and decay factors adapt to evolving patterns. We show how to detect regime shifts early and recover gracefully without overreacting to transient anomalies.
Common pitfalls include feedback loops that chase noise, runaway scale-ins, and blind faith in a single metric. Pattern your system with hysteresis, minimum floors, and backpressure-aware targets. Test chaos scenarios, simulate bursty surprises, and practice disaster drills. Recovery should be boring, scripted, and measurable, turning near-misses into documentation and safer defaults for tomorrow.
Operators need to know why capacity changed. Provide feature attributions, quantile choices, and confidence intervals with every action. Keep immutable logs and diffable plans. Scheduled reviews and postmortems invite constructive skepticism. Human oversight strengthens automation by correcting drift early and ensuring that strategic priorities, not opaque heuristics, steer the cloud at critical moments.