Known Issues-1: AWS Rebalance Recommendation Signal
- Frequent Rebalance Recommendation Signals:
Issue: AWS may send a Rebalance Recommendation signal if it determines that a Spot Instance has an increased likelihood of being reclaimed. This signal can be intercepted by the OpCenter, which uses a rules-based approach to decide whether to maintain the current Spot Instance or proactively capture a snapshot and transition to a new instance.
Impact: Frequent rebalance signals may lead to numerous interruptions, which can cause jobs to fail with checkpoint errors due to multiple spot reclaims.
Example: In the job example screenshot below, the OpCenter encountered numerous rebalance signals, resulting in 17 interruptions over a short period. These interruptions reset the buffer cache, causing processing delays and timeouts.
- Default Rebalance Threshold Settings:
Issue: The rebalance signal threshold on a new OpCenter is set to 64GB by default. This setting might trigger unnecessary rebalance actions that don't lead to actual spot reclaims.
Impact: This can cause jobs that require substantial buffer cache to fail or slow down significantly due to frequent cache resets and timeouts.
Resolution: Increasing the rebalance threshold to 125GB can help by ensuring that jobs are only checkpointed on actual spot reclaim signals rather than on every rebalance signal.
Note: Changes to the rebalance threshold will only apply to newly launched jobs and not to those already in progress.