Book a Demo
Book a Demo

AWS Known Issues

Sateesh Peri

Known Issues-1: AWS Rebalance Recommendation Signal

  1. Frequent Rebalance Recommendation Signals:

Issue: AWS may send a Rebalance Recommendation signal if it determines that a Spot Instance has an increased likelihood of being reclaimed. This signal can be intercepted by the OpCenter, which uses a rules-based approach to decide whether to maintain the current Spot Instance or proactively capture a snapshot and transition to a new instance.

Impact: Frequent rebalance signals may lead to numerous interruptions, which can cause jobs to fail with checkpoint errors due to multiple spot reclaims.

Example: In the job example screenshot below, the OpCenter encountered numerous rebalance signals, resulting in 17 interruptions over a short period. These interruptions reset the buffer cache, causing processing delays and timeouts.

  1. Default Rebalance Threshold Settings:

Issue: The rebalance signal threshold on a new OpCenter is set to 64GB by default. This setting might trigger unnecessary rebalance actions that don't lead to actual spot reclaims.

Impact: This can cause jobs that require substantial buffer cache to fail or slow down significantly due to frequent cache resets and timeouts.

Resolution: Increasing the rebalance threshold to 125GB can help by ensuring that jobs are only checkpointed on actual spot reclaim signals rather than on every rebalance signal.

Note: Changes to the rebalance threshold will only apply to newly launched jobs and not to those already in progress.