GridGain has fairly advance and flexible support for preemption or preemptive scheduling. Preemption is basically when grid job decides or gets instructed that it needs to stop its execution on current node, save its state, migrate to another node and continue its execution on a new node from saved state (distributed continuation). For example, collision SPI can decide that a certain job needs to stop and move to another node. Another example is when job monitors its execution performance and determines that current pace is not enough and decided to preemptively move to another, presumably more performing, node.
Preemption is a complex multi-step process and GridGain supports it with several components working in ensemble:
- Collision SPI provides a way for the system to have a centralized place where each job goes through for resource contention and can be affected (buffered, executed or cancelled)
- Failover SPI provides means of re-mapping given job to a new node
- Checkpoint SPI provides means for storing and retrieving intermediate state
- GridJob cancellation and session attributes provide all necessary API mechanics of stopping job and passing any necessary user information between SPIs, job and job's siblings executing on other nodes, if required.:
Preemption is not trivial mechanism but it can be the only solution in systems with complex resource management requirements.
One real life example where preemption support is required is for priority-based near real-time (nRT) grid tasks. In nRT cases certain task must be given all local resources to be completed within given QoS specification. The only way short of outright canceling all currently running jobs executing on this node is to preempt them onto other node(s) without loosing all the work that has been performed to this moment.