Explain deadlock detection and recovery in detail

Deadlock Detection and Recovery in Operating Systems

Deadlock occurs when a set of processes are permanently blocked because each one is holding resources while waiting for other resources held by others. In a detection-and-recovery approach, the operating system allows deadlocks to occur, periodically checks for them, and then takes action to recover. This is common in general-purpose systems where maximum resource utilization is desired.

Why Use Deadlock Detection?

  • Prevention and avoidance can be overly restrictive and reduce throughput.
  • Detection lets the system run with high utilization and resolves deadlocks only when they actually happen.
  • Useful when resource requests are unpredictable and dynamic.

Key Ideas and Terminology

  • Resource Allocation Graph (RAG): A directed graph showing which process holds or requests which resource.
  • Wait-For Graph (WFG): A simplified graph with only processes as nodes; an edge Pi → Pj means Pi is waiting for a resource currently held by Pj.
  • Available: Vector showing how many instances of each resource type are free.
  • Allocation: Matrix showing how many resource instances are currently allocated to each process.
  • Request (or Need): Matrix showing how many instances each process is currently requesting.

When to Run Deadlock Detection

  • Periodically: Run the detector at fixed intervals.
  • On demand: Trigger when CPU utilization drops, when processes wait too long, or when resource contention spikes.
  • Per resource class: Some systems detect per subsystem (e.g., I/O vs. memory) for faster checks.

Detection with Single Instance per Resource Type (Using Wait-For Graph)

If every resource type has only one instance, detection reduces to cycle finding in the Wait-For Graph. A cycle implies deadlock.

  1. Build WFG: draw Pi → Pj if Pi is waiting for a resource currently held by Pj.
  2. Run a cycle-detection algorithm (DFS or topological reduction).
  3. If a cycle exists, all processes on the cycle are deadlocked.
// Detect cycle in Wait-For Graph (WFG) using DFS
DFS(node u):
  mark u as visiting
  for each v in Adj[u]:
    if v is visiting: cycle found (deadlock)
    else if v is unvisited: DFS(v)
  mark u as visited

for each process p:
  if p is unvisited: DFS(p)
if any DFS reported a cycle: deadlock present

Time complexity is roughly O(V + E), where V is the number of processes and E is the number of wait-for edges.

Detection with Multiple Instances per Resource Type (Matrix-Based Algorithm)

When each resource type may have multiple instances (e.g., several printers, memory frames), detect deadlock using a safety-like test.

  1. Let Available, Allocation, and Request represent the current state.
  2. Initialize:
    • Work = Available
    • Finish[i] = false for all processes that hold resources (Allocation[i] ≠ 0); otherwise true
  3. Find a process i such that:
    • Finish[i] = false, and
    • Request[i] ≤ Work (component-wise)
  4. If found, pretend i completes: Work = Work + Allocation[i]; set Finish[i] = true; repeat step 3.
  5. If no such i exists, then any process with Finish[i] = false is deadlocked.
// Multi-instance deadlock detection
Work = Available
for each process i:
  Finish[i] = (Allocation[i] == 0) ? true : false

repeat:
  found = false
  for each process i:
    if !Finish[i] and Request[i] <= Work:
      Work = Work + Allocation[i]
      Finish[i] = true
      found = true
  // stop if we cannot make progress
until found == false

// Processes with Finish[i] == false are deadlocked

Quick Examples

  • Single-instance (WFG): P1 waits for a file locked by P2; P2 waits for a device held by P1. Cycle P1 → P2 → P1 means a deadlock.
  • Multi-instance: If no process’s Request fits into Available and some processes still hold resources (Finish = false), those processes are deadlocked.

Deadlock Recovery Techniques

Once a deadlock is detected, the OS must break it. Common strategies:

1) Process Termination

  • Abort all deadlocked processes: Fast but wastes work.
  • Abort one process at a time (select a victim), re-run detection after each abort until the cycle is broken.

2) Resource Preemption

  • Temporarily take resources from some processes and give them to others to resolve the wait cycle.
  • Requires choosing:
    • Victim resource/process to preempt
    • Rollback method to return the victim to a consistent state
    • Starvation control to ensure the same process is not always chosen

3) Rollback and Checkpointing

  • Periodically checkpoint process state to stable storage.
  • On deadlock, rollback one or more processes to a safe checkpoint (freeing resources) and restart them later.

Victim Selection: How to Choose What to Kill or Preempt

  • Priority: Prefer to preempt/abort lower-priority processes.
  • Cost of lost work: Avoid killing a process that has done a lot of computation or is near completion.
  • Resources held: Prefer victims holding many or critical resources (to free more capacity).
  • Rollback distance: Choose processes with recent checkpoints.
  • Process type: Interactivity, user importance, or SLA requirements can influence choice.
  • Starvation prevention: Use aging or rotation so the same process isn’t always the victim.

Comparing Recovery Methods

  • Terminate all: Simple, but heavy loss of progress.
  • Terminate one-by-one: Balanced; repeated detection adds overhead.
  • Preempt resources: Preserves more work, but needs checkpointing and careful consistency handling.

Practical Tips for BTech CSE Students

  • Use WFG cycle detection for single-instance resource models.
  • Use the matrix-based detection for multi-instance resource models.
  • After detection, apply victim selection with clear criteria to minimize cost and avoid starvation.
  • Run detection at sensible intervals to balance overhead and responsiveness.

Summary

Deadlock detection identifies circular waits either by finding cycles in a Wait-For Graph (single-instance) or by simulating allocation using Available/Allocation/Request (multi-instance). After detection, the OS recovers by terminating processes, preempting resources, or rolling back to checkpoints. Careful victim selection and reasonable detection frequency ensure good performance and fairness in a real operating system.