Explain deadlock detection and recovery in detail BTECH CSVTU Que...

Deadlock Detection and Recovery in Operating Systems

Deadlock occurs when a set of processes are permanently blocked because each one is holding resources while waiting for other resources held by others. In a detection-and-recovery approach, the operating system allows deadlocks to occur, periodically checks for them, and then takes action to recover. This is common in general-purpose systems where maximum resource utilization is desired.

Why Use Deadlock Detection?

Prevention and avoidance can be overly restrictive and reduce throughput.
Detection lets the system run with high utilization and resolves deadlocks only when they actually happen.
Useful when resource requests are unpredictable and dynamic.

Key Ideas and Terminology

Resource Allocation Graph (RAG): A directed graph showing which process holds or requests which resource.
Wait-For Graph (WFG): A simplified graph with only processes as nodes; an edge Pi → Pj means Pi is waiting for a resource currently held by Pj.
Available: Vector showing how many instances of each resource type are free.
Allocation: Matrix showing how many resource instances are currently allocated to each process.
Request (or Need): Matrix showing how many instances each process is currently requesting.

When to Run Deadlock Detection

Periodically: Run the detector at fixed intervals.
On demand: Trigger when CPU utilization drops, when processes wait too long, or when resource contention spikes.
Per resource class: Some systems detect per subsystem (e.g., I/O vs. memory) for faster checks.

Detection with Single Instance per Resource Type (Using Wait-For Graph)

If every resource type has only one instance, detection reduces to cycle finding in the Wait-For Graph. A cycle implies deadlock.

Build WFG: draw Pi → Pj if Pi is waiting for a resource currently held by Pj.
Run a cycle-detection algorithm (DFS or topological reduction).
If a cycle exists, all processes on the cycle are deadlocked.

// Detect cycle in Wait-For Graph (WFG) using DFS
DFS(node u):
  mark u as visiting
  for each v in Adj[u]:
    if v is visiting: cycle found (deadlock)
    else if v is unvisited: DFS(v)
  mark u as visited

for each process p:
  if p is unvisited: DFS(p)
if any DFS reported a cycle: deadlock present

Time complexity is roughly O(V + E), where V is the number of processes and E is the number of wait-for edges.

Detection with Multiple Instances per Resource Type (Matrix-Based Algorithm)

When each resource type may have multiple instances (e.g., several printers, memory frames), detect deadlock using a safety-like test.

Let Available, Allocation, and Request represent the current state.
Initialize:
- Work = Available
- Finish[i] = false for all processes that hold resources (Allocation[i] ≠ 0); otherwise true
Find a process i such that:
- Finish[i] = false, and
- Request[i] ≤ Work (component-wise)
If found, pretend i completes: Work = Work + Allocation[i]; set Finish[i] = true; repeat step 3.
If no such i exists, then any process with Finish[i] = false is deadlocked.

// Multi-instance deadlock detection
Work = Available
for each process i:
  Finish[i] = (Allocation[i] == 0) ? true : false

repeat:
  found = false
  for each process i:
    if !Finish[i] and Request[i] <= Work:
      Work = Work + Allocation[i]
      Finish[i] = true
      found = true
  // stop if we cannot make progress
until found == false

// Processes with Finish[i] == false are deadlocked

Quick Examples

Single-instance (WFG): P1 waits for a file locked by P2; P2 waits for a device held by P1. Cycle P1 → P2 → P1 means a deadlock.
Multi-instance: If no process’s Request fits into Available and some processes still hold resources (Finish = false), those processes are deadlocked.

Deadlock Recovery Techniques

Once a deadlock is detected, the OS must break it. Common strategies:

1) Process Termination

Abort all deadlocked processes: Fast but wastes work.
Abort one process at a time (select a victim), re-run detection after each abort until the cycle is broken.

2) Resource Preemption

Temporarily take resources from some processes and give them to others to resolve the wait cycle.
Requires choosing:
- Victim resource/process to preempt
- Rollback method to return the victim to a consistent state
- Starvation control to ensure the same process is not always chosen

3) Rollback and Checkpointing

Periodically checkpoint process state to stable storage.
On deadlock, rollback one or more processes to a safe checkpoint (freeing resources) and restart them later.

Victim Selection: How to Choose What to Kill or Preempt

Priority: Prefer to preempt/abort lower-priority processes.
Cost of lost work: Avoid killing a process that has done a lot of computation or is near completion.
Resources held: Prefer victims holding many or critical resources (to free more capacity).
Rollback distance: Choose processes with recent checkpoints.
Process type: Interactivity, user importance, or SLA requirements can influence choice.
Starvation prevention: Use aging or rotation so the same process isn’t always the victim.

Comparing Recovery Methods

Terminate all: Simple, but heavy loss of progress.
Terminate one-by-one: Balanced; repeated detection adds overhead.
Preempt resources: Preserves more work, but needs checkpointing and careful consistency handling.

Practical Tips for BTech CSE Students

Use WFG cycle detection for single-instance resource models.
Use the matrix-based detection for multi-instance resource models.
After detection, apply victim selection with clear criteria to minimize cost and avoid starvation.
Run detection at sensible intervals to balance overhead and responsiveness.

Summary

Deadlock detection identifies circular waits either by finding cycles in a Wait-For Graph (single-instance) or by simulating allocation using Available/Allocation/Request (multi-instance). After detection, the OS recovers by terminating processes, preempting resources, or rolling back to checkpoints. Careful victim selection and reasonable detection frequency ensure good performance and fairness in a real operating system.