Explain different types of vector access memory schemes, which is better and why?
Vector Memory Access Schemes in Vector Processors (CSE, B.Tech)
In vector processors, performance depends heavily on how elements of a vector are fetched from and stored to memory. The main vector memory access schemes are unit-stride, strided, and indexed (gather/scatter). Each scheme maps vector elements to memory differently and achieves different bandwidths due to banking, caching, and address generation behavior.
1) Unit-Stride (Contiguous) Access
What it is: Elements are accessed from consecutive memory locations (stride = 1). Example: A[0], A[1], A[2], …
- How hardware helps: Streaming prefetch, coalesced loads/stores, burst transfers, and banked memory all favor contiguous access.
- Pros: Highest sustained bandwidth, minimal bank conflicts, best cache/TLB locality, simplest addressing.
- Cons: Requires data to be laid out contiguously in memory.
2) Strided Access
What it is: Elements are accessed with a fixed step (stride s ≠ 1). Example: A[0], A[s], A[2s], … Often used for column-wise access in matrices.
- How hardware helps: Vector address generators support programmable stride; some systems include conflict-avoidance techniques.
- Pros: Works well for regular non-contiguous patterns (e.g., submatrices, interleaved arrays).
- Cons: Can suffer bank conflicts and lower bandwidth when the stride interacts poorly with the number of memory banks.
Bank-conflict rule of thumb: With B memory banks and word-address stride s, bandwidth is best when gcd(s, B) = 1 (you cycle through all banks). If gcd(s, B) > 1, only a subset of banks is used, creating hotspots and stalls.
Example (B = 8 banks, addresses map by addr mod 8): - Unit-stride (s = 1): banks used = 0,1,2,3,4,5,6,7 → good - Stride s = 2: banks used = 0,2,4,6 → conflicts (only 4 of 8 banks) - Stride s = 3: banks used = 0,3,6,1,4,7,2,5 → good (gcd(3,8)=1)
3) Indexed Access (Gather/Scatter)
What it is: Elements are accessed using an index vector of addresses. Two operations:
- Gather: Load from arbitrary memory addresses into a vector register using an index vector.
- Scatter: Store vector elements to arbitrary memory addresses using an index vector.
- Pros: Very flexible; essential for sparse matrices, graphs, and irregular data structures.
- Cons: Typically the slowest scheme: non-sequential addresses reduce cache effectiveness, increase TLB misses, and make bank conflicts unpredictable.
4) Masked/Conditional Vector Loads and Stores
What it is: A predicate (mask) enables/disables per-element memory access. Often combined with strided or indexed patterns to avoid invalid or out-of-bounds accesses.
- Pros: Safe handling of tails and sparse conditions; avoids branches.
- Cons: Can reduce effective bandwidth because not all lanes are active.
Hardware Features That Influence Performance
- Memory banking/interleaving: Distributes addresses across banks to increase parallelism; effectiveness depends on stride vs. bank count.
- Alignment and bursts: Aligned, contiguous blocks allow wider bursts and fewer transactions.
- Streaming buffers/prefetchers: Greatly benefit unit-stride; limited benefit for random indexed patterns.
- TLC (TLB-Cache) behavior: Contiguity improves hit rates; random indices can thrash TLB and caches.
Which Scheme Is Better and Why?
Best overall for performance: Unit-stride (contiguous) access.
- Delivers the highest sustained memory bandwidth due to sequential bursts and coalescing.
- Minimizes bank conflicts, latency, and control overhead.
- Achieves superior cache and TLB locality, reducing stalls.
Rank (typical): Unit-stride > Well-chosen strided (gcd(stride, banks) = 1) > Indexed gather/scatter.
Practical Tips to Achieve “Better” Access
- Transform data layout to favor contiguity (e.g., Structure of Arrays over Array of Structures).
- Reorder loops or use blocking/tiling so inner loops use unit-stride vectors.
- For strided access, choose strides relatively prime to the number of memory banks where possible.
- For sparse/irregular data, batch indices to improve locality or reorder data once, then use unit-stride thereafter.
Conclusion: While all schemes are necessary, unit-stride is “better” for pure throughput and efficiency. Strided access is acceptable with careful stride–bank planning. Indexed (gather/scatter) is the most flexible but typically the slowest, used when the data pattern requires it.
