The Cache and memory Hierarchy
.jpg)
The cache and memory Hierarchy: L 1 cache for each SM and a single L2 cache for all operations (load, store and texture). Registers 32K of 32-bit registers make up each SM.The registers of only its own thread are accessible by each thread. A CUDA kernel is permitted to employ a maximum of 63 registers.As the workload (and hence resource requirements) rise with the number of threads, the number of accessible registers smoothly decreases from 63 to 21.About 8,000 GB/s is the approximate bandwidth of registers . L1+Shared Memory On-chip memory that can be utilised to transfer data between multiple threads or to cache data for individual threads (register spilling/L1 cache) (shared memory). Both 48 KB of shared memory and 16 KB of L1 cache and 16 KB of shared memory with 48 KB of L1 cache are options for this 64 KB of memory. Shared memory makes it possible for threads to work together within a single thread block, allowing large on-chip data reuse, an...