The Cache and memory Hierarchy

 

The cache and memory Hierarchy:

L1 cache for each SM and a single L2 cache for all operations (load, store and texture). 


  • Registers
32K of 32-bit registers make up each SM.The registers of only its own thread are accessible by each thread. A CUDA kernel is permitted to employ a maximum of 63 registers.As the workload (and hence resource requirements) rise with the number of threads, the number of accessible registers smoothly decreases from 63 to 21.About 8,000 GB/s is the approximate bandwidth of registers.

  •  L1+Shared Memory

 On-chip memory that can be utilised to transfer data between multiple threads or to cache data for individual threads (register spilling/L1 cache) (shared memory). Both 48 KB of shared memory and 16 KB of L1 cache and 16 KB of shared memory with 48 KB of L1 cache are options for this 64 KB of memory. Shared memory makes it possible for threads to work together within a single thread block, allowing large on-chip data reuse, and significantly lowers off-chip traffic.The threads in the same thread block can access shared memory. It offers extremely high bandwidth (1,600 GB/s) and low-latency access (10–20 cycles) to moderate volumes of data.


                                                    Fig: Fermi Memory Hierarchy

  • Local Memory

Local memory is intended to be a location in the memory used to store "spilled" registers. When a thread block needs more register storage than an SM can provide, register spillage happens. An automated variable typically sits in a register, with the following exceptions: (1) Constant numbers are used to index arrays that the compiler cannot determine; (2) Massive arrays or structures that would take up too much register space; Whenever a kernel utilises more registers than are accessible on the SM, the compiler may opt to spill any variable to local memory.

  • L2 Cache

All loads and stores from and to global memory, including copies to and from the CPU host and texture requests, are handled by a 768 KB shared L2 cache amongst the 16 SMs. In order to control access to data that needs to be shared across thread blocks or even kernels, the L2 cache subsystem additionally implements atomic operations.

 

  • Global memory

Global memory access on Fermi is relatively long when compared to shared memory access and accessible to the host and all threads (CPU).A long latency(400-800 cycles). 

 

       Conclusion:

With a configurable, error-protected memory hierarchy and support for languages like C++ and FORTRAN, Fermi is the first computing architecture to provide such a high level of doubleprecision floating-point performance from a single chip. Fermi is the first full-featured GPU computer architecture in the world as a result.

Comments

Popular posts from this blog

Programming Model

An Overview Of Fermi Architecture