Memory Tuning

There are a number of techniques for optimizing application code and turning the memory hierarchy.

Maximize cache reuse

The following snippets of code illustrate the correct way to access contiguous elements (i.e. stride 1) for a matrix in both C and FORTRAN.

Fortran exampleC example
real*8 :: a(m,n), b(m,n), c(m,n)
...
do i = 1, n
  do j = 1, m
    a(j,i) = b(j,i) + c(j,i)
  end do
end do
double a[m][n], b[m][n], c[m][n];
...
for (i=0; i<m; i++) {
  for (j=0; j<n; j++) {
    a[i][j] = b[i][j]+c[i][j];
  }
}

Prefetching

Prefetching is the ability to predict the next cache line to be accessed and start bringing it in from memory. If data is requested far enough in advance, the latency to memory can be hidden. Compiler inserts prefetch instructions into loop -- instructions that move data from main memory into cache in advance of their use. Prefetching may also be specified by the user using directives.

Example: In the following dot-product example, the number of streams prefetched are increased from 2, to 4, to 6, for the same functionality. However, just prefetching a larger number of streams does not necessarily translate into increased performance. There is a threshold value beyond which prefetching more streams can be counterproductive.

2 streams
do i=1,n
  s=s+x(i)*y(i)
end do
dotp=s
4 streams
do i=1,n/2
  s0=s0+x(i)*y(i)
  s1=s1+x(i+n/2)*y(i+n/2)
end do
s0=s0+x(n)*y(n)
dotp=s0+s
6 streams
do i=1,n/3
  s0=s0+x(i)*y(i)
  s1=s1+x(i+n/3)*y(i+n/3)
  s2=s2+x(i+2*n/3)*y(i+2*n/3)
end do
do i=3*(n/3)+1,n
  s0=s0+x(i)*y(i)
end do
dotp=s0+s1+s2

Fit the problem

Make sure to fit the problem size to memory (256GB/node) as there is no virtual memory available for swap.

  1. Always minimize stride length . For the best-case scenario, stride length 1 is optimal for most systems and in particular the vector systems. If that is not possible, then the low-stride access should be the goal. That increases cache efficiency, as well as sets up hardware and software prefetching. Stride lengths of powers of two is typically the worst case scenario leading to cache misses.
  2. Another approach is data reuse in cache by cache blocking . The idea is to load chunks of the data so it fits maximally in the different levels of cache while in use. Otherwise the data has to be loaded into cache from memory every time it becomes necessary since its not in cache. This phenomenon is commonly known as cache miss . This is costly from the computational standpoint, since the latency for loading data from memory is a few orders higher than from cache, hence the concern. The goal is to keep as much of the data in cache while it is in use and to minimizing loading it from memory.

    This concept is illustrated in the following matrix-matrix multiply example where the indices for the i, j, k loops are set up in such a way so as to fit the greatest possible sizes of the different sub-matrices in cache while the computation is ongoing.

    Example: Matrix multiplication
    Real*8 a(n,n), b(n,n), c(n,n)
    do ii=1,n,nb       ! nb is blocking factor
      do jj=1,n,nb
        do kk=1,n,nb
          do i=ii,min(n,ii+nb-1)
            do j=jj,min(n,jj+nb-1)
              do k=kk,min(n,kk+nb-1)
                c(i,j)=c(i,j)+a(j,k)*b(k,i)
              end do
            end do
          end do
        end do
      end do
    end do
    
  3. Another standard issue is the dimension of arrays when they are stored and it is always best to avoid leading dimensions that are a multiple of a high power of two . Users should be particularly aware of the cache line and associativity. Performance degrades when the stride is a multiple of the cache line size.

    Example : Consider an L1 cache that is 16 K in size and 4-way set associative, with a cache line of 64 Bytes.

    Problem : A 16 K 4-way set associative cache has 4 sets of 4 K each (4096). If each cache line is 64 bytes, then there are 64 cache lines per set. Effectively reduces L1 from 256 cache lines to only 4. That results in a 256 byte cache, down from the original 16 K, due to the non-optimal choice of leading dimension.

    Real*8 :: a(1024,50)
    ...
    do i=1,n
      a(1,i)=0.50*a(1,i)
    end do
    

    Solution : Change leading dimension to 1028 (1024 + 1/2 cache line)

  4. Encourage Data Prefetching to Hide Memory Latency.
  5. Work within available physical memory.