Thrashing the data cache for fun and profit
Ducks in a Row
In the past [1], I discussed how instruction cache misses can completely ruin your CPU's day when it comes to execution speed by stalling processing for thousands or even millions of cycles while program text is being retrieved from slower caches, or even from (gosh!) RAM [2] [3]. Now the time has come to turn your attention to the data cache: Where large enough datasets are involved, spatial locality and cache-friendly algorithms are among the first optimizations you should employ.
Three Times Slower
Compare the code presented in Listings 1 and 2. You might use diff
[4] to verify that the only difference is in the order of the i
and j
variables in the array lookup. Listing 1 increments 100 million integers, scanning the array row by row, whereas Listing 2 increments the 10,000 elements of a column before moving to the next column.
Listing 1
row.c
#include <stdio.h> #define SIZE 10000 ** int array[SIZE][SIZE] = {0}; int main() { for (int i = 0; i < SIZE; i++) for (int j = 0; j < SIZE; j++) array[i][j]++; return 0; }
Listing 2
column.c
#include <stdio.h> #define SIZE 10000 ** int array[SIZE][SIZE] = {0}; int main() { for (int i = 0; i < SIZE; i++) for (int j = 0; j < SIZE; j++) array[j][i]++; return 0; }
Both listings do exactly the same thing, proceeding in what is called row-major (Listing 1) or column-major (Listing 2) order [5]. To see how this plays out in practice, compile both programs:
gcc column.c -o column gcc row.c -o row
Next, run and time the execution of these logically equivalent access strategies:
$ time ./row real 0m1.388s user 0m0.943s sys 0m0.444s ** $ time ./column real 0m4.538s user 0m4.068s sys 0m0.468s
The experiment shows the column-major code taking three and a half times longer to execute than its row-major counterpart. That significant result is partially explained by how C, C++, and other languages "flatten" two-dimensional arrays into the one-dimensional address space exposed by RAM.
As shown in Figure 1, individual rows line up sequentially in memory, something that can be easily shown with a simple program printing the dereferenced value of an incrementing pointer (Listing 3). This new program uses C pointer arithmetic (don't try this at home, as the MythBusters would say, because it is generally considered poor practice) to access and print every location in a small 2D array by linearly accessing memory. C really represents two-dimensional arrays as arrays of arrays, and lays them out in row-major fashion (i.e., row by row in memory).
Listing 3
inspect.c
#include <stdio.h> ** int a[4][5] = { // array of 4 arrays of 5 ints each, a 4x5 matrix { 1, 2, 3, 4, 5 }, // row 0 initialized to 1, 2, 3, 4, 5 { 6, 7, 8, 9, 10 }, // row 1 initialized to 6, 7, 8, 9 ... }; // rows 2 onwards initialized to zeros ** int main() { int *p = a; for(int i=0; i < 20; i++) printf("%d, ", *p++); printf("\b\b \n"); // erasing the trailing comma and adding a newline return 0; }
Running the third program,
$ ./inspect 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 $
indeed produces the output one would expect.
The Right Neighborhood
With every row of the array weighing in at roughly 40KB (10,000 entries of 32 bits), the difference in access time reveals the nature of the underlying machine. Accessing the array in the natural order of the programming language (implementation choices may differ here, with Fortran and Matlab choosing column-major layouts) leverages the spatial locality of an access pattern, reading locations contiguous in physical memory.
Conversely, using the wrong strategy produces code constantly looking up memory locations that are 40KB apart. The first approach is cache-friendly, whereas the latter approach repeatedly evicts cache lines that were just loaded yet used to process but a single value. The more linear the memory access pattern, the easier it is for the CPU to be ready for operations, with data already in the cache when called.
Memory layout and access patterns matter, and a stark demonstration of the difference between the two approaches is provided by Figure 2. The perf stat
command [6] executes the command requested while gathering statistics from the CPU's performance counters. Switching between the row-major and column-major access patterns resulted in a 20-fold increase in L1 cache misses, spiking from 13 million to 273 million!
The Cachegrind module of valgrind
[7] measures how a program interacts with an ideal cache hierarchy by simulating a machine with independent first-level instruction and data caches (I1 and D1), backed by a unified second-level cache (L2). Although Cachegrind requires tuning to match the size of a CPU's caches, the independence of its simulations from any other process running in the system, reproducible results, and compatibility with virtual environments makes it the tool of choice when studying a cache performance issue.
Figure 3 shows the results of the run, producing a 16-fold increase in data cache misses, up from 6 million to 100 million events. Note how the row-major approach delivered a near-perfect result, with D1 cache misses coming in under one percent.
Fully utilizing the data loaded from the slower storage medium (RAM in this case) is a common optimization pattern. Swapping to disk is, of course, the worst case scenario, with 13,700 times the latency of the L1 cache. The situation whereupon memory pages are being offloaded to the swap device after a single variable access and then reloaded to access the next location is known as disk thrashing. RAM is loaded by the CPU in cache lines [8], and data is loaded from disk in full pages; whenever possible, make full use of all their data before moving on to the next.
Infos
- "Heavy Rotation" by Federico Lucifredi, ADMIN , issue 14, 2013, pg. 88
- Visualization of L1, L2, RAM, and disk latencies: https://news.ycombinator.com/item?id=702713
- Interactive visualization of cache speeds: http://www.overbyte.com.au/misc/Lesson3/CacheFun.html
- diff (1) man page: https://manpages.ubuntu.com/manpages/bionic/en/man1/diff.1.html
- Row- and column-major order: https://en.wikipedia.org/wiki/Row-_and_column-major_order
- perf stat (1) man page: http://manpages.ubuntu.com/manpages/bionic/man1/perf-stat.1.html
- Cachegrind manual: http://valgrind.org/docs/manual/cg-manual.html
- CPU cache: https://en.wikipedia.org/wiki/CPU_cache
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.