from CSE142L.notebook import *
from notebook import *
from Question import Question
setup_lab()
# if you get something about NUMEXPR_MAX_THREADS being set incorrectly, don't worry. It's not a problem.
If, after these two labs, you still thirst for practical knowledge about using memory effectively, you should should read this series of articles: What every programmer should know about memory. It's long but quite good. It's not required reading, but, for the programming assignments that say "make this code go as fast you can," everything it includes is fair game.
Your grade for this lab will be based on the following components
| Part | value |
|---|---|
| Reading quiz | 3% |
| Jupyter Notebook | 95% |
| Post-lab survey. | 2% |
No late work or extensions will be allowed.
We will grade 5 of the "completeness" problems. They are worth 3 points each. We will grade all of the "correctness" questions.
You'll follow the directions at the end of the lab to submit the lab write up and the programming assignment through gradescope.
Please check gradescope for exact due dates.
We'll use performance counters to measure cache performance on our bare metal machines in the cloud. We'll use the same tools we've used so far to measure IC, CPI, and CT.
Performance counters are hardware components in the CPU that measure performance-related events - like instructions executed, cache misses, branch mispredictions, and so on. In this lab, we will use performance counters extensively to characterize the behavior of programs.
In this lab and future labs, we'll be using a data structure called a tensor. Tensors are a hot topic nowadays (you may have heard of the TensorFlow programming system from Google that accompanies their Tensor Processing Unit specialized processor), but at their heart they are not complicated: They are just multi-dimensional arrays.
We'll be using a 4-dimensional tensor data type called tensor_t. It's defined in ./tensor_t.hpp. Here's the key parts of the source code:
0%| | 0/1 [00:00<?, ?it/s]
// START_CONSTRUCT
tensor_t( int _x, int _y=1, int _z=1, int _b=1): size(_x, _y, _z, _b), delete_memory(true) {
data = new T[size.x * size.y * size.z * size.b]();
}
// END_CONSTRUCT
// START_GET
const T & get( int _x, int _y=0, int _z=0, int _b=0 ) const {
return data[
_b * (size.x * size.y * size.z) +
_z * (size.x * size.y) +
_y * (size.x) +
_x
];
}
// END_GET
Note that tensor_t is a template, so we can have tensors of many different types: tensor_t<int>, tensor_t<float>, etc.
Although tensor_t is 4-dimensional, it stores its contents in a one-dimensional array of T called data. The get() method maps coordinates into that linear array and returns a reference to the corresponding entry. Since get() returns a reference, we can say things like:
tensor_t<float> t(10, 10, 10, 10);
t.get(1,2,3,4) = 4.0;
float s = t.get(1,2,3,4);
We're going to need to generate some cache misses very efficiently in this lab. It's actually pretty hard to reliably create cache misses efficiently, especially if you want to out-smart modern cache hardware mechanisms like prefetchers. We also need to precisely control the amount of memory it touches. You could use a random number generator, but even the simplest ones (like fast_rand()) are pretty computationally intensive. Instead, we are going to use a clever data structure that I don't think has an official name, so I'm going to call it the "miss machine".
The miss-machine is a circular linked list. The links are allocated in a block (of the size of memory footprint we want) and then formed into a circular linked list. Then, you set the next pointer so that the list jumps around at random, hitting every link, and eventually returning to the start. Then, if you want cache misses, you make the footprint bigger than your cache, and you just traverse the list for as long as you like. The sequence of addresses is (as near as the cache can tell) random, and the code required is extremely small -- just a few instructions.
Here's an implementation. Read through the code and comments carefully.
0%| | 0/1 [00:00<?, ?it/s]
#include<cstdint>
#include<vector>
#include<algorithm>
#include<iostream>
#include"util.hpp"
#include"cfiddle.hpp"
#include <new>
#define BYTES_PER_CACHE_LINE 64
struct MM {
struct MM* next; // I know that pointers are 8 bytes on this machine.
uint8_t junk[BYTES_PER_CACHE_LINE - sizeof(struct MM*)]; // This forces the struct MM to take a up a whole cache line, abolishing spatial locality.
};
extern "C"
struct MM * miss(struct MM * start, uint64_t count) {
for(uint64_t i = 0; i < count; i++) { // Here's the loop that does this misses. It's very simple.
start = start->next;
}
return start;
}
extern "C"
uint64_t* miss_machine(uint64_t footprint_bytes, uint64_t access_count) {
const uint array_size = footprint_bytes/sizeof(struct MM);
auto array = new struct MM[array_size]();
// This is the clever part: 'index' is going to determine where the pointers go. We fill it with consecutive integers.
std::vector<uint64_t> index;
for(uint64_t i = 0; i < array_size; i++) {
index.push_back(i);
}
// Randomize the list of indexes.
std::random_shuffle(index.begin(), index.end());
// Convert the indexes into pointers.
for(uint64_t i = 0; i < array_size; i++) {
array[index[i]].next = &array[index[(i + 1) % array_size]];
}
MM * start = &array[0];
flush_caches();
enable_prefetcher(0);
start_measurement();
start = miss(start, access_count);
end_measurement();
return reinterpret_cast<uint64_t*>(start); // This is a garbage value, but if we don't return it, the compiler will optimize out the call to miss.
}
// Cfiddle-signature=421aa88a2b583f971443d15383a46a72
One of the central issues in cache-aware programming (i.e., crafting your code to make efficient use of the memory hierarchy) is understanding "how much" memory your program (or part of it) is using and "how big" a particular data structure (e.g., an array) is. It turns out there are a surprising number of ways to measure these things and the answers can be non-intuitive. Not only that, the most obvious ways to measure these characteristics are not the most useful -- if you're interested in fully-utilizing your memory system.
So let's take a look at three questions and see how to answer them in a cache-aware way.
Let's start simple: How many bytes do each of the primitive C++ data types occupy? Here's some code to check. It uses C's sizeof operator (it's not technically a function. Why?) that tells you how many bytes something occupies. The types listed are all the C++ primitive types.
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s] sizeof(char) = 1 sizeof(short int) = 2 sizeof(int) = 4 sizeof(long int) = 8 sizeof(long long int) = 8 sizeof(float) = 4 sizeof(double) = 8 sizeof(long double) = 16 sizeof(int8_t) = 1 sizeof(int16_t) = 2 sizeof(int32_t) = 4 sizeof(int64_t) = 8 sizeof(int64_t*) = 8 sizeof(void*) = 8
The types with a number in their names (like uint64_t) let you specify specific sizes (in bits) you'd like to use. They are defined in the cstdint header. The C/C++ standard doesn't give specific sizes for 'int', 'long int', etc. Instead, it places some constraints on what possible values might be. This is a design "bug" in the language. I don't know of any other modern languages that leave the bit widths of primitive types unspecified. One reason is that it hurts portability. The other, is that it makes cache-aware programming harder (as we shall see).
Let's try something more complicated. Look at the code below and answer this question. Then, run the code:
0%| | 0/1 [00:00<?, ?it/s]
#include<cstdint>
#include<iostream>
struct struct_1 {
uint32_t a;
};
struct struct_2 {
uint32_t a;
uint32_t b;
};
struct struct_3 {
uint32_t a;
uint8_t b;
};
struct struct_4 {
uint64_t a;
uint8_t b;
};
struct struct_5 {
uint8_t a;
uint8_t b;
};
struct struct_6 {
uint32_t a;
uint32_t c;
uint8_t b;
};
class class_7 {
uint64_t a;
uint8_t b;
uint8_t c;
} ;
class class_8 {
uint64_t a;
};
class class_9 {
void foo() {};
uint64_t a;
};
class class_10 {
virtual void foo() {};
uint64_t a;
};
class class_11 {
virtual void foo() {};
virtual void bar() {};
uint64_t a;
};
extern "C"
void struct_size() {
std::cout << "\n";
std::cout << "sizeof(struct_1) = " << sizeof(struct_1) << "\n";
std::cout << "sizeof(struct_2) = " << sizeof(struct_2) << "\n";
std::cout << "sizeof(struct_3) = " << sizeof(struct_3) << "\n";
std::cout << "sizeof(struct_4) = " << sizeof(struct_4) << "\n";
std::cout << "sizeof(struct_5) = " << sizeof(struct_5) << "\n";
std::cout << "sizeof(struct_6) = " << sizeof(struct_6) << "\n";
std::cout << "sizeof(class_7) = " << sizeof(class_7) << "\n";
std::cout << "sizeof(class_8) = " << sizeof(class_8) << "\n";
std::cout << "sizeof(class_9) = " << sizeof(class_9) << "\n";
std::cout << "sizeof(class_10) = " << sizeof(class_10) << "\n";
std::cout << "sizeof(class_11) = " << sizeof(class_11) << "\n";
}
// Cfiddle-signature=6bdb9ad41ee4662b301d1394e0a31184
0%| | 0/1 [00:00<?, ?it/s] sizeof(struct_1) = 4 sizeof(struct_2) = 8 sizeof(struct_3) = 8 sizeof(struct_4) = 16 sizeof(struct_5) = 2 sizeof(struct_6) = 12 sizeof(class_7) = 16 sizeof(class_8) = 8 sizeof(class_9) = 8 sizeof(class_10) = 16 sizeof(class_11) = 16
What's going on?!?!
What's going on is data alignment. If a memory address, $A$, is $n$-byte aligned, then $A\mod{} n = 0$. A particular value is "width-aligned" if it is aligned to its own size in bytes. So, a width-aligned uint64_t would reside at an address that is a multiple of 8 bytes.
In most architectures, width-aligned access is faster than non-width-aligned access. In some architectures, the ISA does not directly support non-width-aligned access (i.e., you can't load a 64-bit value from an address that is not a multiple of 8), because it requires extra hardware and complexity (e.g., if the architecture allows unaligned, multi-byte values, then a single load can access two cache lines instead of one). Instead, they require the compiler to implement these accesses with loads and shifts.
The strangeness in the outputs of the sizeof is a product of this. The C/C++ standards require that the members of a struct be stored in the order they are declared. Modern compilers "pad" members of the struct to enforce width-alignment. For this purpose, it's assumed that the struct starts address 0.
For example, consider struct_8 above. It's laid out like so:
| Byte | |
|---|---|
| 0 | a |
| 1 | unused |
| 2 | unused |
| 3 | unused |
| 4 | unused |
| 5 | unused |
| 6 | unused |
| 7 | unused |
| 8 | b |
| 9 | b |
| 10 | b |
| 11 | b |
| 12 | b |
| 13 | b |
| 14 | b |
| 15 | b |
| 16 | c |
| 17 | unused |
| 18 | unused |
| 19 | unused |
| 20 | unused |
| 21 | unused |
| 22 | unused |
| 23 | unused |
If this seems inefficient... it is. Or at least, it's a trade-off. It's better for performance this way and memory is plentiful. Really, though, the programmer should re-order the fields of the struct to allow for a more efficient layout. See if you can re-arrange the fields in struct_8 to make it fit in 16 bytes.
Most of our examples use structs instead of classes.
A struct is essentially a class without any methods and with all public fields. C++ has structs because it evolved from C, which did not have classes.
The layout of a struct and a class is similar, but not identical as you can see from the output for class_7,class_8,class_9,class_10 in the above output.
0%| | 0/1 [00:00<?, ?it/s]
#include<cstdint>
#include<iostream>
struct struct_8 {
uint8_t b;
uint64_t a;
uint8_t c;
} ;
extern "C"
void array_size() {
struct struct_8 _8[3];
std::cout << "\n";
std::cout << "sizeof(struct_8) = " << sizeof(struct_8) << "\n";
std::cout << "sizeof(struct_8[3]) = " << sizeof(_8) << "\n";
}
// Cfiddle-signature=0efcfb51e2f8a828a381d2895804dbe0
0%| | 0/1 [00:00<?, ?it/s] sizeof(struct_8) = 24 sizeof(struct_8[3]) = 72
Above, we measured the size of a data structure or array in bytes, and this makes sense for thinking about its size.
A related question is how much data does my program access. If you are interested in writing code that is cache-aware, thinking of data measured in bytes is not that useful. A better choice is to think about data measured in cache lines, because cache lines are the units of memory that the memory hierarchy transfers between caches.
So what is one cache line of memory? The seemingly obvious answer is that it is the number of bytes that a cache line holds. If that were the case, we could just divide the size of a structure by the cache line size. But there is more to it: Cache lines are width-aligned. That means that each cache line of memory starts at an address that is divisible by the cache line's size. This means that the number of cache lines a struct occupies depends on its alignment.
For example, let's assume our cache line size is 16 bytes, and my_struct is 16 bytes long. Here are two possible scenarios for my_struct. In the first, the beginning of my_struct is aligned to a cache line boundary, and my_struct occupies 1 cache line.
| byte | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| cache line | 0 | 1 | ||||||||||||||||||||||||||||||
| my_struct | ||||||||||||||||||||||||||||||||
In the second scenario, my_struct is not aligned, and it occupies two cache lines:
| byte | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| cache line | 0 | 1 | ||||||||||||||||||||||||||||||
| my_struct | ||||||||||||||||||||||||||||||||
For the width-aligned case, if we access the whole struct, we will incur one compulsory cache miss. But in the second, non-aligned situation we will incur two.
We can modify the miss machine to illustrate this situation. The code is below. There are three differences:
MM. That's just to make the table below more readable.miss() reads from two fields of MM. a at the beginning and h at the end.array in a complicated way.For array, instead of using new, we use posix_memalign() which lets you set the alignment of the allocated memory with its secord argument. We allocate array so that the elements it contains will be width-aligned. That is, array % sizeof(MM) == 0.
Having carefully, allocated aligned memory, we then unalign it:
array = reinterpret_cast<A*>(reinterpret_cast<uint64_t*>(array) + arg1);
The line above shifts array by arg1 8-byte words. So, if arg1 = 4, then array % sizeof(MM) == 32. You can read about reinterpret_cast, but you should use it sparingly (They gave it a hard-to-type name on purpose. Really!).
So, if we call this function with multiple values of arg1, we'll have something like this:
| 8-byte word | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | |||||||
| 64-bytes cache line | 00 | 01 | 02 | 03 | |||||||||||||||||||||||||||||||||||
| arg1 == 0 | a | array[0] | h | a | array[1] | h | a | array[2] | h | a | array[3] | h | |||||||||||||||||||||||||||
| arg1 == 1 | a | array[0] | h | a | array[1] | h | a | array[2] | h | a | array[3] | h | |||||||||||||||||||||||||||
| arg1 == 2 | a | array[0] | h | a | array[1] | h | a | array[2] | h | a | array[3] | h | |||||||||||||||||||||||||||
| arg1 == 3 | a | array[0] | h | a | array[1] | h | a | array[2] | h | a | array[3] | h | |||||||||||||||||||||||||||
| arg1 == 4 | a | array[0] | h | a | array[1] | h | a | array[2] | h | a | array[3] | h | |||||||||||||||||||||||||||
| arg1 == 5 | a | array[0] | h | a | array[1] | h | a | array[2] | h | a | array[3] | h | |||||||||||||||||||||||||||
| arg1 == 6 | a | array[0] | h | a | array[1] | h | a | array[2] | h | a | array[3] | h | |||||||||||||||||||||||||||
| arg1 == 7 | a | array[0] | h | a | array[1] | h | a | array[2] | h | a | array[3] | h | |||||||||||||||||||||||||||
When arg1 % 8 = 0, a MM occupies one cache line. When arg1 % 8 is anything else, it occupies 2. And since miss() accesses h, we will access both lines.
The experiments above show that alignment and struct layout can affect the performance of toy programs, but do these kinds of details really make any difference in real code? They do!
How can we tell? Two ways:
Item 1 is lots of fun and I recommend it, but it takes a long time and we only have one quarter. But there are ~three~ two programming assignments left...
Let's look at #2. For a long time, compiler writers and language designers have realized that memory layout and alignment is important, so they have provided support in the language to help programmers deal with this.
Since the early versions of C, you could initialize a struct like this:
struct Foo {
char a;
int b;
char c;
};
Foo foo = {7,4,3};
Which will set foo.a = 7, foo.b = 4, and foo.c = 3. We'll call this "positional initialization" or PI. I think PI is super-unreadable, but it has an even bigger problem: If you change the layout of Foo to improve its cache efficiency, you have to update every static initialization of a Foo and there's no way for the compiler to tell if you missed one. So the C99 standard gave us "designated initialization":
Foo foo = {.a = 7, .b=4, .c=3};
This feature appeared as a gcc extension before it showed up in the C standard, and it's main use case was making easier to adjust struct layouts for memory-efficiency reasons. For instance, in the unaligned access example above, we could move MM.h to be near MM.next and reduce the effects of poor alignment.
Usually, the compiler can do a good job of aligning your structs, but there are cases where the programmer needs to force a structure to be aligned to a particular size. The old fashioned way to do this was by wrapping malloc() (the C version of new) to create a new memory allocation function that could return memory at the desired alignment. Writing such a function is a moderately interesting programming exercise.
Since 2001, C has had support for aligned allocation through posix_memalign().
C++ has a more elegant solution and allows you to annotate the type with its alignment requirement using alignas():
struct alignas(32) Foo {
char a;
int b;
char c;
};
Which guarantees that all instance of Foo will be aligned to 32 bytes.
For both posix_memalign() and alignas(), the alignment value must be a power of 2. Other alignments are not generally useful.
Latency and bandwidth are two fundamental measures of memory performance.
Caches can improve both latency and bandwidth: The L1 cache has lower latency and higher bandwidth than the L2. The L2 out-performs the L3 on both metrics, and the L3 is better than DRAM on both metrics.
Let's start by taking some baseline measurements of the bandwidth and latency capabilities of our test machine. We'll start with DRAM.
Before we measure DRAM's performance, we should be precise about what exactly we mean by "latency" and "bandwidth" for DRAM.
For DRAM latency, we mean the number of seconds it takes for a load instruction to retrieve data assuming that the load misses in all levels of the cache. This means that the memory request "goes all the way to memory" and has to come all the way back.
For DRAM bandwidth, we mean the maximum number of bytes the processor can read or write per second from DRAM rather than from any of the caches. This means that the only bytes that count for DRAM bandwidth are those read or written as part of a cache miss.
Measuring these two values is surprisingly difficult and would make a very good (and pretty challenging) programming assignment for a graduate-level architecture course. It's made more difficult in modern machines by the presence of multiple processors, multiple DRAM interfaces, multiple caches, and multiple cores. We'll discuss some of this complexity in more detail in future labs when we address the impacts of parallelism, but, in this lab, we will stick to the simplest version of these questions.
Given the difficulty of measuring these values, we will rely on Intel's Memory Latency Checker. It can perform a huge array of measurements, but we'll just do two:
!cse142 job run --take NOTHING --lab caches 'mlc --bandwidth_matrix; mlc --latency_matrix'
You are submitting a job for lab "Lab 3: Caches" (caches).
Creating job 26252c41-1acf-4f43-9a03-82513c80d2fa 0.00 0.00
Ready for submission. 2.09 2.09
Job 26252c41-1acf-4f43-9a03-82513c80d2fa is in state 'PUBLISHED'. 3.36 5.45
Job 26252c41-1acf-4f43-9a03-82513c80d2fa is in state 'RUNNING'. 1.04 6.50.............
Job 26252c41-1acf-4f43-9a03-82513c80d2fa is in state 'DONE_RUNNING'. 14.76 21.25..
Job 26252c41-1acf-4f43-9a03-82513c80d2fa succeeded. 3.13 24.39Writing results 1.00 25.39
Intel(R) Memory Latency Checker - v3.9a
Command line parameters: --bandwidth_matrix
Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Numa node
Numa node 0
0 19252.1
Intel(R) Memory Latency Checker - v3.9a
Command line parameters: --latency_matrix
Using buffer size of 200.000MiB
Measuring idle latencies (in ns)...
Numa node
Numa node 0
0 61.6
Updated these files:
Job Complete 0.36 25.75
The output is a little verbose, but you should see two numbers that look like measured values rather than documentation. They are both for 'Numa node' 0. The first number is the bandwidth in MB/s. The second is latency.
60ns is an almost unimaginably short period of time, and 18GB/s seems like a lot of data, but we need to think about this relative to the processor. If you look back at the CPI measurements you collect for Lab 1, you'll probably find CPI values as low as 0.5. At 3.6GHz, that means our processor is executing around 7 billion instructions per second or, on average, one instruction every 0.166ns. So 60ns is about 360 instructions.
For bandwidth, the situation is not great either. On average about 20% of instruction access memory and every instruction must be loaded from memory to execute. The average length of an x86 instruction is about 2 bytes and for simplicity, let's assume all memory accesses are 4 bytes. That means that, on average, each instruction execute needs (on average) 2 + 0.2 * 4 = 2.8 bytes of memory.
Let's compare that to what our measurements show: 18GB/s divided by 7 Billion instructions per second is about 2.6 bytes per instruction. Not far off! However, recall from lab one that our processor actually has 6 cores! And the memory bandwidth from it's single memory channel will be shared among all of them.
We should also keep in mind that our machines are memory-poor: Campus bought them to be as cheap and compact as possible. So they are physically small (so they only have 4 memory slots) and campus only ordered one DIMM for each.
We will need a way to get higher bandwidth and lower latency.
Caches reduce latency and increase bandwidth. Let's refresh our memory about the machine we are running on and see how large this effect is.
Run the cell below remind yourself how large our L1, L2, and L3 caches are.
!cse142 job run --take NOTHING --force lscpu
You are submitting a job for lab "Lab 3: Caches" (caches). Creating job 5d71c0ce-4764-4ada-9994-7e0da736fa7b 0.00 0.00 Ready for submission. 2.29 2.29 Job 5d71c0ce-4764-4ada-9994-7e0da736fa7b is in state 'PUBLISHED'. 2.79 5.08 Job 5d71c0ce-4764-4ada-9994-7e0da736fa7b is in state 'RUNNING'. 1.07 6.15 Job 5d71c0ce-4764-4ada-9994-7e0da736fa7b is in state 'DONE_RUNNING'. 1.05 7.21.. Job 5d71c0ce-4764-4ada-9994-7e0da736fa7b succeeded. 3.16 10.37Writing results 1.00 11.37 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: GenuineIntel BIOS Vendor ID: Intel(R) Corporation Model name: Intel(R) Xeon(R) E-2146G CPU @ 3.50GHz BIOS Model name: Intel(R) Xeon(R) E-2146G CPU @ 3.50GHz CPU family: 6 Model: 158 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 Stepping: 10 Frequency boost: enabled CPU max MHz: 3501.0000 CPU min MHz: 800.0000 BogoMIPS: 7008.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single ssbd rsb_ctxsw ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 192 KiB (6 instances) L1i cache: 192 KiB (6 instances) L2 cache: 1.5 MiB (6 instances) L3 cache: 12 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-11 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; Load fences, usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS (kernel), IBPB Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable Updated these files: Job Complete 0.37 11.74
The cache sizes are the total size for the whole CPU, which has 6 cores. So L1 caches are 32 kilobytes, the L2 caches (also one per core) are 256 kilobytes, and the single L3 is 12 megabytes.
One cool thing we can do with the miss machine is use it to measure cache latency (and the cache size, if we didn't already know it). To do this, we'll run the miss machine with larger and larger memory footprints and then measure the latency. For footprints that fit in the L1 cache, the average latency of these accesses is the L1 cache latency. For arrays bigger than the L1 but that fit in the L2, it's the L2 cache latency. Likewise, for L3 and main memory.
We don't have a handy tool for measuring cache bandwidth. Maybe we'll build one as a future programming assignment...
We do however, have some technical details about the processor, so we can calculate what the peak (maximum attainable) cache bandwidth should be.
Skylake processors (of which our CPU is an example) can execute two 64-byte loads per cycle and one 64-byte store per cycle. At 3.5GHz, this works out to a bandwidth of 2*64*3.5e9 = 417GB/s for loads and 213GB/s for stores for a total 625GB/s.
The L2 cache provide one 64-bytes load or store per cycle for 213GB/s.
I don't have any information about the L3.
They are certainly better than DRAM -- the L1's latency is 61/1.147 = 52.3 times lower than DRAM, and bandwidth (in our system with one bank of DRAM) is 625/18.8 = 33 times higher. Not only that, remember that we have six cores, so the total L1 cache bandwidth is 625GB/s * 6 = 3.7TB/s.
However, the minimum load latency -- 4 cycles -- is still pretty high, considering that it's 4 times longer than the latency for a simple integer arithmetic operation (add, sub, etc.). That means that having lots of load instructions can have a significant impact on CPI -- remember that unoptimized code from Lab 2 with all accesses to local variables on the stack? The 4 cycle L1 latency is part of why -O0 is so slow.
Caches are also slow compared to the register file. Skylake's register files have a total bandwidth (at 3.5GHz) of something like 5TB/s (although it's very hard to fully utilize it.) per core (or 30TB/s across all 6 cores). The register file latency is a little hard to quantify in a way that is comparable to cache latency, but 1/2 a cycle is a reasonable approximation.
The register file is also quite large: The total size is about 1kB. It's reasonable to think of the register file as a software-managed, "L0" cache.
Caches improve program performance by exploiting the fact that memory accesses are not random: Programs access memory in patterns and there is a lot of repetition in what programs do and a lot of similarity how programs behave at different times. This is not surprising since, programs are written languages with constructs like loops and function calls: These constructs naturally give rise to patterns because they cause the same code to execute repeatedly.
As you should recall from lecture, spatial locality is a property of a program (or part of a program) where it accesses memory locations nearby locations it has accessed recently. Let's see how spatial locality can affect performance.
Here's a simple test program that accesses a 1-dimensional tensor at different "strides". The "stride" of a sequence of accesses is just the distance between consecutive accesses. So, "stride 1" means accessing each element, and "stride 2" means accessing every other element. The outer loop ensures that the total number of accesses the loop perform remains the same, regardless of stride size).
We'll discuss temporal locality in the next lab.
In this lab, you have seen how important memory alignment is and how the compiler lays out structs to enforce it.
You have measured spatial locality and seen how its presence or absence can affect performance.