Load-store instruction set architecture
It is an example of GPR (General Purpose Register) architecture. The operands for the arithmetic and logic operation must be registers not memory address.
For earlier computers, when all memory access takes exactly one clock cycle, there is no down side for the arithmetic and logic operation to perform on memory locations. However, after years, CPU speed increases much faster than the memory access speed. Thus cache was invented to address the problem. Memory access now could be highly unpredictable, from instant if top level cache hit to very slow if a page fault occurs. Therefore, load-store instruction factor out this issue by not including memory as part of the arithmetic and logic operation.
Instead, memory are first loaded into registers. The loading could be scheduled optimally to make sure the operation is more tractable. In contrast, a non-load-store ISA would have to stall the ALU for unpredictable numbers of cycles while waiting for memory.
The load-store ISA was not invented on purpose, but rather that the load-store operations were removed from arithmetic and logic operation.
A pipelined datapath uses extra registers between the principal datapath stages. In each clock cycle, instructions advance through just one stage of the datapath and store the interim results into the pipeline registers. In such way, multiple instruction could be executed in the same time. Pipelining increases the instruction throughput. However, hazards could occur.
Hazards described the dependency between instructions disturbs the operation of the pipelined datapath.
Data Hazard Occurs when the previous instruction has not yet been written to the required register and the next instruction has already started to process the data. Thus old data might be executed. Solved with stalling or data forwarding
Branch Hazard Occurs when the previous instruction has not yet written the branching condition or the branching address target to the required register. Solved by assuming branch not taken and carry on the instruction. If not yet completed, then flushing the pipeline and resumes from the branch target addresss.
MIMD: Multiple instruction multiple data stream
- A most general form of parallelism, it consists of multiple uniprocessors connected on a single bus or via a network.
SMD: Symmetric Multiprocessor Equivalent to UMA: Uniform Memory Access
- A type of single address space multiprocessor in which access to the main memory takes the same amount of time no matter where the processor is and what data word is requested.
NUMA: Non-Uniform Memory Access the opposite to the UMA
- Some memory access could be faster depending on the processor and the word requested
Single bus vs Network
Single bus: Only a small number of processors could be connected via a single bus. Each processors has its own cache, the single bus and main memory could serve the needs of all the processors. (Processor #: a few tens)
Network: When the number of processors increases, the single bus and shared memory system becomes a bottleneck, the needs for physically longer bus reduces the bandwidth and increases the latency. The memory is thus distributed around the nodes, so local processor memory traffic could be processed with high speed and independent of the number of processors. Inter-processor communication slower than a single bus.
Shared memory vs message passing:
- The physically separated memory can be addressed as one unified address space although with NUMA
- Alternatively, if the separated memory could be made completely private and inaccessible by remote processor, then the inter-memory communication is achieved with message passing.
Direct memory access is used to ensure the CPU is not overloaded while loading large chunks of data from disk to MM. During data transfer, CPU hands over control to DMA controller, where it deals with individual bus transaction between the device and memory. Once the transfer is completed, CPU is interrupted. The CPU then checks if the transfer is completed without error.
DMA causes difficulties in terms of both virtual memory system and caches.
For VM system, the DMA controller is supplied with the start address and the number of bytes to transfer. However, should the address be physical or virtual? If virtual, DMA controller will requires extra hardware to store the translation table. If physical, DMA controller could not access across the page boundary. Whichever approach is taken, OS needs to make sure not to move the pages around that involved in DMA transfer.
For caches, it may occur problems during transfer of data from disks to memory for multiple thread/cores system. If there is a copy of this chunk of memory in the caches of both core 1 and 2. So when thread B on core 2 next reads from the shared memory, it might get the old data, not the updated version. 3 ways to resolve this problem:
- Route all DMA activity through cache, but it would waste cache space.
- Let the OS invalidate the cache for an I/O write or force write back. This is the cache flushing.
- The most complex option is to use selective flushing of individual cache entry.
- False sharing for write invalidate: Suppose processor A writes word X, suppose processor B contains a copy of block(block size >1 ) containing X, which is invalidated by A’s writing. Then suppose B wishes to read word Y which is in the same block as X, this results in a miss. This read miss would not occur if the block size is one. This is known as false sharing.
Virtual Memory system
2 Motivation of the adoption of VM:
- The desire to write program without worrying the amount of physical memory installed
- The requirements of executing multiple process separately and each process is protected from and unaware of others
MM and VM are divided into chunks called pages.
- Page table translate between VM and MM. It is indexed by the virtual page number and it stores the physical page number.
- Each running program has its own page table in MM. The memory system maintains a register points to the start of the page table.
- Page fault: Each page table entry has a valid bit determine the page is in the memory. If not, a page fault exists
- On page fault, OS fetch the required page from disk to MM and replaced the least used entry.
To save disk writes, a dirty bit is also included in page table to show if the content is different from the data on disk.
- Pages calculation
- Typical page sizes: 4KiB, 2^12
- Typical Virtual Memory size: 2^32
- Typical Physical Memory size: 2^30
- Thus the number of pages is computed to be 2^32 / 2^12 = 2^20
- Each page table entry contains an physical page number of 18 bits long, including the dirty and valid bit. 20 bits in total. Round in 3 bytes.
- Total size of the pages: 3 bytes * 2^20: 3 MiB
TLB: Translation lookaside buffer
Page table is large, need to be stored in the MM. Each time you would have to access the page table in MM before checking the cache, which is very slow. Thus TLB is used to cache the recently used page table entries. TLB is usually fully associative.
TLB miss: the address translation is looked at in page table in MM. If exists in MM, the page entry will replace the least used entry. The translation is thus retrieved. If the entry is reside on disk, page fault occurs.