By Nathawat Vongchinsri, Dan Jen
Modified by Jeff Luyau
If you have any questions about our notes, please feel free to email Dan Jen at aphelion112@gmail.com.
Recall from last lecture, we were analyzing the performance of reading 40B from a disk. Here were the specifications of our hypothetical machine.
| Specifications of 1 Ghz processor | |
|---|---|
| Cycle | 1 ns |
| Programmed IO Instruction (PIO) | 1 μs |
| Writing a command to disk | 5 μs |
| Interrupt Processing | 5 μs |
| Check if disk ready | 1 μs |
| Read 40B from disk cache to RAM | 40 μs |
| Disk latency | 50 μs |
| Computation time | 5 μs |
Without any utilization techniques, our CPU polled until the disk completed fetching our data. Since it does no other useful task while waiting, we call this busy waiting. Here were the results of our basic read of 40B.
Latency: 100 μs Throughput: 10,000 requests/second Utilization: 5%
Note: Throughput is the rate of requests handled per unit time, while latency is the time it takes to handle a single request. Thus, Throughput = 1/Latency if requests are serial.
Can we do better than this? You'd better believe it. While polling may seem like a good method to get the desired task done, it's far too simple as we sit around waiting for the operations to finish, wasting system resources and time. Our first improvement technique is batching.
Batching is the technique in which the system handling several requests in a group to amortize the overhead and avoid repeated overhead. We can observe that no matter how much information we fetch from the disk at once, the disk latency remains constant. So the plan is to fetch enough to satisfy 21 requests, only paying the disk latency penalty once. Let's run 21 requests at a time to avoid paying per-request overhead.
Latency: 21(5 μs) + 21(40 μs) + 50 μs + 5 μs = 1000 μs Throughput: 21 requests/1000 us = 21,000 requests/second Utilization: 105/1000 = 10.5%
By batching, we amortize cost of disk latency and writing a command to disk, resulting in much better throughput and, more importantly, much better utilization. But notice that latency got worse. Clearly there is room for improvement. Currently, we are still busy waiting on disk latency. For an entire 50us, the CPU does nothing useful. Let's get rid of this by adding device interrupts.
By implementing interrupts, we can scrap the entire polling process all together. To do this, we have the process, when ready, send a signal to the processor. The processor then saves its own state, and it allows the interrupt handler to service the interrupt request. (Reference: Interrupts on Wikipedia) This way, if the system processes an instruction that may have a lengthy service time, the system attend to other requests (device has a buffer of outstanding requests), thus increasing the amount of useful work done by the computer. Here's a small description on how Linux utilizes both the hardware and software to implement interrupts and interrupt handling: Interrupts and interrupt handling
Before After
while(1){ while(1){
write command to disk write command to disk
while(disk not ready) block until there is interrupt
/* */; handle interrupt
read buffer check that disk is ready
compute read buffer
} compute
}
By implemetning interrupts and interrupt handlers, the system can avoid the cost of spinning during disk latency. By blocking, it allows CPU to handle other requests running in parallel. As a result of this method, we see in the following throughput calculation that disk latency penalty per request does not factor into the throughput calculation. Yaaaaay!
| Cost of interrupt | |
|---|---|
| Write command to disk | 5 μs |
| Block until there is interrupt | 50 μs |
| Handling the interrupt | 5 μs |
| Check that disk is ready | 1 μs |
| Read buffer | 40 μs |
| Compute | 5 μs |
Latency: 106 μs Throughput: 1/56 ~ 17,900 requests/second Utilization: 5/56 ~ 8.9%
What the.. ?!?!? We got the latency back down, but throughput and utilization actually got worse than with batching! So we can give up and go back to batching, or we can try to improve what we've got. Let's do the latter. To improve our design, we seek out the bottleneck and pry it open as wide as we can. In our case, the bottleneck is reading the buffer. Can we fix this? Yes, using a technique we learned early in the quarter: DMA (Direct Memory Access)
As we lightly touched on Direct Memory Access (DMA) earlier in the class, we can recall that DMA was a method through which the system tells the disk to put data directly into memory. Therefore, the system need to only do a few programmed IO Instructions, since the processor and disk now communicate through memory. In trasferring data from the disk to memory for DMA, the system uses a bounded buffer. Again, another description on Linux's DMA: Direct Memory Access (DMA).
Bounded Buffer 1. All DMA slots are start off empty. 2. To write a command to the disk, write READY to slot. 3. On completion, disk writes DONE.
Now writing a command to the disk is just a write into primary memory! This is SO FAST we don't even consider its cost, because it's negligible, like 100ns or something. So with a DMA addition to our design, we have completely eliminated the overhead of reading the buffer.
As a result of DMA
while(1){
block until interrupt
check disk is ready
compute
}
Latency: block until interrupt (50 μs + 5 μs) + check disk is ready (1 μs) + compute (5 μs) = 61 μs Throughput: 1/11μs = 62,500 requests/second Utilization: 5/11 = 45%
If disk handles 1 requsts at a time, its latency is 50 μs. But if it can handle more than one request, the latency shrinks to 11 μs. Therefore, in this example, for throughput and utilization, we assume that disk can overlap disk latency for multiple requests.
Whoa, that rocks. But let's be dissatisfied perfectionists. How could we improve? What's our new bottleneck? Interestingly, our biggest bottleneck is interrupt handling now. So forget blocking, let's poll. If we remove interrupts and use polling, we can get a better performance. This is because a request is always ready due to many requests being made at once. Here is a performance for polling with DMA:
Latency: 56 μs Throughput: 166,666 requests/seconds Utilization: 84%
WOW!! Here's a comparison of polling without and with DMA:
Polling/Busy wait Polling with DMA
while (disk not ready [1 μs]) while(DMA slots not ready [5 ns])
/* */; schedule(); // run another request
The first point of interest is that DMA with polling does not incur the latency penalties of polling since the DMA process will schedule processes that are busy, allowing the system to perform other useful tasks. Polling, as noted earlier in the notes, stops system progression by freezing resources by waiting for the process to finish. Secondly, checking to see whether or not the data is ready is much faster with DMA as the checking memory is orders faster than checking the disk. This can be seen in the checking times: 5 ns for DMA, 1 μs for polling.
The following table provides an easy-to-read chart of the different methodologies we have analyzed, including the motivation, latency, throughput, and utilization for each method.
| Summary of methods | |||||
|---|---|---|---|---|---|
| Method | Motivation | Latency | Throughput | Utilization | Other Notes |
| Polling | simple implementation | 100 μs | 10,000 req/s | 5% | Relatively high throughput; high latency |
| Batching | reduce overhead | 1000 μs | 21,000 req/s | 10.5% | Relatively high throughput; high latency |
| Interrupts | allow system to handle other requests while waiting | 106 μs | 17,900 req/s | 8.9% | Relatively high throughput; lower latency (w/o batching) |
| DMA | tranfer data from disk to memory without processor | 61 μs | 62,500 req/s | 45% | Higher throughput; low latency |
| DMA w/ polling | remove latency due to interrupt handling | 56 μs | 166,666 req/s | 84% | Highest throughput; lowest latency |
There are some other techniques for performance improvements that we can always consider. These are batching, dallying, speculation, and buffered I/O.
Other techniques we've used include analysis of overheads and elimination of bottlenecks. These techniques also help in the design of data structures and their corresponding interfaces. What better way to learn about them by seeing them in action on an important OS data structure: the file system.
A file system is an on-disk data structure that provides virtual memory-like abstraction/interface.
Let's analyze some old file systems to see what we can learn.
The RT-11 (Real Time) File System was designed as a real-time program for use by a single user. A small and quick application, it was used for a number of business, commerical, and scientific endeavors due to its real-time and data processing. Its design was straight-foward, making it an easy-to-use application and system resource efficiency. Source: http://shop-pdp.kent.edu/rthtml/rt11.htm.
Another link: Image of system using RT-11
RT-11, with a 4k memroy, used contiguous file allocation. The diagram below provides the inherent problem with contiguous file allocation.
The observed problem of this file system structure is external fragmentation. External fragmentation is defined as "the phenomenon in which free storage becomes divided up into many small pieces over time" (Wikipedia fragmentation entry). One way to solve this problem is by using fixed-size block allocation. The FAT file system did just that.
The File Allocation Table (FAT) File System, designed by Bill Gates and Marc McDonald, was first used for disk management in Microsoft Disk Basic. In 1980, Tim Paterson implemented FAT into his operating system, 86-DOS. However, it has become so universal that many other operating systems have adopted it. (Source: Wikipedia file allocation table entry).
A FAT file system gets its name from one of the structures--the file allocation table (surprise)--implemented in it. This central table keeps track of used/unused disk space and file locations on disk. The disk is divided into fixed-sized blocks of 4KB. The purpose of fixed-size block allocation to avoid external fragmentation. FAT, like any disk structure using fixed-sized blocks, suffers from internal fragmentation, meaning that the data in a block may not use up all the space allocated for it.
The FAT file system disk structure can be respresented as such (regions not to scale):
| Boot Sector | Super Block | File Allocation Table | Data Region (for files and directories) ... (uses the rest of partition or disk) |
The boot sector is the part of the disk that is used to start up the system. It usually contains the necessary operating system executables and other bootstrapping programs. The super block contains the information of the file system, like number of nodes and pointer to the first data structure. The file allocation table region is the part of the disk which holds the FAT structure, as detailed above. Lastly, the disk has a data region, which stores all the file data.
blockno_t fat[1024]; //each element is 4B, making the table 1024(4) = 4kb = 1 block
File Allocation Table is a map of the Data Region. There is an array entry for every block number. An array has 3 possibilities, -1 for free, 0 for EOF, and > 0 for the number of next block in file. Each file can occupy more than one block depending on its size, so the file is essentially a linked list of its blocks.
From the diagram, we can see that the data in blocks 0 and 1 take up only that 1 block. However, data which is greater than the 4KB spans more than one block, as shown with the data in block 7. The number in the entry references the next location of the data. Thus, data beginning in block 7 runs over to block 100 (to user, just 1 contiguous file; to FS, linked list of blocks).
However, the question still remains... how do you find the first block of a file? Well, we use directories and directory entries.
Directories map filenames to file contents. They are stored on disk, just like files. Directories are generally implemented as a disk array of directory entries. Each entry represents a file that is considered to be in the directory. The directory entry looks like this:
FAT directory entry ------------------------- file name file size beginning block # of file
So now we know how to find the first block of a file, but how do we find the first directory (the root directory)? Well, the superblock stores the first block # of the root directory. And now we know how FAT works! At least we know enough.
Since FAT uses linked list to allocate a file, it takes O(n) time to seek; where n is offset into a file. As we all know, O(n) is too slow to really be efficient. Therefore, we need a better data structure, bringing us to inodes.
The inode idea uses a tree instead of a linked list to seek forward given byte.
File record contains:
With this structure, we only need at most 2 lookups to find blocks 1-1033. Continued in Lec 14.