You are expected to understand this. CS 111 Operating Systems Principles, Fall 2006
You are here: CS111: [[2006fall:notes:lec13]]
 
 
 

Lecture 13 notes

By Nathawat Vongchinsri, Dan Jen

Modified by Jeff Luyau

If you have any questions about our notes, please feel free to email Dan Jen at aphelion112@gmail.com.

Performance II

Recall from last lecture, we were analyzing the performance of reading 40B from a disk. Here were the specifications of our hypothetical machine.

Specifications of 1 Ghz processor
Cycle 1 ns
Programmed IO Instruction (PIO) 1 μs
Writing a command to disk 5 μs
Interrupt Processing 5 μs
Check if disk ready 1 μs
Read 40B from disk cache to RAM 40 μs
Disk latency 50 μs
Computation time 5 μs

Polling/Busy wait

Without any utilization techniques, our CPU polled until the disk completed fetching our data. Since it does no other useful task while waiting, we call this busy waiting. Here were the results of our basic read of 40B.

  • Write command to disk: 5 PIO
  • Wait for completion: 50 PIO
  • Read data: 40 PIO (1 PIO per byte)
  • Computation: 5 PIO
Latency:     100 μs
Throughput:  10,000 requests/second
Utilization: 5%

Note: Throughput is the rate of requests handled per unit time, while latency is the time it takes to handle a single request. Thus, Throughput = 1/Latency if requests are serial.

Can we do better than this? You'd better believe it. While polling may seem like a good method to get the desired task done, it's far too simple as we sit around waiting for the operations to finish, wasting system resources and time. Our first improvement technique is batching.

Batching

Batching is the technique in which the system handling several requests in a group to amortize the overhead and avoid repeated overhead. We can observe that no matter how much information we fetch from the disk at once, the disk latency remains constant. So the plan is to fetch enough to satisfy 21 requests, only paying the disk latency penalty once. Let's run 21 requests at a time to avoid paying per-request overhead.

Latency:     21(5 μs) + 21(40 μs) + 50 μs + 5 μs = 1000 μs
Throughput:  21 requests/1000 us = 21,000 requests/second
Utilization: 105/1000 = 10.5%

By batching, we amortize cost of disk latency and writing a command to disk, resulting in much better throughput and, more importantly, much better utilization. But notice that latency got worse. Clearly there is room for improvement. Currently, we are still busy waiting on disk latency. For an entire 50us, the CPU does nothing useful. Let's get rid of this by adding device interrupts.

Device Interrupts

By implementing interrupts, we can scrap the entire polling process all together. To do this, we have the process, when ready, send a signal to the processor. The processor then saves its own state, and it allows the interrupt handler to service the interrupt request. (Reference: Interrupts on Wikipedia) This way, if the system processes an instruction that may have a lengthy service time, the system attend to other requests (device has a buffer of outstanding requests), thus increasing the amount of useful work done by the computer. Here's a small description on how Linux utilizes both the hardware and software to implement interrupts and interrupt handling: Interrupts and interrupt handling

Before                          After
while(1){                       while(1){
   write command to disk           write command to disk
   while(disk not ready)           block until there is interrupt
      /*  */;                      handle interrupt
   read buffer                     check that disk is ready
   compute                         read buffer
}                                  compute
                                }

By implemetning interrupts and interrupt handlers, the system can avoid the cost of spinning during disk latency. By blocking, it allows CPU to handle other requests running in parallel. As a result of this method, we see in the following throughput calculation that disk latency penalty per request does not factor into the throughput calculation. Yaaaaay!

Cost of interrupt
Write command to disk 5 μs
Block until there is interrupt 50 μs
Handling the interrupt 5 μs
Check that disk is ready 1 μs
Read buffer 40 μs
Compute 5 μs
Latency:     106 μs
Throughput:  1/56 ~ 17,900 requests/second
Utilization: 5/56 ~ 8.9%

What the.. ?!?!? We got the latency back down, but throughput and utilization actually got worse than with batching! So we can give up and go back to batching, or we can try to improve what we've got. Let's do the latter. To improve our design, we seek out the bottleneck and pry it open as wide as we can. In our case, the bottleneck is reading the buffer. Can we fix this? Yes, using a technique we learned early in the quarter: DMA (Direct Memory Access)

Direct Memory Access

As we lightly touched on Direct Memory Access (DMA) earlier in the class, we can recall that DMA was a method through which the system tells the disk to put data directly into memory. Therefore, the system need to only do a few programmed IO Instructions, since the processor and disk now communicate through memory. In trasferring data from the disk to memory for DMA, the system uses a bounded buffer. Again, another description on Linux's DMA: Direct Memory Access (DMA).

  Bounded Buffer
  1. All DMA slots are start off empty.
  2. To write a command to the disk, write READY to slot.
  3. On completion, disk writes DONE.

Now writing a command to the disk is just a write into primary memory! This is SO FAST we don't even consider its cost, because it's negligible, like 100ns or something. So with a DMA addition to our design, we have completely eliminated the overhead of reading the buffer.

As a result of DMA

while(1){
   block until interrupt
   check disk is ready
   compute
}
Latency:     block until interrupt (50 μs + 5 μs) + check disk is ready (1 μs) + compute (5 μs) = 61 μs
Throughput:  1/11μs = 62,500 requests/second
Utilization: 5/11 = 45%

If disk handles 1 requsts at a time, its latency is 50 μs. But if it can handle more than one request, the latency shrinks to 11 μs. Therefore, in this example, for throughput and utilization, we assume that disk can overlap disk latency for multiple requests.

Whoa, that rocks. But let's be dissatisfied perfectionists. How could we improve? What's our new bottleneck? Interestingly, our biggest bottleneck is interrupt handling now. So forget blocking, let's poll. If we remove interrupts and use polling, we can get a better performance. This is because a request is always ready due to many requests being made at once. Here is a performance for polling with DMA:

Latency:     56 μs
Throughput:  166,666 requests/seconds
Utilization: 84%

WOW!! Here's a comparison of polling without and with DMA:

Polling/Busy wait                            Polling with DMA
while (disk not ready [1 μs])                while(DMA slots not ready [5 ns])
    /*  */;                                      schedule();     // run another request

The first point of interest is that DMA with polling does not incur the latency penalties of polling since the DMA process will schedule processes that are busy, allowing the system to perform other useful tasks. Polling, as noted earlier in the notes, stops system progression by freezing resources by waiting for the process to finish. Secondly, checking to see whether or not the data is ready is much faster with DMA as the checking memory is orders faster than checking the disk. This can be seen in the checking times: 5 ns for DMA, 1 μs for polling.

Summary

The following table provides an easy-to-read chart of the different methodologies we have analyzed, including the motivation, latency, throughput, and utilization for each method.

Summary of methods
Method Motivation Latency Throughput Utilization Other Notes
Polling simple implementation 100 μs 10,000 req/s 5% Relatively high throughput; high latency
Batching reduce overhead 1000 μs 21,000 req/s 10.5% Relatively high throughput; high latency
Interruptsallow system to handle other requests while waiting 106 μs 17,900 req/s 8.9% Relatively high throughput; lower latency (w/o batching)
DMA tranfer data from disk to memory without processor 61 μs 62,500 req/s 45% Higher throughput; low latency
DMA w/ pollingremove latency due to interrupt handling 56 μs 166,666 req/s 84% Highest throughput; lowest latency

Performance Improvement Techniques

There are some other techniques for performance improvements that we can always consider. These are batching, dallying, speculation, and buffered I/O.

  • Batching - Handling several requests in a group to amortize the overhead and avoid repeated overhead.
  • Dallying - Delaying the handling of requests to create oppotunities for batching.
  • Speculation - Operating in advance of a request, predicting and hoping that the work will be useful.
  • Buffered I/O:
    • batching - reduce system calls
    • dallying - batch writes to reduce overhead
    • speculation - batch reads to reduce overhead

Other techniques we've used include analysis of overheads and elimination of bottlenecks. These techniques also help in the design of data structures and their corresponding interfaces. What better way to learn about them by seeing them in action on an important OS data structure: the file system.

File systems I

A file system is an on-disk data structure that provides virtual memory-like abstraction/interface.

Let's analyze some old file systems to see what we can learn.

RT-11 File System (1970)

Background

The RT-11 (Real Time) File System was designed as a real-time program for use by a single user. A small and quick application, it was used for a number of business, commerical, and scientific endeavors due to its real-time and data processing. Its design was straight-foward, making it an easy-to-use application and system resource efficiency. Source: http://shop-pdp.kent.edu/rthtml/rt11.htm.

Another link: Image of system using RT-11

Problem with RT-11

RT-11, with a 4k memroy, used contiguous file allocation. The diagram below provides the inherent problem with contiguous file allocation.

rt11.jpg

The observed problem of this file system structure is external fragmentation. External fragmentation is defined as "the phenomenon in which free storage becomes divided up into many small pieces over time" (Wikipedia fragmentation entry). One way to solve this problem is by using fixed-size block allocation. The FAT file system did just that.

FAT File System (1977)

Background

The File Allocation Table (FAT) File System, designed by Bill Gates and Marc McDonald, was first used for disk management in Microsoft Disk Basic. In 1980, Tim Paterson implemented FAT into his operating system, 86-DOS. However, it has become so universal that many other operating systems have adopted it. (Source: Wikipedia file allocation table entry).

A FAT file system gets its name from one of the structures--the file allocation table (surprise)--implemented in it. This central table keeps track of used/unused disk space and file locations on disk. The disk is divided into fixed-sized blocks of 4KB. The purpose of fixed-size block allocation to avoid external fragmentation. FAT, like any disk structure using fixed-sized blocks, suffers from internal fragmentation, meaning that the data in a block may not use up all the space allocated for it.

Link: Differences between FAT16 and FAT32

Implementations

Basic Disk Layout

The FAT file system disk structure can be respresented as such (regions not to scale):

Boot Sector Super Block File Allocation Table Data Region (for files and directories) ... (uses the rest of partition or disk)
  • Sector: 512B. This is the size of single disk reads/writes.
  • Fixed-size allocation units: blocks (4 KB)
  • Super block contains information such as file system version, disk size, used blocks, and block number for first block in root directory.

The boot sector is the part of the disk that is used to start up the system. It usually contains the necessary operating system executables and other bootstrapping programs. The super block contains the information of the file system, like number of nodes and pointer to the first data structure. The file allocation table region is the part of the disk which holds the FAT structure, as detailed above. Lastly, the disk has a data region, which stores all the file data.

Link: The numbers behind the FAT FS

File Allocation Table
blockno_t fat[1024]; //each element is 4B, making the table 1024(4) = 4kb = 1 block

File Allocation Table is a map of the Data Region. There is an array entry for every block number. An array has 3 possibilities, -1 for free, 0 for EOF, and > 0 for the number of next block in file. Each file can occupy more than one block depending on its size, so the file is essentially a linked list of its blocks.

file_alloc_tbl2t.jpg

From the diagram, we can see that the data in blocks 0 and 1 take up only that 1 block. However, data which is greater than the 4KB spans more than one block, as shown with the data in block 7. The number in the entry references the next location of the data. Thus, data beginning in block 7 runs over to block 100 (to user, just 1 contiguous file; to FS, linked list of blocks).

However, the question still remains... how do you find the first block of a file? Well, we use directories and directory entries.

Directories

Directories map filenames to file contents. They are stored on disk, just like files. Directories are generally implemented as a disk array of directory entries. Each entry represents a file that is considered to be in the directory. The directory entry looks like this:

   FAT directory entry
-------------------------
file name
file size
beginning block # of file

So now we know how to find the first block of a file, but how do we find the first directory (the root directory)? Well, the superblock stores the first block # of the root directory. And now we know how FAT works! At least we know enough.

FAT Seek Time

Since FAT uses linked list to allocate a file, it takes O(n) time to seek; where n is offset into a file. As we all know, O(n) is too slow to really be efficient. Therefore, we need a better data structure, bringing us to inodes.

The Inode Idea (1970)

The inode idea uses a tree instead of a linked list to seek forward given byte.

File record contains:

  1. n direct block pointers for first n blocks (array)
  2. 1 indirect block pointer, which points to a block filled with block pointers. Specifically block n -> block 1023 + n.

With this structure, we only need at most 2 lookups to find blocks 1-1033. Continued in Lec 14.

 
2006fall/notes/lec13.txt · Last modified: 2007/09/28 00:25 (external edit)
 
Recent changes RSS feed Driven by DokuWiki