by Jesse Chen, Victoria Pan, and Matt Esquivel
modified by Jonathan Chang
December 10, 2006
Accessing data stored on hard disks is a common but costly operation for many applications. Techniques for improving performance in this area must be changed to adapt with the advances in technology. Due to incommensurate scaling, improvements in processor speeds occur more quickly than do advancements in disk speeds (they do not scale at the same rate). As a result, coders must compensate for this imbalance by utilizing techniques to improve performance while maintaining robustness, neutrality, and simplicity.
To figure out a way to improve communication between the processor and the disk, we must first grasp an understanding of how a disk operates and obtain a way to measure the performance of different strategies. Pictured below is a figure of a typical hard drive.

A hard disk consists of many circular platters stacked on top of each other (mmm... pancakes). Each platter has many rings, known as tracks, where data can be stored. A mechanical arm terminated with a magnetic read head hovers over the track with the data desired and reads the data. To read data the drive must do the following:
Seek Time
The total seek time is the amount of time for the read head to move from the outermost track to the innermost track. In this example our total seek time will be 24 ms. Note, however, that the average seek time is NOT half of 24 ms as one might suspect. This is due to the fact that both the starting track number of the head and the track of the data to be read are random. Imagine the read head is on an outermost track, then the average seek time to any random piece of data is 12 ms. However, if the read head happens to be in a middle track, then the average seek time is 6 ms. Considering both a random head location and a random read track, the average seek time is one third of 24 ms, or 8 ms.
Rotational Latency
The platters spin at a very fast rate, a 7200 RPM drive makes one revolution in 8.33 ms (7200 RPM = 120 rotations/second = 8.33 ms/rotation), which is the rotational latency. The data to be read must be under the read head, if it is not, the drive must wait until it is. If you're lucky, the data you want will be under the read head, but if you're not, the data can be up to one revolution away. The average rotational latency is one half the time it takes for one revolution. In the case of a 7200 RPM disk it is 4.17 ms.
Sustained Transfer Rate
The sustained transfer rate is the speed at which the disk can output data and for a disk with our specs it is 66 MB/s. On average, the time needed to read one randomly chosen 4KB block is:
read/write random blk = seek + rot lat + transfer =
= 8ms + 4.17ms + 4KB/(66MB/s) =
= 12.23ms (Approximately 12 million CPU cycles, assuming 1 GHz CPU)
Consider the following user program which:
while(1) { char buf[4096]; read(fd, buf, 4096); compute(buf); write(fd2, buf, 4096); }
The average time for one iteration of this loop is:
time for 1 iter = read random blk + compute + write random blk
= 12.23ms + 1ms + 12.23ms
= 25.46 ms/iter
39.28 iterations per second
157 KB computed per second
How can we improve the performance of the previous example? Notice that the heaviest operations are rotational latency and seeks. To improve performance we must avoid rotational latency and seeks. If data is placed randomly on the disk, we have to pay these overheads every time we want to read or write data. To improve performance we must modify the file system in such a way that it data is laid out intelligently.
When observing the behavior of common programs, many exhibit the property of locality of reference, which states that immediately after an access to item x, we are likely to access an item close to x. Spatial locality means that after accessing a certain memory location, we mostly likely will access memory locations near it. This is particularly true for reading instructions since the next instruction is usually very close to the current instruction. Temporal locality means that if a certain memory location is accessed now, it is likely it will be accessed again in the near future. This happens when we are constantly accessing a variable and doing operations on it.
The file system should aim to keep blocks from a single file (or files in the same directory) in close proximity on disk. In this way, reading multiple blocks is faster because the only latency comes from data transfer, which is almost negligible. File fragmentation is when a file's data is not located contiguously on the disk. Proximity should be maintained to exploit locality of reference and improve performance.
Below are some approaches that can be implemented by the kernel to improve performance.
Recall definition: Requesting data in advance hoping for useful data.
To achieve speculation, we can do the following:
Example - Buffer one track:
i to 63] if block i is requested.Calculations based on above example code. ( An example).
read 1 track = seek + rot lat (not avg, but complete) + transfer =
= 8ms + 8.33ms + 0ms = 16.33ms
1st [iteration]: read 1 track + compute + write
= 16.33ms + 1ms + 12.23ms = 29.56ms
2nd-64th: compute + write
= 0ms + 1ms + 12.23ms = 13.23ms
Average computation = 13.49ms/iter
74 iterations per second
296KB computed per second
Notes:
Recall Definitions:
We can also improve performance by dallying, or delaying a request in hope that we can batch it with future requests. This can improve performance by delaying a request that in the future might not be needed. For example, as mentioned in our course reader, a request to overwrite a disk block may be delayed, in hope that a second request will ask to write to the same block. If the second request shows up, the first request can be ignored and the second request will be used. Performance is enhanced by not wasting time on a request that is not needed. How long should we wait? There is not specific answer -- it depends on system and application specifics.
Dallying can also increase chances for batching, which combines several operations into one to reduce setup overhead (6-3 in your reader).
How can we achieve dallying/batching? Here's an example:
We use (as in the speculation technique) a buffer cache as a virtual file system. This reduces seek time for expensive operations such as read and write. Reading will allow us to read from the cache and writing will update the cache instead of the disk. Dallying and batching will allow us to perform more operations at once, reducing seek time even greater.
[Iterations]
1st: read + compute + write
= [seek + rot lat (not avg) + transfer] + compute + [seek + rot lat + transfer]
= 8ms + 8.33ms + 0ms + 1ms + 8ms + 8.33ms + 0= 33.66ms
2nd-64th: compute = 1ms
Note: For the 2nd to 64th iteration, assume read and write are in cache.
Average = 1.51ms/iter = 662.25iter/s = 2649KB/s
We see much better performance here compared to speculation by paying the expensive overhead of writing to just the first iteration. Dallying and batching work together here. Dallying delays the read and write operations, which gives the opportunity to batch them into the first iteration.
Batching can also allow opportunities to reduce latency be reordering, which introduces us to disk scheduling.
Disk scheduling decides the optimal order to perform a sequence of requests such that the total latency is reduced, by reducing the movement of the disk arm. In other words, the main goal is to maximize the overall throughput and not necessarily the individual delay of each request. At the same time, the situation where a certain disk request is never executed, known as starvation, must be avoided.
Here's a simple model:
LISK (Linear Disk): |0|1|.............|N| (0 to N)
Here are five scheduling algorithms:
A common example is a line. The first person that gets in line will be served first, the second will be served second, etc.
FCFS is basically a queue.
5,6,10,5,6,5,6,5,6,...
Notice that although block 10 was requested 3rd it will never get served as long as requests for blocks 5 and 6 continue.
* But is it optimal? It is the most optimal (shortest schedule) if there is no starvation.
For example:
h = 10
{b0, b1, b2, b3} = {11, 0, 10, 1}
FCFS: 11, 0, 10, 1
time = 31 units (1+11+10+9)
SSTF: 10, 11, 1, 0
time = 12 units (0+1+10+1)
Example of starvation:
Sequence: {10, 11, 0, 1, 12, 13, 14, 15, 16...}
SSTF: 10, 11, 12, 13, 14, 15, 16...
0 and 1 are starved because 10 through 16 are hogging the scheduler.
To minimize starvation
We use the idea: take requests in chunks, within chunks use SSTF, between chunks use FCFS
Let's use the idea of an elevator! We have a direction, either up or down, and we keep going in that direction until we reach the end (no more requests in that direction). After reaching the end, we switch directions and go all the way to the other end. Here is some pseudocode to demonstrate this idea:
d = head direction (up, down)
h = current head position
int getNextBlock(){
if (no bi) return -1 // no requests
if(d == UP)
if(no bi has bi > h)
d = DOWN
return getNextBlock()
else
return smallest bi > h
if(d == DOWN){
if(no bi has bi < h)
d = UP
return getNextBlock()
else
return largest bi < h
}
Using the same set: {11, 0, 10, 1}
| Elevator Direction | Order | Time |
|---|---|---|
| d = UP | 10, 11, 1, 0 | 12 units |
| d = DOWN | 10, 1, 0, 11 | 21 units |
Notice that this algorithm does not suffer from starvation! However, if you look carefully, the middle sectors are actually serviced more often than the sectors on the end. In particular, for every time a sector on the end is serviced, a sector in the middle is serviced twice.
So how do we solve this minor issue and make service fair for every sector? We disconnect the cables and drop the elevator to the floor! In other words, we move only in one direction but in a circular manner (when we reach the end, wrap around back to the beginning).
Properties
Some processes may issue disk requests synchronously, which can cause poor performance in the other algorithms. Process A may issue successive requests only after the previous request has completed, so that it only has one pending request at any given moment. So what may happen is that at the moment when the request is completed, the scheduler assumes that Process A has no further requests (since Process A has not yet issued the next request) and moves on to perform Process B's requests. This is known as deceptive idleness.
The anticipatory scheduling algorithm allows the disk to handle Process A's requests consecutively, which improve performances by reducing the amount of time spent moving the disk head due to switching between requests from different processes.
Properties
An example:
Process A and Process B both start synchronously requesting disk accesses (with A's initial request being sent infinitesimally earlier than B's initial request), with a small delay (1 unit of time) between the completion of each respective process's request and the next issued request. Order of request block numbers for each process: A: 1, 2, 3, 4, 5, B: 11, 12, 13, 14, 15 The disk scheduler sees, and processes the requests in this order with h = 1 initially Using Circular Elevator Scheduling: 1, 11, 2, 12, 3, 13, 4, 14, 5, 15 time = 86 units With Anticipatory Scheduling (with 1 time unit wait time) 1, 2, 3, 4, 5, 11, 12, 13, 14, 15 time = 14 units spent moving the disk head + 9 units waiting = 23 units
This is just a hypothetical example with an arbitrary wait time chosen to demonstrate a situation where Anticipatory Scheduling can offer great benefits. The actual wait time chosen for the algorithm would be chosen based on the typical delay between requests for a system.
For more information about Anticipatory Scheduling, see: http://www.cs.rice.edu/~ssiyer/r/antsched/html/html.html
But wait? Last lecture, the professor emphasized that the careful ordering of disk writes will preserve the invariants for file system correctness. As a reminder, these invariants were:
Invariants
Recall that if the ordering of atomic writes is carefully chosen, none of the invariants will be violated (except for the 4th invariant which is OK to violate). If none of these invariants are violated we are ensured that if the system crashes during an operation, the file system will continue to operate correctly.
So how can we get file system correctness while still gaining the performance benefits of smart disk scheduling (provided with dallying and batching)? It is important to note that some writes will affect the invariants while others will not. Only changes to non-data blocks affect the invariants. These are writes to things such as inodes and the free block bitmap, so careful ordering is crucial here. However, data block writes can be done in any order, so any disk scheduling method can be used there.
There is, however, a small problem. Suppose that there is a sequence of blocks to be written denoted by:
old blocks: ABCDE new blocks: A'B'C'D'E'
Using a First Come First Serve order of scheduling data writes, the possible outcomes of the resulting file (in the case of your computer crashing in the middle of a write) are:
A'BCDE A'B'CDE A'B'C'DE A'B'C'D'E
For a non FCFS order, any intermediate state is possible (any 1 block write, any 2 block writes, etc.)
Imagine writing your CS111 take-home final (in your dreams) and suddenly the power goes out. When you reboot, you'd probably like to see all of your changes or none at all, not some in between mixture. It would be a pain to track down and find which changes were saved and which were not. In some situations, partially saving a file could corrupt the file, causing all the data to be unreadable. This is why we wish to make atomic writes to a file so that the write completely happened or did not affect the file at all.
A very simple way to get atomic writes without sacrificing too much performance is to write the file twice. According to the Golden Rule of Atomicity, you should never write over the only copy because in the case of a system crash, you will not be able to restore either the original file or the modified file.
One way to achieve File System Robustness is to reserve an area of the disk for a journal or log to keep track of changes to the file system. It works in the following fashion:
To make a file system change
After a crash
If a crash occurs before the COMMIT RECORD is written to the log, the system will ignore the log entry on reboot, and the original file will be intact. If a crash occurs after the COMMIT RECORD, the system will copy the modified blocks to the main disk, replacing the original blocks with the new blocks. If yet another crash happens during the copy, on reboot, the modified blocks will be copied to the main disk once again. After the copy is complete, the COMMIT RECORD is cleared, which will indicate that the main disk now contains the modified copy in its entirety.
The following are details on how to write a single block to the log:
Writing a block to log