Table of Contents

Lecture 14 Scribe Notes

by Angel Darquea, Kristie Van, and Chris Wible

File System Robustness (Continued)

Failure Models

Things were made to be broken, or at least, to last a long time before they are broken. In Lecture 13 we were introduced to the Golden Rule of Atomicity and the concept of operating systems journaling. Journaling added to the robustness of the file system and greatly improved recovery from a crash but still relied on the assumption that the actual disk never fails completely and catastrophically which, as we know, is not true. ALL YOUR DISKS ARE BELONG TO ATROPHY and your data could be lost forever.

With that said, file systems are optimized for the case of common failures so that the system can at least compensate for them (i.e., battery failures, operating system crashes, or disk hardware going bad). File system failures can be described using two different failure models:

A Fail-Stop failure occurs when a device stops responding to any requests. The device ceases to function, but does not malfunction. As such, data is likely to be recoverable, as the disk won't randomly erase itself or otherwise obliterate any data. This model works well for situations like power failures.

Unfortunately, the fail-stop model is not very practical in the real world. Instead, the more general Byzantine failure model is used. The Byzantine model allows for any behavior from a system that's failing. Following a Byzantine disk failure, data may or may not be recoverable, that is, any behaviour that follows the disk failure is allowed. If you have some time next weekend, we highly recommend some hot cocoa and a printout of The Byzantine Generals Problem;you should enjoy it.

But of course, not everything is lost. We can still maintain robustness with simple yet powerful tools, explained in detail below.

Redundancy and RAID

Redundancy is one way to improve robustness and prevent data loss. By keeping multiple copies of data on independent disks, that data can survive a catastrophic Byzantine failure on one of the disks. Also, redundancy helps to protect the golden rule of atomicity, which states that the only copy of a big object must never be modified. (Note: journals are an example of using robustness through redundancy.)

The degree to which redundancy aids robustness varies depending on the hardware involved. If the redundant elements fail at the same time, then the redundancy provides little improvement in robustness. For disks, fortunately, the expected lifetime is not absolute, and disk redundancy does benefit the robustness of the system.

Failure of a drive can be expressed as a probability at any given time. The function p(t), where t is time and p(t) is the probability of failure at the time t, is called a Probability Density Function, or PDF. For most disks, the probability of failure follows a Bathtub Curve. The likelihood of failure is relatively high at first, when manufacturing defects are likely to surface. Then, for the normal lifetime of the disk, the chance of failure decreases and stays low, before sharply increasing at the end of the disk's expected lifespan.

The PDF can be integrated to obtain a Cumulative Distribution Function. The CDF describes the probability that a device will fail by a certain time. From the CDF, we can obtain the Median Time To Failure (MTTF), the time at which the cumulative probability of failure is 50%. The MTTF is an easy quantity to use when comparing the robustness of disks and arrays.

Using probability formulas from Statistics class, we can easily show why using multiple disks increases MTTF, and, hence, improves robustness of a file system.

Probability for a file system with a single disk is: P(disk fails at time t) = t

Probability for a file system with two disks is: P(two disks fail at time t) = P'(t) = P(t) * P(T) = t2
Here, failures of two disks are independent event, so we multiply their probabilities.

The use of multiple disks for redundancy is most often referred to as RAID, short for Redundant Array of Inexpensive Disks. RAID arrays come in a few different flavors, which are denoted by numbers. Each RAID flavor has specific advantages and disadvantages. Generally, a RAID system is seen and handled as a single disk by the OS.

RAID 0

(Not covered in lecture)

Technically, RAID 0 is not RAID, because there is no redundancy. A RAID 0 configuration uses a storage scheme called striping. The disk is divided into small chunks, which are then distributed across the entire array. This technique provides great disk performance. Read and write speed both increase proportionally with the number of disks in the array. A two-disk RAID 0 configuration is about twice as fast as a single disk. RAID 0 also allows 100% utilization when all drives in the array are the same size. Unfortunately, these advantages come at a great cost. Robustness is much lower in a RAID 0 system than with a single disk setup. If one disk in the array fails, the entire array stops functioning, and all data is lost. Like performance, the probability of failure at any given time increases with the number of disks. Use of RAID 0 is generally not recommended for any system storing important data, for obvious reasons.

RAID 1

RAID 1 uses simple mirroring across the disks in the array. At least two disks are required. Each disk contains an identical copy of the same data. A RAID 1 array is very robust because an array of N disks can continue operation with N-1 disk failures. Robustness increases as more disks are added to the array, but the gain is finite. Read performance also improves a great deal, because the array can retrieve data from more than one disk at a time. This is especially helpful when the OS is fetching non-contiguous data from a disk, because one disk can read while the other(s) seek. Write performance, on the other hand, is about the same as a single disk, because every disk must perform each write operation simultaneously. Utilization is the pitfall of RAID 1. The effective capacity of the entire array is the size of the smallest disk in the array. Thus, utilization is at most 1/N relative to a single disk.

RAID 3 & 4

Unlike RAID 0 and 1, RAID 3 and 4 both require a minimum of three disks. Striping is used to store data on most of the disks. (RAID 3 and 4 employ byte- and block-level striping, respectively.) The remaining disk(s) are used to store Parity data. Parity is a form of redundancy. The contents of the parity disk(s) are derived from the contents of the data disks using some function (often XOR). If a data disk fails, its contents can be recovered from the contents of the remaining data disk(s) and parity disk(s). However, the array can only recover from as many failures as there are parity disks. Thus, RAID 3/4 is not as robust as a RAID 1 configuration. It allows for better utilization and performance, though. In most cases, being able to lose one disk is enough, because it can be replaced before a second disk fails. Adding more data disks to a RAID 3/4 system actually decreases robustness, because the probability of multiple disk failure increases, while the maximum number of safe disk failures remains the same. However, adding more parity disks to the array will increase robustness.

RAID 5

(Not covered in lecture)

RAID 5 employs block-level striping and parity like RAID 4. However, instead of using a dedicated parity disk, parity data is distributed across the data disks. This improves performance, increases flexibility, and makes "hot-swapping" easier.

Virtual Memory

At this point, we will be returning back to the concept of virtual memory, which we discussed while writing Ursula Moneybags' operating system (in Lecture 2). As a refresher, we used virtual memory to keep us from accidentally modifying the kernel and also to forbid Ursula's son, Louis, from modifying our code. In this way, we dealt with an important part of process isolation.

However, we still have yet to deal with the issue of preventing unprivileged code (i.e., app code) from accessing or modifying memory that it does not "own."

Segmentation

Contiguous regions of memory are called segments, each of which has its own base address and size. Any accesses outside of an application's assigned segment are illegal and generate a trap to the kernel. Dangerous instructions in this type of architectural design that require privilege include changing the base address or the size of the application's segment.

One of the weaknesses of this type of memory management is fragmentation, of which there are three types - internal, external, and data. Given that segments each have their own address and size, this strategy suffers from external fragmentation, which generally occurs when the available storage on a disk is broken up into a lot of smaller fragments that cannot be used on their own to support processes. Although this could technically be resolved by moving all of the processes around in a procedure called compaction, this is extremely expensive resource-wise and should be used sparingly (if at all). Another alternative resolution is to utilize paged virtual memory.

Paged Virtual Memory

Paged virtual memory solves external fragmentation through fixed size allocations, which are referred to as pages. In general, the size of a page is defined by the system architecture that it exists in. (Note: in the x86 architecture, a single page is 4KB, and the page directory base register is %cr3). By breaking memory into pages, it is possible to do page mapping, which allows processes to access spread out memory with "contiguous addresses."

* Virtual memory - address space types