You are expected to understand this. CS 111 Operating Systems Principles, Fall 2006
You are here: CS111: [[2006fall:notes:lec16]]
 
 
 

Lecture 16 notes

by Miho Akiyoshi, Tom Lyttelton, Kamron Farrokh, Jumpei Yoshida
Modified by Benson Luk

Topics covered

  • Robustness in the file system
  • Fault tolerance
  • Distributed systems

Robustness in the File System

Faults and Failures

Faults and failures are two useful terms used when we discuss problems in computer systems:

Fault: A defect in materials, design, or implementation that MAY cause an error and lead to a failure.
Failure: When a system doesn't produce the intended result.

Example in terms of driving:

  • Intended result: Stay on the road.
  • Fault: A flat tire.
  • Failure: Losing control and driving off the road.

A fault (flat tire) in the system prevented us from the intended result (staying on the road), leading to failure (driving off the road).

Now, let's think about writes to the disk in terms of faults and failures.

Goal of file system: Durability

  • Inteded result: All data is in a valid state.
  • Fault: Crashes or power outages.
  • Failure: A file ends up in an erroneous state.

If the system crashes in between writes to different blocks of a file, the file ends up in a half-written state, contrary to our intended result.

File System Fault Tolerance

Atomicity

Crashes in the middle of writing to a file can cause an inconsistent disk. When writing to a file, we want to make sure that even in the case of a fault, either: 1. All the changes are still able to complete, or 2. We revert back to the original, unchanged state of the file.

This is the definition of atomicity.

Golden Rule of Atomicity

Problems we run in to trying to implement atomicity:

  • As the disk head can only be in one place at a time, it is difficult and in some cases physically impossible to make large changes in one atomic step. We must use multiple steps to change a file.
  • If there is a fault between the atomic steps of a change, then the file might be in a half-finished or erroneous state (failure).
  • The fault may prevent us from completing the rest of the changes correctly. We also cannot undo the changes to revert back to the original file, because we do not keep a record of what changes were made. The file is now in an erroneous state, and we have no way of changing it to a valid state.

This leads us to the Golden Rule of Atomicity: Never modify the only copy (that is in a valid state.)

We should modify a copy of a file until all changes it to are completed (that is, the copy is in a changed, but valid, state.) This way, if a fault occurs while we are modifying the copy, we will always have the valid, unmodified original file to fall back to.

Once the changes to the copy are complete, the copy will be in a valid state. We can then overwrite the original file with the contents of the copy. If a fault occurs during this process, we still have the valid, changed copy of the file and can retry overwriting the original.

Obtaining Atomicity by Modifying a Copy

Original (bad) idea:

diskatomicityidea1.png

  1. Write A', B', C' to copy disk.
  2. Copy data from copy disk onto true file system.
  3. On reboot, copy the data from the copy disk into true file system.

One reason that above strategy doesn't work:

  • If crash happens during the step 1, we get inconsistent picture on the true disk. Example:
    1. We attempt to write A', B', C' to the copy disk.
    2. The system crashes during the last step. A' is written, but not B' or C'.
    3. The system is rebooted. We copy A', B, C from the copy disk to the true disk.
    4. The true disk now has A', B, C. It is in a half-finished state with no way of finishing or reverting to the original.

Using this current method of trying to obtain atomicity is not any different than using no method at all. Any erroneous state in the copy disk caused by a fault is directly transferred to the true disk after rebooting.

Revised (better) idea: Commit Record

The problem with the original idea is that we always copy from the copy disk, whether the copy disk is valid or not. We should not copy from the copy disk until we know that the data on the copy disk is valid. We introduce a way of signifying that the changes on the copy disk are valid: a commit record.

Commit Record: We need a record that tells us that we have a valid, changed copy on the copy disk. Once we have written this record, we should copy this data over to the true disk. Note that in the atomic step of writing this record, we are committing to going from the valid original data to the changed valid data.

diskatomicityidea2.png

  • New steps including the idea of commit record:
    1. Write A', B', C' to copy disk.
    2. Wait for writes to succeed.
    3. Write "commit record" on true disk, saying 10, 11, 12 on copy disk are correct.
    4. Copy data to true file system.
    5. On reboot, redo commit record operations.
    6. When data is copied, clear "commit record."

Now, if we write to the commit record, the changes happen to the file system, and if we have not written to the commit record, the changes will not happen. The changes either have no effect, or the changes completely occur. This satisfies the definition of atomicity.

To demonstrate the benefit of using this method, let us analyze all the possible areas where our write could go wrong:

  • Write A', B', and computer crashes: The true disk remains unchanged. Nothing happens after reboot. (atomicity satisfied)
  • Write A', B', C', but didn't write the commit record: The true disk remains unchanged. Nothing happens after reboot. (atomicity satisfied)
  • Write the commit record and crash before copying data onto the disk: When we reboot, we look at the commit record and copy the relevent data from the copy disk. Then we delete the commit record. After the reboot, all changes will happen to the true disk. (atomicity satisfied)
  • Copy some (or all) of the copy disk data, but computer crashes again before we delete the commit record: When we reboot, we (re)write A', B', and C' since we still have the commit record. Then delete the commit record. All changes will happen to the true disk. (atomicity satisfied)

Final (best) idea: A Journal

With the previous idea, our copy disk is a copy of our entire true disk. Yet we only use a very small portion of the copy disk at a time. (We modify one file's worth of blocks on the copy disk, then copy those over to the true disk. We have no need for the rest of the copy disk's data during these steps.) We can accomplish the same results as the previous idea, but with much better disk utilization with write-ahead logging.

diskatomicityidea3.png

Write-ahead logging: With write-ahead logging, we don't have a copy disk that takes the size of our entire true disk. Instead, we create a much smaller journal somewhere on our true disk. The journal consists of a copy of the data we want to write, followed by the commit record, which also tells us which blocks to write this data to. This way, we only simulate the necessary blocks on the copy disk, rather than simulating the entire disk.

Media Fault Tolerance

Write-ahead logging won't help tolerate media faults. If the entire disk fails, the file system also fails because there is only one copy of the disk. What if we want to remain robust in the case of these faults?

Robustness principle: Redundancy => keep more than one copy.

RAID 1

raid1.png

The principle of RAID (Redundant Array of Inexpensive/Independent Disks) 1 is treating multiple disks like a single disk interface. The RAID 1 structure keeps two identical disks. However, since it treats multiple disks like a single disk interface, the OS sees only one disk (implemented by two real disks with equal geometry). If one disk fails, RAID 1 will still write/read to the other disk.

As a result, the write command occurs twice, as follows:

  Write (block, data)
      disk0.write(block, data)
      disk1.write(block, data)

Note that the read command still occurs only once, but will succeed if either disk contains the correct data:

  Read (block)
      if (disk0.read(block) fails)
          if (disk1.read(block) fails)
              fail
      success

It is important to remember that adding more failing components does not improve reliability across the board.

RAID 4

raid4-fixed-.png

RAID-4 adds parity to the RAID structure. Specifically, it utilizes a three-disk structure in which two of the three disks are unique, and the data on the third disk is the data of the first disk XOR the data of the second.

This greatly increases the robustness of the RAID structure by giving us a means for recovery: if any one disk fails, the data on that disk is the data of the other two disks XORed together.

What if any two disks fail? RAID-4 cannot tolerate two simultaneous disk failures with this method, so nothing will be recovered and entire disk system fails. => Trading off better reliability in short-term for worse reliability in longer-term.

It is also important to understand that the MTTF (mean time to failure) for a RAID-4 is not higher than the mean time to failure for any one of its three disks. Consider the following graph, where we use disks with a linear probability to failure that reaches 100% at 1 year. Even if we utilize three disks instead of one, the MTTF, or mean time to failure, has not changed. Though we have gained short-term reliability with the three-disk structure, it comes at the cost of long-term reliability:

pd.png

Why does this happen? Let us denote a working disk as P and a failed disk as F. Let us denote the probability of a single disk failure as P(t). There are four possible combinations of three disks where at least two of the three disks have failed (since only two disks have to fail for RAID-4 to fail):

  • FFP, FPF, and PFF, each with probability P(t)^2*[1-P(t)], or P(t)^2 - P(t)^3
  • FFF, with probability P(t)^3

If we add these four probabilities up we get, 3P(t)^2 – 2P(t)^3 as the probability that at least two out of these three disks will have failed by time t. Plotting this along P(t) yields the graph shown above.

The Bathtab Curve

In our graph above, we used a simplified disk that had a linear cumulative distribution graph. This means that at any particular time, the probability of the disk failing is equal. (The probability distribution graph is a straight line.) The probability and cumulative distribution graphs for our simplified disk is shown below:

simple.gif

In the real world, disks and other components do not have an equal probability of failing at any particular time. Components have a higher probability of failing in the beginning and end of their lifetime.

The higher probability of failure in the beginning is due to bad manufacturing. Many new components come with slight defects that are not caught in the testing process at the factory. These components may fail soon after they are implemented in a system.

The higher probability of failing at the end of the lifetime of a component is due to normal wear and tear. As components get older, the parts of the component wear down and do not function as well as when new. Wear and tear will cause any product to eventually fail as it gets older.

The probability and cumulative distribution graphs for a sample "real" component are shown below. The shape of the probability distribution starts high, goes down, and then goes back up as time passes. The shape of this curve gives it the name "Bathtub Curve".

real.gif

Networking & OS

Introduction

Given that a network is, like a bus and a pipe, a link abstraction, why would a network change the OS?

Unlike a bus or pipe, networks suffer from:

  • Loss (e.g. network downtime)
  • Attack (e.g. DoS attack)
  • Unsolicited requests
  • Increased latency
  • Network congestion

Network Effect

Given all the problems and complexities associated with a network, why bother having networks at all? The answer is the Network Effect, which says:

Value of Network ∝ (Size of Network)²

This means that a large network can give large amounts of people access to large amounts of data. Very large networks are, therefore, extremely useful.

Goal of a Network

The goal of a network is to allow each computer on a network to utilize resources of every other computer on the network. This way, we can create large, powerful networks such as the Internet, or Google's servers:

Google Clusters (Hundreds)

  • Each with > 1000 Machines
  • 300TB storage

Receive Livelock: A Priority Problem

The biggest obstacle to our goal is network growth. Network growth occurs exponentially, and as it grows, the need for hosts to support the network load also grows exponentially.

requestgraph.png

Unrealistic ideal: The first graph represents perfect network handling, in which our computer is able to handle an infinite amount of requests. Obviously, this ideal is grossly unrealistic.

Practical ideal: The second graph represents a more realistic ideal, in which our computer is able to handle an finite amount of requests. Once this capacity (called the maximum loss-free request rate) is reached, the rate of request handling remains constant even if the input request rate increases. The implication behind this graph is that request receipt and request handling are disjoint - that is, that the increase of one does not affect the other (after a certain maximum point.)

Receive Livelock: In the third graph, the requests out equals the requests in up to a certain point. But after this point, as the input requests increase, the output actually decreases. This is a serious problem which was actually in older versions of Linux. The cause of this problem is handling requests with interrupts.

Interrupts

  • Handling interrupts takes priority over all other processes
  • If are interrupts per packet, then the priority of Input Requests > priority of Output Requests
  • Send enough input requests, and we will stop handling output requests altogether

This a priority problem called denial of service (DoS), a "receive livelock" of sorts in which we get no work done because of the overhead of receiving too many requests. This is also the key exploit of the now-infamous DDoS (distributed denial of service) attack, in which several clients bombard a server with requests in an attempt to bring it to a halt. This problem, however, is solvable by handling request receipt with polling rather than interrupts.

Client/Service Architecture for Distributed Systems

csa.png

The above diagram explains how a client and service should ideally interact with one another.

  • Client and service only communicate through messages
  • Client request is triggered by either user or program
  • Server response is triggered only by client request - it therefore needs to constantly wait for new requests that may come in at any time

Event Driven Programming

  • Dividing processing path of a program into small chunks called "callbacks". These are triggered by external events.
  • No "callbacks" should block. If a callback blocks, the event loop blocks, causing our server to block and thereby preventing any work from being done.
  • In a network server, example events include receiving a message and being ready to send a message. We represent connections as file descriptors.

Our event loop thus looks like this. Note that, in every iteration of the event loop, we block on all our connections until at least one of them opens, and then proceed with that iteration of the loop. This is done to avoid busy wait, in which no connections are up and we hang on our event loop.

  while(1) {
      wait until at least one connection is open;
      for(every connection c) {
          if(c->fd.isReadable && something to read) read;
          if(c->fd.isWritable && something to write) write;
  }
 
2006fall/notes/lec16.txt · Last modified: 2007/09/28 00:25 (external edit)
 
Recent changes RSS feed Driven by DokuWiki