You are expected to understand this. CS 111 Operating Systems Principles, Spring 2006
You are here: CS111: [[2006spring:notes:lec16]]
 
 
 

Lecture 16 Notes

by Kou, Min Sup Jin, Tung Dang

File System Robustness

Remember last time we discussed two ways to make file systems robust.

  1. FSCK: Enforce file system invariants at boot time.
    • Pro: Guaranteed that the file system invariants remain true.
    • Con: Giant massive lag at start time.
  2. Write Ordering: Always maintains 3 out of the 4 invariants (just allow space leak)
    • Con: Possibility of memory leak, but FSCK can alleviate this somewhat
    • Con: Somewhat complicated to set up and order.

Still problems can exist even with these two robustness mechanism:

  • Unordered List Item

Why? Because these two systems try to promote File System Integrity, but these two mechanisms only operate from a File System Perspective. They don't concern themselves with the disk, but merely how the file system is organized. What kinds of problems do these two mechanisms fail to address?

  • Hardware Failure: Hardware faults, Disk corruption. Something physically wrong with the disk.
    • Example: CokeTM on disk.
    • Possible Solution: New disk formats: RAID.
  • Integrity: Doesn’t promise local data integrity as the system writes block by block (need to make sure write order is correct).
    • Example: (Half-way write to a file.)
    • Possible Solution: New write format: Journaling.

3. Journaling

  • Problem: We need to make changes to multiple bocks and change should happen <atomically> either completely or not at all, no matter when power pulled.

What we need is a filesystem ATOMIC TRANSACTIONS.

An ATOMIC TRANSACTION is a series of operations on persistent state, possibly 
but not necessarily including changes to persistent state, that happen either 
ALL AT ONCE (and permanently), or NOT AT ALL.  This all-or-nothing property is 
what makes a transaction useful for maintaining disk consistency.
Transactions:

A set of changes with persistent state obeying the A C I D properties. Atomic transactions satisfy four ACID properties, which are:

Atomicity
Consistency
Isolation        (NB not idempotence, see below)
Durability
  • ATOMICITY: The transaction happens all at once or not at all. It is never the case that *some* of the operations happened, but the rest didn't.

(Parallels the Synchronization problem.)

  • Example: Consider a bank account with a starting balance of 10.
    If there are two people simultaneously accessing the account that both deposit(5) and withdraw(5), the ending balance should be 10.
  • What if had an ending balance of 15?
    Then this property has been violated as there is no serial ordering that corresponds to a transaction that could lead to an ending balance of 15.
  • CONSISTENCY: The persistent state is consistent both before and after the transaction. For example, the file system satisfies the 4 invariants. In our case we are actually using transactions to achieve consistency.
  • ISOLATION: The transactions can be serialized: the result of a series of transactions is as if the transactions were executed, each in isolation, in some single order, with no interleaving.
  • DURABILITY: Once the change happens, it is never forgotten, stored to persistent storage.(Once the transaction happens, its changes are permanent.)
Each atomic transaction ends in a single COMMIT POINT.  This ends the group of 
operations.  If the system crashes before the commit point, then none of the 
changes in the transaction can be stored permanently, due to the Atomicity 
property (all-or-nothing).  If the system crashes AFTER the commit point -- 
after the commit returns -- then ALL of the changes in the transaction must be 
stored permanently, because of the Durability property; for instance, the 
reboot process might complete the transaction.

Question: Would ACID properties helpful when we’re trying to maintain a robust and correct file system.

Answer: What if all the changes to the file system happened inside transactions? If the file system plug is pulled, the transaction mechanism takes care of the consistency issue, as the transactions ensure that the change either happens completely or doesn’t happen at all.

File system changes in transactions -> File system is always consistent!

Problem solved! \ (^_^) /

Transaction Implementaiton

Question: How do we implement transactions?

Guess: Hardware? Unfortunately, Hardware implementation of transactions would be much more difficult and outside the scope of the lecture.

Answer/Idea: Do everything twice : keep a log (journal) of all file system changes.(Record of all File System changes, circular log- does not grow infinitely.) Journal mechanism characteristics:

  1. Durably write what you’re going to do. You write what you are going to do into this special area of the disk.
  2. Wait for that journal entry to be stored durably.
  3. Once the journal entry is stored durably then we actually make changes to the main part of the disk.
  4. Replaying the Journal: On reboot, rather than scan the whole disk, we can just scan the journal! And we make those changes to the file system again.
A common implementation technique for atomic transactions is called WRITEAHEAD 
LOGGING; we used this today in the context of journaling.  "Writeahead" means 
that the log is written BEFORE the transaction commits.  Here's how writeahead 
logging works in the context of journaling.
1. All changes are first written to the log.
2. After the transaction's log entries are written, a COMMIT RECORD
    is written to the log.  Writing this commit record is the commit point.
3. After the commit record is written, the changes are written to the main
    part of the file system.
4. When these are done, a COMPLETION RECORD is written to the log, marking the
    transaction as complete.
After a crash, the reboot procedure *recovers* the log by *replaying* any 
committed transactions:
  ''For each entry in the log,
     If the entry has a completion record,
        Skip it
     Else if the entry has a commit record,
        Write the logged changes to the main part of the file system
        Then write a completion record
     Else, the entry has no completion or commit record
        Skip it (do not perform its changes)''
This recovery might write the logged changes to the main part of the file 
system multiple times.  For instance, the computer might crash after writing 
changes to the main part of the file system, but before writing the completion 
record; the recovery procedure would then write those changes again.  For this 
reason, it is important that log entries be IDEMPOTENT: that performing them 
two or more times has the same effect as performing them once.  (I 
accidentally promoted this requirement to an ACID property, but it is really 
just an implementation technique.)
There are other ways to design a transaction log.  For example, you might 
store BOTH the old data AND the new data for each change.  This would allow 
steps 1 and 3 above to happen somewhat in parallel; the reboot procedure can 
UNDO any changes that were written to disk too early.  Such designs are common 
in databases.
Example of Journaling vs. Non-Journaling (eg. deleting a file from the UNIX-like file system)

Non-Journal:

Requires 3 actions:

  1. Mark the directory entry as free.
  2. Mark the I-Node as free. (Assumption: Unlinking the last link to the file.)
  3. Mark the data blocks as free (free block bitmap).

Pro: Simpler, efficient, and doesn't waste as much memory as journaling.
Con: Only guarantees block by block atomic writes. Possible system leak errors as only 3 out of the 4 invariants are held at any given time.

Journal:

  1. Write the changes to the journal (in any order)
    • What happens if I lose power in phase #1?
    • Since there’s no commit record (a single block), then we know this journal entry is invalid. And thus, the write never occurs at all.
  2. Write a commit record to the journal (a single write to the disk.)
    • At this point, the transaction has happened. The changes are committed durably to disk, as the file system has happened.
  3. Make the changes to the main file system.
    • What happens if I lose power in phase #3?
    • Then I redo the records in the journal! Again, the idempotency principle saves us.
  4. Mark the transaction as complete. Note the commit record is NOT the same as the completion record.
Summary: Journaling

Problem: We’re making file system changes that are modifying more than one block yet we can only change one block at a time (safely) and the change should happen atomically, either all at once, or not at all, no matter when there’s a hardware failure.

Solution: Keep a log or a journal of our transactions.

  • Pro: Journals can contain data and metadata.
  • Pro: Application data operations are consistent (System call writes/reads are consistent.)
  • Con: Less disk space (Not quite such an issue – anymore.)
  • Con: Performance (slower write, because we write everything twice!)
Dealing with Mechanical Failure: RAID

Question: How do we defend against mechanical failure?

Solution 1: Replication – Most basic way of achieving robustness.

  • Implementaion: Write the same data onto two different disks.
    • Achieves Robustness vs. Mechanical Failure by separating faults from failures.
      • Fault: Mechanical problem of some sort.
      • Failure: Data loss. Irrecoverable error (application level).
      • Replication: Fault does not necessarily lead to failure.
Replication assures that faults does not necessarily imply failure.
You need more than one fault to have a failure.
Replication strategy helps to avoid failures for the initial rarer faults.
Replication is actually the simplest version of RAID.

Solution 2: Redundant Array of Independent Disk.

RAID uses multiple disks for performance and robustness. (Raid is defined by multiple levels, with each level having it’s own setup, designation. Add to that no one can really agree to what levels should exists, or what some more obscure levels mean…)

Different levels of RAID differ not only in terms of perfomance and efficiency but also in terms of Mean Time to Fault and Mean Time to Failure.

MMTFault/Failure is the average length of time it takes a component to fault/fail.

RAID 1: Mirroring.

.:raid1.gif

  • What is it: 2 copies of the file system on 2 seperate disks.
  • Mean time to fault lowers (more components that can fault.)
  • Mean time to failure rises (robustness.)
  • Space Efficiency: 50%.
  • Speed Performance: 100%±.

You might actually be able to get more performance and have faster SEEKS with a mirrored setup. (By having two independent seek heads.)

RAID 0: Striping

.:raid0.gif

  • What is it? Spread the file system across more than one disk.
  • Mean time to failure: Goes down. (no redundancy = no robustness.)
  • Mean time to fault: Goes down. (more components that can fault.)
  • Space Efficiency: 100%
  • Speed Performance: ≤ 300%.
    • Why? SEEK times decrease: Thus, our reads are faster (if we’re lucky.)

Raid 2: Mirror + Striping combination

Raid 4: Parity

.:raid4.gif

  • What is it? The last disk is the Parity bits of the other disk.
  • Mean time to fault lowers (More faultable components)
  • Mean time to failure raises (No redundancy = no robustness)
  • Space Efficiency: Ratio of disks to disks - 1 (for above example 75%).
  • Speed Performance: 100% * number of storage disks (for above example ≤ 300%).
    • Why? SEEK times decrease. Up to 3 times faster with 3 disks for data.
  • Con: The parity disk (disk 4) is accessed HEAVILY.

Parity Bit Explanation

Given A,B,C,D, and a parity bit X, and the XOR operation, you can design a file recovery system.

You can take away any of the bits and recover the value of all the bits if you know which bit was taken away.

Raid 5: Distributed Parity bits

.:raid5.gif

  • What is it? In contrast to Raid 4, the last disk is not the parity bit container, but the parity bits are scattered around the striped systems, so there is no one disk that has more operations on it than any other.
  • Mean time to fault: Goes down (More faultable components)
  • Mean time to failure: Goes up (Robustness!)
  • Space Efficiency: Comparable to Raid 4.
  • Speed Performance: Comparable to Raid 4, or faster.
Distributed Systems

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” – Leslie Lamport

The main problem of distributed systems: FAILURE, and dealing with it.

What is so different about a networked system compared to our own computers?

Question: We have a small mini network in our computers, the BUS. It networks the CPU and the disk, and memory, and etc. Yet our failure rate isn't as high or as big as a problem as in distributed systems. Why?

Answer: Correlated failure rate. If one part fails, it is likely that that other parts will fail. However, on a distributed network, locality of failure rates does not hold, and there are many more parts that can fail independently. So we have to cope with a very short mean time to fault.

Additional Problems with Distributed Systems:

  • Higher Latency
    • Longer wires = Longer latency.
  • Congestion and Loss
  • Information Security: ATTACKS
    • It is highly unlikely that the hard disk is trying to eat your CPU. However, there is a greater chance that people are trying to eat your precious data if you are tapped into a distributed network.
    • Fundamental difference between local and distributed networks, is ATTACKS.
Network Attack #1: Receive Live Locking

Question: How does the network receive a packet?

Answer: The network card sends an interrupt whenever a packet is available.

Roughly 10000 packets a second are possible, which means… 10000 interrupts a second. How do we service 10000 interrupts a second? We use some sort of DMA queue for received data.

If the number of interrupts grow so large, that we are unable to service any of the packets, and can only handle interrupts.

Under high loads, we actually want to switch to POLLING instead of interrupts to deal with high load attacks.

You can find previous quarter's material here.

 
2006spring/notes/lec16.txt · Last modified: 2006/09/26 11:42 (external edit)
 
Recent changes RSS feed Driven by DokuWiki