====== Lecture 16 Scribe Notes ======
//by Izumi Wong-Horiuchi, Ryan LaFontaine, and Ryosuke Shinoda//\\
===== Copy-on-Write fork=====
==== Simple Way of Forking (without Copy-on-Write)====
We have implemented fork (in our minilab) in this way;\\
* copy process discriptor\\
* copy stack(process isolation)\\
\\
\\
The picture of fork process.\\
{{before.jpg|}}
\\
The function addrallowed is used for checking.\\
\\
addrallowed (va, atype, cpl)
va:virtual address, atype:access type (a/w), cpl:current process level(0:kernel, 3:user)
Sample usage:\\
processor: va
if (addrallowed(va, atype, cpl))
return pmap(floor(va / PGSIZE) * PGSIZE) + va % PGSIZE;
else
pagefault;
\\
\\
However, fork in real OS is implemented in a different way.\\
* copy process's address space: child's virtual addresses & parent's virtual address space\\
\\
\\
**Copying everything can be a waste of the space because: **
- it might not need all that memory.
- since parent code is mapped read-only, AND child's code is mapped read-only, safe to share code pages.
For example, the code of the process is designed not to be changed or modified (robustness), therefore, when the child process is created by fork, the code doesn't need to be copied, but rather, the code in the physical memory can be shared among the parent and the child.
\\
\\
The picture of improved fork process with shared code.\\
{{after.jpg|}}
\\
\\
==== Defining Copy-on-Write Method ====
In order to achieve more efficient and space-saving implementation of forking, Copy-on-Write method is used.\\
**Copy-on-Write**(COW) is the idea that instead of coping everything in virtual memory of parent process to that of the child process when it is forking, it only copies the things that can be shared among the parent and the child. Then, whenever the new memory space is requested by the process, it allocate the space in physical memory that can be accessed only by the process requested the space. In this way, a copy is created on writing and not before writing.\\
\\
==== Efficiency of COW ====
\\
Without the Copy-on-Write, it takes:\\
N - number of pages of user memory
C - number of copies of the pages
----------------------------------------
After the fork:
N more pages created.
NC time to copy pages in fork()
\\
When we use Copy-on-Write:\\
W - total number of pages written
F - cost of page-fault
----------------------------------------
After the COW:
W(F+C) time expected.
\\
==== Implementing COW ====
**Once pages are shared, kernel must:**\\
- keep track of number of the processes shareing a page.\\
- In swapping, remove evicted physical page from every pmap function referencing it.\\
\\
**1. Copy only the shared files to the child process address space.**\\
{{cowbefore.jpg|}}
^ Parent ^^| ^Child ^^^^
^ va ^ pmap(va) ^ addrallowed(va) | ^va ^ pmap(va) ^ addrallowed(va)^|
|0x800000|0x2000|Read-Only, cpl:3| |0x800000|0x2000|Read-Only, cpl:3||
|0x801000|0x0000|RW, cpl:3| |0x801000|0x2000|RW, cpl:3||
|0xB00000|0x1000|RW, cpl:3| |0xB00000|0x1000|RW, cpl:3||
|≥0xC00000|.....|(RW), cpl:0| |≥Kernel |.....|(RW), cpl:0||
|other| X | X | |other| X | X ||
\\
**2. Make all pages read-only. Remenber that RW pages were RW.**\\
^ Parent ^^| ^Child ^^^^
^ va ^ pmap(va) ^ addrallowed(va) | ^va ^ pmap(va) ^ addrallowed(va)^|
|0x800000|0x2000|Read-Only, cpl:3| |0x800000|0x2000|Read-Only, cpl:3||
|0x801000|0x0000|RW, cpl:3| |0x801000|0x2000|RW, cpl:3||
|0xB00000|0x1000|RW, cpl:3| |0xB00000|0x1000|RW, cpl:3||
|≥0xC00000|.....|(RW), cpl:0| |≥Kernel |.....|(RW), cpl:0||
|other| X | X | |other| X | X ||
\\
**3. Use page fault handler to write on the child process.**\\
Page_Fault_Handler(va, atype, cpl){
if (atype == WRITE and current->addrallowed(va, atype, cpl) is COW){
evict a page;
copy data into physical page from current page;
change pamp;
change addrallowed to allow writes;
}
return;
}
\\
Figure:after the write\\
{{cowafter.jpg|}}
^ Parent ^^| ^Child ^^^^
^ va ^ pmap(va) ^ addrallowed(va) | ^va ^ pmap(va) ^ addrallowed(va)^|
|0x800000|0x2000|Read-Only, cpl:3| |0x800000|0x2000|Read-Only, cpl:3||
|0x801000|0x0000|R, cpl:3| |0x801000|0x2000|R, cpl:3||
|0xB00000|0x1000|R, cpl:3| |0xB00000|0xA000|RW, cpl:3||
|≥0xC00000|.....|(RW), cpl:0| |≥Kernel |.....|(RW), cpl:0||
|other| X | X | |other| X | X ||
\\
===== Disk Scheduling =====
Because of the techniques of prefetching and batching, reads and writes of sectors are typically performed in groups. **Disk scheduling** is how to decide to order the writing or reading of disk blocks. This can have an important effect on performance because the cost of reading or writing to disk is significant. A lot of the this cost has to do with the sweep performed by arm of the drive. If the cost of or number of sweeps can be reduced by a good disk scheduling algorithm, then we can get better performance.
**Sample Request Order**\\
Consider the following request order for the cost calculations of the following disk scheduling algorithms.\\
We will also assume that the cost of moving from block b1, to b2 is |b1 - b2|\\
We will only count the cost of the sweeps, not the cost of the actual reads or writes.
^ Time --> ^^^^^^
|0|10|1|11|2|12|
==== First Come First Serve ====
In a **FCFS** disk scheduling algorithm, we read or write blocks in the order they are requested. Therefore, based on the example request order above we can calculate the following cost:
Cost = |0 - 10| + |10 - 1| + |1 - 11| + |11 - 2| + |2 - 12|
= 10 + 9 + 10 + 9 + 10
= 48 units
* Advantage: no starvation.\\
* Disadvantage: Potential for many seeks across performed across the drive causing large cost.
==== Shortest Seek Time First ====
Lets assume we know the disk head's position. In the **SSTF** algorithm, we order block accesses by shortest seek time from disk head.\\
\\
Assume the head starts at 0. Then requests will be performed in the following order based on the above sample:
SSTF Order: 0, 1, 2, 10, 11, 12
Cost = |0 - 1| + |1 - 2| + |2 - 10| + |10 - 11| + 11 - 12|
= 1 + 1 + 8 + 1 + 1
= 12 units
* Advantage: Reduced cost
* Disadvantage: Starvation is possible!
** Starvation **\\
Consider the situation where in the previous sample, requests to write blocks at block 2 are continually being asked. The requests to blocks 10, 11, and 12 maybe never be performed. We must therefore come up with another algorithm that does not suffer from starvation.
==== Elevator Scheduling ====
**Elevator Scheduling**, as the name implies, is inspired by the algorithm used to determine which floor an elevator stops at. Consider the following example of the requests for an elevator:
^ Floors ^ Requests ^
| 9 | |
| 8 | |
| 7 | UP |
| 6 | DOWN |
| 5 | UP |
| 4 ^E (UP)^
Lets assume that the elevator is at Floor 4 and is going UP. It will stop at floors 5 and 7 because they are also requests which are in the same direction that the elevator is currently heading in. Once those requests are completed. The elevator will go back down to the 6th floor and handle that request. We can use the same idea for disk scheduling. Here is the basic idea:
* Sweep smaller blocks to larger.
* Shortest Seek Time First in that direction.
* When you reach the end of the disk, turn around.
Consider the following example of requests. Assume that the head starts at position 0. At time t = 0, the requests are for blocks 0, 2 and 3. After the request for block 2 has completed, requests for blocks 1, 50, and 100 are added.
|{{elevator_scheduling_example.gif|}}|
Notice how after block 2 is processed, block 1 would have the shortest seek time. However, it is not in the direction of the head and therefore is not processed until the head reaches the end and turns around. This is similar to how in the elevator example above, floor 6 is skipped until all the UP requests had been completed. **This algorithm has no starvation**.
===== Disk Robustness =====
Journals and write ordering address power failure. These additions to file systems are based on the failure model that some writes are not committed. Unfortunately, this does not take into account all failures. Consider, for example the situation where a disk physically fails, like a disk explosion!! Journaling will not help with this problem and all data on the disk would be lost. One way to try to prevent this type of data loss is through redundancy.
**Disk Failure Profile: **
A bathtub curve show the failure probability of a disk over time. In the beginning, the failure probability is high because of manufacturing errors. As time passes, the failure probability decreases and there is a constant region of random failures. As time increases, the failure rate increases once again, due to the hardware physically wearing out.\\
Probability of failure vs. time:
^Bathtub Curve^
|{{bathtub_curve_graph.gif|}}|
==== RAID - Redundant Arrays of Inexpensive Disks ====
RAID provides the OS with an interface like a single disk. However, all writes are written to multiple pieces of disk hardware. This adds robustness by storing data in multiple places. There exist multiple RAID configurations or "levels". RAID 1 and RAID 4 are discussed below.
=== RAID 1 ===
In a RAID 1 configuration, multiple disks store same data. This way, if one disk fails, the system still contains the data on the other disk(s) and data is not lost. Note that this reduces efficiency of storage greatly because if there are N disks, only 1/N of the storage can be utilized.
^ RAID 1 - Read ^ RAID 1 - Write ^
| {{raid_read.gif?300x303}} | {{raid_write.gif?300x302}} |
We can compare the probability that the disk has failed by time t, for both a single disk and for a RAID 1 configuration with 3 disks.
^Failure Probability^
|{{RAID_1disk_vs_3disks.gif|}}|
**Median Time to Failure (MTTF): **
Time at which probability of failure = 1/2
* Advantage: Increased Robustness
* Advantage: Increased Read Speed\\
* Disadvantage: Decreased Efficiency (Space utilization is 1/3 for 3 disks)
=== RAID 4 ===
A RAID 4 configuration is another way to use multiple disks to store data. RAID 4 requires a minimum of 3 disks. It uses multiple disks which each store unique data, with one disk designated as a parity disk. The parity disk contains the XOR (Exclusive OR) of the data stored on each sector of the data disks. Therefore, if one disk fails, the data on the failing disk can be reconstructed from the XOR of the remaining disks. However, if two disks fail simultaneously, data will be lost.
**RAID 4 Example:**\\
|{{raid_4_parity.gif|}}|
* Advantage: Fast read performance (allows reading from multiple data disks in parallel)
* Advantage: Increased Robustness (if one disk fails that data can be recovered)
* Disadvantage: Decreased write performance (the parity disk must be updated with each write)
===== Distributed System =====
Effect of network on kernel and applications:
Covered in CS118
* Latency (communicating computers have long delay)
* Loss (dropped messages)
* Congestion (More data sent, so network performance drops)
Topics in CS111
*Unsolicited communication
*Attack!
Network effect:
Value of a network is proportional at least to the number of nodes plugged in
Common distributed system interaction pattern
{{untitled.jpg|}}
Remote prodcedure call:
Makes client/server interactions look like function calls.
==Peer-to-Peer Computing==
Idealistic View: Clients communicate with each other => better utilization.
{{p2p.jpg|}}