by Izumi Wong-Horiuchi, Ryan LaFontaine, and Ryosuke Shinoda
We have implemented fork (in our minilab) in this way;
The picture of fork process.
The function addrallowed is used for checking.
addrallowed (va, atype, cpl)
va:virtual address, atype:access type (a/w), cpl:current process level(0:kernel, 3:user)
Sample usage:
processor: va
if (addrallowed(va, atype, cpl))
return pmap(floor(va / PGSIZE) * PGSIZE) + va % PGSIZE;
else
pagefault;
However, fork in real OS is implemented in a different way.
Copying everything can be a waste of the space because:
For example, the code of the process is designed not to be changed or modified (robustness), therefore, when the child process is created by fork, the code doesn't need to be copied, but rather, the code in the physical memory can be shared among the parent and the child.
The picture of improved fork process with shared code.
In order to achieve more efficient and space-saving implementation of forking, Copy-on-Write method is used.
Copy-on-Write(COW) is the idea that instead of coping everything in virtual memory of parent process to that of the child process when it is forking, it only copies the things that can be shared among the parent and the child. Then, whenever the new memory space is requested by the process, it allocate the space in physical memory that can be accessed only by the process requested the space. In this way, a copy is created on writing and not before writing.
Without the Copy-on-Write, it takes:
N - number of pages of user memory C - number of copies of the pages ---------------------------------------- After the fork: N more pages created. NC time to copy pages in fork()
When we use Copy-on-Write:
W - total number of pages written F - cost of page-fault ---------------------------------------- After the COW: W(F+C) time expected.
Once pages are shared, kernel must:
1. Copy only the shared files to the child process address space.
| Parent | Child | ||||||
|---|---|---|---|---|---|---|---|
| va | pmap(va) | addrallowed(va) | va | pmap(va) | addrallowed(va) | ||
| 0x800000 | 0x2000 | Read-Only, cpl:3 | 0x800000 | 0x2000 | Read-Only, cpl:3 | ||
| 0x801000 | 0x0000 | RW, cpl:3 | 0x801000 | 0x2000 | RW, cpl:3 | ||
| 0xB00000 | 0x1000 | RW, cpl:3 | 0xB00000 | 0x1000 | RW, cpl:3 | ||
| ≥0xC00000 | ..... | (RW), cpl:0 | ≥Kernel | ..... | (RW), cpl:0 | ||
| other | X | X | other | X | X | ||
2. Make all pages read-only. Remenber that RW pages were RW.
| Parent | Child | ||||||
|---|---|---|---|---|---|---|---|
| va | pmap(va) | addrallowed(va) | va | pmap(va) | addrallowed(va) | ||
| 0x800000 | 0x2000 | Read-Only, cpl:3 | 0x800000 | 0x2000 | Read-Only, cpl:3 | ||
| 0x801000 | 0x0000 | R | 0x801000 | 0x2000 | R |
||
| 0xB00000 | 0x1000 | R | 0xB00000 | 0x1000 | R |
||
| ≥0xC00000 | ..... | (RW), cpl:0 | ≥Kernel | ..... | (RW), cpl:0 | ||
| other | X | X | other | X | X | ||
3. Use page fault handler to write on the child process.
Page_Fault_Handler(va, atype, cpl){ if (atype == WRITE and current->addrallowed(va, atype, cpl) is COW){ evict a page; copy data into physical page from current page; change pamp; change addrallowed to allow writes; } return; }
| Parent | Child | ||||||
|---|---|---|---|---|---|---|---|
| va | pmap(va) | addrallowed(va) | va | pmap(va) | addrallowed(va) | ||
| 0x800000 | 0x2000 | Read-Only, cpl:3 | 0x800000 | 0x2000 | Read-Only, cpl:3 | ||
| 0x801000 | 0x0000 | R, cpl:3 | 0x801000 | 0x2000 | R, cpl:3 | ||
| 0xB00000 | 0x1000 | R, cpl:3 | 0xB00000 | 0xA000 | RW, cpl:3 | ||
| ≥0xC00000 | ..... | (RW), cpl:0 | ≥Kernel | ..... | (RW), cpl:0 | ||
| other | X | X | other | X | X | ||
Because of the techniques of prefetching and batching, reads and writes of sectors are typically performed in groups. Disk scheduling is how to decide to order the writing or reading of disk blocks. This can have an important effect on performance because the cost of reading or writing to disk is significant. A lot of the this cost has to do with the sweep performed by arm of the drive. If the cost of or number of sweeps can be reduced by a good disk scheduling algorithm, then we can get better performance.
Sample Request Order
Consider the following request order for the cost calculations of the following disk scheduling algorithms.
We will also assume that the cost of moving from block b1, to b2 is |b1 - b2|
We will only count the cost of the sweeps, not the cost of the actual reads or writes.
| Time --> | |||||
|---|---|---|---|---|---|
| 0 | 10 | 1 | 11 | 2 | 12 |
In a FCFS disk scheduling algorithm, we read or write blocks in the order they are requested. Therefore, based on the example request order above we can calculate the following cost:
Cost = |0 - 10| + |10 - 1| + |1 - 11| + |11 - 2| + |2 - 12|
= 10 + 9 + 10 + 9 + 10
= 48 units
Lets assume we know the disk head's position. In the SSTF algorithm, we order block accesses by shortest seek time from disk head.
Assume the head starts at 0. Then requests will be performed in the following order based on the above sample:
SSTF Order: 0, 1, 2, 10, 11, 12
Cost = |0 - 1| + |1 - 2| + |2 - 10| + |10 - 11| + 11 - 12|
= 1 + 1 + 8 + 1 + 1
= 12 units
Starvation
Consider the situation where in the previous sample, requests to write blocks at block 2 are continually being asked. The requests to blocks 10, 11, and 12 maybe never be performed. We must therefore come up with another algorithm that does not suffer from starvation.
Elevator Scheduling, as the name implies, is inspired by the algorithm used to determine which floor an elevator stops at. Consider the following example of the requests for an elevator:
| Floors | Requests |
|---|---|
| 9 | |
| 8 | |
| 7 | UP |
| 6 | DOWN |
| 5 | UP |
| 4 | E (UP) |
Lets assume that the elevator is at Floor 4 and is going UP. It will stop at floors 5 and 7 because they are also requests which are in the same direction that the elevator is currently heading in. Once those requests are completed. The elevator will go back down to the 6th floor and handle that request. We can use the same idea for disk scheduling. Here is the basic idea:
Consider the following example of requests. Assume that the head starts at position 0. At time t = 0, the requests are for blocks 0, 2 and 3. After the request for block 2 has completed, requests for blocks 1, 50, and 100 are added.
![]() |
Notice how after block 2 is processed, block 1 would have the shortest seek time. However, it is not in the direction of the head and therefore is not processed until the head reaches the end and turns around. This is similar to how in the elevator example above, floor 6 is skipped until all the UP requests had been completed. This algorithm has no starvation.
Journals and write ordering address power failure. These additions to file systems are based on the failure model that some writes are not committed. Unfortunately, this does not take into account all failures. Consider, for example the situation where a disk physically fails, like a disk explosion!! Journaling will not help with this problem and all data on the disk would be lost. One way to try to prevent this type of data loss is through redundancy.
Disk Failure Profile: A bathtub curve show the failure probability of a disk over time. In the beginning, the failure probability is high because of manufacturing errors. As time passes, the failure probability decreases and there is a constant region of random failures. As time increases, the failure rate increases once again, due to the hardware physically wearing out.
Probability of failure vs. time:
| Bathtub Curve |
|---|
![]() |
RAID provides the OS with an interface like a single disk. However, all writes are written to multiple pieces of disk hardware. This adds robustness by storing data in multiple places. There exist multiple RAID configurations or "levels". RAID 1 and RAID 4 are discussed below.
In a RAID 1 configuration, multiple disks store same data. This way, if one disk fails, the system still contains the data on the other disk(s) and data is not lost. Note that this reduces efficiency of storage greatly because if there are N disks, only 1/N of the storage can be utilized.
| RAID 1 - Read | RAID 1 - Write |
|---|---|
| |
We can compare the probability that the disk has failed by time t, for both a single disk and for a RAID 1 configuration with 3 disks.
| Failure Probability |
|---|
![]() |
Median Time to Failure (MTTF): Time at which probability of failure = 1/2
A RAID 4 configuration is another way to use multiple disks to store data. RAID 4 requires a minimum of 3 disks. It uses multiple disks which each store unique data, with one disk designated as a parity disk. The parity disk contains the XOR (Exclusive OR) of the data stored on each sector of the data disks. Therefore, if one disk fails, the data on the failing disk can be reconstructed from the XOR of the remaining disks. However, if two disks fail simultaneously, data will be lost.
RAID 4 Example:
![]() |
Effect of network on kernel and applications:
Covered in CS118
Topics in CS111
Network effect:
Value of a network is proportional at least to the number of nodes plugged in
Common distributed system interaction pattern
Remote prodcedure call:
Makes client/server interactions look like function calls.