UNIX Internals : The New Frontiers

Author: Uresh Vahalia

ISBN-10: 0131019082

ISBN-13: 9780131019089

Category: UNIX

This book offers an exceptionally up-to-date, in-depth, and broad-based exploration of the latest advances in UNIX-based operating systems. Focusing on the design and implementation of the operating system itself — not on the applications and tools that run on it -- this book compares and analyzes the alternatives offered by several important UNIX variants, and covers several advanced subjects, such as multi-processors and threads. Compares several important UNIX variants—highlighting the...

Search in google:

Compares several important UNIX variants -- highlighting the issues and alternative solutions for various operating system components (e.g., kernel memory allocations): System V Release 4 (SVR4) from Novell, Inc. Berkeley Software Distribution (4.xBSD) from University of California. OSF/1 from Open System Foundation. SunOS and Solaris from Sun Microsystems. DEC OSF from Digital Equipment Corporation. HP-US from Hewlett-Packard Corporation. Describes advanced technologies such as: multiprocessor and multithreaded systems. log-structured file systems. modern memory architectures. Provides many programming examples and features over 200 figures. Contains 15-20 exercises for each chapter -- many open-ended and expandable to research assignments. Includes an extensive list of references.

\ \ Chapter 11: Advanced File Systems\ 11.1 Introduction\ Operating systems need to adapt to changes in computer hardware and architecture. As newer an' faster machines are designed, the operating system must change to take advantage of them. Often developments in some components of the computer outpace those in other parts of the system. This changes the balance of the resource utilization characteristics, and the operating system must ree, valuate its policies accordingly.\ Since the early 1980s, the computer industry has made very rapid strides in the areas of CPU speed and memory size and speed [Mash 87]. In 1982, UNIX was typically run on a VAX 11/780. which had a 1-mips (million instructions per second) CPU and 4-8 megabytes of RAM, and was shared by several users. By 1995, machines with a 100-mips CPU and 32 megabytes or more of RAM have become commonplace on individual desktops. Unfortunately, hard disk technology has not kept pace, and although disks have become larger and cheaper, disk speeds have not increased by more than a factor of two. The UNIX operating system, designed to function with moderatly fast disks but small memories and slow processors, has had to adapt to these changes.\ Using traditional file systems on today's computers results in severely I/0-bound systems, unable to take advantage of the faster CPUs and memories. As described in [Stae 91], if the time taken for an application on a system is c seconds for CPU processing and i seconds for 1/0, then the performance improvement seen by making the CPU infinitely fast is restricted to the factor (I + c/i). If i is large compared to c, then reducing c yields little benefit. It is essential to find ways to reduce the time the system spends doing disk I/O, and one obvious target for performance improvements is the file system.\ Throughout the mid- and late 1980s, an overwhelming majority of UNIX systems had either s5fs or FFS (see Chapter 9) on their local disks. Both are adequate for general time-sharing applications, but their deficiencies are exposed when used in diverse commercial environments. The vnode/vfs interface made it easier to add new file system implementations into UNIX. Its initial use, however, was restricted to small, special-purpose file systems, which did not seek to replace s5fs or FFS. Eventually, the limitations of s5fs and FFS motivated the development of several advanced file systems that provide better performance or functionality. By the early 1990s, many of these had gained acceptance in mainstream UNIX versions. In this chapter, we discuss the drawbacks of traditional file systems, consider various ways of addressing them, and examine some of the major file systems that have emerged as alternatives to s5fs and FFS.\ 11.2 Limitations of Traditional File Systems\ The s5fs file system was popular due to its simple design and structure. It was, however, very slow and inefficient, which motivated the development of FFS. Both these file systems, however, have several limitations, which can be broadly divided into the following categories:\ \ Performance - Although FFS performance is significantly better than that of s5fs, it is still inadequate for a commercial file system. Its on-disk layout restricts FFS to using only a fraction of the total disk bandwidth. Furthermore, the kernel algorithms force a large number of synchronous 1/0 operations, resulting in extremely long completion times for many system calls.\ Cash recovery - The buffer cache semantics mean that data and metadata. may be lost in the event of a crash, leaving the file system in an inconsistent state. Crash recovery is performed by a program called fsck which traverses the entire file system, finding and fixing problems as best as it can. For large disks, this program takes a long time, since the whole disk must be examined and rebuilt. This results in unacceptable delays (downtime) before the machine can reboot and become available.\ Security - Access to a file is controlled by permissions associated with user and group IDs. The owner may allow access to the file to him- or herself only, to all users in a certain group, or to the whole world. In a large computing environment, this mechanism is not flexible enough, and a finer granularity access-control mechanism is desirable. This usually involves some type of an access-control list (ACL), which allows the file owner to explicitly allow or restrict different types of access to specific users and groups. The UNIX inode is not designed to hold such a list, so the file system must find other ways of implementing ACLs. This may require changing the on-disk data structures and file system layout.\ Size - There are many unnecessary restrictions on the size of the file system and of individual files. Each file and file system must fit in its entirety on a single disk partition. We could devote the entire disk to a single partition; even so, typical disks are only one gigabyte or smaller in size. Although that may seem large enough for most purposes, several applications (for example, in the database and multimedia domains) use much larger files. In fact, the constraint that the file size be less than 4 gigabytes (since the size field in the inode is 32 bits long) is also considered too restrictive.\ \ Let us now examine the performance and crash recovery issues in greater detail, identify their underlying causes, and explore ways in which they may be addressed.\ 11.2.1 FFS Disk Layout\ Unlike s5fs, FFS tries to optimize the allocation of blocks for a file, so as to increase the speed of sequential access. It tries to allocate blocks of a file contiguously on disk whenever possible. Its ability to do so depends on how full and fragmented the disk has become. Empirical evidence [McVo 91, McKu 84] shows that it can do an effective job until the disk approaches about 90% of its capacity.\ The major problem, however, is due to the rotational delay it introduces between contiguous blocks. FFS is designed to read or write a single block in each 1/0 request. For an application reading a file sequentially, the kernel will perform a series of single-block reads. Between two consecutive reads, the kernel must check for the next block in the cache and issue the 1/0 request if necessary. As a result, if the two blocks are on consecutive sectors on the disk, the disk would rotate past the beginning of the second block before the kernel issues the next read. The second read would have to wait for a full disk rotation before it can start, resulting in very poor performance.\ To avoid this, FFS estimates the time it would take for the kernel to issue the next read and computes the number of sectors the disk head would pass over in that time. This number is called the rotational delay, or rotdelay. The blocks are interleaved on disk such that consecutive logical blocks are separated by rotdelay blocks on the track, as shown in Figure II-1. For a typical disk, a complete rotation takes about 15 milliseconds, and the kernel needs about 4 milliseconds between requests. If the block size is 4 kilobytes and each track has 8 such blocks, the rotdelay must be 2.\ Although this avoids the problem of waiting for a full disk rotation, it still restricts throughput (in this example) to one-third the disk bandwidth at most. Increasing the block size to 8 kilobytes will reduce the rotdelay to I and increase throughput to one-half the disk bandwidth. This is still way short of the maximum throughput supported by the disk, and the restriction is caused solely by the file system design. If the file system reads and writes entire tracks (or more) in each operation, rather than one block at a time, it can achieve 1/0 rates close to the actual disk bandwidth.\ On many disks, this problem disappears for read operations. This is because the disk maintains a high-speed cache, and any disk read stores an entire track in the cache. If the next operation needs a block from the same track, the disk can service the request directly from its cache at the speed of the 1/0 bus, without losing time in rotational waits. Disk caches are usually write-through, so each write is propagated to the appropriate place on disk before returning. If the cache were not write-through, a disk crash would lose some data that the user was told had been successfully written to disk. Hence, although an on-disk cache improves read performance, write operations continue to suffer from the rotational delay problems and do not utilize the full disk bandwidth....

1. Introduction.Introduction. The Mandate For Change. Looking Back, Looking Forward. The Scope of This Book. References.2. The Process and the Kernel.Introduction. Mode, Space, and Context. The Process Abstraction. Executing In Kernel Mode. Synchronization. Process Scheduling. Signals. New Processes and Programs. Summary. Exercises. References.3. Threads and Lightweight Processes.Introduction. Fundamental Abstractions. Lightweight Process Design—Issues To Consider. User—Lever Threads Libraries. Scheduler Activations. Multithreading in Solaris and SVR4. Threads In MACH. Digital UNIX. MACH 3.0 Continuations. Summary. Exercises. References.4. Signals and Session Management.Introduction. Signal Generation and Handling. Unreliable Signals. Reliable Signals. Signals in SVR4. Signals Implementation. Exceptions. MACH Exception Handling. Process Groups and Terminal Management. The SVR4 Sessions Architecture. Summary. Exercises. References.5. Process Scheduling.Introduction. Clock Interrupt Handling. Scheduler Goals. Traditional UNIX Scheduling. The SVR4 Scheduler. SOLARIS 2.X Scheduling Enhancements. Scheduling in MACH. The Digital UNIX Real-Time Scheduler. Other Scheduling Implementations. Summary. Exercises. References.6. Interprocess Communications.Introduction. Universal IPC Facilities. System V IPC. MACH IPC. Messages. Ports. Message Passing. Port Operations. Extensibility. MACH 3.0 Enhancements. Discussion. Summary. Exercises. References.7. Synchronization and Multiprocessing.Introduction. Synchronization in Traditional UNIX Kernels. Multiprocessor Systems. Multiprocessor Synchronization Issues. Semaphores. Spin Locks. Condition Variables. Read-Write Locks. Reference Counts. Other Considerations. Case Studies. Summary. Exercises. References.8. File System Interface and Framework.Introduction. The User Interface to Files. File Systems. Special Files. File System Framework. The Vnode/VFS Architecture. Implementation Overview. File-System-Dependent Objects. Mounting a File System. Operations on Files. Analysis. Summary. Exercises. References.9. File System Implementations.Introduction. The System V File System (s5fs). S5fs Kernel Organization. Analysis of S5fs. The Berkeley Fast File System. Hard Disk Structure. On-Disk Organization. FFS Functionality Enhancements. Analysis. Temporary File Systems. Special-Purpose File Systems. The Old Buffer Cache. Summary. Exercises. References.10. Distributed File Systems.Introduction. General Characteristics of Distributed File Systems. Network File System (NFS). The Protocol Suite. NFS Implementation. UNIX Semantics. NFS Performance. Dedicated NFS Servers. NFS Security. NFS Version 3. Remote File Sharing (RFS). RFS Architecture. RFS Implementation. Client-Side Caching. The Andrew File System. AFS Implementation. AFS Shortcomings. The DCE Distributed File System (DCE DFS). Summary. Exercises. References.11. Advanced File Systems.Introduction. Limitations of Traditional File Systems. File System Clustering (Sun-FFS). The Journaling Approach. Log-Structured File Systems. The 4.4BSD Log-Structured File System. Metadata Logging. The Episode File System. Watchdogs. The 4.4BSD Portal File System. Stackable File System Layers. The 4.4BSD File System Interface. Summary. Exercises. References.12. Kernel Memory Allocation.Introduction. Functional Requirements. Resource Map Allocator. Simple Power-of-Two Free Lists. The McKusick-Karels Allocator. The Buddy System. The SVR4 Lazy Buddy Algorithm. The MACH-OSF/1 Zone Allocator. A Hierarchical Allocator for Multiprocessors. The Solaris 2.4 Slab Allocator. Summary. Exercises. References.13. Virtual Memory.Introduction. Demand Paging. Hardware Requirements. 4.3BSD — A Case Study. 4.3BSD Memory Management Operations. Analysis. Exercises. References.14. The SVR4 VM Architecture.Motivation. Memory-Mapped Files. VM Design Principles. Fundamental Abstractions. Segment Drivers. The Swap Layer. VM Operations. Interaction with the Vnode Subsystem. Virtual Swap Space in Solaris. Analysis. Performance Improvements. Summary. Exercises. References.15. More Memory Management Topics.Introduction. MACH Memory Management Design. Memory Sharing Facilities. Memory Objects and Pagers. External and Internal Pagers. Page Replacement. Analysis. Memory Management in 4.4BSD. Translation Lookaside Buffer (TLB) Consistency. TLB Shootdown in MACH. TLB Consistency in SVR4 and SVR4.2 UNIX. Other TLB Consistency Algorithms. Virtually Addressed Caches. Exercises. References.16. Device Drivers and I/O.Introduction. Overview. Device Driver Framework. The I/O Subsystem. The poll System Call. Block I/O. The DDI/DKI Specification. Newer SVR4 Releases. Future Directions. Summary. Exercises. References.17. Streams.Motivation. Overview. Messages and Queues. Stream I/O. Configuration and Setup. STREAMS ioctls. Memory Allocation. Multiplexing. FIFOs and Pipes. Networking Interfaces. Summary. Exercises. References.Index.

\ From Barnes & Noble\ \ Fatbrain Review\ "There are more flavors of UNIX than most brands of ice cream." Thus begins the herculean task of elucidating and describing the design and implementation of the OS itself, the internals of SVR4, 4.4BSD and OSF/1. Discusses SunOS, Solaris, DEC OSF and HP-UX, no AIX. Explains the process and kernel, threads, job control, scheduling, inter-process communication, multiprocess synchronization, file systems (distributed and advanced), memory management, architecture, device drivers, streams and I/O. Good reference sections, clean exposition, assumes familiarity with UNIX. Recommended companion volume is Magic Garden Explained; Internals of Unix by Goodheart. A very good exposition of UNIX SVR4 internals only.\ \