Document Sample
Linux Powered By Docstoc
					  Linux: Ch. 6.7.3; Ch 20.1-20.9
• Multi-user, multi-tasking, UNIX system
• Goals
   – Speed, efficiency
   – Standardization (e.g., POSIX)
• Components
   – Kernel
      • E.g., virtual memory, process management
   – Libraries
      • Direct system calls to interact with kernel
      • Functions using system calls (e.g., buffered I/O; printf)
      • Non-system call tools
   – System utilities
      • E.g., network daemons (servers)
• Only processes in user mode can be pre-empted
   – e.g., during execution of a system call, if a higher
     priority process becomes runnable, it will not preempt
     during system call
• Process scheduling
   – Time sharing
      • Credit & priority based scheduling
      • Process with most credits allocated CPU
          – Process with zero credits looses CPU
      • Credits modified
          – On timer interrupt
              » Current process looses one credit
          – When no runnable processes, perform recrediting
              » credits = credits/2 + priority
   – Real-time
      • FCFS or RR
      • Schedules first by priority, then by recent waiting time
               Linux Kernel
• Single “monolithic” address space
  – All code and data for kernel in same address
    space (not message passing based)
     • Includes: scheduling, virtual memory management,
       device drivers, file systems, networking
• Kernel modules can be dynamically
     Dynamically Loaded Code
• Normally, a program has all references to data and
  functions resolved
• Dynamically loadable code can have unresolved
   – E.g., your program refers to a function named Foobar,
     but the function is not defined in your program code
   – This code is not designed to run independently
• Process of loading dynamic code (or libraries)
   – File of code is loaded into memory
   – Run-time linker attempts to resolve unresolved
     references using a run time symbol table
   – Run time symbol table is defined by the program
     loading the shared library
      • Run time link errors if some references unresolved
               Kernel Modules
• Modules are arbitrary sections of kernel code
• Run in kernel (privileged) mode
• Typically: device driver, file system access,
  networking prototocol
• Explicit & implicit loading/unloading
• Implicit:
   – When a process requires use of a module (e.g., device
     driver), “module requestor” can load
   – When a module has not be used in a sufficiently long
     time, it can be automatically unloaded
• Startup & cleanup routines
   – Startup: register services, reserve resources
   – Cleanup: called before unloading
• Kernel: service & h/w resource tables
           Process Management
• fork, exec (heavy weight), clone (threads)
• Process identity
   – PID, credentials (user ID, group ID), personality (for
     emulation libraries)
• Process environment
   – Command line arguments; shell variables
• Context
   – Scheduling context (e.g., registers)
   – Accounting info, open file table, file system context,
     signal handler table, address space info
• Same PCB structure for all process types
   – Thread shares some of data structures of parent
   – Each PCB is just a series of pointers into kernel tables
              Process Control Blocks                    memory

PCB1                PCB2               PCB3

  Address space      Address space      Address space

Addr      Addr
space 1   space 2

          Address space kernel table
      Interrupt Service Handlers
• In some cases, turning off interrupts is bad
   – The interrupt handler is long running
   – Multi-processor
• Linux splits interrupt handlers into two parts
   – Top half
       • Prioritized interrupt structure
       • Only interrupts with higher priority can interrupt
   – Bottom half (for longer portion of interrupt handlers)
       •   Runs with all interrupts enabled
       •   Scheduled after running top half of handler
       •   Synchronized kernel critical sections for running bottom halves
       •   Bottom halves run using simple scheduler
SMP: Symmetric Multi-Processing

• Linux 2.0
  – Only one processor at a time can be executing kernel
  – Implemented with a single busy wait semaphore
    (spinlock, see p. 202 of text)
• Linux >= 2.2
  – Multiple spinlocks in kernel
  – Limited execution of kernel code by more than one
IPC: Interprocess Communication
• Signal
   – Asynchronous events (not data); used to inform a
     process that an event has occurred
   – Sent by one user process to another or from kernel to a
     user process (e.g., to inform when a child dies)
   – Only used by user process (kernel uses wait queues)
• Semaphore
   – Can be used between heavy weight processes
• Pipe: communication channel from parent to child
• Sockets
• Shared memory
   – Between light weight processes (in same HWP)
   – Between heavy weight processes
        Ch. 20.6: Linux Memory
• Memory allocation
  – Frames allocated by page allocator
     • Real memory allocated in ‘regions’ (sets or sequences of
  – Static allocation
     • Device drivers during boot time
  – Dynamic
     • User processes; each consists of a series of regions
     • Kernel functions
• Virtual memory
  – LRU-type page replacement policy
  – copy-on-write mechanisms (e.g., fork)
  – Reference counts for frames
   Execution & Loading of User
• Loading binary, executable files
   – Multiple file formats
   – e.g., “a.out” format, ELF format
   – Handled by multiple loader routines in kernel
• ELF file format
   – Header
   – Several page aligned sections
• Loader maps sections of ELF file to separate
  regions of virtual memory
• Region
   – Continuous sequence of pages of address space of process
   – No overlap with other regions
• Address space of a process is a series of regions
   – E.g., one region for program text (code)
        • Another region for the data
        • Another region for the stack
• Information per region (vm_area_struct)
   –   Read, write, execute permissions for process
   –   Any files associated with region
   –   Table of function pointers for page management functions
   –   Region type
• Region index structure (per process)
   – Allows lookups of region by virtual address
   – Balanced binary tree
fork System Call Implementation
• Creates a new child process
   – child receives copies of the parent region descriptors
     and page tables
• Reference count of each resident page in parent is
   – parent and child now share same frames of memory
• Private regions: local writable data in parent
   – Both parent & child page table entries set to read only
      • And marked for copy-on-write
   – If either process tries to modify a copy-on-write page
      • Reference count of frame checked
      • Page still shared? If yes, then:
          –   Copy to new frame
          –   Decrement reference count of source frame
          –   Unmark copy-on-write of page in writing process
          –   Set page to read/write in writing process
             Frame Allocation
• Frames/pages: 4096 bytes (4K)
• Buddy heap
  – Used for allocation of contiguous regions
  – Emphasis on contiguous regions because
     • a) DMA requires contiguous frames (doesn’t use MMU)
     • b) efficiency in TLB usage
         – reduces change of paging tables, reducing memory access time
           by reducing flushing of TLB’s
• Buddy Heap
  – Free frames are maintained on lists of 1, 2, 4, 8, 16, 32,
    64, 128, 256, 512 frames
  – Release operation tries to iteratively merge double sized
         Swapping & Paging
• Linux does not implement whole process
• Relies exclusively on paging
• Policy algorithm
  – Which pages to write to disk
  – When to write to disk
• Paging mechanism
  – Transfer of frame data to/from disk
                      Swap Out
• Implements page replacement policy
• Form of clock algorithm (LRU)
• Successively scan process list
   – If all have 0 in swap_cnt, resets this field in all
     processes to number of frames for process
   – Otherwise, selects process with largest swap_cnt field
     (most frames)
   – if this process cannot be victimized, then sets swap_cnt
     field to 0, continues to next process
   – Pages that have Accessed flag set are not victimized on
     this pass; Accessed flag is cleared
• See Chapter 16, Bovet & Cesati (2001).
  Understanding the Linux Kernel. O’Reilly.
  Memory Allocation “Priority”
• “atomic”
  – Interrupt routines
  – Request satisfied (memory available) or fails
    immediately if no memory is available
  – Memory is to be used for Direct Memory Access
  – Copying data from a device to memory (e.g., a block of
    disk data from disk device)
  – On some computer architectures, not all frames of
    memory can be used for DMA
• User process
  – Stall (block) if insufficient memory is available
• Memory allocator used by kernel routines (e.g.,
  interrupt handlers)
• Allocates variable amounts of memory
• Analogous to malloc
  void *malloc(size_t size); // allocate size bytes of memory
• acquires entire pages, and then splits into smaller
• Allocates until explicitly freed; pages are locked
  and cannot be victimized
     Buffer cache, Page cache,
• Buffer cache
  – Kernel’s main cache for block-oriented devices (e.g.,
    disk drives)
  – Caches pages of disk contents
• Page cache
  – Cache for disk- and network-based file systems
  – Caches pages of file contents; each page is for a
    particular offset into file
• Virtual-memory system
  – Manages the contents of each processes address space
     •   Creates virtual-memory pages
     •   Region management: process has regions containing pages
     •   Page table management
     •   Manages loading from disk/writing back out to disk
      Region Types Defined By
• Backing store
   – “demand zero”
      • When process first tries to access (read or write) page in
        region, a frame is allocated, entered into page table, and it is
        initialized with zeros
   – File backing
      • virtual-memory page is viewport onto page of the file contents
      • When process first tries to access page in region
          – Page table filled with address of page in kernel page cache
          – Each page in kernel page cache for file is for particular offset
            into file
          – Same frame used by page cache and process page tables
• Reaction to writes
   – Private: copy-on-write
      • First write causes a new frame to be allocated, and contents
        copied prior to write
   – Shared: frame is updated
                    Page Tables
• Current location of page of virtual memory
   –   Disk (actual disk location is per region)
   –   Physical memory frame
   –   Read only?
   –   Copy-on-write?
   –   Accessed flag (set by hardware when page accessed)
• When a page access is made to a page that is not
  in a frame
   – Region lookup is performed for the process using the
     region index
   – If needed (e.g., file backing region), the file is obtained
     for this page of the region
   – Region page-management functions are called
   – Physical frame is allocated
   – Page table updated
Chapter 20.7: Linux File System
• Broad concept of files– anything capable of
  handling the input or output of a stream of data
   – Regular files & directories on disk, devices, IPC,
     network connections
   – Also: Process, kernel, & device info (proc file system)
• VFS (Virtual File System) Objects
   – inode: file as a whole
      • Identified by a number pair: file system, inode number
   – file: read/write point in open file; refers to inode
   – file-system: collection of files in a directory hierarchy
      • Gives access to inodes
• Generic methods, independent of underlying
   – E.g., read, write
     Linux File System Formats
• ext2fs: “second extended file system”
   – Scheduling: clusters physically adjacent I/O requests
   – Tries to allocate logically adjacent blocks of a file into
     physically adjacent blocks on disk
   – Block group: a contiguous block sequence
      • Disk file system partitioned into multiple block groups
   – Block allocation
      • Keep related information (disk blocks) in the same block group
           – File: same block group as inode
           – Nondirectory inode: same block group as parent inode
           – Directory inode: dispersed to other block group
      • Bit map of free blocks in a block group
• ext3fs
   – Journaling file system
                proc file system
• Enables process access to system information
   – Via normal file system calls (e.g., read)
• The following are represented as files
   – One directory per process (see next slide)
   – Information about kernel & loaded device drivers
• E.g., “ps” process status command reads files in
          Linux 7.1 Example
Script started on Wed Apr 17 12:54:57 2002
[cprince@rattus /proc]$ ls
1      11912 12165 30583 5359 833            9029   9799            iomem      partitions
1066   11915 12166 30587 539      8411       9030   9885            ioports    pci
1067   11916 1218    30588 6      8412       932    bus             irq         scsi
1068   1192   12185 30589 605     8415       941    cmdline         kcore      self
1070   11950 1219    307    610   8416       942    cpuinfo         kmsg       slabinfo
1078   11951 1222    308    624   848        943    devices         ksyms      stat
1082   1197   1223   309    639   860        944    dma             loadavg    swaps
11626 1199    1225   310    7     888        945    dri             locks      sys
11627 1200    2      312    7314 8921        946    driver          mdstat     sysvipc
1178   1202   27097 32235 7315 8922          9788   execdomains     meminfo    tty
1180   1204   27098 4       760   8925       9792   fb               misc       uptime
11805 1210    28015 4374    77    8926       9793   filesystems     modules    version
11806 1214    28016 4377    773   8929       9794   fs               mounts
1182   1216   3      5      787   8930       9797   ide             mtrr
11832 12164 30049 5358      8147 896         9798   interrupts      net
[cprince@rattus /proc]$ ls 1
cmdline cpu cwd environ exe fd maps          mem      root   stat   statm     status
[cprince@rattus /proc]$ exit
Script done on Wed Apr 17 12:55:29 2002
  Chapter 20.8: Linux Input/Output:
  Block devices (e.g., hard disk, CD
• Block buffer cache
   – For active and completed I/O
   – Data structure per buffer: buffer_head:
       • device, offset within block device, size of buffer
       • lock, dirty time
• Request manager
   – Request: list of buffer_head’s describing I/O to be performed on a
     single device, in a contiguous range of sectors
   – Separate list of requests for each block device driver
   – C-SCAN scheduling
   – Attempts to merge requests in per-device lists
• Some subsystems handle I/O somewhat differently
   – e.g., virtual memory system accessing swap device
   – Still go through request manager
   – Use buffer_head’s to label a page of memory for active I/O only
Chapter 20.8: Linux Input/Output (2)
• Character devices
   – Terminal devices are handled specially
      • for other character devices, I/O operation just passed to device
   – tty discipline
      • Buffering & flow control on the data stream for the device
      • Manages connecting standard input/output streams to running
          – Different processes can be obtaining input/output to the terminal
            over time
• Network devices: Data transferred through
  networking subsystem of kernel
Chapter 20.9: Linux Interprocess
• Signals
    void (*signal(int signum, void (*sighandler)(int)))(int);
        • Establish a signal handler
    int pause(void); // wait for a signal
    int kill(pid_t pid, int sig); // send a signal
    unsigned int alarm(unsigned int seconds);
    int setitimer(int which, const struct itimerval *value,
              struct itimerval *ovalue); // microsecond timer
•   Semaphores
•   Pipe
•   Sockets
•   Shared memory
    Signal Example: Receiving Process
/* signal.c starts */
#include <signal.h>
#include <stdio.h>
                                                   int main() {
#include <sys/types.h>
                                                       printf("pid of handler process is: %d\n",
#include <unistd.h>
                                                       signal(SIGINT, handleSIGINT);
void handleSIGINT(int sig)                             signal(SIGQUIT, handleSIGQUIT);
{                                                      signal(SIGUSR1, handleSIGUSR1);
    printf("received SIGINT (sig= %d)\n", sig);
                                                       for (;;) {
void handleSIGQUIT(int sig)                            }
{                                                  }
    printf("received SIGQUIT (sig= %d)\n", sig);
}                                                  /* signal.c ends */
void handleSIGUSR1(int sig)
    printf("received SIGUSR1 (sig= %d)\n", sig);
  Signal Example: Signals from
Script started on Tue Apr 30 12:55:13 2002
[cprince@rattus cs5631]$ ./signal
pid of handler process is: 2517    ^C was typed
received SIGINT (sig= 2)
received SIGQUIT (sig= 3)           ^\ was typed
    Signal Example: Signals from
         another process - 1
#include <signal.h>
#include <stdio.h>
int main() {
   int pid;

    printf("Enter pid of process to send signal to: ");
    scanf("%d", &pid);

    kill(pid, SIGUSR1);
  Signal Example: Signals from
       another process - 2
[cprince@rattus cs5631]$ ./signal &
pid of handler process is: 2700
[cprince@rattus cs5631]$ ./send
Enter pid of process to send signal to: 2700
received SIGUSR1 (sig= 10)