SCSI Mid-layer by ewghwehws


									                         SCSI Mid-layer

Eric Youngdale

   2nd Annual Linux Storage Management Workshop
   October 2000

Main point of this talk:
  –   Historical evolution of Linux SCSI.
  –   Explain state of the art in Linux 2.2.
  –   Discuss changes for 2.4.
  –   Discuss pending changes in the 2.5 kernel.
             Block devices and Linux
• Linux has a generic block device layer with
  which all filesystems will interact.
• SCSI is no different in this regard – it
  registers itself with the block device layer
  so it can receive requests.
• SCSI also handles character device requests
  and ioctls that do not originate in the block
  device layer.
          What is the “Mid-Layer”?

• Linux SCSI support can be viewed as 3 levels.
• Upper level is device management, such as
  tape, cdrom, disk, etc.
• Lower level talks to host adapters.
• Middle layer is essentially a traffic cop,
  handing requests from rest of kernel, and
  dispatching them to the rest of SCSI.
        State of the art in Linux-2.2
• Error handling handled better for drivers
  that make use of new error handling code.
  New error handling code introduced in 2.2.
• Queue management fundamentally
  unchanged since the Linux 1.x days. “The
  Code that Time Forgot”. Lots of dinosaurs
  running around in the code.
• Rest of mid-level largely stagnant.
              What was wrong in 2.2?
• The elevator algorithms in 2.2 allowed requests to
  grow irregardless of the capabilities of the
  underlying device.
• All SCSI disks were handled in a single queue.
• Disk driver had to split requests that had become
  too large.
• One set of common logic for verifying requests had
  not become too large.
     What was wrong in 2.2 (cont)

• Character device requests not in queue.
• SMP safety was clumsily handled, leading
  to race conditions and poor performance.
• Poor scalability.
• Many drivers continue to use old error
  handling code.
             Queue handling in 2.2

Disk Queue Head      Disk1




               Changes for Linux-2.4
• Block device layer was generalized to
  support a “request_queue_t” abstract
  datatype that represents a queue.
• Contains function pointers that drivers can
  use for managing the size of requests
  inserted into queues.
• Requests no longer can grow to be too large
  to be handled at one time.
               Changes for 2.4 (cont)

• No longer any need for splitting requests.
• No need for ugly logic to scan a queue for a
  queueable request.
• SMP locking in mid-layer cleaned up to
  provide finer granularity.
               Changes for 2.4 (cont)
• A SCSI queuing library was created – a set
  of functions for queue management that are
  tailored to different sets of requirements.
• SCSI was modified to use a single queue for
  each physical device.
• Character device requests and ioctls are
  inserted into the same queue at the tail, and
  handled the same as other requests.
                                      Queuing library
Maintainability is a problem if multiple instances of
 code can perform similar function.
__inline static int
__scsi_merge_requests_fn(request_queue_t * q,
                   struct request * req, struct request * next,
                   int use_clustering,
                   int dma_host)
/* * Appropriate contents */
               Queueing Library (Cont).
static int _FUNCTION(request_queue_t * q, \
           struct request * req, \
           struct request * next) \
 return __scsi_merge_requests_fn(q, req, next, _CLUSTER, _DMA);
    MERGEREQFCT(scsi_merge_requests_fn_, 0, 0)
    MERGEREQFCT(scsi_merge_requests_fn_d, 0, 1)
    MERGEREQFCT(scsi_merge_requests_fn_c, 1, 0)
    MERGEREQFCT(scsi_merge_requests_fn_dc, 1, 1)
               Changes for 2.4 (cont)

• In 2.2, there were separate functions and
  code paths for initializing SCSI for the case
  of compiled into kernel and loaded via
• In 2.4, this was cleaned up – redundant code
  was removed, and the same code is used to
  initialize for both modules and compiled
  into kernel.
          Upcoming changes for 2.5

• All drivers will be forced to use new error
  handling code.
• Disk driver will be updated to handle larger
  number of disks.
• SMP locking will be cleaned up some more
  to improve scalability.
            Old error handling code
• Essentially a bad state machine.
• Has tons of SMP problems that are not
  easily fixed.
• Tries to resolve errors while allowing new
  requests to be queued.
• Many kernel reliability problems are
  because of old error handling problems.
• Needs to be discarded in the worst way.
           New error handling code
• The new error handling code has been
  available since the 2.1.75 kernel.
• To force driver authors to update their
  drivers, the old error handling code will
  simply be removed. Drivers that have not
  been updated will fail to compile.
• Orphaned drivers will be handled on a case-
  by-case basis.
               Further SMP cleanups
• All low-level drivers currently use
  io_request_lock for SMP safety.
• This lock is also used by all other block
  devices on the system to protect their
• Plans are in the works to switch the block
  device layer to use a per-queue lock,
  thereby isolating SCSI from other devices.
                SMP Cleanups (cont).

• Low-level drivers don’t need to protect
  queue – they don’t have access to it.
• Each low-level driver should have a
  separate lock – ideally one per instance of
  host, but could be a driver-wide lock
  initially. This should be up to the low-level
                 SMP Cleanups (cont)

• Block device layer has a number of arrays,
  indexed by major/minor:

• Access is not protected by any locks.
• Impossible for block drivers to resize
  without introducing race condition.
              Large numbers of disks
• Current disk driver allocates 8 majors,
  allowing for only 128 disks.
• Plans are in the works to allow disk driver
  to dynamically allocate major numbers.
• Would support up to about 4000 disks,
  when major numbers are exhausted.
• Possible to go beyond this by using fewer
  bits for partitions.
                                 Wish list.

• Implement some SCSI-3 features (larger
  commands, sense buffers).
• Improve support for shared busses.
• Support target-mode.
• Check module add/remove code for SMP
  safety, implement locks.
• Improvements related to high-availability.
   The major goal of a rewrite of SCSI queuing has
been accomplished. A number of architectural
problems were resolved at the same time.
   There are still some interesting tasks still to be
addressed for 2.5.
   See for more
info, and for
“todo” list.

The notes for this talk are on the website.

To top