Brocade Fabric

Document Sample
Brocade Fabric Powered By Docstoc
					                                     1 YEAR UPGRADE
                                     BUYER PROTECTION PLAN




Building


SANs                                                       with


Brocade
Fa b r i c S w i t c h e s
How to Design, Implement, and Maintain Storage Area
Networks (SANs) with Brocade Fabric Switches
• Step-by-step instructions for establishing your SAN requirements—such as high
  availability, performance, and cost savings—and translating those requirements
  into an effective SAN design
• Detailed examples to guide you through the process of installing and trouble-
  shooting your Brocade SAN
• Practical discussions about SAN components and popular SAN configurations
  such as storage consolidation, disaster tolerance, and LAN-free backup
                                                                         Chris Beauchamp Author
                                                                         Josh Judd Author
                                                                         Benjamin Kuo Contributor
solutions@syngress.com

With more than 1,500,000 copies of our MCSE, MCSD, CompTIA, and Cisco
study guides in print, we continue to look for ways we can better serve the
information needs of our readers. One way we do that is by listening.

Readers like yourself have been telling us they want an Internet-based ser-
vice that would extend and enhance the value of our books. Based on
reader feedback and our own strategic plan, we have created a Web site
that we hope will exceed your expectations.

Solutions@syngress.com is an interactive treasure trove of useful infor-
mation focusing on our book topics and related technologies. The site
offers the following features:
  s   One-year warranty against content obsolescence due to vendor
      product upgrades. You can access online updates for any affected
      chapters.
  s   “Ask the Author”™ customer query forms that enable you to post
      questions to our authors and editors.
  s   Exclusive monthly mailings in which our experts provide answers to
      reader queries and clear explanations of complex material.
  s   Regularly updated links to sites specially selected by our editors for
      readers desiring additional reliable information on key topics.

Best of all, the book you’re now holding is your key to this amazing site.
Just go to www.syngress.com/solutions, and keep this book handy when
you register to verify your purchase.

Thank you for giving us the opportunity to serve your needs. And be sure
to let us know if there’s anything else we can do to help you get the
maximum value from your investment. We’re listening.


www.syngress.com/solutions
              1 YEAR UPGRADE
              BUYER PROTECTION PLAN




Building


SANs                             with


Brocade
Fa b r i c S w i t c h e s
                              Chris Beauchamp Author
                              Josh Judd Author
                              Benjamin Kuo Contributor
Syngress Publishing, Inc., the author(s), and any person or firm involved in the writing, editing, or
production (collectively “Makers”) of this book (“the Work”) do not guarantee or warrant the results to
be obtained from the Work.
There is no guarantee of any kind, expressed or implied, regarding the Work or its contents.The Work is
sold AS IS and WITHOUT WARRANTY.You may have other legal rights, which vary from state to state.
In no event will Makers be liable to you for damages, including any loss of profits, lost savings, or other
incidental or consequential damages arising out from the Work or its contents. Because some states do not
allow the exclusion or limitation of liability for consequential or incidental damages, the above limitation
may not apply to you.
You should always use reasonable case, including backup and other appropriate precautions, when working
with computers, networks, data, and files.
Syngress Media®, Syngress®, and “Career Advancement Through Skill Enhancement®,”are registered trade-
marks of Syngress Media, Inc. “Ask the Author™,”“Ask the Author UPDATE™,”“Mission Critical™,”
“Hack Proofing™,” and “The Only Way to Stop a Hacker is to Think Like One™” are trademarks of
Syngress Publishing, Inc. Brands and product names mentioned in this book are trademarks or service marks
of their respective companies.“Brocade®,” “SilkWorm®,” and the Brocade logo are registerd trademarks of
Brocade Communications Systems, Inc., in the United States and/or any other countries.
KEY      SERIAL NUMBER
001      Q3G4T9U2F5
002      6YHQ94MLE4
003      VMERKJ6C4N
004      XD7Y4B39UN
005      8SRT9U6N7H
006      3W7YRNTEP4
007      LHB65TR46T
008      4DB9R5LZMR
009      N835M4KBAZ
010      QT6Y4RTWFC
PUBLISHED BY
Syngress Publishing, Inc.
800 Hingham Street
Rockland, MA 02370
Building SANs with Brocade Fabric Switches
Copyright © 2001 by Syngress Publishing, Inc. All rights reserved. Printed in the United States of
America. Except as permitted under the Copyright Act of 1976, no part of this publication may be repro-
duced or distributed in any form or by any means, or stored in a database or retrieval system, without the
prior written permission of the publisher, with the exception that the program listings may be entered,
stored, and executed in a computer system, but they may not be reproduced for publication.
Printed in the United States of America
1 2 3 4 5 6 7 8 9 0
ISBN: 1-928994-30-X
Technical Editors: Chris Beauchamp,                   Freelance Editorial Manager: Maribeth Corona-Evans
     Josh Judd, Benjamin Kuo                          Cover Designer: Michael Kavish
Acquisitions Editor: Catherine B. Nolan               Page Layout and Art by: Shannon Tozier
Developmental Editor: Kate Glennon                    Indexer: Jennifer Coker
Copy Editor: Beth A. Roberts
  Syngress Acknowledgments

We would like to acknowledge the following people for their kindness and support
in making this book possible.
Greg Reyes, Jack Cuthbert, Doug Wesolek, Maggie Conroy, Julie Chiu, Elaine Tite,
Jeff Seltzer, and Chris Mingrone at Brocade, for championing the idea of a Brocade
SANs book. Also special thanks to Viet Dao, John Gareri, Mark Murphy, Jay Rafati,
Ron Totah, Ezio Valdevit, John Bae, James Carpignano, Steve Daheb, Derek Granath,
Jay Kidd, Omy Shani, James Bleess, Owen Higginson, Leo Kappeler, Chris M.
Nguyen, Mark Peluso, and Henry Robinson for their help in making this book a
reality.
Ralph Troupe of Callisma for his invaluable insight and guidance. Ralph’s expertise in
SAN architecture and design solutions for next-generation storage networking
implementations helped define our vision for this book.
Richard Kristof and Duncan Anderson of Global Knowledge, for their generous
access to the IT industry’s best courses, instructors, and training facilities.
Karen Cross, Lance Tilford, Meaghan Cunningham, Kim Wylie, Harry Kirchner,
Kevin Votel, Kent Anderson, and Frida Yara of Publishers Group West for sharing
their incredible marketing experience and expertise.
Mary Ging, Caroline Hird, Simon Beale, Caroline Wheeler,Victoria Fuller, Jonathan
Bunkell, and Klaus Beran of Harcourt International for making certain that our
vision remains worldwide in scope.
Anneke Baeten and Annabel Dent of Harcourt Australia for all their help.
David Buckland,Wendi Wong, Daniel Loh, Marie Chieng, Lucy Chong, Leslie Lim,
Audrey Gan, and Joseph Chan of Transquest Publishers for the enthusiasm with
which they receive our books.
Kwon Sung June at Acorn Publishing for his support.
Ethan Atkin at Cranbury International for his help in expanding the
Syngress program.

                                                                                     v
This book was designed and written to provide information about storage area net-
working architectures. Every effort has been made to make this book as complete
and accurate as possible. However, the information in this book is provided to you
“AS IS,” without warranty of any kind, including, without limitation, any implied
warranty of merchantability or fitness for a particular purpose.
    The authors and Brocade Communications Systems, Inc., shall have no liability or
responsibility to any person or entity with respect to any loss, cost, liability, or dam-
ages arising from the information contained in this book or the computer programs
that accompany it, and specifically disclaim any implied.




vi
  Brocade Acknowledgments

This book truly represents a complete Brocade team effort.We would like to
acknowledge several people in particular.Without their help, dedication, and knowl-
edge, this book would not have been possible.The thorough technical review by Viet
Dao, John Gareri, Mark Murphy, Jay Rafati, Ron Totah, and Ezio Valdevit shaped our
manuscripts into a book that Brocade can be proud of. John Bae, James Carpignano,
Steve Daheb, Derek Granath, Jay Kidd, and Omy Shani provided several timely con-
tributions to the content.We also incorporated material written by others within
Brocade: James Bleess, Owen Higginson, Leo Kappeler, Chris M. Nguyen, Mark
Peluso, and Henry Robinson.We would also like to thank Maggie Conroy and Doug
Wesolek for their guidance and help with the publication process.
                                               —Josh Judd and Chris Beauchamp




                                                                                 vii
Authors
   Chris Beauchamp is a Senior SAN Architect for Brocade
   Communications Systems, Inc. Chris moved to Brocade in 1998 as a
   Systems Engineer supporting several strategic customers with the applica-
   tion and qualification of SilkWorm fabric switches. Chris now focuses on
   SAN design and architecture, with an emphasis on scalability and trou-
   bleshooting. His specialties include Sun servers, storage performance anal-
   ysis and capacity planning, Fibre Channel trace analysis, scripting in
   various languages, and SAN administration. Chris holds a Master of
   Science in Computer Engineering from Villanova University and a
   Bachelor of Science in Computer Science degree from West Chester
   University. Chris’s background includes positions as a Systems
   Administrator at General Electric and a Systems Engineer at Sun
   Microsystems. Chris currently resides outside of San Jose in the Santa
   Cruz Mountains with his wife Sarah and daughter Meagan.

   Josh Judd is a Senior SAN Architect with Brocade Communications
   Systems, Inc. In addition to writing technical literature, he provides
   senior-level strategic support for major OEMs and end-users of Brocade
   storage network products worldwide.When he first went to work for
   Brocade, he was the company’s Senior IT Specialist, responsible for escala-
   tions in every area of the company’s network, server, and desktop infras-
   tructure. Josh’s career as an IT consultant has given him a diverse skill set,
   which includes senior-level expertise in several UNIX variants,Windows
   9x/NT/2k administration, RAID configuration and optimization, storage
   virtualization and clustering software (such as that produced by VERITAS
   Software), and network engineering with many vendors, including Cisco,
   Foundry, Lucent, and 3com. Before joining Brocade four years ago, Josh
   worked at IBM Global Services, LSI Logic, and Taos Mountain
   Consulting. He lives in San Jose, California.




                                                                               ix
Special Contributor
   Benjamin F. Kuo is a Software Development Manager at TROIKA
   Networks. Headquartered in Westlake Village, CA,TROIKA Networks is
   a developer of Fibre Channel Host Bus Adapters, dynamic multipathing,
   and management software for Storage Area Networks. Ben manages
   development of network management software at TROIKA and is also
   active as the chair of the HBA API subgroup of the Storage Networking
   Industry Association (SNIA), where he spearheads efforts to develop
   interoperable standards in storage networking. Ben’s background includes
   positions at Paracel Inc. (now Celera Genomics), IBM, Micropolis, and
   Echelon Corp. Ben also runs socalTECH.com, a Web site and daily
   newsletter focused on high tech in Southern California. Ben holds a
   Bachelor’s degree in Electrical Engineering from the University of
   Southern California and is a member of the IEEE. Ben lives in
   Moorpark, California, with his wife Jennifer and son Jonathan.




Contributor
   Alex Neefus is the Lead Interoperability Test Engineer at Lamprey
   Networks, Inc. Lamprey Networks offers certification testing services and
   test tool development to the Fibre Channel industry. Alex has worked on
   developing testing tools for the SANmark program hosted by the FCIA.
   This program certifies Fibre Channel devices for conformance and inter-
   operability. Alex has also co-authored and written a number of confor-
   mance test suites for both the FCIA and the University of New
   Hampshire Interoperability Lab. Alex’s background also includes working
   for the UNH Interoperability Lab in the Fibre Channel Consortium for
   over a year and a half. At the lab his primary work is in developing proce-
   dures and tools for testing Fibre Channel products, and working with
   members of the industry to solve interoperability problems with devices
   on the market. Alex resides in Durham, New Hampshire.
                                                                            xi
                                                       Contents



                             Foreword                                         xxv
Learn When to                Chapter 1 Introduction to SANs                     1
Deploy a SAN                    Introduction                                    2
                                Overview of SANs                                2
                                Taming the Storage Monster                      8
Things to consider when
deciding whether a SAN is
                                Benefits of Building a SAN                     10
the right solution:                 Ensuring High Availability                 10
  s The primary                     Consolidating Storage                      11
      application that              Reducing Network Congestion from Backup    13
      needs to be solved
  s Speed, bandwidth,
                                    Accelerating Backup Cycles                 14
      and distance                  Speeding Up Data Access                    14
      requirements                  Increasing Server Cycles                   14
  s The amount of
                                    Ensuring Disaster Tolerance                15
      data sharing or
      consolidation needed
                                When to Deploy a SAN                           16
  s The cost of the SAN             Designing Around the Application           16
      infrastructure                Assessing Speed, Bandwidth, and
      required, such as               Distance Requirements                    17
      switches, cables,
      and HBAs                      Data Sharing and Consolidation Needs       18
                                        Resource Sharing                       19
                                        Volume-Level Sharing                   19
                                        File-Level Sharing                     19
                                Steps to a Successful SAN Deployment           20
                                Summary                                        25
                                Solutions Fast Track                           26
                                Frequently Asked Questions                     28



                                                                               xiii
xiv      Contents


                               Chapter 2 Fibre Channel Basics                 29
                                  Introduction                                30
                                  The Architecture of SANs                    30
Master Fabric Services
                                      Fibre Channel Protocol                  37
                                      Classes of Service                      37
Fabric services provide               Storage Network Topologies              37
information to nodes in a             Fabric Services                         38
switched fabric topology.
Services can be distributed
                                          Fibre Channel Protocol Basics       38
across all switches,                  Fibre Channel Levels                    40
creating the appearance               ULPs                                    42
of single-service type
                                  Classes of Service                          43
servers. In this chapter, we
discuss a number of                   Class 1                                 43
different fabric services,            Class 2                                 44
including:                            Class 3                                 44
      Login Server
                                      Class 4                                 44
      Name Server
      Fabric/Switch
                                      Class F                                 45
      Controller                  Storage Network Topologies                  45
      Management Server               Point-to-Point Topology                 45
      Time Server                     Fibre Channel Arbitrated Loop (FC-AL)
                                       Topology                               47
                                      Switched Fabric Topology                48
                                  Fabric Services                             49
                                      Login Server                            50
                                      Name Server                             50
                                      Fabric/Switch Controller                51
                                      Management Server                       51
                                      Time Server                             52
                                      Other Services                          52
                                  Summary                                     53
                                  Solutions Fast Track                        54
                                  Frequently Asked Questions                  57
                               Chapter 3 SAN Components
                               and Equipment                                  59
                                  Introduction                                60
                                  Overview of Fibre Channel Equipment         61
                                      Cabling and Media                       61
                                                                 Contents    xv


                             GBICs and Connectors                           61
                             Hubs                                           63
                             Switches                                       63
                             Storage                                        64
                             Host Bus Adapters                              64
                             Routers and Bridges                            64
                          Cabling and GBICs                                 65
                             Copper Versus Optical: Selecting Your Media    65
                                 Copper Cabling                             65
                                 Multimode Optical Cabling                  66
                                 Single-Mode Optical Cabling                68
Understand Fibre             Connecting with Connectors                     69
Channel Equipment                The DB-9 Copper Connector                  69
                                 The HSSDC Copper Connector                 70
                                 The SC Optical Connector                   71
WARNING
                                 High-Density Fiber-Optic Connectors        72
  Any single-mode or         Comparing GBICs to Fixed Media                 73
  multimode laser                Using a GBIC                               73
  can damage your                Pros and Cons of Using GBICs               74
  eyes if it is trans-
                                 GBIC Ports on Equipment                    74
  mitted at 1300 nm.
  The 1300 nm wave-
                                 Serialized Versus Nonserialized            74
  length is not in the           Common Problems with GBICs                 75
  visible spectrum, so       Media Interface Adapters                       75
  you will not see a      Using Hubs                                        76
  laser being trans-         Simple Electrical Hubs                         76
  mitted like in 850 nm      Managed Hubs                                   76
  fiber. A 1300 nm               LIP Service: Fibre Channel LIPs,
  laser is dangerous,
                                   Problems, and Solutions                  78
  because it can cause
                                 Getting Out of the Loop: Migrating
  severe retina
  damage.                          to Switched Fabric                       79
                          Using Switches and Fibre Channel Fabrics          80
                             Basic Switch Types                             80
                                 Entry-Level Switches                       81
                                 Scalable Fabric Switches                   81
                                 Core Fabric Switches                       81
xvi   Contents


                 Features of Fibre Channel Switches            82
                     Zoning                                    83
                     Classes of Service                        84
                     Fabric Services                           85
                     Redundancy                                86
                     Buffer Credits per Port                   86
                     Self-Configuring Ports                    87
                     Auto-Negotiating Speeds                   88
                     IP over Fibre Channel Broadcasting        88
                     Firmware Upgrade Methods                  89
                     Loop Operation: Making Your Switch
                      Act Like a Hub                           90
                     FSPF Compliance                           90
                 Management Interfaces                         91
                     Serial Port                               91
                     Telnet                                    91
                     SNMP                                      91
                     Web-Based Management                      93
                     Application-Based Management              94
                     SCSI Enclosure Services                   94
                     Connecting Your Servers with Host
                      Bus Adapters                              95
                 Connecting Hosts to the Fabric                 95
                     HBA Types                                  95
                     Speeds                                     97
                     Ports                                      98
                     Combination Adapters                       98
                     Fabric-Capable Versus Loop Adapters        98
                     HBA-Based LUN Masking                      99
                     Persistent Binding                         99
                     Default LUN Access Permissions            100
                     Upper-Level Protocol Access Permissions   100
                     Dynamic Versus Static Discovery           101
                     Configuration Management Software         101
                     HBA API Support                           101
                     Remote Boot across the SAN                103
                     Hot-Plug Support                          104
                                                                         Contents      xvii


                                 Connecting Legacy Devices into Your SAN            106
                                     Basic Features of Routers                      106
                                         Number of SCSI Buses                       107
                                         Types of SCSI Ports,Termination            108
                                         Selective LUN Presentation                 108
                                         Extended Copy Support                      108
                                         Management Interfaces                      109
                                 Bridging and Routing to IP Networks
                                  and Beyond                                        109
                                     Fibre Channel to DWDM                          109
                                     Fibre Channel across IP Networks               110
                                     IP over Fibre Channel to Gigabit Ethernet      110
                                 Fibre Channel Storage                              111
                                     Individual Disk Drives and JBODs               111
                                     High-End Storage Arrays                        113
                                         Selective LUN Presentation                 113
                                         LUN Export across Multiple Ports           113
                                         Snapshot Backup Volumes                    114
                                 Summary                                            115
                                 Solutions Fast Track                               116
                                 Frequently Asked Questions                         121
Simplify SAN fabric           Chapter 4 Overview of Brocade
management with               SilkWorm Switches and Features                        123
Brocade WEB TOOLS
                                  Introduction                                       124
                                  Selecting the Right Switch                         124
Brocade WEB TOOLS is a                Entry-Level Switches                           126
software utility that                     SilkWorm 2010 (8 Ports) and
enables you to manage                       2210 (16 Ports)                         127
and monitor your fabric
through a Web browser                     SilkWorm 2040 (8 Ports) and
interface and Java plug-in.                 2240 (16 Ports)                         127
Using WEB TOOLS, you                      SilkWorm 2050 (8 Ports) and
can view all switches in
the SAN from a single                       2250 (16 Ports)                         128
interface from any                    Scalable Fabric Switches                      128
workstation in your                       SilkWorm 2400                             129
enterprise—even at a
remote location.                          SilkWorm 2800                             129
xviii   Contents


                              SilkWorm 6400 Integrated Fabric          130
                         SilkWorm 12000 Core Fabric Switch             131
                      Understanding the Brocade Fabric OS              132
                         Fabric OS Core Functions                      133
                         Fibre Channel Services for Reconfiguration    133
                         Dynamic Routing Services                      134
                         Facilities for End-to-End SAN Management      135
                         Brocade Command Line Interface                135
                      Using Optional Brocade Features                  135
                         Brocade Zoning                                136
                         Extended Fabrics                              136
                         Fabric Watch                                  138
                         Understanding Loop Support, QuickLoop,
                           and Fabric Assist                           138
                         Brocade WEB TOOLS                             139
                      Future Capabilities in the Brocade Intelligent
                       Fabric Services Architecture                    140
                              Brocade ISL Trunking                     140
                              Brocade Frame Filtering                  142
                              More Robust Hardware-Enforced
                                Zoning                                 142
                              Enhanced End-to-End Performance
                               Analysis                                143
                              Secure Fabric OS                         143
                      Summary                                          144
                      Solutions Fast Track                             144
                      Frequently Asked Questions                       146
                   Chapter 5 The SAN Design Process                    149
                      Introduction                                      150
                      Looking at the Overall Lifecycle of a SAN         151
                          Data Collection                               153
                          Data Analysis                                 153
                          Architecture Development                      153
                          Prototype and Testing                         153
                          Transition                                    154
                          Release to Production                         154
                                                                Contents     xix


                              Maintenance                                  155
                            Conducting Data Collection                     156
                              Creating an Interview Plan                   156
                              Conducting the Interviews                    157
                                  What Overall Business Problem Are
                                   You Trying to Solve?                    158
                                  What Are the Business Requirements
                                    of the Solution?                       158
                                  What Is Known about the Nodes that
                                    Will Attach to the SAN?                160
                                  Which SAN-Enabled Applications
Master the seven                    Do You Have in Mind?                   165
phases of the SAN                 Which Components of the Solution
design lifecycle:                   Already Exist?                         165
                                  Which Components Are Already
 1. Data Collection                 in Production?                         166
 2. Data Analysis                 Which Elements of the Solution
 3. Architecture                    Need to Be Prototyped and Tested?      166
    Development                   What Equipment Will Be Available
 4. Prototype and Test
                                    for Testing?                           166
 5. Transition
                                  How and When Are Backups to
 6. Release to Production
 7. Maintenance
                                    Be Done?                               167
                                  What Will Be the Traffic Patterns in
                                    the Solution?                          167
                                  What Do We Know about Current
                                    Performance Characteristics?           168
                                  What Do We Know about
                                    Future Performance Characteristics?    172
                                  How Much Downtime Is Acceptable to
                                    Production Components During
                                    Implementation?                        174
                                  How Much Downtime Is Acceptable
                                    for Routine Maintenance? How Much
                                    Downtime Is Acceptable for Upgrades
                                    and Architectural Changes?             174
                                  When Do You Need Each Piece
                                    of the Solution to Be Complete?        175
xx        Contents


                                          Summary List of Questions               176
                                     Conduct a Physical Assessment                176
                                  Analyzing the Collected Data                    177
                                     Processing What You Have Collected           177
                                     Establishing Port Requirements               182
                                          Simple Case                             183
                                          Moderate Case                           185
                                          Complex Case                            186
                                     Preparing an ROI Analysis                    187
                                          The Return On Investment Proposition    188
                                     The Rest of the Process and the
                                       Repetition of the Cycle                    190
                                  Summary                                         191
Answer Your
                                  Solutions Fast Track                            192
Questions about SAN
Applications and                  Frequently Asked Questions                      193
Configurations
                               Chapter 6 SAN Applications
                               and Configurations                                 195
Q: I would like to cluster        Introduction                                     196
     my databases for             Configuring a High-Availability Cluster          196
     better performance.              Typical HA Application or Database Server    198
     What databases can
     I use?                           Microsoft Cluster Server                     200
                                  Using a SAN for Storage Consolidation            203
A: Most major databases
     now support fabric                   Shared Storage Using a Web Farm          206
     switch-based                     Storage Partitioning Using Switch Zoning     208
     clustering, including                Switch Zoning Configuration for
     Oracle Parallel Server,
     IBM DB2 Parallel                        Departmental SANs                    208
     Edition, and Microsoft           Storage Partitioning Using Storage LUN
     SQL Server.                       Masking                                    210
Q: I would like to have my            Storage Partitioning Using HBA LUN
     Exchange Mail Server              Masking                                    210
     highly available. What
     should I do?                     Partitioning with Software                  211
A: Brocade has developed
                                  LAN-Free Backup Configuration                   212
     HA solutions for the         SAN Server-Free Backup                          213
     Exchange Server that             SAN-Based Third-Party Copy Data Movers      215
     can be used in setting
                                  Making Your Enterprise Disaster Tolerant        216
     up your desired SAN
     configuration.                   Data Replication and Remote Backup          218
                                                        Contents      xxi


                      Metropolitan Area Network Solutions          219
                   Summary                                         222
                   Solutions Fast Track                            222
                   Frequently Asked Questions                      226
                Chapter 7 Developing a SAN Architecture            227
                   Introduction                                     228
                   Identifying Fabric Topologies and SAN
                    Architectures                                  229
                       Useful Topologies                           235
                           Scalability                             236
                           Cascade Topology                        236
Develop a SAN
                           Ring Topology                           237
Architecture
                           Mesh Topologies                         238
                           Core/Edge or Star Topologies            242
                           Topologies at a Glance                  246
                           Complex Topologies                      246
                   Working with the Core/Edge Topology             246
                       Scaling without Downtime                    248
                           Adding an Edge Switch                   248
                           Upgrading the Core                      250
                   Determining Levels of Availability              256
                   Configuring Traffic Patterns                    261
                       Leveraging Tiers                            261
                           Exploiting Locality                     266
                           Using Any-to-Any Connectivity           268
                   Evaluating Performance Considerations           269
                       When Is Over-Subscription Bad?              270
                       Considerations Outside the Fabric           270
                   Summary                                         272
                   Solutions Fast Track                            273
                   Frequently Asked Questions                      275
                Chapter 8 SAN Troubleshooting                      277
                   Introduction                                     278
                   The Troubleshooting Approach:The SAN
                     Is a Virtual Cable                            278
xxii     Contents


                                  A Typical Scenario: “I Cannot See My Disks”   279
                                  Where to Start and What Data to Gather        283
                                      Take a Snapshot: Describe the
                                        Problem and Gather Information          284
                                  Troubleshooting Tools                         287
                                      Using the Switch LEDs                     287
                                      Switch Diagnostics                        289
                                      Helpful Commands                          290
                                      SAN Profile                               308
                                      What Data Can a Host Provide?             312
                                      When to Use portLog and Other
                                        Advanced Tools                          314
                               Troubleshooting the Fabric                       316
                                  What to Look for in a Malfunctioning Fabric   317
SAN Troubleshooting                   Host Behavior                             317
                                      SAN Profile                               317
                                      Switch LEDs                               318
When you start the
troubleshooting process,
                                      The errShow Command                       318
determine whether the                 The switchShow Command                    318
issue is fabric-related or            The topologyShow Command                  320
device-related. A fabric-
related issue impacts
                                      The nsShow and nsAllShow Commands         320
many devices, while a             Now that You Suspect a SAN Issue:
device-related issue affects       Digging Deeper                               321
only a few devices.
                                      Timeout of Edge Devices During
                                        Fabric Bring Up                         321
                                      Port Configuration Conflict or
                                        Missing Fabric License                  322
                                      Segmented Fabrics                         323
                               Troubleshooting Devices that Cannot Be Seen      327
                                  What to Look for in the Fabric                329
                                  Are the Host and Storage Visible via
                                   switchShow on Their Respective Switches?     329
                                  Do the Devices Show Up in the Name
                                   Server?                                      332
                                      Rule Out Zoning Issues                    333
                                      Edge Device Not in the Name Server        334
                                                                 Contents     xxiii


                           Troubleshooting Marginal Links                   335
                              Marginal Point-to-Point/Fabric Device Links   335
                              Marginal Loop Connections                     337
                              Nx_Port (Host/Storage) Behavior with
                                a Marginal Port in the Loop                 338
                                   Marginal GBIC/Cable                      338
                                   Connected Device                         339
                              Fault Isolation                               339
                              How the Switch Can Help: Fabric Watch
                                and QuickLoop Zoning                        339
                              Overview of SilkWorm Port Error Statistics    341
                           Troubleshooting I/O Pauses                       342
                           Summary                                          344
                           Solutions Fast Track                             345
                           Frequently Asked Questions                       347
                       Chapter 9 SAN Implementation,
                       Maintenance, and Management                          349
                          Introduction                                       350
                          Installation Considerations                        351
Use licenseShow to            How to Cable Your SAN for Ease of
Determine What                  Operation                                   351
Licenses Are Installed
                              Racking Considerations                        354
on Your Switch
                              In-Band or Out-of-Band Management?            356
                                    IPFC In-Band Guidelines                 357
core1:admin> licenseShow
                              Setting Switch Parameters                     358
SRzy9Sz9zeTS0zAG:             What Fabric OS Version Should I Use?          361
     Web license              Licenses                                      366
bbSz9eQb9zccT0AQ:         Automating Switch Administration Activities       367
                              Fabric OS APIs                                367
     Zoning license
                              Expect Scripting                              369
RdzdSRcSyzSe0eTn:                   A Switch Management Wrapper
     QuickLoop license               Using Expect                           369
cSczRScd9RdTd0SY:         Brocade Zoning Considerations                     372
                              Where to Zone?                                373
     Fabric license
                              Hard Zoning or Soft Zoning?                   375
xxiv   Contents


                              Hard Zoning and Soft Zoning
                                Differences                            378
                              Zone Management                          378
                              Scripting Zoning Operations              379
                         Zoning Tips                                   381
                     Validating Your Fabric                            382
                         Baseline Your SAN Profile                     382
                         Fault Injection                               384
                         Running an I/O Load                           385
                              Types of Load                            386
                              I/O Generators                           387
                     SAN Maintenance                                   391
                         The Configuration Log: Key Information
                           to Gather and Maintain about Your SAN       391
                         Backing Up and Restoring a Switch
                           Configuration                               393
                         Bringing Up a Fabric                          394
                         Expanding a Fabric: Merging Fabrics, Adding
                           a Switch, or Replacing a Switch             395
                         Upgrading Your Fabric                         398
                         Issues Applicable to Both Hot and
                           Cold Upgrades                               398
                              Performing a Hot Fabric Upgrade          399
                              Performing a Cold Fabric Upgrade         400
                              How to Automate firmwareDownload         400
                         Replacing or Adding an Edge Device in
                           the Fabric                                  401
                     Summary                                           403
                     Solutions Fast Track                              405
                     Frequently Asked Questions                        408
                  Appendix Building SANs with
                  Brocade Fabric Switches Fast Track                   409
                  Index                                                431
                                                               Foreword




Why Write a Book about SANs?
During the last few years, Storage Area Networks (SANs) have fundamentally changed
the way organizations design, build, and manage their enterprise networks. As a supe-
rior alternative to direct-attached storage models, SANs have enabled a wide range of
new configurations and applications. In turn, those applications have generated a
variety of benefits for the organizations that have implemented them.These advan-
tages include superior scalability, simplified storage management, optimized resource
sharing, higher availability, and tremendous cost savings to name just a few.
    As a primary facilitator of the networked storage model, Brocade
Communications Systems actively seeks out new opportunities to raise industry
awareness about the value of SANs. One of our primary goals at Brocade is to help
educate all kinds of organizations about the advantages a networked storage environ-
ment can offer.
    Based on feedback from our customers and business partners, we realized that
there was no self-contained, effective guidebook for implementing Fibre Channel
SANs.To help fill that void, we have joined with Syngress Publishing to bring you
Building SANs with Brocade Fabric Switches. This book details the design, installation,
configuration, and troubleshooting of Brocade-based SANs —basically everything
you need to know before beginning your own SAN implementation.

Who Should Read This Book?
Building SANs with Brocade Fabric Switches is written for anyone who plans to design,
build, and manage SANs using Brocade switches and software. In particular, this book
provides a “how to” reference that describes what you can do today, given the tech-
nologies currently available.
                                                                                    xxv
xxvi   Foreword


    By necessity, the focus is Brocade-centric and features the theory of operation
behind Brocade SilkWorm switches and Fabric OS. However, this book is not
intended to be a comprehensive guide for every configuration and scenario possible.
After all, with the rapid expansion of the SAN marketplace, there will undoubtedly
be other technologies available in the not-so-distant future.

What Does the Book Contain?
In addition to providing an overview of current technology, tools, products, and
design topologies, this book should serve as a guideline for actual SAN implementa-
tion. For instance, the book begins with a detailed analysis of technology require-
ments and the benefits of implementing a SAN. Next, you can learn about Fibre
Channel concepts and definitions as well as the full range of SAN components.
    We then introduce you to the Brocade SilkWorm series of Fibre Channel
switches, including guidelines for integrating these switches into your existing IT
environment.The book concludes with examples of design processes, popular SAN
applications, and detailed troubleshooting and maintenance tips. In addition, each
chapter features a high-level summary and FAQs for anyone who needs a quick
overview of the SAN basics.
    Our goal is to make this book a valuable tool for implementing your own
SAN infrastructure and teach how a well-designed SAN can deliver a
competitive advantage for your organization.We welcome your feedback on our
efforts. If you have any comments or suggestions about this book, please let us know
at www.syngress.com/solutions.
                                          —Kumar Malavalli
                                           Vice President,Technology
                                           Brocade Communications Systems




 www.syngress.com
                                      Chapter 1


Introduction
to SANs




 Solutions in this chapter:

     s   Overview of SANs
     s   Taming the Storage Monster
     s   Benefits of Building a SAN
     s   When to Deploy a SAN
     s   Steps to a Successful SAN Deployment


         Summary

         Solutions Fast Track

         Frequently Asked Questions




                                                1
2   Chapter 1 • Introduction to SANs



    Introduction
    In the early 1980s, direct-attach disk storage through interconnects such as Small
    Computer Systems Interface (SCSI) was the standard way to connect to data.This
    worked well for the amount of data typically handled at the time, and became
    the standard way to connect high-speed, high-performance storage to computer
    systems.
        However, as computer systems increased in speed and data storage needs
    increased, the parallel bus architecture of SCSI soon started hitting performance
    and distance limits. In response to these needs, Fibre Channel was developed to
    provide gigabit-speed serial networking capabilities for storage. Fibre Channel
    includes support for the SCSI protocols for storage, the Internet Protocol (IP) for
    networking, and the Virtual Interface (VI) protocol for clustering, which are
    mapped onto a network architecture.The Fibre Channel standard combines long
    distances of up to 10 km, simplified serial cabling over multiple media types,
    gigabit speeds, and the ability to simultaneously use more than a single protocol
    over the same wire.These features won adoption for Fibre Channel throughout
    the 1990s as a replacement for parallel SCSI, and Fibre Channel is now used for
    most high-capacity, high-end direct storage devices.
        With the advent and market acceptance of Fibre Channel as a point-to-point
    replacement for parallel bus SCSI, a new technique has emerged that combines
    pure storage usage with networking—the Fibre Channel Storage Area Network
    (SAN). A SAN is a network of storage and system components, all communi-
    cating on a Fibre Channel network, that can be used to consolidate and share
    storage, provide high-performance links to data devices, add redundant links to
    storage systems, speed up data backup, and support high-availability clustered
    systems.
        This chapter provides an overview of what a SAN is, what types of problems
    are best solved with a SAN, and some steps to make a SAN deployment suc-
    cessful. After reading this chapter, you should be able to determine if you should
    deploy a SAN, identify the types of applications best solved by SAN technology,
    and be ready to create a deployment plan for your SAN.

    Overview of SANs
    Throughout the 1980s, the standard way of connecting hosts to storage devices
    was point-to-point, direct-attach storage through interfaces such as Integrated
    Drive Electronics (IDE) and parallel SCSI (Figure 1.1). Parallel SCSI offered
                                                               Introduction to SANs • Chapter 1   3


relatively fast (5 or 10 Mbit/sec) access to SCSI-enabled disks, and several disks
could be connected at once to the same computer through the same interface.
This worked well for the time, with fairly reliable, fast-speed connections
allowing administrators to connect internal and external storage through just
simple ribbon cabling or multiconductor external cables. However, as storage
subsystems became larger and larger and computers faster and faster, a new
problem emerged—external storage (which at one time was just a simple disk
drive on the desk next to a machine) started to get bigger.Tape libraries,
Redundant Array of Inexpensive Disks (RAID) arrays, and other SCSI devices
began to require more and more space—requiring the parallel SCSI connection
to be stretched farther and farther away from the host. Input/Output (I/O) rates
also increased, pushing on the physics of keeping signal integrity in a large bundle
of wires (32- and 64-bit data bus widths). Simple parallel SCSI variants were
devised to enable longer distances and to address the signal integrity issues.
However, they all eventually ran up against the difficulties of high-speed signals
across the parallel SCSI bus architecture.

Figure 1.1 Parallel SCSI Bus Connection

                                 SCSI            SCSI               SCSI
                                 ID 1            ID 2               ID 3

                      Host




                                           Parallel SCSI Bus




    The solution to all of this was slow in coming, but eventually the storage
industry settled on using a serial protocol with high-speed transceivers—offering
good noise immunity, ease of cabling, and plentiful bandwidth. Different specifi-
cations (Serial Storage Architecture [SSA] and Fibre Channel as well as more
advanced parallel SCSI technologies) competed for adoption, and companies
began experimenting with different serial communications media. New high-
speed circuits made serial transfers (using a simple pair of wires to transmit bits
serially, in order, rather than a large number of wires to transfer several bytes or
4   Chapter 1 • Introduction to SANs


    words of data at a time) the most practical solution to the signal problems.The
    high speed of the circuits enabled the data rates for Fibre Channel to offer up to
    100 Mbit/sec transfers, versus the slower 10 to 20 Mbit/sec parallel limitations.
        When Fibre Channel was first applied to the area of storage connections, the
    primary reason for the technology was for the extended distances and simplified
    cabling that the technology offered.This extension of direct-attach operation
    basically replaced the old parallel SCSI attachments with a high-speed serial line
    (Figure 1.2).The new Fibre Channel connections offered a much faster interface
    and simplified cabling (four copper wire connections through DB-9 connectors,
    as well as optical cabling), and could be used to distribute storage as far as 10 km
    away from a host computer, or 30 km away with optical extenders.

    Figure 1.2 Using Fibre Channel to Extend Distances from Storage

                       Host
                                                                  Storage Array



                                          Fibre Channel Link
                                            Up to 10 km




        The connections to disks at this time began using the Fibre Channel
    Arbitrated Loop (FC-AL) protocol, which enabled disks to negotiate their
    addresses and traffic on a loop topology with a host (Figure 1.3). Because of the
    combined ability to easily cable and distribute storage, users were now able to
    add separate racks of equipment to attach to hosts. A new component, the Fibre
    Channel hub, began to be used to make it easier to plug in devices.The hub, a
    purely electrical piece of equipment that simply connected pieces of a Fibre
    Channel loop together, made it possible to dynamically add and remove storage
    from the network without requiring a complete reconfiguration. As these com-
    ponents began to be used in increasingly more complex environments, manufac-
    turers began to add “intelligence” to these Fibre Channel hubs, enabling them to
    independently deal with such issues as failures in the network and noise in the
    network from loops being added and removed. An alternative to the hub came in
    the form of the Fibre Channel switch, which, unlike a hub, was not just con-
    necting pieces of a loop, but instead offered the packet-switching ability of
    traditional switches.
                                                              Introduction to SANs • Chapter 1   5


Figure 1.3 Arbitrated Loop Disk Configuration Attached to a Single Host

                                                           Disk
                              Disk

                                                                      Disk



                                          Fibre Channel
                   Host                  Arbitrated Loop
                                                                             Disk



                            Disk
                                                              Disk



     Because there was now a Fibre Channel network available, other hosts (not
storage) were added to take advantage of the same network.With the addition
of SAN-aware software, it was suddenly possible to share storage between two
different devices on the network. Storage sharing was the first realization of the
modern SAN, with companies in the multimedia and video production areas
paving the way by using the Fibre Channel network to share enormous data
files between workstations, distribute jobs for rendering, and make fully digital
production possible (Figure 1.4).
     The next big step in Fibre Channel evolution came with the increased relia-
bility and manageability of a Fibre Channel switched fabric. Early implementations
of FC-AL were sometimes difficult to manage, unstable, and prone to interoper-
ability problems between components. Because the FC-AL protocol was quite
complex, what sometimes would happen would be an inability for anything to
communicate and stay operational on a loop.The solution to this was a move to a
switched fabric architecture, which not only enhanced the manageability and reliability
of the connection, but provided switched, high-speed connections between all
nodes of a network instead of a shared loop. As a result, each port on a switch now
provides a full 1 Gbit/sec of available bandwidth rather than just a portion of the
total 1 Gbit/sec of bandwidth shared between all the devices connected to the
loop. Fabrics now make up the majority of Fibre Channel installations. A typical
Fibre Channel switched fabric installation (Figure 1.5) has multiple hosts and
storage units all connected into the same Fibre Channel network cloud through
one or more Fibre Channel switches.
6   Chapter 1 • Introduction to SANs


    Figure 1.4 Multiple Host Arbitrated Loop for Storage Sharing

                                                 Disk                       Disk




                                                                                                         RAID




                                                           Fibre Channel
                     Host
                                                          Arbitrated Loop




                                                                                                 Disk


                                   Host
                                                                                   Host



    Figure 1.5 Switched Fabric, Multiple Host, and Storage Unit Configuration

                                                                 JBOD
                            JBOD


                                                                                                  RAID




              Tape                                      Fibre Channel Switch




                                                                                                 JBOD
                            Host



                                          Host                                            Host
                                                                                       Introduction to SANs • Chapter 1                          7


    Today, the modern SAN looks much like any other modern computer net-
work. Network infrastructures such as switches, hubs, bridges, and routers help
transport frame-level information across the network. Network interface cards
interface computer systems to the same network (called HBAs in the SAN world,
as they replaced SCSI Host Bus Adapters). Figure 1.6 shows an example of how
these components could be used in conjunction with Fibre Channel switches.

Figure 1.6 Typical Deployed SAN Configuration with Multiple Hosts, Storage,
and Tape Devices

                                                                                                           Legacy
                                                                                                           Parallel
                                                                                                             SCSI
                                                                                                           Storage


                                                                                        RAID
                                          Web Server                                    Array


     Database Server
                                                                                                                  Fibre Channel-to-
                                                                                                                     SCSI Router

                                 HBA
                                                                                                                                        Remote
                                                          Fibre Channel Switch
                                                                                                                                         SAN
           HBA

                        Fibre Channel Switch                                        Fibre Channel Switch

                                                             ISL
                                                        (Inter-Switch
                                                            Link)                                                   Fibre Channel-to-
                                                                                                                      DWDM Bridge
      HBA



                                 Fibre Channel                                        Fibre Channel Hub
                                                       Fibre Channel Switch
                                     Cloud

                                                                                                                           Storage
    Host          HBA                                                                                                       Array

                                                                        Fibre Channel-to-
                                                                           SCSI Router


                               Host
                                                                                                           JBOD
                                                                 Tape Array
8   Chapter 1 • Introduction to SANs




       Resources for SAN Information
        Rather than relying on just the equipment vendor, an effective way to
        learn and become an expert on the technology is to track the industry and
        attend conferences, meetings, and tutorial sessions about the subject.
             Additional resources for learning more about SAN technology are
        the industry organizations devoted to this area. The Storage Networking
        Industry Association (SNIA) offers white papers and educational
        resources, holds technical tutorial sessions and Storage Networking
        World conferences, and supports both the users and vendors involved in
        the storage networking field. More information can be found at
        www.snia.org. The Fibre Channel Industry Association (FCIA) provides
        resources for users and vendors, conducts the SANmark suite of Fibre
        Channel interoperability tests, and holds conferences and meetings to
        help promote Fibre Channel technology. Their site can be found at
        www.fibrechannel.org.



    Taming the Storage Monster
    The advent of SANs has been driven by today’s insatiable appetite for storage.
    The Internet, e-mail, multimedia, and the increasing digital nature of society
    have resulted in an ever-increasing demand for ways to store, retrieve, and back
    up that data.
        For example, e-mail has been on a staggering growth path in the last few
    years, as more and more people have gone online and businesses have made
    e-mail a critical part of their communications infrastructure. According to the
    Year-End 2000 Mailbox Report, there are over 891 million e-mail mailboxes now
    in existence. Corporate mail usage grew 34 percent in 2000, bringing with it a
    huge increase in the need for data storage to save all of that e-mail. Multimedia
    attachments, the movement of business processes to e-mail, and just the sheer
    volume of e-mail have made the storage and backup of e-mail one of the most
    pressing requirements of IT departments.
                                                    Introduction to SANs • Chapter 1   9


     The Internet has also affected the need for storage, with increasing numbers
of Web servers and storage required to support those Web servers. As information
is increasingly digitized and published on the Web, there is an insatiable appetite
for storage to contain that information. Music and full-motion video, even with
compression, take an immense amount of disk space, and the movement of stu-
dios and companies to run a “full digital” shop has resulted in an enormous
demand for storage capacity. Databases, which used to be considered big if they
were gigabytes in size, are now well beyond a terabyte—with companies talking
about eventually having to manage petabytes of database storage.
     In addition, with caching servers,Web load balancing, and Web farms built to
distribute the processing load for Web traffic, the data being presented on Web
sites has to be duplicated 10, 20, and even 100 times to serve those distributed
hosts with information.With the increased connectivity of the Internet, informa-
tion and content are being generated and distributed faster than ever before in
history—so much, in fact, that the University of California at Berkeley recently
released a study that claims that more data will be created in the next two years
than was produced in the history of mankind.
     All of this data has to go somewhere, and it has exceeded the space available
and beyond what can practically be managed on local, direct-attached storage to
hosts. Because local storage is relatively fixed and difficult to expand, and because
its local nature is difficult to manage, organizations have started to look for a
better way to manage this data.The solution has come in the form of very large
storage arrays, capable of storing terabytes and terabytes of data, and farms of
inexpensive Just A Bunch Of Disks (JBOD). All of this needs to be connected,
and the logical way to connect high-speed, block-oriented traffic is through a
Fibre Channel SAN. Increased manageability, the ability to centrally manage
storage, and consolidation of storage space have made the SAN a necessity in any
growing enterprise.
     Data growth is increasing at such a rapid pace that IBM recently reported
that storage sales now exceed server sales at a 70:30 ratio.The requirements to
store data are increasing at a greater rate than the requirement for CPU cycles,
and the entire industry is changing as a result.This gain has meant that data is
now managed separately from the machines that consume that data, making
SANs an ideal choice to break the dependency of hosts from the storage, and
increasing the manageability and usability of a corporation’s investment in data
storage.
10   Chapter 1 • Introduction to SANs


        Implementing a SAN is an ideal technique for taming the storage require-
     ment monster that has resulted from the growth of the Internet and increased
     connectivity of our electronic age.

     Benefits of Building a SAN
     A number of practical, real-world uses for SANs have emerged in recent years.
     Knowledgeable administrators have figured out the types of problems that SAN
     technology best solves. SANs are typically used for the most business-critical,
     technically challenging problems a company faces. Critical, high-availability sys-
     tems used for e-mail, database, and file servers have been the first to switch to
     SANs. A need to consolidate storage and centrally manage volumes has resulted
     in a trend toward using SANs for storage consolidation.With the increase in data
     growth, backups have also become a problem, with companies looking to accel-
     erate backup cycles. Protocols such as IP available on Fibre Channel also make
     SANs attractive for some general networking applications, and VI clustering sup-
     port allows installations to leverage their SAN infrastructure for VI-enabled clus-
     tering applications. Finally, the distance capabilities of Fibre Channel and bridges
     to Metropolitan Area Networks (MANs) and even Wide Area Networks (WANs)
     have enabled a new level of disaster tolerance for storage resources.

     Ensuring High Availability
     As the Internet and digital data have grown exponentially,Web caching tech-
     niques,Web load balancers and distributed server clusters, and other techniques
     have been used to handle the demands of serving up Web requests for static
     pages. Images, files, and Web pages that do not change often can be copied across
     a bank of hosts, all of which can service a request from a user.
         However, these techniques cannot be applied in many critical applications.
     For example, an e-mail server requires one single, consistent image for e-mail
     storage. Back-end databases of e-commerce applications require combining live,
     real-time inventory data with live pricing data to service requests correctly. None
     of these can be cached across a Web server due to the real-time, non-cacheable
     nature of the information.This dependency on a consistent, single image of data
     cannot be solved by just replicating data or sharing across a cluster.The result is a
     new, critical point of failure in the e-mail server or database. Especially with the
     growth in data, more and more vital data is being trusted to those single points of
                                                     Introduction to SANs • Chapter 1    11


failure, raising the stakes and potential losses if those services go down.The con-
cern over these critical points of failure has resulted in a renewed focus on highly
available (fault tolerant) solutions, particularly in the storage area. In combination
with failover software packages such as Microsoft Cluster Server or VERITAS
Cluster Server, high-availability hardware and software has come to the forefront
in ensuring the performance and availability of these critical systems.
     For example, one area where the use of SANs is ideal has been the use of
high-availability solutions for managing and running very large Microsoft
Exchange databases.With the immense increase in data stored in Exchange
servers all over the world, there has been an increase in the amount of back-end
storage required for serving those Exchange installations. Because of the nondis-
tributed nature of Exchange mail databases, there has been a concentration of
data tied to single hosts and storage units—a single point of failure that could
cripple many businesses.The natural solution has been to use application clus-
tering techniques combined with a robust, fully redundant high-availability SAN
to support those clusters and share redundant storage between hosts.
     High-availability systems are now regularly used for ensuring fault-tolerant
access to storage. A focus on eliminating single points of failure has stimulated
demand for fault-tolerant equipment configurations, specific fault-tolerant net-
work equipment, and techniques for ensuring high availability. SAN technology is
ideal for these types of solutions. It allows host-to-host connectivity for heartbeat,
equipment status, and network communications, as well as for sharing critical
storage between alternate and backup servers.
     The availability of SAN connections has solved one of the big problems with
high-availability, clustered installations: access to the same data across a network.
In combination with high-availability features in storage arrays and other equip-
ment, the SAN allows for multiple redundant paths to be made from multiple
redundant hosts, dramatically increasing the reliability of critical systems. In addi-
tion, with flexible SAN interconnections, the large amount of data that needs to
be accessed can more easily be managed separately, rather than being captive to a
potential failure in a host.

Consolidating Storage
As data needs have increased, it has become increasingly difficult to manage the
hundreds of hosts and local disks attached to those hosts. In order to manage this
growth, administrators have begun to centralize their storage resources. Large
storage arrays and pooled storage are much more efficient and infinitely more
manageable than local storage.
12   Chapter 1 • Introduction to SANs


          As storage needs have increased, the model of attaching local storage to hosts
     has broken down. Administrators figured out that, even though a company as a
     whole might own enough storage for all of its needs, that storage was not neces-
     sarily in the right place. For example, a Web server might be running out of space,
     with no more space available on local disks and not enough SCSI connections to
     add more external storage, while the database server next to it has gigabytes free.
     In the old model of local storage, there was no way to take advantage of that fact.
     You ended up purchasing much more storage than you needed, because you had a
     very low rate of utilization—yet you never had enough capacity.You also ended
     up purchasing more servers than you needed, because you did not need more
     CPU cycles, but rather, more storage slots.
          With the advent of the Fibre Channel network, the ability of both clients and
     storage to coexist and share storage has spawned a new crop of solutions that take
     advantage of that sharing. Sharing of storage, which previously was limited to ver-
     tical markets such as video editing and multimedia, has become a general technique
     used anywhere that storage is more easily managed in a pool, such as in Internet
     Service Provider (ISP) and Application Service Provider (ASP) installations. Indeed,
     most corporate IT environments can take advantage of this technique.
          Through software such as VERITAS Volume Manager,Tivoli SANergy, and
     DataCore SANsymphony, users are now able to allocate and share storage among
     multiple hosts.
          By using the SAN infrastructure, large centralized pools of disks can be
     divided between hosts, and new volumes allocated as needed from the general
     pool.This results in a huge increase in efficiency in use of storage, eliminating the
     pools of expensive, local, unusable storage. Instead, one large, easily managed vir-
     tual storage pool can be centrally administered, and storage costs and administra-
     tion centralized and consolidated.
          Sharing is accomplished through this high-level software, which discovers and
     manages all of the storage on the network. Drivers and software in the host manage
     which machines do and do not get access to a specific part of a storage device. In
     general, a central administrator is able to allocate arbitrary pieces of storage to
     specific hosts, all while the network and all hosts are running in real time.
          A typical example of this is an ISP with a large number of user Web home-
     page accounts. Extensive pools of clustered and independent Web servers help to
     ease the traffic load and provide redundancy on the Internet, while being tied into
     a single- or dual-redundant SAN. Storage allocation and sharing software is run on
     all of these hosts, and the different Web homepage accounts are allocated to dif-
     ferent Web servers.When a failure on a host or storage device occurs, either an
                                                    Introduction to SANs • Chapter 1   13


automated process or manual intervention will re-allocate those user volumes to
another Web server, or fail over to another storage device, resulting in uninter-
rupted service and no dependency of specific users on a local disk. In some cases,
multiple Web servers can access the same, read-only data on the SAN, providing a
high-bandwidth pipe and eliminating the need for expensive, redundant copies of
the same data.

Reducing Network Congestion from Backup
A typical problem any administrator faces is that of data backup. Because of the
huge growth in data, even on local disks, and the increasing criticality of the data
stored on networks, backup has become very important. Software packages such
as VERITAS NetBackup, Legato NetWorker, and other packages have long relied
on agents that transport data over IP connections to a central backup host.The
result has been a noticeable slowdown due to the vast amount of data being
transported across these IP packets over Ethernet connections—and not just late
at night.The backup window for many enterprises has extended from overnight
to include hours of peak system operation, simply because there is too much data
to fit into the more traditional and convenient backup windows.
     Anecdotal stories from system administrators illustrate how entire corporate
networks have become swamped with daily backups over IP, slowing not only
e-mail, but critical file servers, print servers, and Web access. Some shops have
gone as far as to install separate, high-speed Ethernet networks in an attempt to
offload this problem.
     SANs lend themselves to several techniques that directly help the backup
problem. One of these techniques is the use of IP over Fibre Channel to offload
the network congestion on the Ethernet network. IP, when transported over
Fibre Channel, is identical in form and function to IP over Ethernet and other
networks.Taking advantage of the fact that there are already Fibre Channel
connections into a network for access to shared data, administrators have installed
IP drivers into their servers and entirely offloaded the backup function onto the
Fibre Channel network.This frees the corporate Ethernet from the immense job
of transporting IP backup traffic, and takes advantage of the increased bandwidth
efficiency that is characteristic of Fibre Channel. Due to the connection-oriented
protocols built on Fibre Channel, IP traffic impacts the Fibre Channel network
less and helps administrators gain better usage out of their networks. In addition,
an increasing number of applications can perform shared backups over a SAN
using the backup devices’ native SCSI protocol, which greatly increases the
efficiency of the backup process.
14   Chapter 1 • Introduction to SANs


     Accelerating Backup Cycles
     Another reason for SAN implementation, which attacks the problem of overall
     backup cycles, has been the development of the technique of third-party copy.
     Taking the advantages of the Fibre Channel network a step further, specialized
     hardware called data movers work in conjunction with next-generation backup
     software to skip the IP transport of backup data entirely; they directly move
     storage that needs to be backed up from storage devices on the network to tape
     backup devices on the same network. Because the transfer is direct, it is very fast
     (no copy to server memory), and drastically reduces the CPU processing power
     needed for backup.The time spent copying data to and from storage to local host
     memory frees up valuable CPU cycles for something else: for example, running
     the applications that the host was installed to run. Companies such as Chaparral
     Network Storage and Crossroads Systems have been developing these third-party
     copy devices as part of their Fibre Channel-to-SCSI bridges, in conjunction with
     different backup vendors, who are now able to move data across a Fibre Channel
     network without the intervention of hosts.

     Speeding Up Data Access
     The keyword to SANs is speed, speed, and more speed. As a block-level protocol,
     SCSI over Fibre Channel Protocol (FCP) is the fastest and most efficient net-
     working technology available to transport block-type data from storage to hosts.
     Companies that previously were using TCP/IP-based networking technology
     over Ethernet and attempted to migrate that to Gigabit Ethernet have found that,
     despite the similar wire speeds of the technology, the efficiency and protocols
     available don’t allow for the use of bandwidth that Fibre Channel does.
         By using Fibre Channel, companies have found that they can speed up data
     access between hosts and storage. In addition, the more efficient usage of IP over
     Fibre Channel has advantages in network utilization over Gigabit Ethernet, with
     a shared network making standard TCP/IP networking over SANs an attractive
     solution.

     Increasing Server Cycles
     A growing problem has emerged with high-speed networks based on IP.
     Companies have been using clustering techniques (running many, coordinated
     servers in tandem to distribute processing) to attempt to get past the problems of
     limited CPU speeds and server scalability. However, clustering techniques rely on
                                                   Introduction to SANs • Chapter 1   15


the latency of a network to determine what types of scaling processing are avail-
able.With most Ethernet networks comes a negative scaling effect by adding
clustered servers—adding more servers allows less and less processing power, due
to the increased dedication of CPU cycles required just to coordinate that cluster.
    The Virtual Interface (VI) Architecture protocol, a standard proposed by Intel,
Microsoft, and Compaq for reducing the use of the CPU for network transfers,
has emerged as the leading protocol for network communications for clustered
environments. By providing a simplified model for direct hardware access to clus-
tered machines,VI eliminates the complex IP stack in favor of a hardware-based,
Direct Memory Access (DMA) approach to transferring data across a network.
The FC-VI standard maps the VI protocol (also available on Ethernet and propri-
etary interconnects) to the Fibre Channel protocol, and makes low-latency, direct
access available to clustering applications.
    The primary areas in which the VI protocol is being used today include clus-
tered databases such as Oracle Parallel Server and IBM DB2, both of which
natively support the VI protocol over Fibre Channel and other networks. A signif-
icant base of researchers and other developers are also using the VI protocol for
scientific computing and distributing computational tasks across large networks
of machines.
    SANs are beginning to be used in this area to take advantage of the FC-VI
protocol. Businesses are using the VI protocol to free server cycles on their data-
base servers, and recent third-party copy records have been set using VI-capable
hardware. An administrator running a clustered database such as the Oracle
Parallel Server or IBM DB2 should consider taking advantage of the SAN infra-
structure and installing an FC-VI-capable HBA to further accelerate the database
cluster.

Ensuring Disaster Tolerance
One of the major advantages of SAN technology is its high-performance, long-
distance capability. Initially, SAN technology was mostly used to extend to larger
distances within a building or campus. However, recently this has been applied to
the problem of disaster tolerance: being able to keep an operation up and run-
ning even if catastrophe strikes. For example, data center managers are now using
Fibre Channel technology bridged through MANs to make their installations
more disaster tolerant.
    A typical example of using MANs for disaster tolerance is brokerage houses
located on Wall Street. A common scenario is a large data center that supports
16   Chapter 1 • Introduction to SANs


     customer operations and the trading floor in Manhattan, which needs to have a
     live backup site to handle the possibility of a power or telecommunications
     outage, natural disaster, or other major catastrophe. Brokerages are now locating
     live, connected SANs through Fibre Channel to Dense Wavelength Division
     Multiplexing (DWDM) and other types of MAN technology directly to SANs in
     New Jersey.The two data centers continually share data and replicate (mirror)
     data between the independently running sites, so that if a catastrophe strikes, all
     information is up to date and still available.This technique is also heavily used in
     continental Europe, where operations can be spread between countries through
     these metropolitan connections or dark fiber, an optical communications tech-
     nology that allows the transport of high bandwidth data.

     When to Deploy a SAN
     Before deciding to deploy a SAN, consider whether implementing a SAN is the
     right thing for the situation at hand. Frequently in technology, people decide to
     implement something before they have evaluated whether the technology is actu-
     ally the best for their needs.The result is often disappointing for the user and for
     the vendors involved when, after lots of money and time is spent, there was little
     chance that the solution would have solved the overall problem in the first place.
     On the other hand, if the problem is first considered and matched to the best
     technology, the odds for success are much greater.Things to consider when
     deciding whether a SAN is the right solution:
          s   The primary application that needs to be solved
          s   Speed, bandwidth, and distance requirements
          s   The amount of data sharing or consolidation needed
          s   The cost of the SAN infrastructure required, such as switches, cables,
              and HBAs


     Designing Around the Application
     The most important part of determining whether to deploy a SAN is to focus on
     the actual business application that will be served with the SAN deployment.
     Unlike Ethernet networking technology, Fibre Channel SAN technology really
     should be applied on a network application-by-application basis. Equipment and
     software deployment is entirely driven by what types of applications need to be
     served, as opposed to being just an interconnect to plug in all of the desktops in
                                                    Introduction to SANs • Chapter 1   17


an organization.Typical applications are storage consolidation,Web hosting farms,
database and business-critical transaction servers, and workgroup data sharing.The
types of issues to consider are the use of those applications: data sharing, faster
backup, or the disaster tolerance aspects of the network. An understanding of the
benefits of deploying a SAN is vital to driving the design, software, and hardware
used to deploy a network.
     One thing to consider when starting SAN deployment is to carefully rank
and prioritize the goals of the project. Equipment, software, and solutions are all
geared toward specific types of applications, and understanding what is needed is
very important in determining if a new feature that a vendor is trying to sell is
critical to making an application deployment successful.

Assessing Speed, Bandwidth,
and Distance Requirements
Key factors in deciding to use a SAN are speed and bandwidth requirements. For
the most demanding applications, a SAN might be the only option. One
Gbit/sec (100 MB/sec) components are widely distributed now; 2 Gbit/sec (200
MB/sec) switches, storage, and HBAs are starting to hit the market; and Fibre
Channel standards of up to 10 Gbit/sec are already in development. If unim-
peded access to storage is required, Fibre Channel exceeds the speeds available
from legacy techniques such as parallel SCSI, and simplified cabling and connec-
tions make it far more reliable. Compared with other technologies, such as IP-
based file sharing and Network Attached Storage (NAS), the Fibre Channel
protocol provides for more usable bandwidth and faster data transfer.
    Distance is another factor in deciding to use SAN infrastructure. If data needs
to be distributed across a building, campus, or city, a storage network is perfect.
Long cable lengths, multiple cabling options, and robust components make a SAN
a perfect fit for distributing data. For example, many companies have solved the
problem of not having enough space for all of the data and hosts they need in a
single server room by running Fibre Channel from different parts of the building
together. Large storage arrays, which infrequently need access by administrators,
can be housed in a “lights out” facility separate from production servers, which
often need connection to monitors, keyboards, and administration. Deployment of
18   Chapter 1 • Introduction to SANs


     these solutions is also relatively easy: optical fiber is connected on either end to
     devices or switch ports, and full bandwidth operation is seamlessly available.




        SANs Versus NAS: What’s the Difference?
         First-time data administrators are often confused as to whether they
         should be using SANs or NAS. Both techniques are useful, but for dif-
         ferent types of applications.
               NAS uses common client networks such as Ethernet to connect
         client computers to a host file server. Unlike SANs, the client does not
         directly communicate with the storage. Instead, the client computer
         uses a high-level networking file system such as Network File System
         (NFS), which runs the TCP/IP protocol over Ethernet. Data exchange
         occurs at the file level, unlike a SAN where data is operated at the block
         level over Fibre Channel.
               In general, NAS techniques are best used for client-to-host con-
         nections, and SANs are better suited to high-speed file sharing and
         host-to-storage connections. NAS connections are typically easier
         to deploy over existing infrastructures, albeit much slower. SANs are
         typically used where bandwidth and speed are most important, and
         block-level, direct connections are required. Both techniques can
         coexist in the same installation. In fact, some NAS systems require a
         back-end SAN to support their operation.



     Data Sharing and Consolidation Needs
     Determining whether data will be shared on a SAN is important. As one of the
     major motivations behind moving to SAN technology, it is critical to understand
     exactly how data will be shared across the SAN. Areas to consider when assessing
     data sharing and consolidation needs include:
          s   Will information be shared at a file level or volume level?
          s   Are resources such as storage arrays (static shares) shared, or is file
              sharing required as part of the workflow (dynamic shares)?
                                                     Introduction to SANs • Chapter 1    19


Resource Sharing
The simplest form of storage network data sharing and consolidation is simple
resource sharing. In this case, a large storage array or storage farm is shared
among many machines, but access to each array or disk is statically allocated.
Each machine on the network is assigned storage, and does not change that own-
ership very often, if at all.This is typically used to partition a large storage array
among many hosts and can be a very convenient way to manage storage
resources.This resource partitioning can be done in a variety of ways, including
zoning at the switch level, storage-based Logical Unit Number (LUN) masking,
HBA-based LUN masking, or by a virtualization device that presents “virtual
LUNs” to the hosts. Resources are generally allocated once and are infrequently
changed or modified, and rarely are resources switched between hosts.

Volume-Level Sharing
Volume-level sharing is the sharing of resources at the volume level: for example,
sharing a LUN between two systems for a clustering application, or moving vol-
umes between machines as part of a digital media workflow.This generally
requires the intervention of software that can mount and unmount volumes as
part of an operating system, and also might require translators between different
operating system formats.

File-Level Sharing
File-level sharing is sharing of resources at a file level.This means writing and
reading files on a single volume between different machines.This is typically
done in SAN configurations with one machine that has write permissions to a
volume, and many machines with read permissions to the files on that volume.
However, if per-user security (only certain users have access to certain files) is
desired, or if many machines need to write to the same volume, it will require
more advanced file systems to achieve this functionality. Some techniques (such
as global file systems) can be used in this case, but often a better fit is NAS tech-
nology and more traditional network file-sharing techniques such as NFS.
    A SAN is an ideal solution for:
     s   Block-level access to shared storage
     s   High bandwidth requirements
     s   Need for expandability
20   Chapter 1 • Introduction to SANs


          s   Required access to very large centralized storage arrays
          s   Need for redundant, highly available access to storage
          s   Clustered server configurations
          s   Distributed applications
          s   Need for disaster tolerance
          s   Backing up large amounts of data nightly
          s   Running clustered databases (Oracle Parallel Server, IBM DB2)
          s   Need for a highly scalable infrastructure
          s   Centralized storage management
         A SAN is not appropriate for:
          s   A small amount of storage with no sharing required
          s   File-level, client access to volumes only
          s   No storage consolidation required


     Steps to a Successful
     SAN Deployment
     As with any advanced technology, the most difficult part of working with the
     technology is the actual deployment of the hardware and software.This section
     outlines at a high level some recommended steps to take to help ensure success
     in SAN deployment, and explains the overall process of deploying a SAN. Later
     chapters discuss the SAN design process in more detail, breaking the process into
     seven steps: data collection, data analysis, architecture development, testing a pro-
     totype, transitioning existing hardware, release to production, and maintenance.
          The first step to successful SAN deployment is to evaluate the intended goals
     of the deployment. A firm ranking of the top items to achieve with the deploy-
     ment is key to evaluating hardware and software options as well as determining
     topology and design layout.The channel characteristics of storage devices make
     topology selection and overall architectural design critical in SANs. For example,
     it is critical for high availability. If this is the primary goal, consider dual-redun-
     dant SAN fabrics, fault-tolerant components, and a topology that allows for
     redundant HBAs and storage ports. On the other hand, if data consolidation and
                                                      Introduction to SANs • Chapter 1    21


cost reduction is the goal, the design should be a topology without the redun-
dancy and separate fabrics.The same considerations go into other applications
such as backups, databases, or disaster tolerance. Defining the goals and devel-
oping a detailed technical plan is an important factor in success. Future chapters
cover the data collection and analysis phases in detail.
    As with any technology solution, a great deal of a SAN deployment should be
the investigation of the different software and hardware that will be part of the
solution. Because hardware and software are constantly changing, and new innova-
tions and equipment are available every day, speaking with various vendors,
attending trade shows, and talking to current users will help in determining the
best options for a SAN. Although Fibre Channel equipment, for the most part, is
now fully interoperable and will work together, it is still a good idea to get a sense
of the different options available. Storage arrays, the foundation of a SAN and the
most important part of the solution, should be a primary focus when designing a
network. However, because everything in a network has to work together, the
infrastructure and HBAs are also a critical part of the formula.
    The next step is installing a SAN prototype and testing the install, as covered
in future chapters. Networking is a complex area, and SANs are no exception. An
important deployment risk reduction item is the actual installation of a SAN
testbed to prototype the installation. Creating a SAN prototype allows for testing
the ultimate installation and working out any issues that might be encountered
with both the software and hardware being used. Unless an outside party is doing
the installation, consider it a necessity to personally install the setup, to make sure
that all of the components work together.
    It is usually best to set up a lab with all the required power, cabling, and hard-
ware needed to install and test equipment.This includes enough host machines to
run the application and hosts to install and test HBAs and software, along with
racks and benches for the storage, switches, and other hardware that will be used. In
fact, for mission-critical SANs, maintaining a testbed in parallel with the production
SAN should be seriously investigated before rolling it out to the actual production
network.The testbed allows for pretesting configuration changes and new compo-
nents, debugging production problems, validating changes, and verifying that new
versions of network and system software and firmware behave as expected.
22   Chapter 1 • Introduction to SANs




        SANmark and Other Interoperability Programs
         The Fibre Channel Industry Association runs a program called
         SANmark, which certifies equipment against Fibre Channel interoper-
         ability suites defined by the industry.
               Run in conjunction with the University of New Hampshire, the
         SANmark tests define how different kinds of equipment need to inter-
         operate with other types of Fibre Channel hardware. Several levels
         of SANmark certification exist, and the standards are constantly
         evolving to make sure that the latest hardware is tested against
         existing standards.
               Because of the importance of interoperability in the Fibre Channel
         area, many companies publish compatibility matrices that describe
         which components and versions of software and hardware have been
         tested against common configurations.


          The best way to select hardware and software is to start with the information
     available from different vendors on suggested configurations and interoperable
     hardware.Vendor “interoperability labs” and certifications give excellent points on
     how to pick the right products that will work together.
          Interoperability labs are now a standard part of almost every vendor’s support
     structure.These labs, where vendors extensively test and qualify equipment and
     certify configurations, are set up to make sure that all of the equipment they pro-
     vide is compatible with other vendors. Extensive testing by vendors ensures that
     the hardware and software work together as expected. Stress testing, configuration
     testing, and negative testing under various loads and in different configurations
     flush out problems before a piece of equipment is shipped. Most vendors make all
     of this information available to users and are happy to share what configurations
     they have certified for use with their equipment. Researching the interoperability
     information from the software and hardware vendors should provide a good idea
     of what works together.
                                                     Introduction to SANs • Chapter 1   23


     After investigating the options, installing a SAN testbed, and selecting what
seems to be the right software and hardware, it’s testing time. Even with precon-
figured and specified configurations, it’s best to set up a real-world configuration
to help test the deployment on a small scale.This configuration should include a
representative sample of everything that will be deployed on the SAN, with real-
life applications and representative data and with sufficient load generated to
replicate a limited deployment. No configuration or combination of hardware
and software is foolproof, so testing a real-life configuration in a controlled envi-
ronment before rollout can help to flush out the last issues and major showstop-
pers that could derail the deployment. Key areas to cover in testing include:
     s   Installing all of the major hardware vendors that have been selected
     s   Testing for interoperability of components with the versions of software
         and firmware that will actually be deployed
     s   Testing all crucial functionality with the software and applications
         running
     s   Testing for error handling and tolerance
     Simple testing includes plugging components in and out of the network,
powering down components to see how they recover, and moving cables. More
complex testing involves running heavy traffic to components, setting up the
application, and running simulated loads.The most important part of this testing
is running a simulated or actual load on the application that is being deployed
and making sure that even under real-life conditions, everything works as
expected, with no problems.
     The final step in a successful SAN deployment is staging the actual deploy-
ment into the enterprise. Staging the deployment helps to minimize risk and
maximize the probability of success. Rather than moving to a solution in one fell
swoop, it is better to deploy on a limited basis in certain areas and expand that
deployment once everything is up and running smoothly. A technique that is
frequently used to minimize deployment risk is in-place staging of SAN deploy-
ment. In this technique, the equipment and software are set up and tested where
the network will permanently be installed. All the testing in the previous step is
done where the SAN deployment will eventually be installed, so no equipment is
moved, damaged, or inadvertently reconfigured. Instead, when the time comes to
24   Chapter 1 • Introduction to SANs


     deploy, a cable is connected and the SAN configuration is instantly live and avail-
     able. Using the advantage of long-distance Fibre Channel cabling to enable “live”
     installation of new storage, large enterprises use this technique whenever they add
     new SAN hardware.The new storage is tested, run through diagnostics and stress
     tests, and then added onto the production network with the addition of cabling
     and modifications to zoning—all without moving the equipment or reconfig-
     uring the setup.
          Carefully staging the deployment, applying changes on a limited basis, and
     then rolling it out gradually will minimize any risk and ensure that everything
     operates smoothly. Future chapters discuss the steps of the SAN design process in
     more detail, how to analyze options and the underlying hardware and software,
     how to design a network, and how to best take advantage of the tools available.
                                                     Introduction to SANs • Chapter 1   25



Summary
The need for more data storage is constantly growing, with the Internet, e-mail,
multimedia, and the sheer generation of data demanding more and more storage.
With predictions calling for more data to be created, stored, and managed in the
next two years than was produced in the history of mankind, it is very important
to address those storage needs in a scalable and reliable way.
    In this chapter, we covered a bit of the history behind the SAN: its roots in
parallel SCSI connections, its evolution from just a SCSI replacement to more
sophisticated, loop-based storage sharing, to its current incarnation in highly scal-
able, reliable switched fabric networks. Fibre Channel networks are now being
deployed for the most business-critical and important areas in enterprises.Through
this evolution, SAN technology now offers a robust platform to establish and sup-
port the most important business applications.
    The benefits of building a SAN include ensuring high-availability access to
data, consolidating storage resources and management, reducing backup windows
and traffic, freeing host CPU cycles for other important tasks, and ensuring data
availability through disaster tolerance techniques. Building a Fibre Channel SAN
enables a more reliable, highly scalable, large bandwidth access to data. A SAN is
typically used for the most business-critical, technically challenging problems a
company faces.
    Deploying a SAN takes some planning. It is important to consider the appli-
cation in use, speed and bandwidth requirements, and whether data sharing and
consolidation offer any benefit. It is also important to consider the budget for the
project.The keys to a successful SAN deployment are evaluating the goals for the
technology; fully investigating the software and hardware to purchase; taking the
time and resources to install a testbed; working with vendors to select the right
combination of software and hardware; and testing the configuration thoroughly.
Finally, stage the deployment so that problems can be solved on a limited scale
first, before rolling it out on a larger basis.
    SAN technology has the ability to meet the most demanding business needs
and is the only technology currently available that meets the distance, bandwidth,
and reliability requirements of critical applications.With the explosive growth of
data storage requirements, this technology enables the efficient use and manage-
ment of data resources. By following proven techniques and carefully planning
deployments, Fibre Channel SANs can help solve the most difficult data storage
problems.
26   Chapter 1 • Introduction to SANs



     Solutions Fast Track
     Overview of SANs
              SAN technology evolved from direct-attach interconnects like Small
              Computer Systems Interface (SCSI).
              Fibre Channel supports SCSI, Internet Protocol (IP), and the Fibre
              Channel Virtual Interface (FC-VI) Protocol.
              The distance between Fibre Channel nodes can be as much as 10 km.
              Fibre Channel supports copper, multimode optical, and single-mode
              optical media.
              SAN technology has moved from Fibre Channel Arbitrated Loop to full
              Fibre Channel switch fabric.


     Taming the Storage Monster
              Data storage needs are increasing rapidly.
              Requirements due to databases, e-mail, multimedia, and the Internet
              have dramatically increased the required amount of storage for data.
              Disk farms, storage arrays, and storage consolidation are the keys to
              solving the storage problem.


     Benefits of Building a SAN
              Fibre Channel is ideal for supporting high-availability configurations and
              business-critical back-end operations, due to the ability to set up redun-
              dant networks and clusters.
              SAN technology allows for storage consolidation and data pooling for
              more efficient use of storage resources.
              Backup windows are shrinking, and backup traffic on the LAN can be
              easily reduced by using a SAN to reduce network congestion due to
              backup.
              Block-level, high-speed access through SCSI-Fibre Channel Protocol
              (FCP) can accelerate data access between storage and hosts, and can
                                                 Introduction to SANs • Chapter 1   27


     free up host resources that would be occupied serving files and data
     through IP.
     Cluster protocol access through FC-VI frees up CPU cycles in hosts
     and enables clustered database operations.
     One of the major advantages of SAN technology is its long-distance
     capability for disaster tolerance.


When to Deploy a SAN
     The most important part of determining whether to deploy a SAN is to
     focus on the actual business application that will be served with the
     SAN deployment.
     Speed and bandwidth requirements determine if the technology is right
     for the application. Compared with other technologies, such as IP-based
     file sharing and Network Attached Storage (NAS), the Fibre Channel
     protocol provides for more usable bandwidth and faster data transfer.
     A SAN is ideal for block-level access to shared storage.
     Fibre Channel works well for centralized access to storage arrays,
     redundant connections, clustered configurations, and disaster tolerance.


Steps to a Successful SAN Deployment
     Data collection Evaluate the goals of the deployment to determine
     options in achieving high availability, redundancy, fault tolerance, data
     consolidation, cost reduction, and so forth.
     Data analysis Investigate the hardware and software options that
     support those goals.
     Architecture development Design and install a SAN testbed to set up
     configuration and components. Select the software and hardware
     carefully to avoid any interoperability problems.
     Testing the prototype Test the configuration for interoperability,
     functionality, error handling, and fault tolerance.
     Transition existing hardware in a controlled release to production
     Stage the deployment by rolling out the setup gradually, making changes
     on a limited basis to minimize risk.
28   Chapter 1 • Introduction to SANs



     Frequently Asked Questions
     The following Frequently Asked Questions, answered by the authors of this book,
     are designed to both measure your understanding of the concepts presented in
     this chapter and to assist you with real-life implementation of these concepts. To
     have your questions about this chapter answered by the author, browse to
     www.syngress.com/solutions and click on the “Ask the Author” form.


     Q: Is Fibre Channel more expensive to deploy than Gigabit Ethernet?
     A: Cabling, GBICs, and transceivers are physically identical to Gigabit Ethernet.
         Costs of Fibre Channel switches and other equipment are very close to those
         of Gigabit Ethernet components.

     Q: Is there any way to preserve the investment in legacy SCSI storage in an
         enterprise?
     A: Yes, through the use of Fibre Channel-to-SCSI bridges.

     Q: Where can I get expert help in setting up a SAN?
     A: There are numerous system integrators and resellers who can help. Check
         with the equipment or software vendors.

     Q: Is interoperability a problem with Fibre Channel?
     A: No, the earlier problems with interoperability in Fibre Channel were
         mostly due to Fibre Channel Arbitrated Loop (FC-AL) technology. Switched
         fabric technology eliminates these problems and provides very reliable perfor-
         mance. However, as with any technology, it is still a good idea to check for
         equipment compatibility with the respective vendors.
                                         Chapter 2


Fibre Channel Basics




 Solutions in this chapter:

     s   The Architecture of SANs
     s   Fibre Channel Protocol Basics
     s   Classes of Service
     s   Storage Network Topologies
     s   Fabric Services


         Summary

         Solutions Fast Track

         Frequently Asked Questions




                                              29
30   Chapter 2 • Fibre Channel Basics



     Introduction
     Storage Area Network (SAN) infrastructures are built using new technologies that,
     although related to and derived from other technologies such as SCSI and IP net-
     working, has its own set of terminology and concepts. Like standard computer
     system networking, Fibre Channel also has its own stack of protocol levels, ranging
     from the physical connectors and media (FC-0) to upper-level protocols (FC-4).
     Each of these levels defines a different and separate part of how Fibre Channel
     equipment communicates. An understanding of these protocol levels, although not
     required, helps in understanding the equipment and how to debug and monitor
     the equipment.The different FC-4 protocols (FCP, IP,Virtual Interface [VI], and
     others) are tied directly to the different kinds of applications (storage, networking,
     and clustering), and enable Fibre Channel to support a robust set of uses.
         This chapter introduces some of the basics of Fibre Channel and reviews the
     underlying architecture of Storage Area Networks (SANs).You will discover the
     major parts of the Fibre Channel protocol, the primary physical components
     involved, and how they relate to the software and applications running on a SAN.
     At the end of this chapter you will be able to determine the kinds of protocols
     you need to run in your network, and better understand the various SAN
     topologies and terminology.

     The Architecture of SANs
     SANs provide a topology for connecting a number of hosts to storage devices. An
     exciting part of Information Technology (IT), SANs allow more users access to
     more data at faster rates.The concept of a SAN is to provide an infrastructure
     over which large amounts of data can be transferred robustly between servers and
     storage devices such as Just a Bunch of Disks (JBODs), tape drives, and
     Redundant Array of Independent Disks (RAID) systems. SANs also enable the
     sharing of storage devices such as tape silos and RAID systems. Although there
     are some efforts in the industry directed to using Gigabit Ethernet and
     InfiniBand technologies to implement SANs, the primary SAN infrastructure
     available today is Fibre Channel based.
         SAN storage is useful for business, because the high level of connectivity
     allows you to consolidate all your storage needs in a SAN, which is easily
     expanded, as you require more space. SANs are also accessible to everyone on the
     network, which makes it easy to share large projects. Another advantage of using
     a SAN as a means to distribute data across your network is speed.The most
     common protocol used to implement a SAN is the Fibre Channel protocol,
                                                     Fibre Channel Basics • Chapter 2   31


which most commonly operates at 1 Gbit/sec (a data transfer rate of 100 MB/sec).
There are also 2 Gbit/sec devices that just came to market and plans for 10
Gigabit devices.These high speeds mean that data is less susceptible to bottlenecks.
    A Fibre Channel SAN also provides the advantage of increased reliability.The
Fibre Channel protocol uses both buffer-to-buffer and end-to-end flow control,
and it calculates a Cyclical Redundancy Check (CRC) on every transmitted
frame. Reliability can also be increased through redundancy by developing fall-
back connections over a large geographic area. Fibre Channel is designed to
transmit distances that exceed the epicenter of an earthquake.This means that a
SAN can stay fully operational if it is designed with redundant links at remote
points.
    Another advantage of a SAN is scalability. SANs provide storage that is not
server attached, which improves performance by avoiding bottlenecks on the
connection to one machine. SANs also provide affordable scalability, because a
storage device can be directly attached to the SAN. Since disks are detached from
direct host attachment, multiple devices can allocate the same storage area
without performance limitations (Figure 2.1). Server-detached storage can offer a
more cost-effective storage solution, since a server is no longer necessary in order
to distribute a file system over a multihost network.
    A SAN implemented using the Fibre Channel protocol incorporates the ben-
efits of a channeled connection and a network. A channel is a high-speed infor-
mation conduit but, unlike a network, it is hardware-intensive. Channels
specialize in streaming data between two devices, such as your computer and a
storage subsystem. Some examples of channel protocols are Small Computer
System Interface (SCSI) and High-Performance Parallel Interface (HiPPI). A net-
work, on the other hand, specializes in connectivity, allowing flexibility to add
and remove nodes from the environment. Examples of network protocols are
Token Ring, Ethernet, and Asynchronous Transfer Mode (ATM). Fibre Channel
incorporates the flexibility of a network with the high speed and reliability of a
channel—essentially allowing you to connect a large number of devices without
degrading performance.
    When we talk about a SAN, we generally think of transporting SCSI data
over Fibre Channel. Although this is what is most commonly used in a SAN,
Fibre Channel supports many other protocols. Some other protocols that can be
transported over Fibre Channel are HiPPI, Internet Protocol (IP), Fiber
Distributed Data Interface (FDDI), and ATM, although IP, SCSI, and Virtual
Interface (VI) are the predominate protocols transported on Fibre Channel today.
32   Chapter 2 • Fibre Channel Basics


     Figure 2.1 Storage Server Versus SAN Architecture

                                              Storage Server
                   RAID




                                                   Bottle                    Client
                                                   Neck


                                     Server
                                                                             Client
                   TAPE




                                                                             Client




                                      SAN
                                                 Server

                          TAPE




                                                                          Client
                          Switch
                                                               Ethernet




                                                                          Client


                                   RAID
                                                 Server

                                                                          Client




          A SAN is constructed from three primary types of elements: target devices,
     initiating devices, and interconnecting devices.
          A target device is usually a storage device on a SAN.There are many different
     types of storage devices, including tape drives, JBODs, RAIDs, and IP targets. A
                                                      Fibre Channel Basics • Chapter 2   33


tape drive is commonly used for backup of other storage devices which might be
a database or critical file system. Fibre Channel tape technology is an emerging
technology, and testing procedures have just recently been developed.We discuss
Fibre Channel components in further detail in Chapter 3, “SAN Components
and Equipment.”
    Fibre Channel storage disks and Fibre Channel-capable tape drives are the
most common types of target devices. Fibre Channel disks have a Fibre Channel
controller on them. In general, Fibre Channel disks are contained in a JBOD. In a
JBOD, each disk is visible to the SAN, each is assigned an address, and each is
treated as an autonomous device even though the physical disks are located in the
same enclosure. SCSI disks might also be contained in a RAID, in which case the
RAID controller will make the array of disks appear to be one disk to the SAN.
The RAID is one disk in the sense that it will take a single address on the SAN.
Another type of target might be an IP target. Since IP is a protocol commonly
used over Fibre Channel, you might see devices communicate by passing IP
packets back-and-forth. In this case, there is no distinction between a target device
and an initiating device, since both devices can initiate exchanges of Fibre
Channel and IP frames.
    An initiating device is a device that actively seeks out and interacts with target
devices on the SAN. Examples are a server or a workstation, and they are often
called hosts. A Host Bus Adapter (HBA) is a Peripheral Component Interconnect
(PCI) or bus-type adapter that resides in a host machine.That machine can be a
server, a workstation, or other device that would request information from a
group of disks or storage. It could conceivably be an automated tape backup
system.The distinction between a target and an initiator is that an initiator
actively searches for a target with which to initiate a transfer, while a target is a
passive device.There is often a fine line between the two, since some devices
(such as IP devices or bridge devices) might read and write to each other.When
a device opens an exchange, it acts as an initiator.
    From an infrastructure perspective, the most important components in a SAN
are the interconnecting devices, namely as switches. Switches create the foundation of
a Fibre Channel SAN and provide a high-speed interconnect for routing frames
from one device to another. Switches provide fabric services, additional ports for
scalability, and the linking capability of the SAN over a wide distance. Although a
Fibre Channel SAN can technically exist without any switches using arbitrated
loop topology (discussed later in this chapter), a loop-only topology does have its
challenges. Arbitrated loop topologies can be subject to performance issues,
which can be avoided by connecting the SAN in a switched topology. Switches
34   Chapter 2 • Fibre Channel Basics


     are responsible for correctly routing frames from one node to another over the
     entire network, with a group of one or more interconnected switches called a
     fabric (Figure 2.2).
     Figure 2.2 Interconnected Switches Make Up a Fabric

                              RAID                   RAID
                                                                            RAID
             JBOD
                                                                                                   RAID




                                                      FABRIC
                                            Switch                 Switch
                    Switch                                                                          Switch




                                        Switch                        Switch


                                                                                                                       Workstation
                       Host

                                                            Host                   Host
                                     Host
                                                                                                             FC/SCSI
                                                                                                              Bridge




                                                                                     SCSI Tape Library



        There are two other common devices encountered in a SAN architecture:
     hubs and routers. A Fibre Channel hub provides similar function to an Ethernet
     hub. A hub is a box with a number of ports to which devices can be attached,
     which simplifies device interconnection.The bandwidth on the hub, which is 1
     Gbit/sec, is shared among all the connected devices.There are two types of hubs,
     managed and unmanaged. An unmanaged hub simply provides a physical wiring
                                                     Fibre Channel Basics • Chapter 2   35


between all the connected devices. It does not do any signal processing. A device
that is transmitting data, regardless of what the data is, will be connected to the
other devices on an unmanaged hub. A managed hub, on the other hand, will
wait to connect a device to the other devices on a hub until it sees valid trans-
mission data from the device.The disadvantage of unmanaged hubs is that they
increase disruption time during the insertion of a new device, and they will also
allow a device that is no longer functioning properly to continue transmitting
bad data over the link.This can cause a Fibre Channel loop to remain in an
unusable state, and stop traffic between other devices on the hub. For these
reasons, most hubs are managed hubs.
    The terms router and bridge are interchangeable in Fibre Channel terminology.
A router generally connects two different protocols, such as Fibre Channel and
Ethernet, or Fibre Channel and SCSI. A router is usually a one-to-many con-
nector or a many-to-many connector, whereas a bridge generally connects in a
one-to-one manner.
    The American National Standards Institute (ANSI) began work on Fibre
Channel in 1988, and since then the X3T11 Task Group has developed over 20
standards. Fibre Channel’s complexity is not without reward, however: Fibre
Channel presently transmits at 1.0625 Gbit/sec over all types of physical media.
Recently, many companies have increased that number to speeds of 2.125
Gbit/sec and specifications have recently been published on 10 Gbit/sec Fibre
Channel as well. Since Fibre Channel bytes are encoded in 10-bit blocks, this
provides a transfer rate of approximately 100 Mb/sec at 1.0625 Gbit/sec.


NOTE
     Right now there are over 20 Fibre Channel standards projects with many
     more to come. The main organization involved in the standards process
     is T11 (www.t11.org). Copies of all the current standards are available at
     the T11 Web site. The Fibre Channel Industry Association (FCIA, at
     www.fibrechannel.com) has also started the SANmark program to test
     the conformance of Fibre Channel devices to those standards, based on a
     sample set of interoperability tests. Devices can be certified to a number
     of SANmark Conformance Documents (SCDs). A device’s ability to pass
     these tests is an indication of its ability to interoperate with other Fibre
     Channel devices.
36   Chapter 2 • Fibre Channel Basics


          Fibre Channel is most easily understood if it is broken down into layers.
     There are five Fibre Channel layers, labeled FC-0 to FC-4.The layered break-
     down makes Fibre Channel easier to study and understand.We can break it down
     further by thinking of FC-0 and FC-1 as the physical and signaling layers. FC-2
     is a link, or protocol layer.The FC-3 layer specifies common services such as the
     Name Server, which provide services for all nodes on a Fibre Channel network.
     The FC-4 layer specifies the mapping of Upper-Level Protocols (ULPs) with the
     Fibre Channel protocol.
          The physical media is the FC-0 layer. Although it is called Fibre Channel, it
     can be carried over either fiber-optic cables or copper twisted-pair type cables.
     There are two common types of fiber-optic cable: single-mode and multimode.
     Single-mode cable has the ability to transmit longer distances (100 km) than
     multimode fiber (500 m).
          Fibre Channel transmits in 8b/10b-encoded characters, and the signaling inter-
     face is the FC-1 layer.This means that for each 10 bits of information transmitted,
     you actually receive 8 bits of information, which is encoded into a character.The
     8b/10b encoding of characters provides a low level of error detection, because if
     bits are lost or inadvertently changed, invalid characters will be received. Four trans-
     mission characters make a transmission word. Certain transmission words are then
     used as the primitives in the Fibre Channel protocol for signaling purposes.
          We consider primitives and transmission words to occur at the FC-2 level.
     Primitives control the flow of frames on a Fibre Channel link. Frames are sets of
     transmission words that contain routing headers and a payload.The payload is
     where ULP information is stored, such as SCSI commands or data.The mapping
     of SCSI commands or data into Fibre Channel frames is an ULP activity that
     occurs in the FC-4 layer.
          Devices in a SAN are generally interconnected with a switch. A single switch
     or a group of all interconnected switches is commonly referred to as a fabric,
     which provides certain services to the nodes attached to it.The services provided
     are part of the FC-3 layer and include a Name Server,Time Server, Alias Server,
     and so on.The Name Server is a distributed database that registers all devices on
     a fabric and responds to requests for address information. On a fabric, all services
     are conceptually distributed, meaning that the same server provides service to all
     nodes independent of direct switch attachment.
                                                     Fibre Channel Basics • Chapter 2   37


Fibre Channel Protocol
This chapter explains the concepts of the Fibre Channel protocol.The goal is to
gain a high-level understanding of the mechanisms of the protocol, such as arbi-
tration, arbitrated loop address selection, and frame generation and transfers.
     We abstract the Fibre Channel protocol by dividing it into five layers and
analyze how each layer interacts with the other layers.We will further analyze the
FC-4 layer in detail, because it is the Fibre Channel layer that controls the map-
ping of ULPs that can be transported over Fibre Channel.We discuss SCSI and
IP primarily, but also consider HIPPI, ATM, and IPI-3.

Classes of Service
Classes of service are different semantics used to transfer frames using various ver-
ification and buffering mechanisms.The classes of service section later in this
chapter describes the different types of classes of service and their uses:
     s   Class 1 Acknowledged connection-oriented service
     s   Class 2 Acknowledged connectionless service
     s   Class 3 Unacknowledged connectionless service
     s   Class 4 Connection-oriented fractional bandwidth
     s   Class F Inter-switch communication format


Storage Network Topologies
In the storage network technologies section later in this chapter, we will look at
different topologies and discuss how differences in architecture can affect data
flow over your SAN.There are three primary topologies, and the goal in this sec-
tion is to understand how different functions can be achieved in a SAN by using
a single topology or a combination of topology models.We look at examples of
topologies and define the terminology for referring to nodes in the topology:
     s   Point-to-point topology
     s   Arbitrated loop topology
     s   Switched fabric topology
38   Chapter 2 • Fibre Channel Basics


     Fabric Services
     Fabric services provide information to nodes in a switched fabric topology.
     Services can be distributed across all switches, creating the appearance of single-
     service type servers.The services provided by the different servers on a fabric
     make the interconnection of hundreds to thousands of devices seamless.They
     provide addressing, device-type, and connection-type information to requesting
     nodes. In this chapter, we discuss a number of different fabric services, including:
          s   Login Server
          s   Name Server
          s   Fabric/Switch Controller
          s   Management Server
          s   Time Server


     Fibre Channel Protocol Basics
     Fibre Channel was developed to combine the benefits of channel and network
     technologies. Channels are directly connected devices that do not require large
     amounts of logic to be incorporated. Channels are hardware-intensive because they
     are designed for fast transfer of large amounts of data between buffers. Examples
     of channels are HiPPI and the serial connection made between serial ports on
     two computers. Networks, on the other hand, are capable of handling very large
     numbers of nodes. Networks used to be software-intensive because packets needed
     to be routed to one of many devices on a network. Most of today’s networks use
     hardware-forwarding. Networks also have to adapt “on the fly” to devices being
     added and removed. Fibre Channel was developed to incorporate the best fea-
     tures of both.
         Fibre Channel allows data to be transferred at faster speeds.The base speed of
     Fibre Channel is 1 Gbit/sec. Many devices, however, are running at double speed
     right now, and the 10 Gbit/sec specification is presently in draft form. Another
     advantage of Fibre Channel is that it incorporates the ability to dynamically con-
     nect large numbers of nodes over a very wide area. Using single-mode fiber, elab-
     orate SANs can span many kilometers.This adds the benefit of being able to
     incorporate redundancy for mission-critical applications. Fibre Channel was
     designed to produce redundant dynamically reconfigurable SANs that would pro-
     vide storage even in the event of a natural disaster after a large portion of the
     infrastructure was damaged.
                                                      Fibre Channel Basics • Chapter 2   39


    Fibre Channel is primarily used to transport the SCSI and IP protocols.The
benefits of using Fibre Channel for the mapping of these protocols is increased
speed connectivity and longer connection distances.There are three primary
topologies for Fibre Channel devices.The first topology is point-to-point.This
topology is used between two devices. In point-to-point topology, there is no
addressing, since all frames (Fibre Channel packets) are intended for “the other
device.” Device connections to switches are sometimes called point-to-point
connections.
    The second topology is arbitrated loop. In this topology, devices are connected
together in a loop, with the receive fiber coming from an upstream device and
the transmit device going to a downstream device. An 8-bit Arbitrated Loop
Physical Address (AL_PA) identifies devices on that loop.
    For specifics regarding FC-AL, you can consult the state diagrams in FC-AL-2
(www.t11.org).This section provides the basics on how the arbitrated loop pro-
tocol works. It is not important that as a system administrator or end user you
understand the protocol in detail. However, it is important to understand the con-
cepts, because it will make diagnosing problems in your SAN faster and easier.
    The third topology is the switched fabric configuration, which enables you to
connect a large number of devices. A switched fabric topology is sometimes
referred to as a point-to-point topology as well.The switched fabric topology is
easily scalable, allowing devices to be added and removed with little disruption to
the rest of the attached nodes. A switched topology allows more efficient use of
bandwidth by using circuits in the switches to route paths between nodes, as
opposed to arbitrated loop where there is one path between a set of nodes on
the loop.
    Information is transferred in frames, which contain a header and a payload.The
header contains routing information. It specifies where the frame came from,
what kind of frame it is, and where it is going. Frames start with a primitive Start
Of Frame (SOF), which indicates the class of service the frame is being trans-
mitted in, specifying the connection type.The class of service, discussed in detail
later in the chapter, is a set of universal rules for nodes handling the frame of that
type. Classes of service handle tasks such as frame acknowledgment and transfer
verification.
    Information transfer in Fibre Channel is analogous to writing a paper. A total
message or idea is broken down into parts. Like a paragraph made up of sen-
tences, it is a collection of related information composed of sequences. A
sequence (sentence) is a collection of frames (words) that fit together logically. In
Fibre Channel there is punctuation around the “words” as well. Frames start with
40   Chapter 2 • Fibre Channel Basics


     an SOF and end with an End Of Frame (EOF).There are multiple fields in a
     frame—the first six words, fields zero through five, are mandatory and are called a
     frame header.The frame header specifies the source of the frame, the destination of
     the frame, and what type of frame it is.The payload contains the data from the
     ULP frame.

     Fibre Channel Levels
     When discussing Fibre Channel, it is usually easiest to break down the tech-
     nology into a number of levels.You can also use the level approach when trying
     to debug a situation where something is broken.You can first determine the
     failure level, and work with the components of that level. In this section, we
     abstract the Fibre Channel protocol by dividing it into five layers and analyzing
     how each layer interacts with the other layers.There are five Fibre Channel
     layers, designated FC-0 to FC-4.The layers include all aspects of the technology,
     from the physical media to the ULPs that are transported on Fibre Channel.
     Figure 2.3 is a commonly used visual aid to help envision how the different
     layers interact.
     Figure 2.3 Fibre Channel Layers

                                                         FC-4
                                   Fibre Channel Upper Level Protocol (ULP) Mappings
                                                         FC-3
                                            Fibre Channel Common Services
                                                         FC-2
                                        Fibre Channel Framing and Flow Control
                                                         FC-1
                                           Fibre Channel Encode and Decode
                                                        FC-0
                                             Fibre Channel Physical Media



         The FC-0 layer is the lowest-level layer; we have already seen most of the
     components of this layer by looking at the different media types and connectors
     involved in creating a Fibre Channel connection.The FC-0 layer specifies how
     light is transmitted over fiber and how transmitters and receivers work for all
     media types.This layer deals with the physics of transmitting and receiving a
     signal at different transfer rates.When there is a problem with a GBIC or fiber
     line, you know you have an FC-0 level problem. Most of the work being done at
     the FC-0 level is electrical engineering work in designing transmitter and
     receiver components.
                                                      Fibre Channel Basics • Chapter 2   41


     The FC-1 layer is the signal encoding and decoding layer.When you consider
the FC-0 and FC-1 layers together, they are generally referred to as the signaling
interface.This layer is responsible for taking the serialized signal and encoding it
into data characters you can use.The FC-1 layer uses 8b/10b encoding.This
means that for each 10 bits sent, you get 8 bits of actual data—the other bits are
parity bits.The bits are encoded into two kinds of characters: K characters and D
characters. In Fibre Channel, all primitives (LIP, SOF, OPN, CLS, IDLE, and so
on) are delimited by a K character, which is often referred to as a special character.
Data characters are used to provide all the other 8-bit values.
     The FC-2 layer is the Fibre Channel protocol level, which is responsible for
framing and flow control.The protocol level operates on primitives that are
encoded from the FC-1 level as a special character followed by three data charac-
ters (D characters). Primitives drive the state machines that control things like
arbitration, loop initialization, and data-carrying frames.The FC-2 layer is where
the firmware embedded on chips in your Fibre Channel devices is active.The
FC-2 layer controls the flow of data by sending the correct primitives to initiate
transfers. Although the FC-2 layer sends the frames, the payload of the frames is
not part of the FC-2 layer.The FC-2 layer is responsible for correctly filling in
the frame headers that are responsible for routing the frames.
     The FC-3 layer is the Fibre Channel common services layer. An example of
common services is the Name Server, which provides to requestors the addresses
of other fabric-connected devices. Fabric servers are necessary to provide central-
ized resources to all attached nodes. Although there might be server agents on
each individual switch, all the Name Server agents will share their information
through a switch protocol, which makes the Name Server on each switch iden-
tical to every other Name Server.This creates the illusion of a single Name
Server.This concept is called distributed fabric services, and the same theory is
applied to all servers, like the Time Server, which synchronizes with all of the
other switches as well.The behavior of all servers that can be implemented in a
distributed fabric is specified in FC-GS-3, a Fibre Channel standard that specifies
generic service.
     FC-4 is the Fibre Channel ULP mappings layer.This layer specifies how
ULPs like SCSI, IP, HiPPI, IPI-3, and ATM can be carried over a Fibre Channel
conduit.The most commonly transported protocol is SCSI. SCSI Fibre Channel
Protocol (SCSI-FCP) is the standard that specifies how to encapsulate SCSI
frames in the Fibre Channel protocol.The FC-4 layer is responsible for making
sure that the ULP data or commands get broken down appropriately and pack-
aged correctly in the Fibre Channel frames.The frame is then passed down to the
42   Chapter 2 • Fibre Channel Basics


     FC-2 layer where the node might query an FC-3 server to obtain the destination
     address for the frame.The FC-2 layer then adds this information into a header on
     the frame and sends the frame to the FC-1 decoder, which breaks the frame into
     bits that can be sent over the physical wire at the FC-0 level.

     ULPs
     The FC-4 layer specifies the mapping of different ULPs to Fibre Channel.To
     recap, ULPs are the protocols that can be transferred over Fibre Channel. A wide
     variety of protocols can be transported over Fibre Channel.The advantage for
     most network protocols of mapping over Fibre Channel is increased speed. For
     channel protocols, the advantage is added scalability and dynamic reconfiguration.
     The following are some specifics on protocols and their mappings on Fibre
     Channel:
          s   Small Computer System Interface (SCSI) The most widely used
              ULP on Fibre Channel networks. SCSI is a parallel interface standard
              capable of speeds up to 80 MB/sec. SCSI devices can be chained
              together to create a channel with multiple nodes. SCSI gains speed from
              Fibre Channel since Fibre Channel operates at a base speed of 100
              MB/sec. FCP is the name of the FC-4 protocol for SCSI.
          s   Internet Protocol (IP) IP is a standard networking protocol. IP over
              Fibre Channel is used for different reasons than Gigabit Ethernet, and
              has many uses, such as offloading backup traffic, in-band access for man-
              aging devices, and so on. Fibre Channel allows faster transfer speeds than
              most Ethernet technology.
          s   Virtual Interface (VI) VI is a standard protocol defined for low-level
              clustering communications over Fibre Channel.This protocol is used by
              distributed databases, file systems, and other clustering applications to
              efficiently transfer cluster information over a network between hosts.
          s   Intelligent Peripheral Interface (IPI) IPI is an ANSI-defined stan-
              dardized protocol for controlling peripherals from a host computer.The
              IPI-3 is the level-three part of IPI that deals with packetized communi-
              cation between a host and a peripheral device.
          s   High-Performance Parallel Interface (HiPPI) HiPPI is a channel
              used to transfer large amounts of data at 800 Mb/sec or more to super-
              computers that have the processing power to use that much data at such
                                                       Fibre Channel Basics • Chapter 2    43


         a fast rate.The data can be read either between a file system and pro-
         cessor, or between memories on separate systems to create parallel
         machines. Fibre Channel is a good conduit in the point-to-point
         topology since it can provide the speed at the FC-0 and FC-1 levels to
         achieve the goals of HiPPI.
     s   Fiber Distributed Data Interface (FDDI) FDDI is one of the first
         protocols developed for fiber-optic technology. FDDI networks use
         token passing and support transfer rates of up to 100 Mbit/sec. FDDI
         networks were typically used as backbones for WANs, but are now being
         commonly replaced by Fibre Channel and other high-speed Ethernet
         technologies.


Classes of Service
Classes of service specify what mechanisms will be used for the transmission of
data. Different classes of service are used for different types of data. Flow control is
one mechanism that is specified by class of service. End-to-end flow control is when a
receiving port transfers a frame back to the sender to verify receipt of the trans-
mitted frame.When the transmitter receives the acknowledge frame (ACK) back, it
is allowed to adjust its credit by one, so it can send another frame. Buffer-to-buffer
flow control is used between fabric ports and node ports, or two node ports, to indi-
cate the maximum number of frames the device can receive. A Receiver Ready
(R_RDY) primitive signal is sent on the link to indicate that the device can receive
a frame. If a certain number of R_RDYs are sent, it indicates that the device has
enough buffer space to accept that number of frames. In addition to flow control,
classes of service also specify whether the connection is dedicated. In a connection-
type transfer, you cannot send frames that are not addressed to the dedicated
receiver. In addition, you cannot send frames in a class other than the connection
class.This guarantees that the connection can utilize the full bandwidth.

Class 1
Class 1 service is a dedicated connection class between a transmitter and a
receiver. Class 1 connections emulate the features of channel protocols. All packets
sent are acknowledged, meaning that an ACK frame is sent back for every frame
transmitted.The connection is dedicated, which means that the communicating
device uses the full bandwidth of the connection. No other devices can commu-
nicate with the connected devices as long as the Class 1 connection is open.
44   Chapter 2 • Fibre Channel Basics


          In a Class 1 connection, since the connection is dedicated, you are assured that
     frames will arrive in the order in which they were sent. Only end-to-end flow
     control is used in a Class 1 connection. Class 1 connections are used for time-
     critical applications and for transferring streaming data such as sound or video.
          Intermix is an optional service of devices that support Class 1 connections.
     Intermix allows the transmitter to send Class 2 and Class 3 frames when there are
     no Class 1 frames to be transmitted, allowing the device to use the bandwidth
     more efficiently.The Class 2 or Class 3 frames cannot be sent to the same device
     with which the Class 1 connection is established.

     Class 2
     Class 2 service provides a connectionless conduit between two ports.This means
     that the frame is transferred to the switch, and the switch is responsible for
     attempting delivery at the switch’s earliest convenience. Class 2 allows devices to
     share all the available bandwidth. Class 2 frames use both buffer-to-buffer and
     end-to-end flow control, so the transmitter will receive either a positive or nega-
     tive acknowledgment of receiving a frame.This is an acknowledged class of service.

     Class 3
     Class 3 service is similar to Class 2, except there is no end–to-end confirmation
     of the data transfer.This is the preferred class of service for SCSI, and therefore is
     the class of service most often used in transfers over a SAN. Class 3 uses buffer-
     to-buffer flow control, which is controlled on an FC-2 level using the R_RDY
     primitive. Class 3 allows devices to share the bandwidth of the SAN. Class 3
     allows devices to operate at full speed when there is little traffic, but causes the
     bandwidth to be shared when there is heavy traffic. It is ideal for distributed
     storage solutions like SANs.

     Class 4
     Class 4 service is a less common service but most similar to Class 1. In Class 4
     service, the bandwidth is divided into Virtual Circuits (VCs). For this reason, Class
     4 is known as a fractional bandwidth class of service.Within a VC, the bandwidth
     allocated is guaranteed. A node can divide the bandwidth into a number of VCs
     that share the connection.VCs can be established with a number of other ports.
     In a Class 4 connection, both buffer-to-buffer and end-to-end flow control are
     used. Frame ordering within VCs is guaranteed. For Class 4, intermix is required
     with Class 2 and Class 3 frames.
                                                      Fibre Channel Basics • Chapter 2   45


Class F
Class F service is used for internal control and coordination of the fabric. Class F
frames can be sent only between switches, so all devices are instructed to ignore
them. Switches use Class F frames to coordinate services like the Name Server
and resolve transmission hierarchy.

Storage Network Topologies
As mentioned earlier, there are three primary types of topologies in Fibre
Channel: point-to-point, arbitrated loop, and switched fabric. Each topology is useful
for different purposes, and the topologies can be combined to form a SAN spe-
cific to the solution you need. Point-to-point connections limit you to two
devices, so you will generally use a point-to-point connection only when you
have two systems that need to talk to only each other at high transfer rates, or if
one of the devices is acting as a switch or bridge.The arbitrated loop topology is
useful for connecting many devices, but because only one device can arbitrate at
a time, it severely limits your bandwidth. In any large SAN, you will need to use
a switched topology as the backbone of your connection.There might be one or
many switches that form your fabric. However, just because your SAN incorpo-
rates a fabric topology does not mean that the other topologies cannot be inte-
grated into the SAN as well. A fabric port can easily be connected as part of an
arbitrated loop. Point-to-point connections are established when a single device
is plugged into a switch port. Figure 2.4 is a diagram of a fictional SAN that
incorporates all the different topology features.

Point-to-Point Topology
Point-to-point connections in Fibre Channel are limited to a few specific situa-
tions.The primary use of the point-to-point topology is to connect devices
directly to switches or other bridge devices. Rarely would a target device and
initiating device be connected in a point-to-point topology.This is generally a
waste of resources; since Fibre Channel components are faster than all disks man-
ufactured today, it is not likely that the disk they would be attaching could fully
utilize the bandwidth provided to it. Furthermore, rarely would a single host
system require data streamed at that rate or be able to process it.There are excep-
tions, particularly if Fibre Channel is being used to build parallel systems, like in
the case where HiPPI would be the ULP used for memory sharing.
46   Chapter 2 • Fibre Channel Basics


     Figure 2.4 Fibre Channel Topologies



                                          RAID               RAID                               WAN


              JBOD




                                                                               IP / Fibre Channel Bridge


                                     FABRIC
                                                                                Inter-Switch
                            Switch               Switch
                                                                                 Link (ISL)




                            Switch               Switch


                                                                                Workstation



                                                                               FC/SCSI Bridge
                       Arbitrated
                         Loop




                                                      Host

                                                                    SCSI Tape Library




                     JBOD               Host



          In a point-to-point topology there is no addressing, since any data transmitted
     is intended for the other device. A point-to-point topology can be set up by con-
     necting device A’s transmit fiber into device B’s receive connector, and vice versa
     (Figure 2.5). Point-to-point connections have a very simple initialization routine,
                                                                          Fibre Channel Basics • Chapter 2   47


since no address assignment needs to be resolved. Point-to-point connections also
have the advantage of allowing the two devices the entire bandwidth of the line
at all times.
Figure 2.5 Point-to-Point Topology

           Fibre Channel Device                                                            Storage Device
                                          Transmit                 Receive
                                            (tx)                    (rx)




                                          Receive                  Transmit
                                           (rx)                      (tx)




Fibre Channel Arbitrated Loop (FC-AL) Topology
The arbitrated loop topology is a configuration used to connect up to 127
devices without a switch. Devices in an arbitrated loop are connected in a ring
formation.The transmit fiber of the upstream device goes into the receive port of
the downstream device.This is repeated around the loop until the first device
receives the transmit fiber of the last device (Figure 2.6).
Figure 2.6 Arbitrated Loop Topology

                                                      FC-AL Disk

                                  FC-AL Disk
                                                                              FC-AL Disk




                 FC-AL Disk




                                                                                             Host


                                  FC-AL Disk
                                                                     FC-AL Disk
                                                     FC-AL Disk
48   Chapter 2 • Fibre Channel Basics


         The arbitrated loop initialization process discussed earlier is complicated.The
     complexity is necessary in order to fairly assign all devices in the loop an AL_PA.
     The initialization state machines are difficult to implement and because of this,
     interoperability among different vendors’ products is a major issue in arbitrated
     loop topology. By arbitrating for control of the loop, the arbitrated loop configu-
     ration allows all devices connected on the loop to share the bandwidth of a single
     line. For this reason it is important to put devices on a loop that can afford to
     suffer the performance degradation that sharing bandwidth entails.You might
     consider a looped topology if the devices on the loop would rarely be accessed
     concurrently. It is generally best to place a small group of storage devices on a
     loop rather than hosts, which constantly require access to the loop. A loop con-
     figuration is also good for archiving across many drives that will be accessed
     rarely, but then need to dump large quantities of data across the network, such as
     automated backup systems.

     Switched Fabric Topology
     Fabrics allow you to expand your SAN as need dictates, and they allow thousands
     of devices to be interconnected.The switched fabric topology is easily scalable,
     allowing devices to be added and removed with little disruption to the rest of the
     attached nodes.This is a distinct advantage over an arbitrated loop topology,
     which requires a reinitialization of all nodes every time a node is added. Imagine
     how unstable a network with hundreds of nodes would be if all devices were
     reset every time a device was inserted.
          Switches also allow more efficient use of bandwidth by using circuits in the
     switches to route paths between nodes. In this way, many transfers can occur at
     once using full bandwidth.This is also an advantage over the arbitrated loop
     topology. Switches have two types of ports: F_Ports and FL_Ports. FL_Ports are
     fabric loop ports—arbitrated loops are attached to these ports. F_Ports are fabric
     ports to which point-to-point connections are established. Switches also have
     E_Ports, which are used to connect to other switches. E_Ports communicate in
     Class F frames to distribute information about the different servers and to set up
     circuits for the passing of frames to the appropriate nodes over the fabric.
         A fabric provides a way for devices to communicate with each other over
     long distances. In order to find a port, the fabric needs an address for identifica-
     tion. Nodes attached to a fabric receive a 24-bit address.The address has the
     format XXYYZZ and is carried in the Destination ID (D_ID) of frames
     intended for the device and in the Source ID (S_ID) of frames sent from the
     device. XX is the domain.This two-digit hexadecimal number refers to the physical
                                                   Fibre Channel Basics • Chapter 2   49


switch itself.YY is the area, which corresponds to the port on the switch to
which the device is attached. ZZ is the AL_PA, the address assigned to the device
during the arbitrated loop initialization process.This ZZ value is set to 0x00 for
point-to-point connections between the switch and the edge device. See Figure 2.7
for an illustration of switched fabric addressing. Brocade addressing is discussed
further in Chapter 7.
Figure 2.7 Switched Fabric Addressing

                                                        Host
                        Storage Array                  011e00
                           081600




                               F_Port                F_Port




                                                     E_Port




                                                     FL_Port




                       Host
                      0a1602                          FC-AL Disk
                                                       0514ef




Fabric Services
The switches that form a fabric save information about the devices connected
directly to them in databases.The switches also provide services for notifying
devices of changes on the fabric that affect the way the device functions.
50   Chapter 2 • Fibre Channel Basics


          Switches distribute this information among themselves through Class F ser-
     vice frames. Switches exchange information in their servers so that all individual
     switch servers contain the same information.This creates a singular fabric entity
     and makes it appear that there is only one of each type of server. For instance, the
     Name Server information is shared among all Name Servers on all attached
     switches.This creates a distributed Name Server that has information about all
     devices on the fabric. By distributing the servers, the switch structure becomes
     transparent to attached nodes.
          There are a number of fabric services defined in FC-GS-3 (Generic
     Services).The Alias Server manages aliases for multicast groups and hunt groups.
     The Time Server distributes time information for setting timers and expiration
     times.The Key Distribution Server provides encryption keys for secure connections
     between two nodes. In this section, we cover in detail the Login Server, Name
     Server, Fabric/Switch Controller, Management Server, and Time Server.
          Like nodes, fabric services have addresses, but the address of a fabric service
     is a fixed value called a well-known address.Well-known addresses are reserved by
     the standard.

     Login Server
     The fabric port is at well-known address FFFFFE. It is sometimes called the
     Login Server because a device is required to send a Fabric Login (FLOGI) frame
     to this port before it can communicate with the rest of the fabric. A port that
     needs to connect to the fabric must log in with this server.The node sends a
     FLOGI frame with the S_ID field filled in only for its AL_PA value.The Login
     Server then sends a response with the D_ID field filled in with the device’s
     AL_PA and newly assigned domain and area values (see the Switched Fabric
     Topology section earlier in this chapter).

     Name Server
     Directory services can be accessed at well-known address FFFFFC.The Name
     Server is the primary feature of directory services. It is a database used to store
     information about devices attached to a fabric.The Name Server gets informa-
     tion from a device through the Port Login (PLOGI) frame at initialization and
     through subsequent registration frames.The Name Server acts like a database—
     entries can be looked up, added, or deleted. Nodes transmit request frames to the
     Name Server and receive a response containing the information requested or
     confirmation of the action requested. One of the most common requests is a
     Request For Transfer (RFT_ID) frame. An RFT_ID is a request to Register
                                                    Fibre Channel Basics • Chapter 2   51


FC-4 types (ULPs). A device does this so that the Name Server has a record of
what type of device it is. Often, a host computer will send a request to return the
address of all devices that support a certain type of ULP, such as FCP.This way, a
host can find all the SCSI disks on the fabric. Another common request is a Get
All Next (GA_NXT).This request obtains all information about the next highest
node in the Name Server at the specified address.This command is useful for
devices that are trying to map the fabric, such as a fabric management utility, or
for devices that are trying to find appropriate hosts with which to begin transfers.

Fabric/Switch Controller
The Fabric Controller, at well-known address 0xFFFFFD, provides a state change
notification service to registered nodes, which notifies any device registered to
receive the service when a change in fabric topology occurs. Devices that use this
service are generally hosts that want to keep track of a number of storage targets.
A device registers for state change notification by transmitting a State Change
Registration frame (SCR) to the well-known address.When there is a change in
fabric topology, the Switch Controller transmits a Registered State Change
Notification (RSCN) frame to the device.The RSCN frame is simply a notifica-
tion to the device that there has been a change. It is up to the device to query
the Name Server to assess the state of the fabric at this time.

Management Server
The Management Server provides information about the fabric without stipula-
tion as to zone. A zone is a collection of nodes defined to reside in a zone set.
Multiple zones can be defined. Nodes within a zone are aware of other nodes
within that zone, but not of nodes outside their zone. For instance, a Name
Server query will not return information for nodes outside the requestor’s zone.
The Management Server provides a single access point for managing the fabric
as well as three services. First is the Fabric Configuration Server, which provides
information to management entities trying to discover the fabric topology.The
second service is the Unzoned Name Server, which provides access to Name Server
information for nodes within all zones.The final service is the Fabric Zone Server,
which allows management entities to control zone participation and access
present zone information.
52   Chapter 2 • Fibre Channel Basics


     Time Server
     The Time Server is provided so devices can maintain system time with each
     other.The Time Server is accessed at well-known address identifier FFFFFB. A
     client will send a Get_Time frame to the Time Server, which then responds with
     a Get_Time_Response frame containing the time offset, in seconds.

     Other Services
     Switch manufacturers often provide many other common services, such as the
     Alias Server, which acts like a Name Server to handle the aliases for multicast
     groups. A multicast group is a group of nodes that receives data destined for a
     multicast address.The Alias Server keeps a registry of all nodes belonging to a
     multicast address, and also handles registration and deregistration of nodes from
     multicast groups.The Alias Service is not involved in the routing of frames for
     any group.
                                                     Fibre Channel Basics • Chapter 2   53



Summary
An exciting part of IT, SANs are allowing more users access to more data at faster
rates.The purpose of a SAN is to provide an infrastructure over which large
amounts of data can be transferred robustly between servers and storage devices
such as JBODs and RAID systems. SANs provide three key advantages: speed,
reliability, and scalability.
     Fibre Channel is the primary SAN technology. Currently, the most popular
protocols used over Fibre Channel are FCP and IP. A SAN implemented using
the Fibre Channel protocol incorporates the benefits of a channeled connection
and a network.
     A Fibre Channel SAN is constructed from initiating devices, switches, target
devices, hubs, repeaters, and bridges. A target device is a storage device on a SAN,
and there are many different types of storage devices, including tape drives,
JBODs, RAIDs, and IP targets. An initiating device is a device that actively seeks
out and interacts with target devices on the SAN. Examples are a server or work-
station, and they are often called hosts. Switches create the foundation of your
Fibre Channel SAN and provide a high-speed interconnect for routing frames
from one device to another. Switches provide the linking capability of a SAN
over a wide distance, as well as additional ports for scalability.
     Fibre Channel is most easily understood if it is broken down into its five
layers, which are labeled FC-0 to FC-4.The physical media is the FC-0 layer.
Fibre Channel transmits in 8b/10b-encoded characters.The signaling interface is
the FC-1 layer.This means that for each 10 bits of information transmitted, 8 bits
of information are received, which are encoded into a character. Four transmis-
sion characters make a transmission word. Primitives and transmission words are
at the FC-2 level. Primitives control the flow of frames on a Fibre Channel link.
A fabric provides certain services to the nodes attached to it—the services pro-
vided are part of the FC-3 layer, and include a Name Server,Time Server, Login
Server, and others. On a fabric, all services are conceptually distributed, meaning
that the same server provides service to all nodes, independent of direct switch
attachment. SCSI data mapped into Fibre Channel frames is the ULP mapping
referred to as the FC-4 layer.
     There are three topology types for a SAN: point-to-point, arbitrated loop, and
switched fabric. Most often, your SAN will contain examples of all three topolo-
gies. Switched fabric (also called point-to-point) is used to connect single nodes
to a switch F_Port. Arbitrated loop is a topology used to connect a number of
54   Chapter 2 • Fibre Channel Basics


     devices and can be connected to a switch through an FL_Port. Devices on an
     arbitrated loop share the bandwidth of one line.
         One or more interconnected switches are called a fabric. Switches distribute
     data about devices connected to them among the entire fabric to provide dis-
     tributed services.The switches that form a fabric save information about the
     devices connected directly to them in databases.They also provide services for
     notifying devices of changes on the fabric that affect the way the device functions.
         Classes of service specify what mechanisms will be used for the transmission
     of data. Different classes of service are used for different types of data. Class 1 ser-
     vice provides a dedicated connection using end-to-end flow control. Class 2 ser-
     vice is connectionless and uses end-to-end flow control. Class 3 is used for SCSI.
     It uses buffer-to-buffer flow control and is connectionless. Class 4 service provides
     fractional bandwidth connections.

     Solutions Fast Track
     The Architecture of SANs
              A Fibre Channel SAN provides the advantages of increased speed, relia-
              bility, and scalability.
              Fibre Channel presently transmits at 1.0625 Gbit/sec over single- and
              multimode optical and copper cabling.
              A SAN implemented using the Fibre Channel protocol incorporates the
              benefits of a channeled connection and a network.
              A SAN is constructed from three primary types of elements: initiating
              devices, switches, and target devices.
              A target device is a storage device on a SAN. Device enclosures like
              tapes, JBODs, or RAIDs are the most common type of target device.
              An initiating device is a device that actively seeks out and interacts with
              target devices on the SAN.
              Switches create the foundation of the Fibre Channel SAN. A group of
              interconnected switches is called a fabric.
                                               Fibre Channel Basics • Chapter 2   55


Fibre Channel Protocol Basics
     Fibre Channel is primarily used to transport the SCSI and IP protocols.
     Devices are identified by an 8-bit Arbitrated Loop Physical Address
     (AL_PA) in an arbitrated loop topology, and a 24-bit address for
     switched fabric connections.
     Frames start with a primitive Start Of Frame (SOF) and end with an
     End Of Frame (EOF) primitive.
     There are five Fibre Channel layers, designated FC-0 through FC-4.
     The FC-0 layer is the physical media layer and includes the media
     selection and connectors.
     The FC-1 layer is the signal encoding and decoding layer.The FC-1
     layer uses 8b/10b encoding.
     The FC-2 layer is the Fibre Channel protocol layer.
     The FC-3 layer is the Fibre Channel common services layer.The ser-
     vices are servers in a Fibre Channel fabric that manage connections
     between devices connected remotely through the switched fabric.
     The FC-4 layer is the Fibre Channel ULP mappings layer.


Classes of Service
     Classes of service specify what mechanisms are required for transmission
     of different types of data.
     Class 1—Acknowledged connection-oriented service.
     Class 2—Acknowledged connectionless service.
     Class 3—Unacknowledged connectionless service.
     Class 4—Fractional bandwidth connection-oriented service.
     Class F—Used for inter-switch communication.
56   Chapter 2 • Fibre Channel Basics


     Storage Network Topologies
              There are three primary types of topologies in Fibre Channel: point-to-
              point, arbitrated loop, and switched fabric (also called point-to-point).
              The primary use of the point-to-point topology is to connect devices
              directly to switches or other bridge devices.
              The arbitrated loop topology allows up to 127 devices in a ring
              formation to share the bandwidth of a single line without a switch.
              Fabrics allow thousands of devices to be interconnected.
              Switches have three types of ports. FL_Ports are fabric loop ports that
              attach arbitrated loops to the fabric. F_Ports are fabric ports that connect
              single devices to the fabric in a point-to-point topology. E_Ports connect
              a switch to another switch.
              Fabric-attached devices have a three-part address.The first segment indi-
              cates the physical switch, the second part indicates the physical port, and
              the last part is the arbitrated loop address in a loop device or 0x00 for a
              fabric device.


     Fabric Services
              Switches exchange information in their servers so that all individual
              switch servers contain the same information.This creates distributed
              servers.
              The fabric port is used to log a device into the fabric.The response
              frame from login assigns the device its 24-bit address.
              The Name Server is used as a database to register and store information
              about all devices on the fabric.
              The Fabric Controller at well-known address 0xFFFFFD provides state
              change notification service to registered nodes. State change notification
              is a service that notifies devices when a change in fabric topology
              occurs.
              The Management Server provides information about the fabric without
              stipulation as to zone.
                                                    Fibre Channel Basics • Chapter 2   57



Frequently Asked Questions
The following Frequently Asked Questions, answered by the authors of this book,
are designed to both measure your understanding of the concepts presented in
this chapter and to assist you with real-life implementation of these concepts. To
have your questions about this chapter answered by the author, browse to
www.syngress.com/solutions and click on the “Ask the Author” form.


Q: When should I use a hub rather than a switch?
A: Hubs can be used on small SANs to interconnect devices in an arbitrated
   loop topology or connect two devices directly. Hubs should not be used
   when there is more than one active host, since there will be more competi-
   tion for the limited bandwidth of the loop.The more hosts that are added to
   the loop, the less efficient it is, because a larger percentage of time is spent
   arbitrating for control of the loop than actually transmitting data. Remember
   that a switch creates circuits to maximize bandwidth, while all devices
   plugged into a hub share the bandwidth of one line.

Q: How does Fibre Channel compare to SCSI in terms of performance?
A: SCSI LVD (wide) has a maximum transfer rate of 80 MB/sec, as opposed to
   Fibre Channel’s 100 MB/sec.The advantage of using Fibre Channel over
   SCSI is not entirely speed, however. Fibre Channel allows you the unique
   opportunity to create a switched network with Fibre Channel devices. Not
   only can you attach more devices together, but the performance is actually
   increased as well. Fibre Channel also allows you to use fiber-optic cable as a
   media type, which extends the area you can connect the devices by 10 km
   per cable length.

Q: What are some other SAN technologies?
A: Right now, Fibre Channel is by far the most common protocol in the SAN
   marketplace. No other technology has the ability to incorporate the aspects of
   networks and channels in the way Fibre Channel does. In the past, FDDI was
   a popular technology that used a loop configuration similar to Fibre
   Channel’s arbitrated loop. Some emerging technologies in the SAN industry
   are InfiniBand and Gigabit Ethernet.
58   Chapter 2 • Fibre Channel Basics


     Q: How do I know if my new device will interoperate properly with my
         existing SAN?
     A: It is always difficult to know. However, devices are getting dramatically better
         at working across multivendor switched fabrics.The Fibre Channel Industry
         Association (FCIA) has also started an initiative to document common inter-
         operability problems and develop testing specification documents to deter-
         mine whether a device contains interoperability bugs.The SANmark Program
         has been active for a little over a year, and devices can now be certified as
         SANmark-compliant. Devices that pass these tests are probably the most
         interoperable devices.
                                      Chapter 3


SAN Components
and Equipment




 Solutions in this chapter:

     s   Overview of Fibre Channel Equipment
     s   Cabling and GBICs
     s   Using Hubs
     s   Using Switches and Fibre Channel Fabrics
     s   Connecting Legacy Devices Into Your SAN
     s   Bridging and Routing to IP Networks
         and Beyond
     s   Fibre Channel Storage


         Summary

         Solutions Fast Track

         Frequently Asked Questions
                                                59
60   Chapter 3 • SAN Components and Equipment



     Introduction
     Whether you are building a small or large network, one aspect of a robust Fibre
     Channel deployment is the SAN components used to build your solution. By
     understanding the different components, features of those components, and how
     they are best used, you can plan and deploy a reliable, scalable network. Upfront
     qualification, testing, and selection of equipment are important pieces of making
     your SAN deployment work. Understanding which features you will be using in
     your equipment will help guide your testing process.This chapter discusses the
     different components of a SAN and their major features, and guides you in
     selecting the right equipment for the job.
         Fibre Channel has its own various types of connectors and media, including
     both optical and copper interfaces, and varied ways to connect between the
     different kinds of media. Fibre Channel uses fiber-optic and high-speed copper
     media to bring together the speed and reliability of a channeled technology, with
     the scalability of networking technologies.This is the perfect medium for trans-
     porting large amounts of data quickly to many different nodes across a network.
         The standardized use of Gigabit Interface Connectors (GBICs) has made
     switching between media types simple and easy, and mixed-media networks are
     standard.This chapter explains the different connector and cabling options, and
     how to select the right one for your application. It also covers the kinds of net-
     work topologies you can implement and why.
         Hubs, switches, Host Bus Adapters (HBAs), and storage make up the compo-
     nents of a SAN. Hubs serve as the center of simple Fibre Channel-Arbitrated Loop
     (FC-AL) configurations, and range from simple unmanaged hubs to more intelligent
     managed hubs capable of switching frames between ports but not acting as switched
     fabrics. For more reliable, manageable, and scalable networks, Fibre Channel
     switches are used instead of hubs. Switches scale between as few as eight ports to
     64 or more ports, and form the core of a switched fabric. HBAs serve as the entry
     point into the SAN from your servers and hosts, providing translation of Small
     Computer Systems Interface (SCSI) information from the operating system to
     Fibre Channel addresses on the network. High-capacity storage systems can contain
     petabytes of data and form the core of the data storage infrastructure of your
     storage network. Finally, routers and bridges enable you to move data between
     legacy SCSI components and Fibre Channel, as well as to networks based on
     Gigabit Ethernet, Asynchronous Transfer Mode (ATM), and Dense Wave Division
     Multiplexing (DWDM).
         This chapter reviews the hardware components that make up a SAN, explains
     the major features and functionality of each, and describes the tools and techniques
                                         SAN Components and Equipment • Chapter 3      61


available to manage each piece.We will guide you through the features to look
for on each component and how to use these features.

Overview of Fibre Channel Equipment
Fibre Channel shares much of the same terminology as Ethernet networking
with hubs, switches, network interface cards, and routers all representing a part of
the network infrastructure. However, although the names are often the same, the
way they work is quite different. In the development of the Fibre Channel
industry, many of the terms were borrowed from more familiar networking ter-
minology, even though in actual practice the functionality has changed. For
example, in Ethernet a hub exists mostly as an unmanaged electrical device that
allows multiple Ethernet connections to connect to a single point, with all con-
nections seeing the same network traffic. In Fibre Channel, a hub connects each
port to the one next to it in a circuit. It is important not to confuse the Ethernet
use of the terminology with the Fibre Channel terminology and usage. Figure
3.1 identifies the components of a SAN.

Cabling and Media
Many characteristics of your SAN are determined by the physical wiring plan of
your network.The type of media you select impacts the scalability and function-
ality of your SAN.This chapter discusses the options for choosing a physical
media, including the advantages and disadvantages of different types of fiber-optic
cabling and the choice between fiber and copper cables.
     Types of media we discuss in this chapter include:
     s   Copper (Shielded Twisted Pair [STP])
     s   Multimode optical
     s   Single-mode optical


GBICs and Connectors
The cabling and GBICs section in this chapter is dedicated to familiarizing you
with the different types of physical connectors used to connect devices and ter-
minate cable.Your selection of a fixed connector versus a GBIC affects the capa-
bility of your SAN to adapt to new devices and support legacy connector types.
The choice of connector type can also affect your ability to add to your SAN in
the future. Different types of connectors have different considerations and might
62   Chapter 3 • SAN Components and Equipment


     require special handling to use correctly.Types of connectors we discuss in this
     section include:
          s   DB9 (copper)
          s   HSSDC (copper)
          s   SC (optical)
          s   High-density optical connectors (Small Form Factor Pluggable [SFP]):
              —MT-RJ
              —LC

     Figure 3.1 Different SAN Components in a Network

                                     JBOD




                                              RAID                      RAID


                                                                                             Remote
                                                                                              SAN
                               Hub


                                                                       DWDM
                    Host                                                                       Remote
                                                                                                 SAN
                                     Switch          Switch




                                     Switch          Switch

                                                                    FC-to-ATM Bridge
                 Host

                                                                                       FC/SCSI
                                                                                        Bridge
                        Host
                                       Host          HBA




                                                              HBA

                                                                         SCSI Tape Library
                                         SAN Components and Equipment • Chapter 3       63


Hubs
In Fibre Channel, hubs serve at a very basic level as electrical connections between
the different ports and are used only in FC-AL configurations. Hubs originally
started out as simple electrical devices, which, if a cable was attached to a port,
completed the electrical loop between ports. If a signal is detected, a hub will
complete the circuit and pass traffic through the attached wire. A simple resilient-
loop circuit is used to make sure that connections are maintained through unused
ports. A hub in this sense is a simplification of cabling that reduces the need for
separate transmit and receive wires to all devices in a system (one of the original
ways to connect Fibre Channel equipment). Intelligent managed hubs provide
basic functionality but add more sophisticated error and fault detection, switched
frames, and additional features for managing loop environments.
     Hubs support the use of Fibre Channel to connect up to 127 devices in a
loop. Due to the complexity of looped environments and available bandwidth,
the number of devices is generally significantly less than the maximum of 127.
Also, while loops of 127 devices are theoretically possible, but impractical, net-
works larger than 127 devices could not be built due to address space limitations.

Switches
Fibre Channel switches, unlike hubs, are primarily used in Fibre Channel
switched fabric installations. Instead of a loop, where traffic is passed between all
nodes (a shared bandwidth and error segment architecture), the Fibre Channel
fabric instead routes frames directly from initiators and targets across a full-band-
width fabric.This means that each connection across a fabric can exist indepen-
dently of every other connection. Switches, which can range from as few as eight
ports to 64 ports or more, contain sophisticated switching hardware used to route
frames from any port to any other port. In addition, switches can also be cascaded
through Inter-Switch Links (E_Ports), which allow fabrics to extend to thousands
of nodes and up to a current limit of 239 switches.
    Switches add the intelligence of fabric services such as name services and
management services, and provide a more robust protocol set for connecting
devices. Switches are used in almost all environments to provide a reliable mecha-
nism for connecting hosts to storage and are a necessity for environments with
multiple initiators or more than just a few devices. Fibre Channel switches are
the foundation upon which the rest of the SAN infrastructure is built.
64   Chapter 3 • SAN Components and Equipment


     Storage
     Fibre Channel storage is a key component of a Fibre Channel network, providing
     the shared storage resources that can be accessed through your SAN. Storage is
     usually the area of focus for your SAN, except in the case where the network is
     being used exclusively for IP or Virtual Interface (VI) traffic.
         Fibre Channel storage ranges from single disk drives that support dual-port
     Fibre Channel connections, to banks of disk drives called a Just A Bunch Of
     Disks (JBOD) wired together into a cabinet, to more sophisticated Redundant
     Array of Independent Disks (RAID) storage devices with hundreds of gigabytes
     of capacity, and finally to enterprise-level storage subsystems that contain a ter-
     abyte or more of data.

     Host Bus Adapters
     HBAs are used to connect servers and hosts to the Fibre Channel network.
     Analogous to Network Interface Cards (NICs), the term host bus adapter comes
     from their use of connecting servers to the SCSI bus. HBAs consist of hardware
     and drivers, which interface with operating systems to represent Fibre Channel
     storage as devices in the operating system. HBAs are the gateway to accessing
     your SAN.
          HBAs typically plug into a host’s bus (for example, PCI or Sbus), although
     some HBAs might be embedded on the motherboard and translate signals on the
     local computer to frames on a Fibre Channel network. A key part of this process
     is the driver, which controls your Fibre Channel HBA and determines how the
     device behaves with the operating system as well as the general Fibre Channel
     network. Unlike typical NICs, Fibre Channel HBAs tend to be far more intelli-
     gent than the standard network card, providing for negotiation with switches and
     tracking devices that are attached to the network. Robust software and hardware
     functionality enable these components to offload I/O processing from the host,
     monitor network configurations, and support load balancing and failover
     capabilities.

     Routers and Bridges
     Routers in the Fibre Channel sense do not serve the same purpose as routers in
     the networking world. Instead, Fibre Channel routers act as bridges, translating
     Fibre Channel frames to other types of transports.The most common routers
     translate between legacy SCSI connections, representing a SCSI bus as a number
                                         SAN Components and Equipment • Chapter 3        65


of individual Logical Unit Numbers (LUNs) behind a Fibre Channel port. SCSI-
to-Fibre Channel routers are frequently used to connect SCSI tape devices to
Fibre Channel. It is possible that true Fibre Channel routers will emerge in the
future, which might cause some confusion.
    Other types of bridges include Fibre Channel-to-Gigabit Ethernet bridges,
which typically bridge IP frames from Gigabit Ethernet to Fibre Channel, and
Fibre Channel to DWDM or ATM bridges, which transport full Fibre Channel
frames across DWDM or ATM technology for extended fabrics that span from
several kilometers to several hundred kilometers.

Cabling and GBICs
The most basic layer of your SAN is the physical layer, which includes your media
and connector choices.These choices depend on the primary purpose of your
SAN. Like most technologies, improvements are happening every day, so we will
highlight the connectors and components that are most popular today.
    You must consider a number of factors as you put together the physical layer
of your SAN. One factor is distance, or how far you must connect two points of
interest. Next, you need to consider your existing architecture: if there is
already fiber or copper that can be used, determine if it is compatible with the
components you would like to add.You also need to consider scalability, so your
SAN will be easily upgradeable to allow devices to be added and removed with
a minimal amount of added materials.The final consideration is cost.What
components are the least expensive, and what are their advantages and disadvan-
tages? This section provides information on specific media and connectors, so that
you can assemble a cost-effective SAN that is efficient, reliable, and scalable.

Copper Versus Optical: Selecting Your Media
In selecting the type of media to use for your SAN, you have two primary
choices: copper and optical.The distinct advantage of copper is that it is inexpensive
compared to all types of optical.The advantage of optical fiber is that it provides
a reliable signal over a longer distance than copper.The choice between the two
types of optical fiber (multimode and single mode) is also one of distance and
cost.There are no speed differences between any of the media types.

Copper Cabling
Copper has the advantage of being the least expensive media by which to connect
components of your SAN. Copper is generally 150-ohm shielded twisted pair,
66   Chapter 3 • SAN Components and Equipment


     although 75-ohm video cable and mini-coaxial cable are also used. Copper cabling,
     when used at a rate of 100 MB/sec, has an effective range of 0 to 25 meters
     without sacrificing quality.Transmitting at half speed and quarter speed increases
     the effective distance of transmission. However, few companies manufacture half- or
     quarter-speed products. Copper is usually terminated with either an HSSDC or
     DB-9 male connector (on the DB-9 connector, the end with the pins is male).
     Although at this time DB-9 is a more common connector, HSSDC is quickly
     becoming more popular. See Table 3.1 for a comparison of media specification.

     Table 3.1 Copper Media Type Comparison Chart

     Media Type                          Speed (MB/sec) Optimal Distance (Meters)
     Shielded Twisted   Pair (Active)    100               0–30
     Shielded Twisted   Pair (Passive)   100               0–15
     Video Cable                         100               0–25
     Shielded Twisted   Pair
     (Active)                            200               0–10
     Video Cable                         200               0–10
     Shielded Twisted   Pair
     (Active)                            50                0–40
     Video Cable                         50                0–40

     Values in table are estimated lengths based on optimal signal strength.

          Copper is highly durable and easy to store, which makes it useful for a lab or
     area where devices are commonly plugged and unplugged, or when you are con-
     stantly connecting and disconnecting a device over a short distance to a number of
     different hosts. An advantage to using the DB-9 and HSSDC copper connectors is
     that there is only one way they fit into the complementary connector, which
     means it is impossible to cross a transmit and receive line, a common mistake for
     even experienced individuals dealing with fiber optics. Optical cabling is harder to
     terminate and can be susceptible to scratches. In addition, copper is a better choice
     in the cabinet short-length connection. For lengths longer than five meters, single
     or multimode optical fiber might be a better choice.

     Multimode Optical Cabling
     Multimode optical cable is available in 50 micrometers (µm) and 62.5 µm sizes.
     These measurements correspond to the diameter of the fiber—there is no speed
     difference between the two that affects Fibre Channel. Multimode optical cable is
                                         SAN Components and Equipment • Chapter 3       67


available in 850 nanometer (nm) and 1300 nm wavelengths.The 850 nm wave-
length is within the visible spectrum and is not harmful to your eyes.This is not
true of 1300 nm wavelength lasers, which are not visible but could severely
damage your retinas. Multimode optical fiber is terminated using a variety of
optical connectors, including SC, LC, and MT-RJ (we discuss connector types
later in the chapter). A 50 µm multimode fiber has an effective range between 0
and 500 meters at a 1 Gbit/sec rate (see Table 3.2 for specifications on other
multimode fibers).The 62.5 µm fiber has about half the range of 50 µm fiber.
    Multimode fiber is the more common media type and is inexpensive com-
pared to single-mode fiber, although the two are coming closer together in price
as the demand for single mode rises. Multimode transmitting and receiving com-
ponents are also much less expensive, because multimode generally uses a con-
centrated LED rather than an actual laser.This is because multimode fiber is
much wider in diameter than single-mode fiber.
    Many multimode fibers have a feature called Open Fiber Control (OFC),
which is a feature of the transmitter and receiver pairs. In OFC, the transmitter
periodically transmits short bursts of light.When the receiver detects this light, it
begins to transmit regularly and causes the other transmitter to go out of OFC
mode.The OFC mechanism was designed to avoid the potential hazards of
having unconnected lasers transmitting in a work environment. OFC is becoming
a less common feature, since most multimode transmitters use LEDs rather than
lasers and there is no associated safety risk.

Table 3.2 Multimode Optical Media Comparison

                           Laser/ LED        Speed          Optimal Distance
Media Type                 Type (nm)         (MB/sec)       (Meters)
50 µm multimode            850               100            2–500
62.5 µm multimode          850               100            2–300
50 µm multimode            850               200            2–300
62.5 µm multimode          850               200            2–90
50 µm multimode            850               50             2–1000
62.5 µm multimode          850               50             2–400

Values in table are estimated lengths based on optimal signal strength.
68   Chapter 3 • SAN Components and Equipment


     Single-Mode Optical Cabling
     Single-mode optical fiber (Figure 3.2) is the most expensive media type, but
     preferable for long distances. It most often comes in 1300 nm wavelength, which
     is not visible and can be harmful to your eyes.

     Figure 3.2 Single-Mode Fiber with SC Terminators




          Single-mode optical fiber is approximately 9 µm in diameter.The small diam-
     eter makes light waves less likely to be altered over long distances, so for long-
     distance SANs, single-mode fiber is the best solution. Because of its small diam-
     eter, it also theoretically has the highest transmission speed potential (the theoret-
     ical limit is around 25 Tbit/sec, as opposed to multimode, which is around 10
     Gbit/sec). Single mode is the ideal media to use for long interconnects.
          Single-mode fiber itself is not significantly more expensive than multimode
     fiber or even copper—the added price is in the transmitting components, which
     use lasers rather than LEDs. Since the fiber has such a small diameter, it takes
     added precision to align the laser in the transmitter with the fiber. See Table 3.3
     for specifications on single-mode fibers.
                                        SAN Components and Equipment • Chapter 3     69


Table 3.3 Single-Mode Optical Media Comparison

                         Laser/LED        Speed           Optimal Distance
Media Type               Type (nm)        (Mb/sec)        (Meters)
9 µm single mode         1300             100             2–10,000
9 µm single mode         1300             50              2–10,000
9 µm single mode         1300             200             2–2,000

Values in table are estimated lengths based on optimal signal strength.



WARNING
     Any single-mode or multimode laser can damage your eyes if it is trans-
     mitted at 1300 nm. The 1300 nm wavelength is not in the visible spec-
     trum, so you will not see a laser being transmitted like in 850 nm fiber. A
     1300 nm laser is dangerous, because it can cause severe retina damage.




Connecting with Connectors
There are many different types of connectors, and no particular connector makes
a difference in performance as long as the connection is clean. Some connectors
are bonded, which means that the transmit and receive fibers are physically
mounted in the same piece of plastic.This is usually acceptable, but for some less
orthodox wiring systems it might be preferable to select connectors that have
loose transmit and receive fibers.This section covers the most well-known types
of connectors.
    You should try to minimize the total number of connections and patches
when building your SAN. As discussed earlier, light is reflected by poor connec-
tions and patches in the path between devices, so minimizing the number of
patches between devices makes your SAN less susceptible to loss-of-signal errors.

The DB-9 Copper Connector
DB-9 is the standard copper connector, although more organizations are switching
to HSSDC because of its improved reliability and smaller size. DB-9 has the same
appearance as DB-9 serial cabling, so it is important to understand that they are
not the same (Figure 3.3). DB-9 connectors have a metal D-shaped connector rim
70   Chapter 3 • SAN Components and Equipment


     with 9-pin sockets on the female end and either four or eight pins on the male
     end, rather than all nine used in serial cabling. Currently, DB-9 terminated cable is
     less expensive than HSSDC terminated cable.
     Figure 3.3 DB-9 Copper Connector




         Copper cabling is available in two types: passive copper and active copper. Passive
     DB-9 has four pins (two for transmitting and two for receiving) and, like HSSDC,
     is used to terminate shielded twisted pair. In active copper cabling, four pins of a
     DB-9 connector are used to transmit power in addition to the two pairs that are
     used for transmit and receive. Active copper lines get twice the distance of passive
     copper lines. Both active and passive type DB-9 connectors are equally priced.
         Again, it is important when purchasing DB-9 cables not to confuse the con-
     nector with DB-9 serial cabling.The resistance between the two is not the same
     and can severely damage your equipment.

     The HSSDC Copper Connector
     The HSSDC, shown in Figure 3.4, is starting to replace DB-9 connectors on
     some HBAs.The most probable reason is that they are smaller than the DB-9
     connectors, so more can fit on a single interface card.The HSSDC connector
     uses a single plastic squeeze lock, so it is easy to insert and remove.The HSSDC
     connector was specifically designed as a Gigabit copper connector, by improving
     density and performance over the DB-9 style connector.
                                        SAN Components and Equipment • Chapter 3      71


Figure 3.4 HSSDC Copper Connector




   Looking Forward
   The InfiniBand architecture, a widely supported and fast-moving effort,
   uses a type of copper connector called HSSDC-2, as well as the same
   types of cabling media as Fibre Channel. The InfiniBand protocol is
   designed as a replacement for the PCI bus on server systems that need
   greater width for I/O. The goal is to avoid the impact on the bus of a
   server by directing data to the appropriate channel in the server.
   InfiniBand will be able to encapsulate many protocols, including FCP, in
   order to transfer Fibre Channel or other ULPs to the appropriate
   adapters; however, this should be transparent to the SAN architecture.
   With devices already coming to market, the ability to include InfiniBand
   servers on your SAN might be a consideration for expansion.



The SC Optical Connector
The SC connector, shown in Figure 3.5, is probably the most widely used optical
connector.The SC connector has been commonly used to replace the ST con-
nector, which at one time was widely used with legacy fiber technologies.The
SC connector is a square plastic block containing a glass housing for the fiber.
The plastic fits snugly in the connector slot on the board or GBIC. SC connec-
tors are used to terminate single or multimode fiber.
    SC connectors can be either bonded or unbonded, and come in either single
or duplex quantities. A single-quantity SC patch cord is a piece of single or
multimode fiber terminated at both ends with one SC connector. A duplex
quantity is a pair of fibers, one for transmitting and one for receiving.The plastic
insulation on the fiber is molded together to provide a transmit/receive pair.
There are a total of four SC connectors, two at each end of the fiber.The SC
72   Chapter 3 • SAN Components and Equipment


     connectors on duplex fiber might be bonded together, meaning that the SC con-
     nector pairs are made of one piece of (or bonded) plastic. Bonded connectors
     have the advantage of reducing your ability to plug the transmit fiber into the
     transmit slot on the GBIC or port.

     Figure 3.5 Unbonded and Bonded SC Optical Connectors




     High-Density Fiber-Optic Connectors
     High-density connectors represent the next generation of fiber-optic connectors.
     High-density connectors are designed to be small to allow more connections in a
     small space, which might be the back of a PCI adapter card or the faceplate of a
     hub or switch. As the Fibre Channel protocol develops, you can connect more
     nodes reliably, and you will need to have space to make those physical connec-
     tions.The most popular types of high-density connectors are the LC and MT-RJ
     connectors. Neither type uses any new optical technology. In fact, the connectors
     use the same multimode and single-mode fiber that SC connectors use.The dif-
     ference is entirely in the piece of plastic in which the fibers are housed. HSSDC
     copper connectors, discussed earlier in this section, are a high-density-type con-
     nector.The HSSDC connectors are designed to accomodate more copper ports
     on HBAs and switch and hub faceplates.
         LC connectors are bonded pairs of miniature connectors.The design of the
     plastic pieces looks like a small, elongated version of the SC connector. However,
     the LC pair is comparable to the width of a single SC connector.The MT-RJ
     uses a single terminator for pairs of fibers.The plastic design is similar to a minia-
     turized HSSDC without the copper contacts on top of the housing. Instead, the
                                        SAN Components and Equipment • Chapter 3      73


fiber transmits and receives out of two pinholes in the end of the plastic housing.
MT-RJ connectors are also about the width of a single SC connector.

Comparing GBICs to Fixed Media
GBICs are removable transceivers used in all types of Fibre Channel devices,
including switches, hubs, and HBAs. They are used widely in Fibre Channel and
other network technologies. GBICs offer the option of interfacing with almost all
types of connectors. A GBIC fits into a GBIC port on a device. A large percentage
of Fibre Channel devices have a GBIC slot rather than a fixed media type port.
GBICs convert the electrical signal generated from the device into the appropriate
signal for transmission, depending on the type of connection the GBIC was
designed to make. GBICs can convert the device’s electrical signal to a signal that
is appropriate for single-mode fiber, copper (either HSSDC or DB-9), and multi-
mode fiber. GBICs have a variety of connector types (Figure 3.6) and can be used
to make your SAN connection types homogeneous.


Figure 3.6 SC, HSSDC, and DB-9 GBICs




Using a GBIC
Many devices have GBIC slots.Vendors provide them to make devices easier to
connect to a variety of media.The GBIC should slide easily into the GBIC slot.
74   Chapter 3 • SAN Components and Equipment


     GBIC slots generally have a trap door, which flips up on insertion of the GBIC.
     The GBIC should be inserted with the single socket pointed in and the connec-
     tion end facing out. (If you meet with any resistance, do not force the GBIC,
     since it is most likely upside down.)
         GBICS come in a variety of types, including multimode, single mode,
     HSSDC, and DB-9. It is important to remember to connect the right type of
     media. For instance, multimode fiber will not work in a single-mode GBIC.You
     will most likely need to use GBICs in a switch or hub.

     Pros and Cons of Using GBICs
     The advantage to using GBICs is the versatility they give devices to work within
     a topology.With a range of GBICs, a device can be attached to any SAN. Using
     GBICs, however, breaks a cardinal rule of networking, which is to minimize the
     number of connections. Although GBICs, when working properly, do not
     degrade signal, including a GBIC in the connection introduces another element
     that can malfuntion. Although the newer GBICs are highly dependable, they tend
     to be used over and over in different slots. Since connectors go through so many
     insertions and removals, they tend to break more quickly than a fixed connection
     could.When GBICs are used frequently over a long period of time, they become
     less dependable. GBICs are also expensive—second only to fiber in being the
     most often replaced piece of a SAN.

     GBIC Ports on Equipment
     Equipment might or might not have a GBIC port.Without a GBIC port, you are
     limited to the type of connection on board, whereas if you have a GBIC port,
     you have additional options. Switches and hubs almost always have GBIC ports.
     HBAs and storage devices often have fixed media.
         GBIC ports are becoming more common on devices now.This allows you to
     choose your media type based on the location of your device and other device
     specifications.With fixed media, however, if the vendor decided that a particular
     type of port is appropriate, it might limit your options as far as the distance and
     speed at which the device can be connected. Another drawback to a fixed port is
     that a failure in that port requires the unit to be replaced.

     Serialized Versus Nonserialized
     Serial ID GBICs provide serial number, model numbers, and diagnostic data
     through embedded Electrically Erasable Programmable Read-Only Memory
                                       SAN Components and Equipment • Chapter 3      75


(EEPROM).This allows for better asset tracking and diagnosis of GBIC
problems.This feature is used by SAN management tools and is the only way to
see if you have a mixture of GBICs in your fabric. Serial ID is a very important
feature if you want to use Brocade Fabric Watch to monitor GBICs. See Chapter
4, “Overview of Brocade SilkWorm Switches and Features,” for further discussion
about Fabric Watch.

Common Problems with GBICs
GBICs are highly reliable devices but, as mentioned earlier, their pluggable func-
tionality causes them to break more frequently than other components. If you are
careful and never force a GBIC, it should last a long time.When trying to diag-
nose a connection problem with a device, start by making sure your transmit and
receive wires are not crossed before deciding to replace the GBIC.
    Another common problem with GBICs is finding appropriate GBICs for the
media type you are using. Use only single-mode GBICs with single-mode
devices, and only multimode GBICs with multimode devices. It is a common
mistake to plug a fiber into an already-inserted GBIC and assume it is the correct
mode for the type of fiber you are connecting. Also, be careful when using
various GBICs between equipment from different vendors. Although GBICs can
be used between devices, vendors often ship GBICs that have been fully qualified
and tested specifically with their equipment. It is usually best to stick with the
GBICs that are shipped with a product or provided by the manufacturer, since
this reduces the possibility of support issues with your equipment. Although
GBICs have their issues, it would be almost impossible to develop a scalable SAN
without them.

Media Interface Adapters
Media Interface Adapters (MIAs) convert copper signals to optical signals by
sitting between a copper port and generating a laser from the copper signal.
MIAs convert DB-9 copper connectors to optical SC connectors and are most
commonly used when a device with a fixed media copper port needs to be con-
nected optically to the rest of the SAN. Since the maximum range of an active
copper line is 30 meters, using MIAs extends your connection range to 500
meters, the maximum distance for multimode fiber. MIAs are most commonly
used in this manner to connect legacy devices with fixed copper media.
     You should carefully consider using MIAs. Using an MIA as an adapter adds a
connection, which significantly reduces signal quality.
76   Chapter 3 • SAN Components and Equipment



     Using Hubs
     Fibre Channel hubs are used to connect simple FC-AL environments. Hubs were
     the original interconnect mechanism used for Fibre Channel, and provide connec-
     tivity between nodes in a loop. A hub connects ports, sending frames between indi-
     vidual ports but not routing them to other ports. Simple hubs do this electrically,
     and more intelligent hubs might also switch frames through the loop. Switched
     hubs do not implement a switched fabric protocol, but they still maintain the
     FC-AL environment. As a result, they have most of the same reliability, scalability,
     performance, and manageability limits of unmanaged hubs. For this reason,
     switched fabric is becoming the dominant SAN technology. However, for low-end
     installations, hubs can offer a less-expensive alternative to switches. If you need to
     scale your SAN at some point in the future, consider buying an entry-level switch
     instead of a hub.
          This section briefly describes the different kinds of hubs and their major
     features and how hubs are best used in your network.

     Simple Electrical Hubs
     Simple electrical hubs consist of a simple series of circuits that detect whether a
     connection has been plugged into a port on the hub.Think of a hub as being a
     “feature-rich” wire. A resilient-loop circuit simply completes a connection to
     ensure that a loop is continuous throughout the hub. Simple hubs generally sup-
     port only copper and do not include any software functionality. Although these
     hubs are still available, they are used only in the simplest of configurations due to
     their lack of fault tolerance, operational difficulty, shared bandwidth, and difficulty
     in maintaining a stable loop when there is more than one initiator.

     Managed Hubs
     As Fibre Channel has evolved, manufacturers have found that just having simple
     hubs does not address problems of stability and manageability in a network.
     Managed hubs were designed in response to these issues. Managed hubs, unlike
     their simple electrical predecessors, do not just connect wires from port to port.
     Instead, they add more sophisticated functionality such as fault detection on
     ports, settable port modes, and isolated loop operation. More advanced hubs
     enhance performance or usability with frame switching capabilities between
     ports, privately routed frames between initiators and targets (rather than having all
     nodes pass traffic through the entire loop), and advanced diagnostic capabilities.
                                        SAN Components and Equipment • Chapter 3      77


    A typical managed hub operates much like a simple electrical hub, by con-
necting adjacent ports in a continuous electrical circuit. However, managed hubs
add a fair amount of intelligence to monitoring the port, and might also contain
circuits that can interpret and modify frames received on the port. Instead of
blindly connecting two electrical paths, a managed hub might actually receive the
frames on a port and decide whether to transmit them further down the loop.
Another capability of these hubs is to filter out unwanted frames. For example,
when a marginal connection is sending invalid primitives into the line, a managed
hub will discard this frame. Managed hubs can also interpret a frame and send it
directly to a downstream port—essentially “switching” the packet and avoiding
the need for every node on the loop to see the frame. Managed hubs can also
segment a loop through software (loop zoning), and usually include intelligence
to handle the difficulties of managing Loop Initialization Primitive (LIP) condi-
tions and loop bring up. LIPs are part of a loop initialization process and are
expected in a healthy and normal loop. However, certain conditions inherent to
loops create scenarios where LIPs cause an interruption to I/O or prevent
devices from effectively communicating on the loop.
    Advanced managed hubs also avoid the problems of simple electrical hubs by
isolating initiators. Initiators on the loop can be configured to see and communi-
cate only with specific storage devices and can be screened from other initiators.
This prevents some problems that can occur when initiators try to reset or other-
wise communicate with each other.
    The following is a list of the typical capabilities provided by managed hubs.
Each item is discussed in more detail in the sections that follow:
     s   LIP isolation
     s   Automatic port bypass
     s   Signal retiming
     s   Loop zoning
     s   Web interface
     s   Telnet
     s   Port-event logging
     s   SNMP support
    LIP isolation is the capability of a hub to prevent LIPs from being transmitted
to all nodes in a loop. LIPs used to be the primary source of instability in an
78   Chapter 3 • SAN Components and Equipment


     FC-AL configuration, due to the complexity of the FC-AL protocol.With LIP
     isolation, LIPs are isolated from all other parts of the loop, preventing the disrup-
     tion of traffic and avoiding many LIP protocol-related problems.
         Automatic port bypass is the ability of a hub to automatically bypass a port if too
     many errors have been detected. Software keeps track of the number and rate of
     errors coming from devices, and if the error rate exceeds a certain, user-defined
     threshold, it will automatically bypass a port.This helps to prevent the whole loop
     from becoming unusable due to a single device having problems.When a device
     in a loop experiences partial failure and is not bypassed, it normally will constantly
     LIP.This essentially eliminates communication between other devices on a loop—
     similar to somebody’s cell phone continually ringing during a meeting.
         Signal retiming is the ability of hub hardware to clean up the signal received
     from a device. Instead of just electrically passing a signal through the port
     (including potential errors or noise on the line), a retimed port will take the signal
     and re-encode it on the wire. Any errors will be removed from the signal and
     noise removed from the line.
         Managed hubs, unlike unmanaged hubs, also provide for manageability func-
     tions such as telnet, SNMP, serial ports, and port logging to monitor the ports.
     This makes it easier to configure devices, diagnose problems, and view activity in
     your FC-AL configuration.
         Due to inherent limitations in FC-AL, hubs are being used less frequently in
     Fibre Channel installations and are generally found only on low-end, low–port-
     count installations with only a few initiators. Early problems with instability in
     Fibre Channel were due to the difficulties in implementing the FC-AL protocol.
     Although managed hubs have reduced the problems, the inherent limitations of
     the early technology have driven installations to migrate to fully switched fabric
     switches, which are more reliable and provide a higher level of performance and
     manageability.

     LIP Service: Fibre Channel
     LIPs, Problems, and Solutions
     With hub technology and FC-AL, one of the hardest parts of maintaining a net-
     work is managing the LIP process. Because of the complexity of this process and
     various early incompatibilities between equipment, the LIP process often resulted
     in the instability of Fibre Channel installations. Note that LIPs were designed as,
     and continue to be, a healthy aspect of FC-AL. Unfortunately, LIPs can also be
     the cause of loop instability.
                                        SAN Components and Equipment • Chapter 3     79


    A typical problem occurs when nodes are added or removed from the loop,
either intentionally by an administrator, when any cable is either plugged or
unplugged, or when a device is powered up or down.When a change to an FC-
AL loop occurs, the LIP process starts and all of the devices on a loop stop what-
ever they are doing and renegotiate for addresses in the loop.When a node is
added or removed from a loop, nodes go into the LIP process. Each node passes a
frame through the loop with information used to determine which address to use
to send Fibre Channel frames to it—its device ID.
    As you can imagine, this is a problem if traffic is being sent across the Fibre
Channel loop—any data that has been sent must be re-sent, drivers and software
must time out, and all transactions must be retried. Even worse, because of early
hardware incompatibilities and the complexity of the LIP process, sometimes the
process can take minutes for a loop to quiesce, and in rare cases the loop might
never settle and prevent transactions from continuing.
    The term LIP storm was coined to describe what happens in these situations,
and an entire generation of managed hubs were designed to minimize or elimi-
nate this problem.Today, with the newest hubs and switches operating in loop
mode, these problems are minimized.

Getting Out of the Loop:
Migrating to Switched Fabric
Because of the problems described in FC-AL environments, many installations
have been migrating out of loop environments into switched fabric. Loops scale
to only 127 devices, while switched fabrics support hundreds or even thousands
of devices. Because Fibre Channel switches inherently support loop devices, and
in fact implement a superset of the Fibre Channel loop protocol, migration is
fairly straightforward. By migrating to a switched fabric environment, the relia-
bility problems, manageability problems, and bandwidth limitations of loops can
be eliminated fairly easily.
     Brocade switches provide features called QuickLoop and Fabric Assist that
make it easy to migrate from a loop environment directly to switched fabric.
Note that it might be necessary to purchase a separate license for these features.
Hubs can be entirely replaced by switches and, in fact, some low-end Brocade
switches are positioned to directly replace hubs as a component (for instance, the
SilkWorm 2010 switch). Devices that cannot take advantage of switched fabrics
can still operate in private-loop mode, fabric-aware devices can operate in fabric
mode, and operation and capabilities can be maintained with only a simple
equipment swap.
80   Chapter 3 • SAN Components and Equipment


         QuickLoop operates by setting up a virtual loop through switched ports.
     Each of the ports in this case operates as if it were a hub port, but takes advantage
     of the capabilities of the Brocade switch, including switching capabilities.

     Using Switches and
     Fibre Channel Fabrics
     A Fibre Channel switch is logically positioned in the center of a SAN and is
     connected to hosts, storage, or other switches.The fabric infrastructure can be
     viewed as the foundation upon which the rest of the SAN is built.When a frame
     arrives from a device, a switch accepts and then routes that frame to the proper
     destination device. In fact, using the Brocade cut-through routing approach, a
     frame can begin to be forwarded even before it has been completely received. A
     fabric switch also contains a great deal of intelligence, providing services for
     locating other nodes in a network (the Simple Name Server [SNS]), automati-
     cally establishing routes between other switches in the fabric, compartmentalizing
     devices into zones (zoning), as well as monitoring and handling errors (basic
     Brocade Fabric OS functions and Fabric Watch).We discuss fabric services further
     in Chapter 2, “Fibre Channel Basics.”
         Brocade switches also provide functionality that allows private loop devices to
     participate in a fabric and translate the communication between fabric devices
     and older private devices. In fact, the translative mode of operation for a port on
     a Brocade switch will automatically allow any private target node (such as a pri-
     vate loop JBOD) to function fully as part of a fabric.This feature is a core piece
     of the Fabric OS and does not require a license. Making this work for a private
     HBA, on the other hand, requires QuickLoop and/or Fabric Assist options.

     Basic Switch Types
     Fibre Channel switches are often classified into different categories, depending on
     capabilities and features. In many cases, the hardware might be based on the same
     underlying architecture or Application-Specific Integrated Circuit (ASIC), but
     the software features are variable and priced accordingly to meet the require-
     ments of that class of switch.The exception is highly redundant “core-class”
     switches, which tend to be developed on their own fault-tolerant hardware
     platforms.This section covers the major categories of switches and explains the
     differences between each kind of switch.
                                        SAN Components and Equipment • Chapter 3       81


Entry-Level Switches
Entry-level switches are focused on small workgroups of eight to 16 ports, are
geared toward low cost, and deliver limited scalability and management.They tend
to be used to replace hubs and offer higher bandwidth and reliability. Entry-level
switches are often integrated into complete storage solutions rather than purchased
separately. Entry-level switches offer limited levels for cascading of switch ports.
Brocade entry-level switches can be upgraded with a license to handle more scala-
bility or to add functionality such as zoning or Web management.

Scalable Fabric Switches
Fabric switches provide the ability to cascade switches together to create a larger
fabric. By connecting one or more ports between two switches, all of the ports
connected to either switch see one single image of the network, with any nodes
on the switches available to other nodes in the fabric. Essentially, by connecting
the switches together, you can create one large, virtual switch that also has the
advantage of being distributed—even over large distances.
     Fabrics built with fabric switches work as a single fabric, with all ports con-
nected into the network able to view and access any other node on the network
as if it were on the local switch. A unified Name Server and management services
allow you to view and modify fabric information for an entire fabric through
single interfaces.
     An important factor in creating a distributed fabric is understanding the
bandwidth availability of the ISLs. It is important to remember that the speed
available between any two ports can be impacted by the lack of available band-
width on an ISL and that you might need to employ multiple ISLs to maintain
the necessary bandwidth.We discuss this topic further in Chapter 5, “The SAN
Design Process,” and Chapter 7, “Developing a SAN Architecture.”

Core Fabric Switches
Core fabric switches are designed to reside in the middle of a large SAN, inter-
connecting multiple edge switches to form multihundred-port SANs. Core fabric
switches can also function as standalone or edge switches, of course, but their
robust feature set and internal architecture is designed to allow them to work in
carrier-class environments as well. Other attributes of core fabric switches are the
ability to support protocols other than Fibre Channel, such as InfiniBand, 2
Gbit/sec support, and advanced fabric services like security, trunking, and frame
filtering.
82   Chapter 3 • SAN Components and Equipment


          Core fabric switches generally provide a high port count, from 64 to 128
     ports, and employ extensive internal interconnects to route frames at full speed.
     These switches are built for scalability and bandwidth, and are designed to route
     as many ports as quickly as possible with the least amount of delay to a frame. In
     addition, they tend to be blade-based: you can add and remove switch blades in a
     chassis to add functionality as needed, to facilitate hot sparing or online repairs,
     and to “pay as you grow.”
          Some enterprise switches do not support arbitrated loop operation or other
     loop devices directly, instead focusing on core switching capability. Brocade high-
     end enterprise switches provide all of the functionality of Brocade mid- and
     entry-level switches, including support for loops.
          For environments in which availability is most important, and you are willing
     to pay a premium for redundancy, highly redundant switches provide fully redun-
     dant components throughout the switch, remove single points of failure, and pro-
     vide extremely high uptime. A premium is paid for highly available backplanes,
     power supplies, redundant circuitry, and software to maintain availability.These
     types of switches include a great deal of logic and circuitry to deal with hardware
     failures within the switch. Beyond redundancy, core fabric switches support non-
     service—interrupting software upgrades, virtually eliminating the need to
     schedule maintenance windows. An alternate approach that provides a level of
     redundancy in the network is deploying a resilient, dual fabric. A resilient, dual
     fabric allows you to remove single points of failure and protect against the
     unlikely event of an entire fabric going down due to a software or hardware
     error, fire, natural disaster, or operator error. For the most highly available net-
     works, you should deploy a dual fabric built with core fabric switches.

     Features of Fibre Channel Switches
     Fibre Channel switches provide many different features, including support for
     GBICs, redundant fans and power supplies, zoning, loop operation, and multiple
     interfaces for management. Each of these features adds to the overall operation of
     your switched network and understanding the benefits and advantages of each
     can help you design a robust and scalable SAN.This section covers the major fea-
     tures of Fibre Channel switches, describes what you need to know about each of
     the features, and how to best use these features.The capabilities of Fibre Channel
     switches are listed here:
          s   Self-configuring ports
          s   Loop mode operation
                                         SAN Components and Equipment • Chapter 3     83


     s   Switch cascading
     s   Auto-sensing speed detection
     s   Configurable frame buffers
     s   Zoning (physical port- and WWN-based)
     s   IP over Fibre Channel broadcasting
     s   Telnet
     s   Web-based management
     s   Simple Network Management Protocol (SNMP)
     s   SCSI Enclosure Services (SES)


Zoning
Control over which nodes in a network can view and access each other has
become a necessary part of configuring your SAN. Depending on which Fibre
Channel switch you buy, zoning is implemented in different ways and also might
support different kinds of zoning.
    The simplest kind of zoning is port-based zoning, or zoning by a physical
port on the switch. A port zoning entry could be translated something like,
“Only allow devices on switch 1, port 1 to talk to devices on switch 3, port 2.”
WWN-based zoning provides the capability to restrict devices, specified by a port
or node WWN, into zones.This is much more flexible, since it allows nodes any-
where in a network to maintain the zones they are restricted to. However, it does
have its disadvantages. For example, if you replace a device, the WWN might
change, while the port address stays the same.
    Zoning is classified into two types: hard and soft zoning. Soft zoning uses only
software to enforce zones—usually through selective information presented to
end nodes through the fabric SNS. Nodes in a zone are informed of each other
only through names services in soft zoning. However, frames are not barred from
being transmitted between nodes that are not in the same zone.This works fairly
well, but does suffer if zones change, if hardware caches Name Server tables, or if
you want to guarantee that no frames (intentional or accidental) are sent to
devices.
    Hard zoning uses hardware, examining each frame that comes across the
fabric and ensuring that it is allowed to pass through to a node. Hard zoning
behaves exactly like soft zoning and is usually used in conjunction with it.
84   Chapter 3 • SAN Components and Equipment


     However, no frames, accidental or intentional, can pass through to nodes where
     permission is not given.
         Newer hardware is starting to extend the features of zoning further into the
     network, to the level of storage LUNs. As hardware advances and the ability to
     filter traffic beyond the port level all the way down to the LUN level becomes
     available, zoning will enable finer granularity control over the specific logical
     units on a storage unit that a specific initiator on the Fibre Channel fabric can
     access.This will enable better control and allocation of storage through zoning on
     a network.
         If you are sharing storage on your SAN, if the network is large, or if you want
     to closely control access to data and information on your SAN, zoning is a neces-
     sary feature of your switch and should be considered a requirement. On the other
     hand, if the size of your SAN is limited and devices attached to it are very well
     controlled, then zoning might not be as much of a necessity.

     Classes of Service
     The Fibre Channel protocol supports different classes of service.The class of ser-
     vice determines the level of error control for transfers. For communication
     between nodes to be successful, a switch has to support different modes of opera-
     tion.The important part of support for classes of service is making sure that all of
     the equipment that you are running supports the same classes of operation.
     Otherwise, they will be unable to communicate with each other.
         Most Fibre Channel switches and other hardware devices support Class 3
     operation, which is a connectionless conduit without confirmation of transfers
     across the SAN—the ideal case for SCSI transfers.This is because the upper-layer
     protocol on top of Fibre Channel is already doing the error control. Doing it in
     Fibre Channel as well would just add overhead.The majority of components you
     can buy for Fibre Channel are Class 3 devices and are fully interoperable. In some
     cases, the confirmation of transfers (acknowledgement frame [ACK]) between
     nodes is desired for better error detection, in which case, Class 2 would be used.
     However, Class 2 is not widely available on all hardware, although most Fibre
     Channel switches support it.
         Another class of service is Class F, which is used for internal control and
     coordination of the fabric. Switches are exclusive users of Class F. Finally, there is
     Class 1, which is rarely implemented in switches, but is supported by some older
     equipment. See Chapter 2, “Fibre Channel Basics,” for further detail on the
     various classes of service defined for Fibre Channel.
                                         SAN Components and Equipment • Chapter 3        85


Fabric Services
Fabric services are the set of internal services available to devices in a SAN. Fabric
services determine the level of manageability and interoperability of your switch
fabric.These services are used by devices when they first attach to your network
and allow different devices to locate others on the network.This section discusses
fabric services relative to switch and SAN implementations.
     Name Server support is a part of the Fibre Channel standard for switched
fabrics and provides devices with a directory of other devices on the fabric.There
is one Name Server service for an entire fabric, whether it is a single switch or
dozens of switches. However, that service is distributed across every switch in the
fabric and provides a high degree of resiliency. Any node querying the Name
Server will receive the same answer regardless of location—all of the switches
participating in the Name Server service cooperate and present a unified picture
of the fabric. All switches support this functionality since it is a basic part of
switched fabric operation.
     The Management Service is an in-band fabric service that provides basic
management data about the network. Included in this data is topology informa-
tion: what is connected to where on a switched fabric and basic information
about attached nodes. In addition, the Management Server provides unzoned
access to the Name Server.This is necessary when you have a management sta-
tion that needs to know about everything in the network, but does not need to
have access to the storage or hosts in a zone.The Management Server is used by
SAN management software that needs additional information about the fabric,
topology (physical layout) of the network, and other management information.
     Registered State Change Notification (RSCN) is a service of the fabric that
notifies nodes of changes in state of other attached nodes: for example, if a node
is reset, removed, or otherwise undergoes a significant change in status. Most
switches support RSCN, which is critical for operation of your SAN.This is par-
ticularly important for detecting error conditions and informing nodes about
problems.When the state of the node changes, devices using that node are imme-
diately informed and can react properly, rather than trying requests and timing
out due to errors.This feature is required for hosts that add storage “on the fly,”
since the RSCN is the mechanism by which the host finds out about the newly
available storage. RSCN events are generated by devices and by the fabric itself
for any sort of physical change to the topology of the fabric. For example, a
device is added or removed from the SAN, a switch is added to a fabric, or a
device has been internally reset and has dropped off and comes back onto the
86   Chapter 3 • SAN Components and Equipment


     SAN. RSCNs reduce the need for hardware to repeatedly check for changes in
     equipment condition, called polling, and thus reduce the amount of nondata
     traffic necessary to keep your network up and running.
         There are additional services defined in the Fibre Channel standards that are
     not necessarily supported by all, or even any, switches. For example, the Time
     Server is not yet supported on any switching platform that we know of, but it is
     defined as a standard. If you or the management software you buy requires the
     additional services, you should ensure that the switch you buy supports or can
     enable those services. For example,VERITAS SANPoint Control uses the
     Management Server, but many switch vendors do not support that service.
     Brocade does support the Management Server.

     Redundancy
     Because SANs are usually involved in the critical parts of your business, and
     because, unlike regular network traffic, data traffic on a SAN must not be lost or
     corrupted, the need for equipment protection through redundancy is important.
     Redundancy in its most basic capability takes the form of redundant power
     supplies and fans. In the field, power supplies and fans are the most likely
     components to fail. Fan bearings, which are the most mechanical pieces of any
     equipment, receive the most use, and because they are relatively inexpensive, these
     components tend to have a shorter life span than an integrated circuit.
         Redundant and hot-swappable power supplies help alleviate the problem of
     wear and tear on power supplies and the fans that cool them.With a redundant
     power supply, if one of the power supplies fails, circuitry can detect it and shut
     down the offending supply and issue a warning—either through software or
     through an indicator light or buzzer—all while allowing the equipment to con-
     tinue running. A technician can swap in a new replacement unit for the power
     supply in real time, without affecting operation.
         Similarly, other components of a switch can also be made redundant,
     including back planes, circuit boards, memory, and CPUs—albeit at a much
     higher cost.

     Buffer Credits per Port
     An important aspect of Fibre Channel throughput is the amount of buffer credits
     that are available on each switch port.The number of buffer credits available on a
     switch port is an important factor, particularly for long-distance applications.
     Although optical networks are fast, light still has a definite speed, which is
                                         SAN Components and Equipment • Chapter 3       87


approximately 5 km/microsecond.This is slow enough that over long distances,
the amount of buffers required to keep operations running is very important. If
there are not enough buffers available on the far end of a long run of optical
cable, the hardware will run out of buffers to receive them, throttling the actual
amount of data that can be sent over that cable. By ensuring that you have
enough buffers to support this sort of operation, you can ensure smooth and
maximum throughput across your optical link.
    In shorter cable length configurations, buffer credits are less important. Most
Fibre Channel switches are configured with plenty of buffer credits per port
when dealing with distances up to 30 km. However, it is worth knowing what
your switch can support so that you can ensure optimal operation, especially
when you intend to use your switch across long distances. At 1 Gbit/sec, it takes
five 2 KB frame buffers to provide enough buffering to ensure full-bandwidth
performance at 10 km.To ensure full-bandwidth performance at 100 km, you
need approximately 50 buffer credits for each switch port.

Self-Configuring Ports
Fibre Channel has many different modes of operation for ports: loop (FL_Port),
switched fabric (F_Port), and ISLs (E_Port). Even within loop ports there can be
different modes, depending on whether the attached devices are public or private
loop, and if they are initiators or targets. Also, the emerging 2 Gbit/sec and higher
Fibre Channel standards will create even more modes of operation. Self-config-
uring ports are able to detect what kind of mode the other side of the link is
operating on and automatically configure themselves to support that mode of
operation.This is particularly important in the case of fabric-supporting devices,
which operate much better and with more reliability if they are operated in
fabric mode (also called point-to-point when a device is connected to a switch).
A self-configuring port analyzes the primitives on the wire to properly configure
operation to match the connected Nx_Port hardware.The term Nx_Port is used
to identify either an N_Port (point-to-point connection) or an NL_port (loop
connection) for the connecting device.This also supports dynamic reconfigura-
tion of a network: for example, changing the placement of an ISL should happen
automatically, rather than requiring an administrator to telnet or log in to a Web
interface to control the configuration of a particular port.
    Some switch vendors have specific ports that support only certain operations,
requiring that ISLs be connected on only a few specified ports. All Brocade
switches can support self-configuration on all ports. Certain entry-level products
might require an additional software license to enable this support, but the
capability is present in the hardware. Self-configuring ports make it much easier
88   Chapter 3 • SAN Components and Equipment


     to manage your fabric, eliminating the need for dedicating certain ports for
     certain functions.
         Another feature of switch hardware and software is the ability to manually set
     the configuration of these ports. Sometimes, equipment is not able to properly
     auto-configure a port. A device that supports switched fabrics might not be recog-
     nized as a fabric device, and a port might be configured as a loop device. In order
     to ensure that those devices operate in the best mode, you might need switch soft-
     ware to force configuration of a particular port. Being able to manually set the type
     of port indicates that there is a conflict between the connecting device and the
     switch. If a port is locked to a certain type, it limits the functionality and can cause
     potential problems if other devices are plugged into that port. Once a port is
     locked, it does not become “plug-and-play” anymore.

     Auto-Negotiating Speeds
     As Fibre Channel moves from 1 Gbit/sec to 2 Gbit/sec and beyond, support for
     auto-negotiation of speeds becomes necessary to support mixed-speed networks.
     Auto-negotiation uses communication with devices attached to a switch to
     determine if they are running at 1 Gbit/sec or 2 Gbit/sec and automatically
     selects the correct speed.

     IP over Fibre Channel Broadcasting
     The use of IP over Fibre Channel (IPFC) is, for the most part, identical to any
     other IP network. Fibre Channel, as a communications medium, does not inher-
     ently support broadcasting frames to all nodes on a fabric identically to Ethernet
     or other IP networks. Fibre Channel broadcast is a function of switches that will
     automatically resend broadcast frames to all attached ports on the Fibre Channel
     network, effectively emulating the broadcast properties of Ethernet networks.This
     helps to fully support file sharing, such as NFS, bootp, ARP, ping, and other pro-
     tocols on top of IP that are dependent on broadcast and that are usually not
     aware of the behavior of IPFC.
         This is a necessary part of fabric operation if you intend to send any IP
     frames across your network. Some HBAs do not react well to IP broadcasts, so
     you might need to use switch zoning to allow them to coexist with other HBAs
     that are running IP.
                                        SAN Components and Equipment • Chapter 3      89


Firmware Upgrade Methods
Although many users will buy a switch directly out of a box and never look at
the firmware installed on it again, sometimes it is necessary to upgrade the
firmware to fix a bug with newly introduced hardware, add a new feature, or
enable management through a third-party software package. Also, if you are
building a large fabric, it might be desirable to have a unified firmware version
throughout that fabric.This will ensure a consistent feature set and the most reli-
able operation in a large heterogeneous fabric.
      Software upgrades can be accomplished in different ways. For fast upgrades of
firmware, look for the capability to download firmware to the switch through
Ethernet.The most basic of these downloads is through the serial port, which
requires an RS232 connection and a PC or other machine that sends the
firmware to a switch through a slow serial link. Brocade switches do not support
the download of firmware through a serial connection and instead use Ethernet
for downloading firmware. However, it is possible to manage all Brocade switches
except the SilkWorm 2800 with a serial connection.
      Another consideration for firmware upgrades is how much impact this will
have on your network.The ideal operation is “hot upgrades,” firmware upgrades
that can be done while equipment is running and can be “rolled in” to produc-
tion. Few pieces of equipment currently support this, but equipment that does
can keep downtime to a minimum. Next is upgrade-on-reboot, where firmware
upgrades can be done, but the new firmware does not take effect until the switch
is booted. Operation can continue until a reboot is triggered.
      The worst option is offline upgrading.This is required when a component
must be offline to upgrade, or even worse, when all equipment must be upgraded
at the same time. Many pieces of hardware are eliminating this, but you should
still be aware what kind of work is required when you need to upgrade switch
firmware.
      The good news with firmware upgrades is that, in a dual-fabric SAN, you
can upgrade one fabric at a time.This will enable a firmware upgrade to take
place with no disruption to your environment. Using dual fabrics might require
additional software on your host, such as VERITAS DMP, a multipathing HBA
driver such as the TROIKA driver, or multipathing RAID drivers. Since dual
fabrics are always advisable in uptime-sensitive environments, the firmware
upgrade disruption question is moot for real-world applications. See Chapter 7
for more information on SAN availability models.
90   Chapter 3 • SAN Components and Equipment


     Loop Operation: Making Your Switch Act Like a Hub
     A convenient feature of almost all Fibre Channel switches is their capability to
     act similarly to a Fibre Channel hub. By behaving like a hub, a switch can work
     with FC-AL devices that do not support fabric operation. Because the Fibre
     Channel standard began with simple FC-AL operation, and much of the storage
     hardware and even some of the HBAs available might be only FC-AL and not
     switched fabric-compatible, the ability to make a switch act like a hub can ensure
     that older equipment will still work in your network.
         In general, a set of switch ports or an entire switch might be configured for
     loop operation, specifying which ports are running in loop mode. Some low-end
     switches are actually forced to operate in loop mode, with the ability to be
     license-upgraded to support full fabric.
         The capability for loop operation is a must if you are directly attaching
     storage devices that support only the older FC-AL protocol to your switch. Refer
     to Chapter 4 for further detail regarding how QuickLoop and Fabric Assist
     enable private-loop devices to fully participate in the SAN.

     FSPF Compliance
     Fibre Channel switches from different vendors started out fully compatible with
     end nodes in a network. However, until fairly recently they were not able to pass
     frames between each other (inter-vendor switch frame compatibility). Several dif-
     ferent routing algorithms for inter-switch routing existed. Brocade switches all
     used a protocol developed at Brocade called Fabric Shortest Path First (FSPF).
     Recent efforts in compatibility have standardized on the FSPF routing protocol
     algorithm for routing Fibre Channel frames between switches, and now all ven-
     dors are required to support this protocol in order to be Fibre Channel standards-
     compliant. FSPF, which was originally used by all Brocade switches, forms the
     basic protocol for exchanging and routing frames between switches in a Fibre
     Channel fabric.
         FSPF compliance is most important if you are trying to mix and match Fibre
     Channel switch hardware. Because all switches must follow this standard fully to
     interoperate, you must make sure all switches in your network support the stan-
     dard. In addition, because the standard is very new, it is important to make sure
     that all pieces of hardware you expect to work together have actually been tested
     together. Also check to see what advanced features will be lost when intercon-
     necting switches from other vendors into a Brocade fabric. At this time, no switch
     vendor supports the complete feature set that Brocade switches implement, and it
                                        SAN Components and Equipment • Chapter 3     91


is possible that the needed functionality might be lost by introducing other ven-
dors’ switches. Since Brocade implements a superset of most other vendors’ fea-
ture sets, it might be more practical to introduce a Brocade switch into another
vendor’s existing fabric than vice versa. Firmware versions and topology are still
very important in these mixed environments, which are still undergoing testing at
the publication of this book.
    The trend toward full compatibility in inter-vendor switch compatibility is an
important basis for the future and promises not only to reduce cost, but also
more important, to permit the interchangeability of hardware in your network.

Management Interfaces
Fibre Channel switches support many different ways to manage a switch.These
different interfaces allow you to choose how you want to configure a switch, and
they also take into consideration how you intend to manage your switch, as well
as any other tools you might already have deployed in your network.

Serial Port
The most basic management interface for Fibre Channel switches and other
equipment is the serial port. A standard, RS232-based port is generally available
on Fibre Channel switch equipment that allows command-line interaction with
different configuration options.

Telnet
Telnet is the standard IP networking ability to log in to a piece of equipment
through a telnet interface from any host server attached on Ethernet, or even
in-band through Fibre Channel itself (Figure 3.7).You typically log in to a telnet
interface and execute command-line commands.Telnet has the advantage of
being convenient to run remotely or through a slow connection, and can also be
scripted for automatically configuring switch settings through a nightly script or
for difficult operations.The disadvantage is that command-line interfaces tend to
be difficult to use, especially for complex operations like zoning and viewing lots
of information at once, in which a GUI interface such as WEB TOOLS is more
practical.

SNMP
Simple Network Management Protocol (SNMP) is an IP-based protocol for managing
any kind of network equipment, including Fibre Channel switches. SNMP pro-
vides mostly read operation of switch functionality and configuration, as well as
92   Chapter 3 • SAN Components and Equipment


     critical error counters and statistics. Almost all Fibre Channel switches (as well as
     hubs, routers, and even some storage arrays) provide SNMP Management
     Information Base (MIB) information.This is mostly used in conjunction with
     traditional network monitoring applications like HP OpenView, CA Unicenter,
     and Tivoli NetView, but also is used by Fibre Channel management software to
     provide information from network hardware.

     Figure 3.7 Telnet Session with a Brocade Switch




        Standards for SNMP Management Information
        SNMP defines only the basic protocol that transports management
        information. The actual information transported across the network is
        defined on top of the SNMP protocol through Management Information
        Base (MIB) definitions. It is important to make sure that the equipment
        you buy supports the correct MIBs that enable software to interpret and
        use the information available from your switch.
             The FibreAlliance MIB, now under consideration in the Internet
        Engineering Task Force (IETF) standards organization, is supported by
        most Fibre Channel network hardware. Defined by the FibreAlliance
        organization, which was started by storage manufacturer EMC, the MIB
        provides common information for discovery of topology and equipment
        capabilities in a SAN. The MIB provides information about how many

                                                                                 Continued
                                        SAN Components and Equipment • Chapter 3       93


   ports exist on a piece of equipment; what is connected on each port;
   and even detailed information such as error counters and frame counts.
   In addition, basic asset tracking information such as manufacturer
   strings, model numbers, and other information is presented in this MIB
   definition.
         The Fibre Channel Management MIB, which is supported by some
   earlier switches, is a different MIB that provides much of the same infor-
   mation. This MIB preceded the FibreAlliance MIB and is required by some
   software for accessing parameters not exposed by the FibreAlliance MIB.
   It is best to check with your software vendor to understand if this infor-
   mation is required for operation.
         Most hardware also presents equipment- or vendor-specific MIBs,
   defined by the manufacturer. Often times, these MIB definitions are also
   made available to customers and can expose things that are not
   industry-standard: for example, special features of a switch, information
   that is specific to the hardware, or special functions. You should check
   with your manufacturer or with software shipped with your hardware
   for equipment-specific MIB definitions.



Web-Based Management
Web-based management interfaces provide a graphical,Web-based way of
accessing and modifying switch settings. In general, most Web-based management
tools provide a page that you can access in your browser, view switch status, and
set most switch settings. In many cases, the Web-based GUI can help make com-
plex tasks such as zoning much easier, and also provide a visual indicator of
switch function. However, not all switches export all functionality through a GUI
and they instead might require a telnet or serial port session to access some tasks.
Brocade switches allow practically all management through the Web interface,
which greatly simplifies the management by more casual users.
    There are two types of Web-based GUIs available: pure HTML-based and
Java-based Web pages.Web-based applications usually present simple HTML
pages to access all switch functionality, where Java-based Web pages have
embedded Java applets running more like a standard application.There are advan-
tages and disadvantages to each, but you are generally stuck with the type of
interface that ships with your switch.The advantage of pure HTML pages is the
speed of loading a specific page. A Java Virtual Machine (JVM) does not have to
be loaded, which sometimes can take quite a while across a slow link. Moreover,
because of compatibility issues between different Web browsers, you might
94   Chapter 3 • SAN Components and Equipment


     encounter some difficulties if your browser is not exactly the same as the version
     for which the switch GUI developers qualified their Java software.The advantage
     of a Java-based applet is it offers user-friendly management of complex tasks.
     When you select the switches you intend to use in your network, you should
     compare what common tasks you would like to do against the Web-based GUI
     tools available, and make sure that you purchase the licenses for Web-based man-
     agement.Web-based tools are very helpful for accomplishing day-to-day tasks. In
     addition,Web-based tools make management of your switches easy even from
     remote locations or offsite.

     Application-Based Management
     Some switches on the market also support application-based management. Instead
     of an embedded Web server or Java application, a separate, externally run program
     manages your switch.These applications are usually based on Java as well, but need
     to be installed on a server. Managing your switch through an application some-
     times can be faster than running an application from a Web interface, and some
     hardware offers identical interfaces between the Web and an application. Brocade
     Fabric Manager is an example of an externally based management program.
         Application-based management works best when you have a permanent net-
     work management station where you can have software installed and used nor-
     mally. It is more difficult if you need to move from place to place and do not
     want to have to reinstall the software on whatever machine you happen to be
     using that day. Application-based management hosts are a management single
     point of failure.

     SCSI Enclosure Services
     SCSI Enclosure Services (SES) is a SCSI protocol-based method of obtaining man-
     agement information. SES support gives some information on the status of SCSI
     equipment on the network. If your software supports the SES standard, you can
     use this feature of Fibre Channel switches to also monitor the basic health and
     well-being of your switch through the same SCSI-based management software.
     SES is generally used by operating systems so that they can incorporate certain
     management functionality into their environment. End users rarely use it, and its
     usage by operating systems is waning due to the advent of sophisticated and
     powerful alternatives such as the Management Server.
                                       SAN Components and Equipment • Chapter 3      95


Connecting Your Servers with Host Bus Adapters
HBAs connect hosts to your Fibre Channel network. An HBA translates oper-
ating system SCSI commands to the proper Fibre Channel frames and protocols
on the wire when plugging into a bus such as PCI or SBus. Unlike Ethernet net-
work adapters, Fibre Channel adapters are actually much more intelligent and fre-
quently contain embedded processors and embedded firmware to negotiate the
Fibre Channel protocol. Fibre Channel HBAs provide advanced functionality
such as persistent binding and HBA-based LUN masking, which are used in con-
junction with switch zoning and storage LUN masking to control and allocate
storage in your SAN.
    This section covers the major types of HBAs and details the features available
on most HBAs. In addition, a discussion of specific issues about using HBAs and
how to implement these features in your own SAN will help you understand
how to best use these components in your network.

Connecting Hosts to the Fabric
HBAs operate by plugging into the internal bus of your host machine (for
instance, PCI or SBus). Loaded with device driver software, the HBAs appear to
operating systems as a SCSI adapter. In most cases, an HBA is indistinguishable
from other storage adapters, such as SCSI adapters, to the operating system. HBAs
even emulate the way the legacy SCSI adapters communicate with the operating
system.The HBA will map devices seen on the network to SCSI bus, target, and
LUN addresses associated with a SCSI adapter.
    An operating system treats an HBA exactly like it does a SCSI adapter, down
to the exact same SCSI commands and packets.The HBA takes these packets and
translates them into the Fibre Channel protocol, adding network headers and
error handling. It transmits the packets across the network, makes sure of the
response from the storage, and returns the information and status back to the
operating system—all as if it were a SCSI adapter.
    Other advanced HBAs also do this for the Internet Protocol (IP) and Virtual
Interface (VI) protocol, providing network and clustering adapters to the
operating system and software.

HBA Types
HBAs range from low-cost, embedded chips to high-end, dual-channel multi-
pathing adapters.The most basic HBAs support only small FC-AL loops with a
few devices and contain minimal buffering memory or intelligence. On the high
96   Chapter 3 • SAN Components and Equipment


     end, adapters might include additional buffer memory for better performance and
     throughput, intelligence to handle large fabric deployments, and high-end fea-
     tures such as HBA-based LUN masking and failover capability.

     A Plethora of Protocols
     Fibre Channel networks, although primarily used for storage using the SCSI FCP
     protocol, also can be used for other protocols such as IP for standard networking
     and VI for clustering. Different HBAs can support different protocols and at a min-
     imum, support SCSI FCP. It is becoming standard for adapters to support SCSI
     FCP and IP, and newer adapters now support the VI protocol as well. If you are
     designing a SAN and think you might want to use it for routing IP frames, backup,
     or other IP traffic, it is well worth checking to see if your HBA hardware supports
     IP or VI, or can support these protocols through a software or firmware upgrade.

     The FCP/SCSI Protocol
     FCP/SCSI is the primary protocol used to transfer data over the Fibre Channel
     network. Fibre Channel Protocol (FCP) encapsulates standard SCSI commands,
     which are identical to the old SCSI bus commands. Instead of signals, however,
     the Fibre Channel standard transmits the commands as a set of frames containing
     the usual command and data phases of the SCSI protocol. In fact, a SCSI applica-
     tion running on top of Fibre Channel is identical to running on the SCSI bus,
     with no modification. HBAs take responsibility for translating requests to a SCSI
     bus, target, and LUN, and redirecting that to a specific Fibre Channel address and
     LUN address. Applications and operating systems written for SCSI can run on
     top of Fibre Channel unmodified.

     The IP Protocol
     The IP protocol, the standard for the Internet, runs on top of Fibre Channel by
     following the IPFC protocol. Using the same concepts of IP address and mask, IP-
     capable adapters generally look and behave identically to Ethernet adapters—only
     much faster. By installing the appropriate drivers, you can add a network adapter
     and a set of IP addresses, which instead of being transmitted via Ethernet or
     another network can be sent by Fibre Channel. Although not typically used to
     replace Ethernet, IPFC is useful for managing in-band Fibre Channel devices,
     offloading backup traffic, or connecting machines over the same long-distance
     links as your storage and taking advantage of the cost savings of not having to run
     a different network over a different wire. In installations where server slot space is
     at a premium, IPFC can save an additional slot and network infrastructure.
                                        SAN Components and Equipment • Chapter 3       97


    The one area to note is that unlike Ethernet, Fibre Channel is not well suited
to broadcasts and multicast operation. Some Fibre Channel networks do not
inherently support broadcast packets and rely on software and switch support to
properly broadcast frames to all nodes on the network. Brocade supports hard-
ware forwarding of broadcast and multicast frames and the software necessary to
support both in a multiswitch fabric.

The VI Protocol
VI is a specification that was developed by Intel for low-latency server clustering.
In standard clustering environments, Ethernet or other IP protocols have been
used to pass data from machine to machine across a network. Unfortunately,
because of the software overhead of IP stacks and the many software layers that
data must pass through to send IP frames, clustering has struggled to reach its full
potential.To solve this problem,VI technology removes the traditional IP net-
working stack and instead provides a method of directly sending data across a
wire to another computer’s memory. Applications, especially cluster-aware appli-
cations like databases, have started to use this protocol for cluster communica-
tions.VI over Fibre Channel provides the ability to leverage the Fibre Channel
infrastructure to also pass VI traffic.
    Fully clustered databases like Oracle Parallel Server and IBM DB2 both sup-
port the VI protocol in their inter-node communications. By configuring these
databases to use VI on a Fibre Channel card, these applications run faster and
with significantly less CPU usage.

Speeds
All Fibre Channel adapters today support the 1 Gbit/sec (100 MB/sec) speeds that
all Fibre Channel equipment supports. As network infrastructure, such as switches,
moves towards the new 2 Gbit/sec standards, so do HBAs and storage.This new
standard provides for double the speed of current adapters, and standards are
firming up that allow for auto-negotiation between the 2 Gbit/sec and 1 Gbit/sec
speeds and protocol differences.The next-generation 10 Gbit/sec standards are
now in development and should allow for 10 times the speed of operation of
current-generation SAN components in the future.With most Fibre Channel
storage, it is rare that you will even approach the 100 MB/sec performance
numbers (200 MB/sec full duplex) that current adapters allow you to reach at
1 Gbit/sec.
98   Chapter 3 • SAN Components and Equipment


     Ports
     The number of ports available on a Fibre Channel adapter can range from a
     single port to dual-port adapters with the capability to act as two individual
     HBAs on a single card. Dual-port adapters add a significant cost reduction by
     enabling two separate connections into a single card and can be useful where
     there is a need to connect to two separate fabrics from a single system.
         One limitation to be aware of with dual-port cards is the overall bandwidth
     you can achieve when using these adapters.This is not due to the HBAs, but
     because of the computer system buses to which they are attached. Different
     architectures such as PCI (33 MHz, 32 bit) cannot handle much more than the
     100 MB/sec speeds available on a single Fibre Channel port today.You will prob-
     ably not be able to get twice the performance with the two ports available on a
     single card. Moreover, if the HBA were to fail and it contained several ports, they
     would all fail.

     Combination Adapters
     A recent innovation in Fibre Channel adapters is the appearance of combination
     adapters, which combine the functionality of Fibre Channel with other network
     interfaces. For example, combination Gigabit Ethernet and Fibre Channel cards
     exist on the market.These tend to be used where slot space is at a premium, or
     in embedded applications where there is a need to support multiple interfaces in
     a small space.

     Fabric-Capable Versus Loop Adapters
     There are two different classes of HBA capability. Legacy HBAs usually support
     only loop operation to connect to a network and do not support connection to
     the fabric.These adapters are termed private HBAs.They can generally connect
     only to other private FC-AL devices. Some low-end HBAs still support only
     loop operation. However, it is possible to upgrade the drivers on these HBAs to
     support fabric attachment. Sun and HP HBAs originally were capable of private
     mode only and are examples of legacy HBAs.
         Fabric-capable HBAs support both loop and fabric and can address thousands
     of nodes connected into a switch.They use the fabric Name Server to access dif-
     ferent fabric devices. In addition, fabric switches are aware of the different fabric
     protocols used to monitor and find other nodes in the network and do not
     require special modes in a switch to operate in fabric. In general, it is best to find
                                        SAN Components and Equipment • Chapter 3       99


a fabric-capable adapter for your switched fabric SAN, since they are much more
advanced and able to deal with the complexities of a switched fabric operation.

HBA-Based LUN Masking
HBA-based LUN masking is the capability of an HBA to selectively hide—
“mask”—storage devices on the network from a host. By masking specific LUNs
from a host, you can control which storage a host maps into the operating
system.This is important when you have mixed operating systems on the same
network, since you can prevent corruption of data because of ownership con-
tention. LUN masking also provides a method for dividing storage capacity in
your network.
    LUN masking is very important in the context of operating systems such as
Windows NT and Windows 2000.Windows operating systems, which are not
natively Fibre Channel-aware, do not expect storage volumes to ever be shared
with any other hosts. If a storage volume is exposed to more than a single host,
the operating system might not be able to mount the file system located on this
disk. As a response, the operating system might write a signature on a disk it does
not own and will most likely corrupt any data that is on that volume already. By
masking the LUNs to only the volumes a host is permitted to see and own, you
can avoid these problems entirely.
    LUN masking is also very important when you are mixing operating systems
in a network. Because the way file systems are written varies for different oper-
ating systems, if a LUN is formatted for one operating system, the other oper-
ating system will not recognize that it is in use. If LUN masking is not used, the
second operating system could assume that, because it does not recognize an
operating system, it can write its own data on the identical disk—corrupting data
that is already there.

Persistent Binding
Persistent binding, sometimes referred to as LUN mapping, is the mapping of a
Fibre Channel device into an operating system at a specific device location.
    Persistent binding is particularly important for applications that use the oper-
ating system SCSI address to address a device: for example, a fixed device node in
Solaris or a raw volume used by an Oracle database. In both of these applications,
the ID must be persistent and fixed from reboot to reboot.
    In some implementations of HBAs, persistent binding and LUN masking are
the same. Some vendors use persistent binding to enforce LUN masking: only
100   Chapter 3 • SAN Components and Equipment


      devices that have been persistently bound by hand or through software are
      allowed access into the system.
          Other implementations do not couple persistent binding and LUN masking,
      and instead automatically bind devices as they are discovered—as long as LUN
      masking allows those new devices to show up to the operating system.This
      allows for more flexibility, since devices do not manually have to be configured
      for masking settings.

      Default LUN Access Permissions
      Default LUN access permissions are used by HBA software to determine
      whether a disk device should be mounted and accessible to a host operating
      system. HBA drivers can usually be configured to always allow access (automatic
      mapping of devices to an operating system), or to never allow access (manual
      mapping of devices to an operating system).
          For example, it is important for very large networks to never allow automatic
      access and require manual mapping of a device to the operating system. For
      example, you might have 20 LUNs exported from a storage array, but only the
      first two LUNs should be accessible to your host. If you set access permissions to
      default to allow automatic access, potentially all of those 20 LUNs would be
      claimed and probed by the operating system. By setting the default access to
      deny, only the intervention of an administrator will allow the operating system
      access to the disks (through LUN masking).This keeps hosts from trampling on
      the data already written to LUNs in a network.
          Typically, for very small networks the HBA drivers are configured to always
      allow access to new devices in the network. For large networks, it is a require-
      ment to never allow access unless an administrator specifically grants access.

      Upper-Level Protocol Access Permissions
      As HBAs have begun to add IP and VI capabilities to their cards, an important
      option that is beginning to appear is the control over IP and VI access permis-
      sions. Like LUN access permissions, this allows you to control which devices are
      allowed to receive IP or VI frames.The major use of this currently is to prevent
      IP or VI frames from being sent to hosts or storage that do not understand those
      protocols. In some cases, receiving these frames causes errors or software to crash
      on these storage devices, even though they should not recognize the frames and
      should ignore them.The ability to set these permissions adds the ability to
      control this functionality and prevent these types of errors.
                                        SAN Components and Equipment • Chapter 3       101


Dynamic Versus Static Discovery
Fibre Channel is a networking protocol, and unlike a parallel SCSI bus, devices in a
network are not usually powered up and down at the same time. In fact, devices
can be added or removed at any time, and the whole network continues to stay up.
Older parallel SCSI devices used to require a reboot to discover any new storage
devices attached to a host, because of the static nature of the parallel SCSI bus.
    Dynamic discovery of devices is the capability of an HBA to discover new
devices on a network without rebooting.This allows the most flexibility, and by
rescanning drives with operating system software a new storage volume can be
added or removed without rebooting the system. Static discovery occurs when
an HBA requires a reboot to discover new devices, which is still the case for
some hardware.
    Understanding if your hardware supports dynamic or static discovery is
important. If you want to run a Fibre Channel network, dynamic discovery is
necessary for 24x7 network operation.

Configuration Management Software
The suite of software available in most network adapters has usually been limited
to very basic command-line utilities or a few simple configuration pages attached
to the device driver. However, in the Fibre Channel HBA world, the features and
capabilities of cards are more advanced than a few simple configuration settings.
The advent of sophisticated management software has made it possible to not just
change card settings, but also monitor what a Fibre Channel card sees in the net-
work; monitor the status of connections; identify externally connected nodes; and
run diagnostics. Figure 3.8 shows the TROIKA SAN Command utility, and
Figure 3.9 shows an example of Emulex’s configuration utility.
    Different vendors provide varying amounts of configuration management
software, ranging from simple command-line utilities to sophisticated GUI
applications.

HBA API Support
The HBA API is a C-level API supported by Fibre Channel HBA manufacturers
to enable the collection and management of information available from HBAs.
This API is used by SAN management software to collect information such as
model numbers, vendor names, hardware and software version numbers, port
speeds and settings, as well as ports attached to a Fibre Channel HBA. Support
102   Chapter 3 • SAN Components and Equipment


      for the HBA API is widespread and is a requirement for managing your Fibre
      Channel HBAs through most management software.
      Figure 3.8 HBA Configuration Management Software (TROIKA
      SAN Command)




      Figure 3.9 Emulex Configuration Software
                                        SAN Components and Equipment • Chapter 3     103


    The HBA API consists of two major parts: a shared, common library that is
loaded onto your operating system and accessible to any applications, and a
vendor-specific library that supports the HBA that you are loading onto your
system.The common library allows applications to generically call specific func-
tions from the HBA API and is dynamically linked with applications such as
VERITAS SANPoint Control, SANavigator, and other SAN management soft-
ware.This library is installed into a common location (such as C:/WINNT/
SYSTEM32 on Windows NT and \usr\lib on UNIX).The vendor-specific
library is provided by the HBA manufacturer and is usually installed in a manu-
facturer’s install location.
    The HBA API has an advantage over the previous technique used for man-
aging HBAs, which tended to be vendor-specific I/O Controls (IOCTLs).The
API, which was developed as part of the efforts of the Storage Networking
Industry Association (SNIA) to increase interoperability and manageability of
Fibre Channel networks, has the advantage of running with any HBA vendor’s
hardware across different operating systems and allowing different vendors’ hard-
ware to be addressed in the same box.

Remote Boot across the SAN
Remote boot is the capability of an operating system to use a Fibre Channel HBA
to access and mount the boot volume for a system. Unlike parallel bus SCSI, vol-
umes in a SAN are not limited to a local bus, and additional logic is necessary to
boot an operating system from the network. A boot binding between a specific
volume in the network—WWN and LUN—is required for remote boot to start.
This choice is usually made through software accessible before boot and startup.
    Remote boot allows for an interesting use of the Fibre Channel SAN.
Because boot volumes can be on a network and be available to practically any
device connected to the SAN, they make it possible to dynamically change
which physical hardware a system is booting up from by changing the remote
boot binding. For example, you could have the operating system image of your
Web server stored on the network on a disk, and if the hardware that was run-
ning that Web server fails, all you would need to do is reassign the boot volume
to a new server.This makes it possible to easily reallocate functions of your
servers across your network when hardware failure occurs, without requiring
moving, reinstalling, restoring, or any other typically cumbersome and lengthy
processes associated with local storage utilization.This is also used to enable
104   Chapter 3 • SAN Components and Equipment


      advanced disaster tolerance, so that you can boot the image of a host if using dif-
      ferent hardware.
          The difficulty of remote booting is that when you use network volumes to
      run your operating system, if there is a network error or failure in your network,
      you will not be able to access anything on the boot operating system. Errors will
      not be logged, and in many cases, everything will come to a grinding halt until
      the network error is fixed. A simple case of someone unplugging the Fibre
      Channel connection on the back of a machine could be catastrophic, versus the
      normal situation where an internal disk is unlikely to be removed without
      powering down a machine and opening the case.This situation can be remedied
      by using a dual-fabric configuration.

      Hot-Plug Support
      With newer operating systems and hardware, the capability to swap failed equip-
      ment while your computer and operating system are still running has begun to
      become possible. Peripheral Component Interconnect (PCI) hot-plug systems
      allow the removal and insertion of hardware, even while I/O is occurring.This
      makes it possible to fix problems without taking down critical systems or
      rebooting systems. In redundant configurations, this ensures that operation can
      continue even through single hardware failures.
          The typical sequence of events in a hot-plug situation is this: an HBA fails or
      is otherwise showing signs of problems.Through system error logs, an adminis-
      trator determines that a certain piece of hardware is at fault, and either indicates
      through software or through buttons or levers on the system that he or she
      would like to swap this HBA.This signals to driver software and to the hardware
      bus to stop I/O to a card and isolate the connections of that card from the
      remainder of the system. Usually, a light or other indicator will show that it is
      acceptable to remove the offending card, and the administrator will swap it with
      an identical piece of hardware. By pressing another button or lever, or through
      running software, the administrator tells the operating system that a new piece of
      hardware has been swapped in. Finally, that signals drivers and the operating
      system to start using that hardware again.
          This generally works great. However, there are several Fibre Channel-specific
      complications that need to be addressed with PCI hot-plug.These are Fibre
      Channel settings, storage LUN masking, and zoning settings, which are usually
      tied to a very specific piece of hardware, and which will not understand that the
      newly swapped card is identical to the other piece of hardware it replaces.
          Fibre Channel settings are the specific settings for how a piece of hardware
      should behave on the Fibre Channel network. Because there are different modes
                                         SAN Components and Equipment • Chapter 3       105


of operation, and specific compatibility modes and settings needed to operate
with different Fibre Channel hardware, these settings are important to maintain.
Often, these settings are stored in Nonvolatile RAM (NVRAM), which is a
physical piece of hardware attached to an HBA.When you swap the HBA, these
values are no longer remembered, and instead will use the settings that are
written into the new card.Two things need to be done in this case to properly
set up these Fibre Channel settings. First, if possible, replacement cards and spares
should be configured ahead of time identically to the running system. Different
timeout values, modes of operation, and settings should be logged and set on all
spare hardware so that they do not need to be reset on a hot-swap card. In addi-
tion, it is best to force settings in your driver software (typically done through
registry entries or configuration management software in Windows systems and
through .conf files in UNIX), so that the driver forces the setting no matter what
is programmed into the card.
     Storage LUN masking is the capability of Fibre Channel storage to enforce a
level of security on which devices are allowed to access specific volumes on the
storage. High-end RAID arrays are typically the only pieces of hardware that
contain this functionality. A piece of storage will contain settings within its soft-
ware and memory that allow only specific devices (HBAs) access to certain
LUNs.This is usually defined either with a port WWN or node WWN of the
HBA that is installed into your host system. If you swap this HBA, the WWNs
will change, because they are globally unique to that specific HBA.You will have
to reconfigure your storage to accept the new port WWN or node WWN of the
HBA that you have hot-swapped into a system, or no I/O operations will be
allowed to the LUN.To some extent, you can avoid this by knowing in advance
the port and node WWNs of the hot swap and preprogramming your storage to
accept the hot swap’s WWNs. If this is not possible, you will have to ensure that
as part of your hot-plug procedures that the reprogramming of your storage array
is included.
     The last Fibre Channel-specific setting that is important in hot-plug situa-
tions is switch zoning. Switch zoning operates much like storage LUN masking,
since it also relies on the WWNs programmed on an HBA to perform the
zoning operation (unless you are using port zoning, in which case this does not
apply).When you insert a new HBA, the WWNs programmed into different
zones will no longer match, meaning that any storage that your host previously
had access to will not be available if switch zoning is in effect.There are different
things you can do to minimize the amount of work needed to do a hot-plug
swap. One is to use an alias in your switch to define the host, so that it is easy to
106   Chapter 3 • SAN Components and Equipment


      change the actual port or node WWN of the alias and have all other zone settings
      change automatically. Another is to do as in storage LUN masking, and prepro-
      gram, if possible, the WWNs of your spares into the proper zones so that they are
      automatically included as a part of the zone. Finally, some HBAs allow you to
      force a node WWN to apply to an entire set of HBAs in a system and use that
      node WWN as the zoning member, which makes it possible to swap in new
      HBAs without requiring reconfiguration of your switch zoning. If you use this
      method, perform all zoning by using the node WWN of each HBA, rather than
      the port WWN.

      Connecting Legacy
      Devices into Your SAN
      When Fibre Channel networks were first put into operation, many storage
      devices were not able to natively communicate using the Fibre Channel protocol.
      However, the parallel SCSI bus was widely supported and many different kinds of
      devices were implemented with parallel SCSI support. Because Fibre Channel
      provides the same SCSI protocol over a different kind of medium, devices called
      Fibre Channel-to-SCSI routers were developed to translate Fibre Channel frames
      to the appropriate parallel SCSI commands.These routers include one or more
      Fibre Channel ports on one side, and one or more parallel SCSI bus connections
      on the other. Devices on the parallel SCSI bus side are presented to the Fibre
      Channel network as any other Fibre Channel native storage devices, as LUNs
      available from a Fibre Channel port.
          Fibre Channel routers make it possible to use legacy parallel SCSI devices by
      simply plugging a box into the network on one side and the legacy parallel bus
      on the other side. In particular, tape libraries have relied on Fibre Channel routers
      to help enable direct backup of storage on the SAN.

      Basic Features of Routers
      Fibre Channel routers should more accurately be called bridges, as they bridge
      legacy SCSI devices and Fibre Channel, translating Fibre Channel SCSI-FCP
      transactions and parallel SCSI bus transactions. A Fibre Channel router plugs into
      the Fibre Channel network on one side and a SCSI bus on the other.To the
      SCSI bus, a router looks like an initiator such as a host. It issues SCSI commands
      such as resets and inquiries, and determines what SCSI devices exist on the SCSI
      bus.To the Fibre Channel network, a router looks like any other storage node on
                                        SAN Components and Equipment • Chapter 3      107


the network. Presenting each of the SCSI devices it has found on the SCSI bus as
a LUN connected to a port, the router takes any frames sent to the SCSI devices
and translates them from SCSI-FCP to parallel bus SCSI. Likewise, any responses
received from the SCSI devices are changed to SCSI-FCP, tagged with network
headers, and sent back to the initiating device.
    Routers support any type of parallel SCSI device and are used for everything
from legacy RAID arrays and disk storage to non-Fibre Channel-capable tape
drives. Most routers act as Class 3 devices, because they are geared toward the
transport of storage SCSI traffic (which operates best with Class 3).They also tend
to support loop-only operation, rather than full Fibre Channel fabric protocols,
due to the fact that they are target and not initiating devices. In addition, these
routers support additional levels of error recovery on top of normal SCSI error
recovery, including all error recovery in FCP and newer FCP-2 error recovery
procedures.The capabilities of Fibre Channel-to-SCSI routers are as follows:
     s   Number of SCSI buses
     s   Types of SCSI buses
     s   Internal or external SCSI termination
     s   Selective LUN presentation
     s   Extended copy support
     s   SNMP
     s   Telnet
     s   Ethernet ports
     s   Serial ports


Number of SCSI Buses
The most basic routers have one SCSI bus and one Fibre Channel port con-
nected to the Fibre Channel network. More capable routers have more than one
bus and multiple Fibre Channel ports.The advantages of having more than one
bus include available bandwidth and isolation of error conditions on a bus.With
high-bandwidth SCSI devices and the architecture of parallel (shared bus) SCSI,
speeds are limited by the amount of available bandwidth that can be shared on a
SCSI bus. By providing more than a single SCSI bus, routers with multiple SCSI
ports allow you to reduce the contention for resources on those buses for
high-speed devices.
108   Chapter 3 • SAN Components and Equipment


      Types of SCSI Ports,Termination
      Another consideration in routers is the types of SCSI ports available.With dif-
      ferent kinds of connections available for SCSI, you need to make sure that the
      type of SCSI ports available on your router matches the device you are running.
      The different types of SCSI ports that might be supported by your device range
      from narrow, single-ended SCSI to fast, wide, ultra-wide, ultra2, and Low Voltage
      Differential (LVD) devices.
          One area to look at is the type of SCSI termination used by your router.
      Parallel SCSI termination can be either internal or external, depending on the
      manufacturer. In the case of internal termination, flipping a switch can turn on
      termination, enabling the device to be the end of a SCSI chain.With external
      termination, you either need to buy an external terminator or the device needs
      to be in the middle and not at the end of a SCSI chain. Internal termination
      makes it easier to manage this aspect of parallel SCSI operation and also means
      one less component you need to track.

      Selective LUN Presentation
      More advanced routers provide the capability to selectively filter which hosts are
      allowed to access a specific SCSI target. Like HBA-based LUN masking and
      switch zoning, this can be used to help enforce which devices are allowed to
      access a specific storage volume.To configure this LUN presentation, manage-
      ment software is used to specify which hosts are allowable for specific LUNs.This
      is usually specified with a port WWN of the HBA.
          Selective LUN presentation can be used to limit access to a specific SCSI
      device on one side of a router to only one Fibre Channel device, such as to allow
      only a backup server access to your tape drive. It can also be used to partition
      data between hosts, in order to assign different SCSI targets to different hosts.

      Extended Copy Support
      Extended copy support is the capability of a Fibre Channel router to support the
      Extended SCSI Copy command, which is used for server-free backup on your
      SAN. In conjunction with special Fibre Channel SAN-aware backup software,
      routers help form a part of a solution that can offload the backup traffic from the
      hosts on your network to the Fibre Channel router, enabling storage to be
      backed up directly to a tape drive connected to the router instead of requiring
      the intervention of different hosts in the network.
                                         SAN Components and Equipment • Chapter 3       109


    Extended copy appears as a feature of a Fibre Channel router, which inter-
prets SCSI commands received from backup software to fetch and store data from
storage arrays.

Management Interfaces
Like switches, routers also support different management interfaces, including
serial ports, Ethernet ports, SNMP, FTP,Web interfaces, and also emerging stan-
dards like Sun’s Jiro management standard. Most Fibre Channel routers also sup-
port the FibreAlliance MIB through their SNMP interface, which allows most
management software to derive basic information from the equipment.

Bridging and Routing to
IP Networks and Beyond
As Fibre Channel expands from just a few machines and storage to much larger
networks, interoperability and connectivity are becoming much more important.
Routers and bridges are starting to be used to take Fibre Channel traffic and
send it across regional and wide area networks, as well as for remote backup and
IP traffic.

Fibre Channel to DWDM
Fibre Channel-to-DWDM technology is beginning to be used to help extend
the distances of SAN operation. Fibre Channel-to-DWDM equipment operates
by multiplexing Fibre Channel optical signals onto a higher wavelength fiber.
Typically, a single switch E_Port is connected to DWDM equipment, which mul-
tiplexes the Fibre Channel signal to a remote DWDM port. On the other side, a
DWDM multiplexer is connected to a remote SAN where the frame is sent out
through a switch E_Port back into the network.To the Fibre Channel switch, the
existence of a DWDM link is almost transparent—no change occurs to the pro-
tocol, information is received at full speed, and there is no indication that a frame
went over a long-distance link. However, because of distance and the speed of
packets of light, switches do need to be configured for more available frame
buffers between the two switch links connected by the DWDM equipment. Fibre
Channel-to-DWDM equipment is used primarily for regional transport, because
of the distance limitation of Metropolitan Area Networks (MANs).
    Using DWDM has many advantages for Fibre Channel.The first is extending
the distance limitation of Fibre Channel beyond 10 km. DWDM technology can
110   Chapter 3 • SAN Components and Equipment


      extend Fibre Channel to MAN distances up to 100 km. Second is the ability to
      multiplex a large number of Fibre Channel connections to fewer fiber links,
      reducing the amount of optical fiber needed between facilities and simplifying
      long-distance cabling.

      Fibre Channel across IP Networks
      Developing standards allow for the encapsulation of Fibre Channel onto an IP
      frame (FC_IP) for transport across any IP-capable network. Like Fibre Channel
      across DWDM, this is being targeted at extending the distance of SANs across
      regional and wide area distances.The FC_IP protocol encapsulates Fibre Channel
      frames within IP packets.This works through special equipment that connects
      into a Fibre Channel network and encapsulates Fibre Channel frames and
      transmits them on IP networks.This allows any IP-capable network—including
      Gigabit Ethernet, ATM, and any other technology—to transport Fibre
      Channel frames.
          Transporting Fibre Channel over IP networks can help extend SANs well
      beyond the campus networks and regional networks now in use.With suitable
      bandwidth available,WAN distances can be possible for SANs with this emerging
      technology.
          The major limitation of running FC_IP is the speed and latency of the IP
      network.The IP network must be able to handle the amount of bandwidth gen-
      erated by Fibre Channel, or otherwise risk being bottlenecked in the IP trans-
      port. In addition, different vendors are not compatible with each other using
      FC_IP, since the standard is still in a draft phase. However, as the standard pro-
      gresses, expect that vendors will begin to interoperate.

      IP over Fibre Channel to Gigabit Ethernet
      Fibre Channel to Gigabit Ethernet routers allow for the transport of IP frames
      generated on either the Fibre Channel network or Gigabit Ethernet network to
      appear transparently on the other network.This can be done through dedicated
      hardware, or even through a host computer set up to route IP frames between
      different network interfaces. Standard IP protocol services such as Web pages,
      FTP, and telnet can be seamlessly run across your Fibre Channel network and
      routed across to an IP network like Gigabit Ethernet.
          This is now usually done primarily as a way to retrieve management infor-
      mation from network equipment through in-band management. In particular,
      with expensive DWDM or remote links being used to connect remote sites,
                                        SAN Components and Equipment • Chapter 3      111


using IPFC and bridging that information to other general-purpose networks
makes it more cost-efficient to manage the remote equipment. However, as IPFC
becomes more prevalent, these bridges and routers can make it possible to merge
traffic between both IP and Fibre Channel networks.

Fibre Channel Storage
Selecting Fibre Channel storage depends on what kinds of applications and, most
importantly, how much data and protection you need. Fibre Channel storage
ranges from individual disk drives that support the Fibre Channel protocol, to
devices with dozens of connections and different kinds of storage connection
ports, such as Fibre Channel, SCSI, and Enterprise System Connection
(ESCON).This section attempts to briefly describe the Fibre Channel aspects of
this storage, but does not cover the very complex task of
evaluating storage systems.

Individual Disk Drives and JBODs
Individual disk drives, although they support the Fibre Channel protocol, are
rarely deployed alone in a SAN. In general, these individual disk drives are added
to JBOD enclosures that hold four or more single disks in a single, loop-ready
configuration.These drives are wired together into a miniature FC-AL loop with
one or two ports to connect to the drives.The first port of a JBOD is generally
wired to one of the dual ports on a disk drive, and the secondary port of the disk
drives is wired on a secondary loop.To an HBA and system on the other side,
there is no difference between a disk drive and a JBOD. In fact, you cannot tell
that individual disks are tied together into a JBOD electronically or through soft-
ware. Different JBOD systems differ in the number of ports, physical enclosure
features, and individual disk drives used in the systems. Most JBOD systems do
not add any actual Fibre Channel functionality and just physically connect the
internal Fibre Channel disk drives to the Fibre Channel network. Differences in
JBODs include the number of disk drives included in an enclosure; rack-mount
and standalone options; amount of cooling and the power supplies; temperature
sensors and alarms; and the ability to hot-swap and replace faulty components.
    Fibre Channel RAID systems can also be connected into a Fibre Channel
network with one or more ports, and range from low-end systems with only
several gigabytes of capacity and little cache to higher end, hundreds of gigabyte
capacity arrays with extensive cache.They provide varying levels of redundancy
112   Chapter 3 • SAN Components and Equipment


      and performance, and generally can be configured into different types of RAID
      levels for differing protection. An important Fibre Channel consideration when
      you are selecting a RAID system is the ease of configuration of the RAID. For
      example, some systems might require specific operating systems and HBAs in order
      for configuration software to work.You should make sure that the RAID system
      you select supports the other components in your system and has been qualified
      with switches in your system.The capability to configure a RAID system from the
      serial port or from an Ethernet port can ease management of that system. RAID
      systems also are available in different rack-mount and standalone configurations and
      provide redundant cooling and power component options.




         RAID Levels
         RAID arrays generally support different levels of protection and redun-
         dancy for your data. By selecting different RAID levels, you can trade off
         speed of operation and recovery against the amount of protection for
         your data. The following is a brief description of the most common RAID
         levels available:
               s   RAID 0: Striping Provides very rapid access to data by
                   striping information across different disks. By distributing
                   data across different spindles (disks), the data can be
                   retrieved very rapidly and at a rate greater than an individual
                   disk drive can support. However, RAID 0 provides no redun-
                   dancy or protection for data if any disks fail.
               s   RAID 1: Mirroring Duplicates data across disks on a one-to-
                   one basis. This provides 100 percent protection of data
                   across disks and instant access to data if one disk fails. A
                   duplicate copy of all data is simultaneously written to a mir-
                   rored disk, which is made available if the primary disk fails.
                   However, this is fairly expensive since it requires purchasing
                   twice as much capacity.
               s   RAID 3: Striping with Parity Data is striped as in RAID 0,
                   but with the addition of a parity disk. This provides speed
                   and fault tolerance.
               s   RAID 5: Striping with Distributed Parity Similar to RAID 3,
                   but parity is instead distributed across all disks, allowing for
                   better read performance and fault tolerance.
                                        SAN Components and Equipment • Chapter 3     113


High-End Storage Arrays
High-end storage arrays generally support multiple terabytes of data and often
include the capability to support Fibre Channel connections as well as interfaces
such as SCSI, FICON, and other specialized storage interconnects. Built to
support dozens or even hundreds of storage users, the arrays can range from
refrigerator-sized to half a room. In addition to many storage connections, these
devices provide large amounts of memory cache to accelerate disk accesses, as
well as advanced capabilities like LUN masking through selective LUN presenta-
tion, snapshot backup volumes, redundant controllers, and failover and replication
capabilities.

Selective LUN Presentation
Selective LUN presentation is the capability of a storage device to filter or mask
which hosts are allowed to see a LUN. For example, storage can be configured to
show LUN A to host X and LUN B to host Y, but not vice versa, making it pos-
sible to partition the storage entirely by hosts on the network.This has many
advantages, including the capability to allocate the storage in the box within the
network at a single interface, to guarantee that users do not accidentally mount
an incorrect volume and corrupt data, and to better control how hosts see the
storage in the network.
    Selective LUN presentation works through hardware and software that exam-
ines frames coming into a storage subsystem from the network. By examining the
frames and comparing the source of those frames with an administrator-
configured list of allowable hosts, the storage array can allow or deny access to
specific LUNs in the array. In addition, arrays can also enforce LUN numbers
that a host sees, even if those are not the actual physical LUN numbers of the
internal volumes.This is typically used for operating systems like Windows NT,
which requires that every storage volume have a LUN 0—which is fine for
simple RAID arrays but impractical on a storage array that might have a hundred
LUNs allocated by dozens of hosts.Through selective LUN presentation, every
host can see a different LUN 0, which is physically different but can be addressed
through the same LUN 0 address.

LUN Export across Multiple Ports
With Fibre Channel, the capability to make data highly available through a net-
work has resulted in software support for highly available transport of data from
114   Chapter 3 • SAN Components and Equipment


      storage.This software can identify volumes that are the same, even if they are seen
      at different points of a network.
           High-end storage arrays have a feature that allows a single logical LUN to be
      exported across multiple Fibre Channel ports on an array. For example, a single
      volume for an e-mail database can be exported across two different storage ports,
      across two separate fabrics. If a storage port’s hardware fails or a switch fabric is
      disrupted, the other port can still be accessed across a different path.This ability
      to export LUNs across multiple ports enables software and dynamic multipathing
      software and drivers to intelligently access, either on an active-passive or active-
      active basis, the same LUN.
           Array management software typically must be configured to allow for this
      multiple LUN export. Exporting a LUN across multiple interfaces without the
      addition of dynamic multipathing or volume management software can result in
      data corruption or collisions due to duplicates in the operating system image. In
      addition, the arrays need to identify that these different exported LUNs are actu-
      ally and logically the exact same volume images.This is referred to as Page 83h
      information, named after the SCSI mode page that describes the identification of
      storage volumes on a storage device such as a RAID.

      Snapshot Backup Volumes
      Snapshot backup volumes are a special feature of high-end arrays that take a “snap-
      shot” of an operational LUN’s data at a point in time and copy that data to
      another volume.This snapshot, which is made instantaneously while traffic is
      running and sometimes in coordination with a halt in traffic to a storage LUN,
      enables an easy way to back up a very busy storage array.
          Because high-end storage arrays are typically highly utilized and attached to
      business-critical systems that must be available 24x7, backup is a very challenging
      task. Usually, backup has to be done to a static system, or a system that is not
      being written to during the entirety of the backup.With these critical systems,
      this never happens—and backups still have to be done. In fact, backups are prob-
      ably even more important for these systems.
          Snapshot backup volumes solve this problem by providing a copy of the data
      at a point in time, which can be backed up without the problem of data con-
      stantly being changed on a volume.These backup volumes are exported to dif-
      ferent hosts or backup hardware on the SAN and at predetermined periods will
      refresh their information from the “live” data and allow for the next backup.
                                         SAN Components and Equipment • Chapter 3        115



Summary
The keys to robust Fibre Channel deployment are the components that you
use to build your network. By understanding these components, you can better
design a robust, scalable network. Understanding the features and capabilities
of your hardware will help you select and qualify the best equipment for
your needs.
    The most basic layer of your SAN is the physical layer, which includes your
media and connector choices.There are a number of choices that are dependant
on the primary purpose of your SAN. In selecting the type of media to use for
your SAN, you have two choices: copper and optical.The distinct advantage of
copper is that it is inexpensive compared to all types of optical fiber.The advan-
tage of optical fiber is that it provides a reliable signal over a longer distance than
copper.There are two types of fiber: single mode and multimode.The difference in
the fibers is the diameter of the fiber. Smaller-diameter single mode can transmit
at a far greater distance.
    The cardinal rule for connecting a SAN is to minimize the total number of
connections and patches. SC is the standard optical connector, but high-density
connectors are becoming more popular as more devices are connected to SANs.
Examples of high-density connectors are LC, MT-RJ, and HSSDC. HSSDC and
DB-9 are the copper connectors available. In most SANs, you will need to use
some types of GBICs. GBICs provide an easy way to adapt devices to whatever
connection type you prefer.This lets you customize your SAN based on distance
and speeds required between devices. Hubs serve as the most rudimentary sup-
port for Fibre Channel, providing basic FC-AL connectivity to smaller networks.
Simple hubs provide only basic electrical connectivity, shared bandwidth, and no
intelligence. Intelligent hubs provide much better error recovery and manage-
ment, especially in multi-initiator fabrics. Key features to look at in hubs are the
different management interfaces available, error recovery, and LIP isolation.
    Switches form the core of a switched Fibre Channel fabric, and not only switch
frames between nodes but also contain intelligent services to locate and manage
nodes in the fabric. In addition, the capability to work in loop mode also makes
Fibre Channel switches a good replacement for hub deployments. Performance and
functionality such as zoning and the ability to cascade switches together make
switches a key to controlling access in the SAN and scaling your installations.
    HBAs connect hosts to the Fibre Channel network. In conjunction with device
drivers, these devices translate SCSI commands and protocols from operating
116   Chapter 3 • SAN Components and Equipment


      systems and send them across the Fibre Channel network. Advanced functionality
      such as persistent binding and LUN masking add to your ability to control how
      storage is allocated in your network. HBAs are also capable of non-SCSI traffic
      such as IP frames and VI clustering protocols, which are enabling new applications
      for your SAN.
           Fibre Channel-to-SCSI routers provide an important bridge from legacy par-
      allel bus SCSI devices to new Fibre Channel networks. By translating between
      Fibre Channel SCSI and older parallel bus SCSI protocols, older non-Fibre
      Channel SCSI devices can seamlessly be used in your network. Advanced features
      such as selective LUN presentation and support for extended copy can also help
      support applications on your SAN.
           Companies are starting to extend their SAN infrastructure over long distances,
      adding disaster tolerance and remote backup to their data centers.Technologies
      such as Fibre Channel-to-DWDM equipment and encapsulation of Fibre Channel
      over IP networks are gaining ground as a way to extend SANs to MANs and even
      WANs. Bridging IP from Fibre Channel is also helping to drive the interoper-
      ability between standard corporate IP networks and specialized data SANs.
           Fibre Channel-capable storage forms the core of a SAN and is present as
      low-cost JBOD disk drive cabinets, midrange RAID arrays, and high-end storage
      subsystems. JBOD arrays present individual Fibre Channel-capable disks drives to
      the network. RAID arrays add error tolerance, cache, and management ability to
      the network, with high-end storage subsystems providing much higher function-
      ality and capacity to Fibre Channel and other storage interconnects. High-end
      features such as selective LUN presentation, multiple export of LUNs, and snap-
      shot backup are some of the features that can help better manage the allocation
      of storage on your SAN, add redundancy to your networks, and help to keep
      your data available 24x7.
           These components together form the SAN. By studying them and under-
      standing their major features and benefits, you can better select components and
      design your network.

      Solutions Fast Track
      Overview of Fibre Channel Equipment
              Understanding the features of your Fibre Channel equipment is key
              when building a robust infrastructure.
                                    SAN Components and Equipment • Chapter 3       117


     A Fibre Channel network is comprised of cabling, GBICs, hubs,
     switches, HBAs, and routers.
     Fibre Channel shares much of the same terminology as Ethernet
     networking, but the functionality of similarly named equipment is not
     necessarily identical.


Cabling and GBICs
     Copper cabling is almost always terminated with either an HSSDC or
     DB-9 male connector.
     Multimode optical fiber is terminated using a variety of optical
     connectors, including SC, LC, and MT-RJ.
     Single-mode fiber is the most expensive media type, but preferable for
     long distances.
     Single-mode fiber, because of its small diameter (9 µm), has the highest
     transmission speed potential.
     Copper cabling is available in two types: active and passive. Active copper
     lines provide twice the distance of passive copper lines.
     The HSSDC connector was specifically designed as a Gigabit copper
     connector, improving density and performance over the DB-9 style
     connector.
     GBICs are removable transceivers used in all types of Fibre Channel
     devices, including switches, hubs, and HBAs.
     GBICs offer the option of interfacing with almost all types
     of connectors.
     A Media Interface Adapter (MIA) converts DB-9 copper connectors to
     optical SC connectors.


Using Hubs
     Hubs serve as a very basic level for connecting different ports in a
     network together.
     Hubs can connect up to 127 devices together in an FC-AL loop.
118   Chapter 3 • SAN Components and Equipment


              Simple hubs contain no intelligence, just electrical connections.
              Managed hubs provide a level of error tolerance and
              management features.
              Managed hubs provide LIP isolation, automatic port bypass, signal
              retiming, and management interfaces.
              Fibre Channel LIPs can be a major source of problems in arbitrated loop
              configurations.
              To avoid an earlier generation of problems due to loop architecture,
              most people are moving to switched fabric devices instead.


      Using Switches and Fibre Channel Fabrics
              Switches are classified into three categories: entry-level, scalable fabric,
              and core fabric switches.
              Entry-level switches are focused on small workgroups of 8 to 16 ports,
              usually are geared toward low cost, and deliver limited scalability and
              management. Fabric switches provide the capability to cascade switches
              together to create larger fabrics.
              A core fabric switch is designed for interconnecting multiple edge
              switches to form multihundred-port SANs.
              HBAs are used to connect servers to the network.They map SCSI
              commands in the operating system to Fibre Channel frames on the net-
              work. HBAs range from low-end, loop-only devices to high-end, fabric
              multipathing adapters.
              Major protocols supported by HBAs are SCSI-FCP for storage, IPFC for
              networking, and VIFC for clustering.
              HBAs either support 1 Gbit/sec or 2 Gbit/sec speeds, with current gen-
              eration cards supporting 1 Gbit/sec, and emerging cards supporting both.
              HBAs can be found in single one-port configurations or dual-port
              adapters for higher density.
              LUN masking enables control of access to devices in the network from
              the HBA.
                                    SAN Components and Equipment • Chapter 3      119


     Persistent binding is the mapping of a Fibre Channel device into an
     operating system at a specific device location.
     Dynamic discovery is the capability to dynamically add and remove
     drives from your system without reboot.
     HBA API support is an important feature that allows management of
     your HBA by SAN management software.
     Remote booting is the use of an HBA to boot an operating system
     image across the SAN and is used to dynamically change hosts and
     enable ease of disaster recovery.


Connecting Legacy Devices into Your SAN
     Α Fibre Channel router, which is also known as a bridge, allows legacy
     parallel SCSI devices to attach to your Fibre Channel SAN.
     A Fibre Channel router plugs into Fibre Channel on one side and
     a SCSI bus on the other.
     Frames are translated from SCSI-FCP to parallel SCSI bus signals
     through routers.
     Routers provide many different features, including different numbers
     of SCSI buses and different support for parallel SCSI protocols
     and termination.
     Advanced features include selective LUN presentation, extended copy
     support, and various management interfaces.
     Selective LUN presentation is the capability of a router to mask the
     presence of devices to different hosts in the network and allow for better
     security and control over resources.
     Extended copy support (third-party copy) allows software to directly
     back up data on the SAN, saving CPU and network traffic.
     Available management interfaces include telnet, SNMP, Ethernet, and
     serial ports.
120   Chapter 3 • SAN Components and Equipment


      Bridging and Routing to IP Networks and Beyond
              Fibre Channel-to-DWDM technology multiplexes Fibre Channel signals
              onto higher bandwidth fiber for transmission over MAN distances (up
              to 100 km).
              Use of DWDM is transparent to Fibre Channel switches, except for
              buffer settings.
              It is necessary to increase buffer credit settings to handle the long
              distances/delays involved in MANs.
              Fibre Channel can also be transported across IP networks like ATM and
              Gigabit Ethernet.
              FC_IP (not to be confused with IPFC) encapsulates Fibre Channel
              frames in the IP protocol and can be used for remote backup and
              extending SAN distances.

      Fibre Channel Storage
              Fibre Channel storage is important as the core of the data storage on
              your network.
              Fibre Channel storage ranges from simple JBOD devices to multi-
              terabyte storage arrays.
              A JBOD is a cabinet of independent disks, all connected into the Fibre
              Channel network in a loop.
              Hosts individually address disks in a JBOD.
              RAID arrays provide additional protection and performance to
              your storage.
              Different RAID levels are appropriate for different applications.
              High-end storage arrays add support for multiple terabytes of data.
              Other types of connections include parallel SCSI, ESCON, and FICON.
              High-end arrays also generally include a large amount of cache, which is
              used to speed up data access.
                                       SAN Components and Equipment • Chapter 3       121


        Selective LUN presentation is the ability of high-end storage to control
        access by hosts to data and to ensure data integrity.
        LUN export across multiple ports is used for redundancy and high avail-
        ability, but requires dynamic multipathing software or drivers to work.
        Snapshot backup volumes are used to enable backup on live databases
        and data images.


Frequently Asked Questions
The following Frequently Asked Questions, answered by the authors of this book,
are designed to both measure your understanding of the concepts presented in
this chapter and to assist you with real-life implementation of these concepts. To
have your questions about this chapter answered by the author, browse to
www.syngress.com/solutions and click on the “Ask the Author” form.


Q: Should I buy a switch that supports GBICs or fixed media?
A: This depends on your application. GBICs offer much more flexibility in the
   configuration and media in your network, allowing you to mix copper and
   optical media and different copper and optical connectors. However, GBICs
   do add some cost to your installation and are another possible point of failure.
   Using fixed media generally lowers the cost of your switches and is more reli-
   able. However, a failure of a transceiver in fixed media means a swap of the
   entire switch and not just a GBIC.

Q: I notice that many Fibre Channel devices support a common SNMP MIB.
   How do I get this to work with my network management software?
A: You can get MIB files from the manufacturer of your hardware, usually from a
   distribution disk or through a download from their Web site. Using an MIB
   browser in your network management software, you can browse this data.
   However, a better way to use this is through Fibre Channel management soft-
   ware that understands how to interpret these MIBs and show (for example) a
   visual topology of the network from this data.
122   Chapter 3 • SAN Components and Equipment


      Q: Should I be buying 1 Gbit/sec or 2 Gbit/sec components for my network?
      A: You should consider the bandwidth requirements of your application. 1 Gbit/sec
         networks are the most proven and mature technology. However, as 2 Gbit/sec
         equipment starts to become available, you can expect to see all parts of the net-
         work (storage, switches, and HBAs) supporting 2 Gbit/sec standards.The higher
         bandwidth will provide significant performance improvements when used on
         ISLs, though.

      Q: It looks like most Fibre Channel components provide some sort of masking
         ability to control access to either ports or LUNs.Which one of these
         methods should I use in my network?
      A: Even though different components all offer some sort of zoning, LUN
         masking, or LUN presentation techniques, you will probably need to use all
         of them in one way or another to control your network. For example, switch
         zoning is invaluable for isolating specific SAN segments and control, but you
         will have to use either HBA-based LUN masking or storage-based LUN
         presentation to more finely partition your volumes.
                                       Chapter 4

Overview of
Brocade SilkWorm
Switches and
Features



 Solutions in this chapter:

     s   Selecting the Right Switch
     s   Understanding the Brocade Fabric OS
     s   Using Optional Brocade Features
     s   Future Capabilities in the Brocade
         Intelligent Fabric Services Architecture


         Summary

         Solutions Fast Track

         Frequently Asked Questions




                                                    123
124   Chapter 4 • Overview of Brocade SilkWorm Switches and Features



      Introduction
      This chapter covers the Brocade switch models and describes their differences
      and similarities. It also describes how to select the most effective switch for spe-
      cific requirements or SAN applications, a critical part of the overall process of
      implementing and managing a SAN. A discussion of the Brocade Fabric OS helps
      to explain the core aspects of switch functionality, and how that functionality
      relates to your own installation. In addition, this chapter gives you an overview of
      the features of Fabric OS and how they control the behavior of Brocade
      switched-fabric SANs. Finally, a discussion about the future capabilities of SANs
      assists you with developing your SAN vision.The information in this chapter
      presents an overview of the Brocade product family to provide context for the
      other chapters in the book and to enable this book to be a self-contained refer-
      ence on Brocade SANs. For the latest and most detailed information about
      Brocade products, visit the Brocade Web site at www.brocade.com.

      Selecting the Right Switch
      By choosing among the wide variety of switches available today, you can deploy
      them in simple, standalone configurations or by networking them with other
      switches to build larger or more resilient fabrics.
           Selecting the right switch for your application is extremely important, since
      the switch is the building block of the infrastructure for your SAN.The right
      SAN infrastructure can significantly improve information management and allow
      you to address some of your most challenging business requirements.
           As with other types of technology implementations, selecting a switch is a
      strategic decision.You must understand your current IT infrastructure and require-
      ments, as well as how you might need the switch to fit into your strategic direc-
      tion. As a result, you need to understand key factors such as scalability, compatibility,
      and interoperability with other hardware resources. After all, your SAN is hardly a
      static platform, and you should expect to expand it or integrate it with other com-
      ponents as new technologies emerge.This is how a SAN provides you with invest-
      ment protection by enabling you to scale and adapt it to meet future requirements.
           At a very high level, you should be aware of a few key characteristics when
      selecting a switch, including hardware redundancy features, which reflect the level
      of reliability and availability you require; port count, or the overall size of your
      installation and the number of devices you need to connect; function, that is, the
      capabilities you need for certain types of applications or environments; and finally,
      cost, in planning the overall budget to achieve your SAN goals.You can use these
                     Overview of Brocade SilkWorm Switches and Features • Chapter 4   125


characteristics to differentiate the various members of the Brocade SilkWorm
family of switches:
     s   Hardware Redundancy The low-cost SilkWorm 2000/2200 entry-
         level models have a single fixed power supply and cooling mechanism.
         The enterprise-class SilkWorm 2400/2800 products have dual hot-
         swappable power supplies and hot-swappable cooling fans. All except the
         lowest-end SilkWorm 2000 series of switches offer pluggable Gigabit
         Interface Converters (GBICs) to enable fast replacement of optical
         transceivers.
     s   Port Count Brocade offers 8- and 16-port switches, a 64-port inte-
         grated fabric and a dual 64-port core fabric switch (in the future).You
         can use these switches alone or together to form fabrics consisting of
         hundreds of ports.
     s   Function The Brocade Fabric OS software installed on all Brocade
         switches is inherently capable of supporting all key features. However,
         software license keys might be required to activate some of these fea-
         tures. Each switch comes bundled with certain software keys, and if you
         need features that are not bundled with the product, you will need to
         budget for the additional licenses.
     s   Cost Brocade has a wide range of switch models, from low-cost entry-
         level departmental switches all the way up to robust, enterprise-class
         fabric switches, integrated fabrics, and core fabric switches.
    One of the first steps in selecting a switch is to decide how many servers and
storage devices you want to connect.There are some general guidelines you can
follow when deciding what type of switch best meets your needs. For networks
of less than eight devices with no growth plans, an 8-port switch will suffice. If
you plan to have more than eight devices at some point in time, you will prob-
ably want a 16-port switch. If you have more than 16 devices, you will want a
network built from a number of 16-port switches. If you have over 50 devices,
you will want to either build a network from 16-port switches or consider a
more advanced solution such as the SilkWorm 6400 Integrated Fabric, which
provides a preconfigured network in a convenient package.
    If you have business-critical applications that must be continuously available,
you will need to connect them in a resilient manner.You might even want to use
a dual-fabric configuration.We discuss this in more detail in Chapter 5, “The
SAN Design Process,” and Chapter 7, “Developing a SAN Architecture.” Since
126   Chapter 4 • Overview of Brocade SilkWorm Switches and Features


      redundancy lies within the fabric, it might be possible to utilize the less expensive
      switches with a single fixed power supply. If one of the switches fails, multi-
      pathing software on the host and storage array will route traffic over the redun-
      dant fabric until the failed switch is replaced.
           For applications that can tolerate occasional downtime for periodic mainte-
      nance, a single fabric is less costly and can be built upon switches with redundant,
      hot-swappable power supplies and fans—so the fabric can withstand the loss of
      one of the components without downtime.We discuss the topics of redundancy
      and resilience in detail in Chapter 7.
           Most Brocade switches support hot-pluggable GBICs that provide both flexi-
      bility—you can choose optical or copper media—and high availability. If a single
      GBIC fails, you can replace it without disturbing the other ports on the switch.
      However, the lowest priced 8-port entry model supports one GBIC and seven
      fixed optical ports.
           The Brocade family of fabric switches includes a wide variety of cost options,
      ranging from the least expensive 8-port SilkWorm 2010 switch with fixed media
      to the highly available 16-port SilkWorm 2800 and the redundantly configured
      SilkWorm 6400 Integrated Fabric.The high end of the family is the SilkWorm
      12000 Core Fabric Switch, which will provide 128 ports in a single enclosure
      when it is released. All Brocade switches are fully interoperable with each other
      and can be mixed in fabrics to provide the optimal balance of cost-effectiveness,
      expansion capability, and high availability.

      Entry-Level Switches
      Brocade entry-level switches are designed for environments with smaller storage
      growth requirements where cost considerations outweigh the need for high avail-
      ability.The 8-port SilkWorm 2010, 2040, and 2050 are the same physical 1U-
      high switch with different feature sets enabled (Figure 4.1).The 16-port
      SilkWorm 2210, 2240, and 2250 are the same physical 1.5U-high switch with
      different features enabled (Figure 4.2).
      Figure 4.1 SilkWorm 2000 Series Switch
                     Overview of Brocade SilkWorm Switches and Features • Chapter 4   127


Figure 4.2 SilkWorm 2200 Series Switch




SilkWorm 2010 (8 Ports) and 2210 (16 Ports)
The SilkWorm 2x10 are arbitrated loop-only switches and are a viable alternative
to a managed hub-based solution.The SilkWorm 2010 and 2210 series deliver
superior performance compared to a hub by providing simultaneous, full-band-
width data transfers and enhanced availability through fault isolation and intelli-
gent Loop Initialization Primitive (LIP) management.The SilkWorm 2010
features seven fixed optical ports and one pluggable GBIC slot, while the 2210
offers 16 pluggable GBIC slots.The fixed media solution in the SilkWorm 2010
reduces cost by eliminating the need to purchase GBICs for the remaining slots.
A consideration to this configuration is that, if one fixed optical component fails,
you need to replace the entire switch rather than a single component.
    The SilkWorm 2x10 are bundled with Brocade Zoning, Brocade WEB TOOLS,
QuickLoop, and the Simple Name Server (single switch) enabled.The 2x10 can
be upgraded to a full-fabric switch with a license key.This capability is key to
investment protection: you might need only a loop-switch now but decide in the
future that you need fabric capabilities as well.With the 2x10, you do not need to
perform a forklift upgrade of your SAN in order to add that functionality.

SilkWorm 2040 (8 Ports) and 2240 (16 Ports)
The SilkWorm 2040 and 2240 are the same physical switches as the SilkWorm
2010 and 2210, respectively, with additional support for entry-level fabrics.You
can use these switches in dual-switch configurations to support fabrics or loop
environments of up to 30 ports. Alternatively, you can have one E_Port connec-
tion per switch to connect the switch to a larger Brocade fabric.
    The SilkWorm 2040 is bundled with Brocade Zoning, Brocade WEB TOOLS,
and the Distributed Name Server enabled. It can be upgraded to a full-fabric
switch (SilkWorm 2x50) with a license key. However, you must then also pur-
chase Brocade QuickLoop to connect Fibre-Channel Arbitrated Loop
(FC-AL) hosts.
128   Chapter 4 • Overview of Brocade SilkWorm Switches and Features


      SilkWorm 2050 (8 Ports) and 2250 (16 Ports)
      The SilkWorm 2050 and 2250 are the same physical switches as the SilkWorm
      2010 and 2210, respectively, but they feature a full-fabric configuration that
      enables multiple E_Port connections and networked fabrics of multiple switches
      (equivalent to what is provided by the SilkWorm 2400 and 2800 switches).This
      design enables redundant paths between switches.
          In a networked configuration, any device can connect to any other device,
      with I/O taking the shortest available path to the target device.The SilkWorm
      2x50 is bundled with Brocade Zoning, Brocade WEB TOOLS, and the
      Distributed Name Server enabled. However, optional Brocade QuickLoop soft-
      ware is necessary to connect private FC-AL hosts.
          While the SilkWorm 2400 and 2800 switches offer more high-availability
      features (such as redundant, hot-swappable power and cooling), the highest avail-
      ability environment requires that dual fabrics be used with connections from each
      server and storage to both fabrics. Operator errors, natural disasters, and other
      catastrophic events can cause an entire fabric to become inoperable. In that
      respect, a single fabric is a single point of failure. Path failover software—such as
      EMC PowerPath,VERITAS Dynamic Multipathing software, Compaq
      SecurePath, or other solutions—allows traffic to flow over either fabric. In these
      environments, the SilkWorm 2050 and 2250 switches can provide cost-effective
      solutions when availability of an individual switch is not the highest requirement.


      NOTE
           By enabling data routing, rerouting, self-healing, and high scalability,
           Brocade full-fabric products enable switch-to-switch networking to pro-
           duce a resilient multiswitch fabric. If you think you might eventually
           upgrade to a full fabric, this feature set is highly useful.




      Scalable Fabric Switches
      The Brocade SilkWorm family of switches is designed to work with other
      popular storage hardware and servers to enable a best-of-breed open systems
      environment. By integrating with heterogeneous IT infrastructures, Brocade
      switches help leverage existing storage investments while providing a strategic
      path to manage continued data growth.
                     Overview of Brocade SilkWorm Switches and Features • Chapter 4   129


    The SilkWorm 2400 and 2800 are the most widely deployed members of the
Brocade SilkWorm family. Both provide a fault-tolerant solution with dual power
supplies and redundant cooling fans.To further increase availability, both the
power supplies and cooling fans can be replaced without taking the switch
offline.These switches are bundled with Brocade Fabric OS, including
Distributed Name Server, FSPF routing, automatic discovery, and advanced diag-
nostic and management functions. Most Brocade switches are sold bundled with
Brocade Zoning and Brocade WEB TOOLS optional software packages.

SilkWorm 2400
The SilkWorm 2400 is an 8-port full-fabric switch (Figure 4.3). It is 1U high
and has two slim power supplies that can be removed and replaced while the
switch is online.These power supplies are the same power supplies used for the
SilkWorm 2800 and therefore are interchangeable with the SilkWorm 2800.The
SilkWorm 2400 supports pluggable optical or copper GBICs on all ports and
offers an Ethernet connection and a serial connection for management. All of the
ports are capable of supporting and automatically detecting fabric connection
(F_Port), loop device connection (FL_Port), or connections to other fabric
switches (E_Port).
Figure 4.3 SilkWorm 2400 Fabric Switch




SilkWorm 2800
The SilkWorm 2800 is a 16-port full-fabric switch (Figure 4.4). It is 2U high and
has dual power supplies (the same power supplies are used for the 2400) that can
be removed and replaced while the switch is online.The SilkWorm 2800 sup-
ports pluggable optical or copper GBICs on all ports and offers an Ethernet con-
nection, a built-in two-line LCD display, and a four-key keypad for management.
This management capability enables you to configure a switch without additional
equipment (such as an ASCII terminal). All ports are capable of automatically
130   Chapter 4 • Overview of Brocade SilkWorm Switches and Features


      detecting and supporting either fabric connection (F_Port), loop device connec-
      tion (FL_Port), or connections to other fabric switches (E_Port). All ports can be
      used in any mode.

      Figure 4.4 SilkWorm 2800 Fabric Switch




      SilkWorm 6400 Integrated Fabric
      The SilkWorm 6400 is a 64-port integrated fabric designed for data center envi-
      ronments (Figure 4.5). It harnesses the strong networking capability of Brocade
      switches to create an integrated 64-port solution that is typically half the cost of
      director-class products. In addition, it enables interconnection of a large number
      of hosts and storage devices for enterprise-wide distributed applications.The
      SilkWorm 6400 is simple to install, cable, and configure.The high-port-count
      solution features a modular design of six SilkWorm 2250 switches in a custom
      housing.The six switches are preconfigured and preconnected to form a highly
      available fabric with 64 user ports and no single point of failure. Brocade Fabric
      Manager software enables consolidated management of the switch modules.
           The SilkWorm 6400 is interoperable with other SilkWorm switches, and it
      supports private loop Host Bus Adapter (HBA) environments through Brocade
      QuickLoop.The switch is bundled with Fabric OS, Brocade Zoning, Brocade
      WEB TOOLS, and Brocade Fabric Watch.
           The internal topology of the SilkWorm 6400 is designed to provide a cost-
      effective high-availability solution.Two of the switch modules operate as the core
      of the fabric and use eight ports to provide dual connections to each of the four
      edge switch modules.The remaining eight ports are used for device connections.
      Each edge switch module provides two connections to each of the core switch
      modules, leaving 12 ports for device connections.The use of dual core switch
      modules and dual connections from each edge switch ensures that no single
      failure can bring down the entire integrated fabric.
                      Overview of Brocade SilkWorm Switches and Features • Chapter 4   131


Figure 4.5 SilkWorm 6400 Integrated Fabric




Fabric Manager
By providing a centralized view of the SilkWorm 6400 switch modules, Brocade
Fabric Manager simplifies SAN administration and maintenance of the SilkWorm
6400 Integrated Fabric. A portable, Java-based management application that is
easy to install on a Windows management station, Fabric Manager makes it easy
to view the status of all switch modules, drill down to individual switch modules,
and access Brocade WEB TOOLS.

SilkWorm 12000 Core Fabric Switch
To further support enterprise-level SAN deployment, Brocade has developed the
SilkWorm 12000 Core Fabric Switch, which will provide up to 128 ports of
connectivity in a single enclosure (Figure 4.6).The switch will be the first model
based on a third-generation ASIC that enables auto-sensed link speeds of 1 and 2
Gbit/sec.With its high-performance and high-reliability characteristics, the
SilkWorm 12000 will provide the same capabilities as director-class switches, but
with improved intelligence, scalability, and interoperability.With superior built-in
intelligence, the SilkWorm 12000 will enable a centrally managed core-to-edge
network model based on proven core backbone networking methodology.
    The SilkWorm 12000 will also feature a protocol-independent backplane
that supports 2 Gbit/sec Fibre Channel blades on release, and in the future,
10 Gbit/sec Fibre Channel blades.The protocol-independent design will also
support emerging storage protocols such as Small Computer Systems Interface
over IP (iSCSI), Fibre Channel over IP (FC_IP), and InfiniBand. In addition to
132   Chapter 4 • Overview of Brocade SilkWorm Switches and Features


      Fibre Channel, InfiniBand, and IP, the SilkWorm 12000 will support an optional
      Application Platform blade that enables the deployment of high-performance
      fabric services such as storage virtualization and third-party copy.With the
      Application Platform integrated into the switch, higher data rates will be possible
      and management between switches and applications will be much easier.

      Figure 4.6 SilkWorm 12000 Core Fabric Switch




      NOTE
           At the time of printing, the SilkWorm 12000 represents an unreleased
           product. This section is, therefore, a visionary statement of future SAN
           design capabilities.




      Understanding the Brocade Fabric OS
      The Brocade Fabric OS enables you to easily configure, manage, and maintain a
      SAN for your specific needs.The de facto industry standard, Fabric OS simplifies
      management for both FC-AL and switched-fabric SANs. Fabric OS allows you
      to discover the network of connected storage and host devices and automatically
      determines the available data paths through the switches and fabric. In addition,
      Fabric OS enables you to customize your fabric via telnet commands or a Web-
      based Graphical User Interface (GUI). Figure 4.7 shows the high-level functions
      provided by Fabric OS.
                           Overview of Brocade SilkWorm Switches and Features • Chapter 4                             133


Figure 4.7 Fabric OS Functions

           Optional
                      BROCADE       BROCADE         BROCADE     BROCADE        EXTENDED          REMOTE
            Fabric
                       ZONING         SES          WEB TOOLS   QUICKLOOP        FABRICS          SWITCH
           Software
                       FABRIC SERVICES                           MANAGEMENT SERVICES
            Base       Distributed Simple Name Server            Management Server
           Services    Alias Server (Multicase)                  SNMP Agent
                       Routing Services, FSPF                    Telnet / Serial (Front Panel)




                                                                                                          Fabric OS
                       Universal Port                            Web Server
                                                                 Fabric Watch
             Fibre
           Channel                                 SILKWORM SWITCH FAMILY
           Platform




Fabric OS Core Functions
Fabric OS provides core functions such as:
     s   Automatic discovery of devices Fabric devices log in to the Simple
         Name Server (SNS).Translative mode is automatically set to allow fabric
         initiators to communicate with private loop targets.
     s   Universal port support Fabric OS identifies port types and automati-
         cally initializes each connection specific to the attached Fibre Channel
         system, whether it is another switch, host, private loop, or fabric-aware
         target system.
     s   Continuous monitoring of ports for exception conditions Fabric
         OS disables data transfer to ports when they fail. Ports are automatically
         enabled after the exception condition is corrected.


Fibre Channel Services for Reconfiguration
Fabric OS provides a standard set of Fibre Channel services that provide fault tol-
erance and automatic reconfiguration when a new switch is introduced to the
fabric.These services include:
     s   Management Server Supports in-band discovery of fabric elements
         and topology.
     s   Simple Name Server (SNS) Incorporates the latest Fibre Channel
         standards and registers information about SAN hosts and storage devices.
134   Chapter 4 • Overview of Brocade SilkWorm Switches and Features


               It also provides a Registered State Change Notification (RSCN) when a
               device state changes or a new device is introduced.
           s   Alias Server Supports the multicast service that broadcasts data to all
               members of a group.


      Dynamic Routing Services
      Fabric OS provides dynamic routing services for high availability and maximum
      performance.These routing services include:
           s   Dynamic path selection via link-state protocols Uses Fabric
               Shortest Path First (FSPF) to select the most efficient route for transfer-
               ring data in a multiswitch environment.
           s   Load sharing to maximize throughput through Inter-Switch
               Links (ISLs) Supports high throughput by using multiple ISLs between
               switches.
           s   Automatic path failover Automatically reconfigures alternate paths
               when a link fails. Fabric OS distributes the new configuration fabric-
               wide and reroutes traffic without manual intervention.
           s   In-order frame delivery Guarantees that frames arrive in order.
           s   Automatic rerouting of frames when a fault occurs Reroutes
               traffic to alternative paths in the fabric without interruption of service
               or loss of data.
           s   Routing support for link costs Enables network managers to manu-
               ally configure the link costs of individual ISLs to create custom FSPF
               function that supports unique network management objectives.
           s   Support for high-priority protocol frames (useful for clustering
               applications) Ensures that frames identified as priority frames receive
               priority routing to minimize latency.
           s   Static routing support Allows network managers to configure fixed
               routes for some data traffic and ensure resiliency during a link failure.
           s   Automatic reconfiguration Automatically reroutes data traffic onto
               new ISLs when they are added to the SAN fabric.
                    Overview of Brocade SilkWorm Switches and Features • Chapter 4   135


Facilities for End-to-End SAN Management
Fabric OS includes an extensive set of facilities for end-to-end SAN manage-
ment, including:
    s   Management Server based on FC-GS-3 Permits in-band access to
        fabric discovery.
    s   An SNMP agent and a series of comprehensive Management
        Information Bases (MIBs)
        —Assists with monitoring and configuring the switches.
        —Provides an extensive set of trap conditions.
        —Immediately alerts administrators about critical exception conditions.
    s   In-band (through IP or over a Fibre Channel link) or external
        Ethernet interface Gathers SNMP information and provides access to
        all the switches in the fabric through a single fabric connection.
    s   Syslog daemon interface Directs exception messages to up to six
        recipients for comprehensive integration into a host-based management
        infrastructure.
    s   Switch beaconing Identifies an individual switch among a group of
        remotely managed fabric elements.


Brocade Command Line Interface
The command line interface can be an excellent tool for managing your switch.
You can log in to the command line interface through two methods: you can
telnet into the switch or, on some models, you can connect an ASCII console to
the DB9 serial port and log directly in.

Using Optional Brocade Features
Brocade offers a wide variety of optional features designed to simplify the
deployment, management, and administration of SAN fabric environments.These
optional features are designed to help you fully leverage your SAN resources to
ensure a fast Return On Investment (ROI).
136   Chapter 4 • Overview of Brocade SilkWorm Switches and Features




      NOTE
           SilkWorm switch features are enabled through a licensing system. When
           you purchase additional feature products, you simply enter a license key
           into the switch to enable the feature.




      Brocade Zoning
      Brocade Zoning, now bundled with almost every switch, provides advanced SAN
      management capabilities. It enables the separation of a fabric into smaller, isolated
      subfabrics to address closed user group requirements. As a result, zoning is an
      excellent way to enhance security within a fabric.
          With zoning, fabric-connected devices are arranged into logical groups of
      devices over a single physical configuration, enabling you to segregate certain
      devices from other devices.This can be helpful when you have devices that do
      not interoperate well, or when you want to separate a development environment
      from a production environment without purchasing additional switches. Devices
      in a zone only “see” other devices in that zone and can access only those mem-
      bers. Any device not included in a given zone is not available to the devices in
      that zone.
          Brocade Zoning is available in both software and hardware formats, and you
      can intermix both formats within a fabric. In general, software-enforced zoning
      provides more flexibility, while hardware-enforced zoning provides the highest
      level of security.
          Brocade Zoning involves zone specification, enforcement, and management.
      You can use a set of telnet commands (either in-band or out-of-band) to create,
      delete, and display zones; to add or remove zone members; and to configure a set
      of zones. Further information about Brocade Zoning appears in Chapter 6, “SAN
      Applications and Configurations,” and Chapter 9, “SAN Implementation,
      Maintenance, and Management.”

      Extended Fabrics
      Because you might need to connect multiple data centers over a long distance,
      Brocade offers Extended Fabrics support. Extended Fabrics reconfigures the
      switch to support the rigors of transmitting I/O over long distances in conjunc-
      tion with technologies such as Dense Wave Division Multiplexing (DWDM).
                      Overview of Brocade SilkWorm Switches and Features • Chapter 4   137


This feature extends all of the scalability, reliability, and performance benefits of
Fibre Channel SANs beyond the native 10 km distance specified by the Fibre
Channel standard. Moreover, it enables the use of full-performance applications
over extended distances, including disaster recovery, remote backup, extended
storage consolidation, remote mirroring, and tape consolidation.
     Extended Fabrics can be especially useful if you want to connect remote
locations—such as a disaster recovery facility—with the high performance and
reliability associated with a Fibre Channel SAN. Using Extended Fabrics, you
can leverage existing high-speed public and private networks to connect your
Fibre Channel SANs over Metropolitan Area Networks (MANs) and Wide Area
Networks (WANs).




   Brocade Fabric Access Layer API
   As the proliferation of storage networking increases, the need for
   storage applications to directly access and control fabric resources has
   become a critical requirement. Fabric Access, the Brocade Fabric OS
   Access Layer API, provides a flexible way for applications to access a
   variety of SAN information. Through Fabric Access, applications can con-
   trol the fabric for functions such as discovery, access (zoning) manage-
   ment, performance, and switch control.
         Fabric Access consists of a host-based library that interfaces the
   application to switches in the fabric through an out-of-band TCP/IP con-
   nection or an in-band IP-capable HBA. You can develop your own SAN
   management applications from the API or take advantage of applica-
   tions from third-party software developers, such as EMC and VERITAS.
   Key benefits of Fabric Access include:
          s   Single point of access to the fabric
          s   Secure access control
          s   Multifabric access
          s   Transaction-based management
          s   Object-oriented XML interface
          s   Multiplatform support
          s   Conformity to industry standards
138   Chapter 4 • Overview of Brocade SilkWorm Switches and Features


      Fabric Watch
      An optional SAN monitor for Brocade SilkWorm switches running Fabric OS
      2.2 or higher, Fabric Watch enables each switch to constantly watch fabric ele-
      ments for potential faults—and automatically alert you about problems before
      they become costly failures. Fabric Watch tracks a variety of SAN fabric ele-
      ments, events, and counters—including fabric-wide events, ports, GBICs, and
      environmental parameters. Using Fabric Watch, you can quickly identify and iso-
      late faults while optimizing fabric-wide performance. In addition, Fabric Watch
      integrates easily with standard enterprise systems management tools—sending
      traps as opposed to requiring polls to report exceptions. Compared to other types
      of management protocols—such as SNMP—Fabric Watch provides a more robust
      solution that enables proactive management of your SAN environment.

      Understanding Loop Support,
      QuickLoop, and Fabric Assist
      Brocade switches can replace intelligent hubs by creating virtual loops.This is
      done by logically connecting ports on one or two SilkWorm switches, with one
      or more private loop devices connected to each of the ports. Each switch port
      and the devices attached to it form a “looplet” that can independently transfer
      data at 100 MB/sec. Unlike hub-based environments, bandwidth is not shared
      across looplets, so full-bandwidth data transfers can occur simultaneously on all
      switch ports. Multiple hosts can simultaneously transfer data to different looplets
      in parallel. Also unlike hubs, Brocade switches provide superior fault isolation
      capabilities, preventing errors on a single device from disrupting the entire SAN.
          The Brocade QuickLoop software product provides a cost-effective solution
      for using private loop hosts with your switched-fabric-based SAN. QuickLoop is
      an alternative to various hub-based solutions, and since you are connecting these
      devices to a switch as opposed to a hub, private loop environments should exhibit
      significant performance and reliability improvements. As a result, QuickLoop can
      help protect your existing technology investment in hub solutions while inte-
      grating them into a higher performance environment. If you want to connect
      private loop HBAs to your Brocade Fabric, you will need QuickLoop or
      QuickLoop with Fabric Assist.
          QuickLoop operates as a translator between the legacy loop initiators and the
      fabric target devices.When setting up QuickLoop, you specify which ports are
      connected to loop initiators and which fabric targets will be available to them.
                     Overview of Brocade SilkWorm Switches and Features • Chapter 4   139



NOTE
     Only private loop initiators (HBAs) require the QuickLoop/Fabric Assist
     products. Public loop devices, and even private loop targets (storage) use
     the Fabric OS built-in translative mode feature instead.



    QuickLoop offers two modes of private loop host attachment: Hub
Emulation and Fabric Assist.With Hub Emulation, QuickLoop actually builds
private loops across a set of switch ports on one or two switches. A combination
of QuickLoop functions and Brocade Zoning, Fabric Assist creates virtual loops
capable of spanning the entire fabric while allowing private loop hosts to func-
tion as if they were attached to a physical loop switch.
    Brocade Fabric OS assigns a “phantom” fabric address to each private loop
target device, enabling each device to be registered transparently in the fabric.To
migrate a private loop host to the fabric, a QuickLoop port is reconfigured to
operate in Fabric Loop Attach (FLA) mode. Next, a single private loop host is
attached to the port (which cannot connect to any other devices). A Fabric Assist
zone is then configured to include the private loop host and all attached storage
with which it needs to communicate.The storage can be private loop, public
loop, or fabric aware.The private loop host appears to reside on a dedicated
private loop with all of the storage in the Fabric Assist zone.

Brocade WEB TOOLS
To simplify SAN fabric management, Brocade WEB TOOLS is a software utility
that enables you to manage and monitor your fabric through a Web browser
interface and Java plug-in. Using WEB TOOLS, you can view all switches in the
SAN from a single interface from any workstation in your enterprise—even at a
remote location.You can perform administration and configuration tasks for the
entire SAN fabric, fabric switches, and individual ports.The utility presents a
graphical representation for each switch licensed for WEB TOOLS, and you can
manage and manipulate each switch through the GUI.
    In addition to showing a graphical representation of each switch, the
WEB TOOLS screen indicates the status of the switch.When a switch has a
warning, you can click on that switch to obtain a detailed view to see power
supply status, GBIC/link status, and activity indicators.This approach enables you
140   Chapter 4 • Overview of Brocade SilkWorm Switches and Features


      to glance at a display and know immediately if there is a problem with the fabric
      so you can take corrective action.

      Future Capabilities in the Brocade
      Intelligent Fabric Services Architecture
      To provide a powerful yet flexible framework for addressing critical SAN require-
      ments, Brocade has developed the Intelligent Fabric Services Architecture, which
      will provide both the basic switching functions and the advanced services that
      improve manageability, availability, security, and scalability (Figure 4.8).This archi-
      tecture, which will help transform the network into an intelligent SAN fabric,
      consists of the following building blocks:
           s   The SilkWorm family of fabric switches
           s   Advanced Fabric Services
           s   Open Fabric Management tools
           s   Enterprise-class security products


      NOTE
           At the time of printing, components of the Brocade Intelligent Fabric
           Services Architecture represent unreleased products or product functions.
           This section is, therefore, a visionary statement of future SAN design
           capabilities.



          Beginning with the SilkWorm 12000, the Intelligent Fabric Services
      Architecture will enable a wide range of advanced switch fabric services,
      including Brocade ISL Trunking, Brocade Frame Filtering, more robust hardware-
      enforced zoning, more comprehensive performance monitoring, and enhanced
      security with Brocade Secure Fabric OS.

      Brocade ISL Trunking
      Brocade ISL Trunking enables as many as four Fibre Channel links between
      switches to be combined to form a single logical ISL with an aggregate speed of
      up to 8 Gbit/sec.These high-speed trunks simplify network design, optimize
      bandwidth utilization, and ensure that server-to-storage performance remains
      balanced under heavy network loads (Figure 4.9).
                               Overview of Brocade SilkWorm Switches and Features • Chapter 4                                    141


Figure 4.8 The Brocade Intelligent Fabric Services Architecture


                                               SAN Management and Administration,
                             Storage Resource Management, Storage Administration, Storage Virtualization,
                                                       LAN-Free Backup




                 Advanced Fabric                            Enterprise-Class                           Open Fabric
                    Services                                    Security                               Management
                                                               Fabric OS
                                                                                                    6400                 12000
                2000            2200                      2400             2800




                                                    Intelligent Switching Platform


                  Fibre Channel                                   IP                                        InfiniBand

                                                        Multiprotocol Support




Figure 4.9 Brocade ISL Trunking Relieves Congestion and Enables High-Speed
Data Traffic

                                                  Optimal bandwidth utilization using load
                                                  balancing on up to four 2 Gbit/sec links


                        Without Trunking             1G                                With Trunking
                                                          1.5 G                                             2G
                                                              0.5 G                                              1.5 G
                                                                  1G                                                 0.5 G
                                                                    1G                                                   1G
                                                                                                                           2G
                       Congestion


              2G                                                 2G
               1.5 G                                              1.5 G
                   0.5 G                                              0.5 G                             ASIC Preserves
                       1G 2G                                                   1G 2G
                                                                                                        In-Order Delivery



   ISL Trunking is an optional software product available with all Brocade 2
Gbit/sec Fibre Channel fabric switches.This new technology is ideal for optimizing
142   Chapter 4 • Overview of Brocade SilkWorm Switches and Features


      the performance and simplifying the management of a multiswitch SAN fabric
      containing Brocade 2 Gbit/sec switches.When two, three, or four adjacent ISLs are
      used to connect two switches, the switches automatically group the ISLs into a
      single logical ISL, or a “trunk.”The throughput of the resulting trunk is 4, 6, or 8
      Gbit/sec.

      Brocade Frame Filtering
      Another advanced feature incorporated into Brocade 2 Gbit/sec switch hardware
      is the ability to filter Fibre Channel frames to increment counters or to perform
      other actions such as blocking the frame itself. Overall, Frame Filtering enables a
      variety of new capabilities for monitoring and managing SAN fabrics, including
      the ability to:
           s   Increase zoning capabilities and security
           s   Facilitate the deployment of new SAN management applications that
               improve visibility into, and control over, the fabric
           s   Enable new fabric services for clustering, virtualization, and shared
               file systems
          Wire-speed filtering of each frame based on the content of several fields in
      both the header and the payload enables fabric zoning based on Logical Unit
      Number (LUN), network protocol, or I/O request type.This approach enables
      fabric-wide heterogeneous LUN masking managed from a central point.
          Brocade Frame Filtering combines the security of hardware zoning with the
      cabling flexibility of software zoning.When an administrator moves a cable from
      one port to another, the Frame Filtering capabilities can monitor the unique
      address of the device, change the zone, and block inappropriate data from com-
      municating with it.This critical zoning improvement helps ensure security,
      while minimizing the time and effort required to manage the SAN fabric and its
      associated zones.

      More Robust Hardware-Enforced Zoning
      Hardware-enforced zoning by World-Wide Name (WWN), port ID, or Arbitrated
      Loop Physical Address (AL_PA) simplifies administration while providing the
      highest level of secure control over data access.This capability provides administra-
      tors with much more flexibility in how they partition storage and servers to secure
      the overall fabric.
                      Overview of Brocade SilkWorm Switches and Features • Chapter 4     143


Enhanced End-to-End Performance Analysis
Enhanced end-to-end performance analysis enables more effective tracking of
resource utilization on a fabric-wide basis. Administrators can capture I/O perfor-
mance levels associated with specific initiator and target device IDs anywhere in
the fabric, independent of fabric topology. In addition to reducing management
cost through more proactive capacity planning, this capability enables reporting at
a level required to demonstrate adherence to service level agreements.

Secure Fabric OS
Within the Intelligent Fabric Services Architecture, Brocade provides Secure
Fabric OS, the most comprehensive SAN security architecture available. Based on
state-of-the-art networking security technologies, Secure Fabric OS addresses
vulnerabilities in the SAN fabric and supports authentication methods at the
following access points:
     s   User access to the management interface
     s   Management console access to the fabric
     s   Server access to the fabric
     s   Switch access to an existing fabric
    To prevent unauthorized configuration or management changes, Fabric OS
employs policies with multilevel passwords, extensive use of Access Control Lists
(ACLs), and centralization of fabric configuration changes on “trusted” switches.
Fabric OS prevents WWN spoofing—the practice of assuming the WWN of
another server to gain access to storage in its zone—at both the HBA and server
level by locking certain WWNs to certain ports.With Secure Fabric OS, new
switches are assigned digital certificates, enabling an existing fabric to authenticate
any switch that joins the fabric.While Secure Fabric OS prevents unauthorized
access to the fabric from the outside, Brocade Zoning ensures that devices can
access only their authorized storage resources.
144   Chapter 4 • Overview of Brocade SilkWorm Switches and Features



      Summary
      Before selecting what switches you should use to build your SAN fabric configu-
      ration, you need to consider a wide variety of variables. As with all resource plan-
      ning, you should identify and prioritize your key requirements—both current
      and future—before comparing the available products in the marketplace. Brocade
      offers a full range of Fibre Channel switches—from hub alternatives to highly-
      available fabric switches, integrated fabrics, and core fabric switches with 64 ports
      of connectivity. Each Brocade SilkWorm switch provides unique characteristics
      designed for entry-level SANs, medium-sized data centers, or very large enter-
      prises.
          Along with these switches, Brocade provides Fabric OS, a real-time operating
      system designed to deliver all the key functions for managing your SAN environ-
      ments. Fabric OS includes a wide range of basic functions, including Fibre
      Channel services and rerouting services. It also features an API that you can use
      to write your own SAN management applications (or you can take advantage of
      applications developed by third-party software vendors).
          To simplify the deployment, management, and ongoing administration of
      fabric-related tasks, Brocade offers a comprehensive set of software products.
      Products such as Brocade QuickLoop, Zoning, and WEB TOOLS can help you
      fully leverage your existing hardware investments and help position you for con-
      tinued growth. In addition, Brocade provides optional features—such as Brocade
      Extended Fabrics and Fabric Watch—designed for specific types of SAN environ-
      ments or functions. By taking advantage of the full range of these products, you
      can significantly increase the overall return on your SAN investment.
          Brocade will also provide strategic software functions as part of its Intelligent
      Fabric Services Architecture. Functions such as ISL Trunking and Frame Filtering
      will provide high-performance and high-reliability characteristics to meet the
      most demanding enterprise-level requirements. In addition, enhanced hardware
      zoning functions and Secure Fabric OS will greatly improve security within SAN
      environments, enabling SANs to grow in a safe, controlled manner.

      Solutions Fast Track
      Selecting the Right Switch
               Identify your requirements for availability, port density, functionality,
               and cost.
                 Overview of Brocade SilkWorm Switches and Features • Chapter 4   145


     Decide whether you need an arbitrated loop or full-fabric environment.
     Learn which switch functions best satisfy your requirements.
     Consider what strategic direction you want to take, and whether your
     current switches will scale easily to meet your needs.


Understanding the Brocade Fabric OS
     Fabric OS is the operating system for all Brocade SilkWorm switches.
     Key functions include auto-discovery, in-order frame delivery, zoning,
     and others.
     Fabric OS provides the capability to work with other storage
     management applications.


Using Optional Brocade Features
     You can use Brocade Zoning to isolate devices into separate,
     virtual SANs.
     Zoning is ideal for multiple customer environments where data security
     is critical.
     Extended Fabrics enables the benefits of Fibre Channel technology at
     distances up to 100 km.
     Fabric Watch tracks switch and fabric events to help you optimize
     fabric-wide performance and proactively identify problems before they
     happen.
     QuickLoop integrates private loop-based devices into switched fabric
     environments.
     QuickLoop helps support legacy devices to protect existing investments
     while also providing performance and reliability advantages.
     WEB TOOLS is an advanced monitoring tool that sends alerts about
     fabric events to help prevent downtime.
     You can use a Web browser interface and Java plug-in to monitor
     switched-fabric SANs from any workstation in your enterprise.
146   Chapter 4 • Overview of Brocade SilkWorm Switches and Features


      Future Capabilities in the Brocade
      Intelligent Fabric Services Architecture
               The Brocade Intelligent Fabric Services Architecture includes the
               SilkWorm family of fabric switches, advanced fabric services, open fabric
               management tools, and enterprise-class security products.
               ISL Trunking is an optional product ideal for optimizing πperformance
               of Brocade 2 Gbit/sec Fibre Channel fabric switches.
               Frame Filtering enables a variety of new capabilities for monitoring and
               managing SAN fabrics and enhancing both security and reliability.
               Secure Fabric OS is the most comprehensive SAN security architecture
               available, addressing vulnerabilities in the SAN fabric and supporting
               multiple authentication methods.

      Frequently Asked Questions
      The following Frequently Asked Questions, answered by the authors of this book,
      are designed to both measure your understanding of the concepts presented in
      this chapter and to assist you with real-life implementation of these concepts. To
      have your questions about this chapter answered by the author, browse to
      www.syngress.com/solutions and click on the “Ask the Author” form.


      Q: Can I use Brocade switches in an arbitrated loop environment?
      A: Yes. Brocade offers switches that are viable alternatives to managed hubs. In
          addition, the Brocade QuickLoop product enables the integration of private
          loop-based devices into a switched fabric.

      Q: What switch is most capable of providing high availability at a low cost?
      A: For a relatively inexpensive switch that provides high-availability characteris-
          tics, try the SilkWorm 2400 and 2800. Both switches have redundant, hot-
          swappable components at a cost-effective price point.

      Q: What is the most reliable way to keep certain hosts from interacting with
          other hosts or storage devices?
                      Overview of Brocade SilkWorm Switches and Features • Chapter 4   147


A: Brocade Zoning enables you to configure distinct zones to restrict
   interoperability between various devices.

Q: How can I perform fabric administration and management?
A: Brocade offers several tools for simplifying these types of tasks. Fabric Watch,
   Fabric Access, and WEB TOOLS all provide timesaving functions that help
   reduce SAN management costs.

Q: What kind of switch is most suited for a very large enterprise SAN?
A: The Brocade SilkWorm 6400 provides 64 ports of connectivity with high-
   availability characteristics, making it ideal for data center environments and
   enterprise SAN implementations.
                                        Chapter 5


The SAN Design
Process




 Solutions in this chapter:

     s   Looking at the Overall Lifecycle of a SAN
     s   Conducting Data Collection
     s   Analyzing the Collected Data


         Summary

         Solutions Fast Track

         Frequently Asked Questions




                                                 149
150   Chapter 5 • The SAN Design Process



      Introduction
      We intend this book to allow you to effectively design, implement, and maintain
      storage networks. Doing so requires an understanding of the processes in each of
      the seven phases of a SAN’s lifecycle, and their relationships with each other.
      Without taking a moment to review the process from the highest level, it is easy
      to get lost in the details of SAN hardware.
          In this chapter, we provide that high-level view.We show how the SAN
      design process is really an ongoing lifecycle.We take you through the process
      from the moment the decision is made to deploy a SAN, through releasing the
      SAN to production.Then we explain the extent to which the process should be
      repeated when upgrades and architectural changes are needed.We also provide
      detail on the first two parts of the lifecycle.
          The processes presented here are derived from other areas of Information
      Technology (IT) and they are normal parts of any large-scale IT project. For
      example, when implementing a SAN, you should interview people who will have
      a key interest in the finished product—the same is true when putting in a Local
      Area Network (LAN) or Wide Ares Network (WAN). Much of this material
      should be second nature to any IT network architect, Database Administrator
      (DBA), or senior systems administrator. For the more advanced users to whom
      these techniques are well understood in general, this chapter will serve as reference
      material showing how these processes are applied to SANs in particular.We have
      attempted in this book to provide material that will allow both the beginner and
      the expert alike to successfully design a SAN.
          It is true that more attention must be paid to SAN design than to most other
      networking technologies.This is because SANs typically have more stringent
      availability and performance requirements than other networks. A SAN is similar
      to a traditional network in its requirements, but is also somewhat like a channel
      (for example, a CPU/RAM interconnect mechanism, or a PCI bus). Channels
      require very high performance, and are almost assumed to be 100 percent reli-
      able.This is in stark contrast to the traditional Ethernet LAN, where things like
      five-nines uptime for all node connections, in-order packet delivery, and tuned
      approaches to bandwidth management are rare indeed.
          Fortunately, SANs provide the tools necessary to achieve these performance
      and availability goals. For example, it is commonplace in a Fibre Channel SAN to
      use a dual-fabric approach to SAN architecture.This means having two com-
      pletely separate networks for data to travel over, and potentially using both net-
      works as active paths.While it is certainly possible to do this sort of thing using
      IP/Ethernet networks, it is substantially more difficult, since Fibre Channel was
                                                 The SAN Design Process • Chapter 5   151


designed with this in mind, and Ethernet was not.The SAN designer must pro-
vide for higher availability and spend some time thinking about performance, but
will know going into the process that these goals are entirely achievable.
     We should also note here that the process outlined in this chapter is designed
to make a complex SAN design successful.With less complex designs (that is, the
majority of SAN deployments to date), it is perfectly acceptable to skip over
much of the process. For example, if you are deploying a SAN with only three
servers and two storage arrays, spending much time on architectural analysis is
unnecessary.The complexity is presented here so that users with complex
requirements will have it available to them; users with simpler scenarios can use
their judgment about which bits to incorporate into their design process.
     The seven phases of the lifecycle of a SAN at the very highest level can be
broken down into three broad categories: design, implementation, and mainte-
nance.The first of these, designing the SAN, includes the collection and the anal-
ysis of data, which defines the requirements of the network.We will go into
detail on these first two phases of the design process in this chapter.These phases
provide a solid launch pad for your journey through the remainder of the SAN’s
lifecycle.
     The third and fourth phases of the SAN lifecycle—architecture development
and prototype testing—complete the design process. Implementing the SAN
encompasses the transition phase and the release to production phase, the fifth
and sixth phases of the lifecycle.These phases are discussed in Chapters 6 and 7
of this book. Chapters 8 and 9 cover the troubleshooting, maintenance, and man-
agement—the final phases of the lifecycle model.
     When you are finished reading this chapter, you should have a solid under-
standing of the design processes, and have a valuable reference tool to enable pro-
ject planning on any future SAN deployments.

Looking at the Overall
Lifecycle of a SAN
Any SAN will go through certain phases over the course of its life. Depending
on the size and complexity of the SAN, some phases might take months to com-
plete, and some might be only glanced over. For example, a single-switch SAN
does not require much in the way of network design. However, if the solution
involves hundreds of devices, including storage components from many different
vendors that were not already pretested and determined to be interoperable, it
could require extensive testing or validation.
152   Chapter 5 • The SAN Design Process


          When an existing SAN must undergo a fundamental change, be it at the
      architectural level or simply the introduction of a new type of storage array, you
      should cycle back through the phases of SAN development.This will ensure that
      the critical applications running on the SAN are not unexpectedly disrupted by
      changes. However, when the change is fundamental but small (like adding a new
      type of storage array) it is possible to take a fast track through this process.
          The SAN’s lifecycle, which can be described at a high level as design, imple-
      mentation, and maintenance, translates directly into action-oriented phases on the
      part of the SAN designer: data collection, data analysis, architecture development,
      prototype and testing, transition, release to production, and maintenance. See
      Figure 5.1 for a flowchart of these phases and their relationships to each other.
      Figure 5.1 An Overview of the Lifecycle of a SAN

                                                   Upgrade / Architectural Change



                  Design                     Implementation                         Maintenance

                                                                                             Add / Change /
                                                                                               Remove /
                           Data Collection               Transition                          Management /
                                                                                                 Trouble-
                                                                                                shooting




                                                        Release to
                           Data Analysis
                                                        Production




                           Architecture
                           Development




                           Prototype and
                                Test
                                                  The SAN Design Process • Chapter 5   153


Data Collection
You must define the requirements of the SAN before building it.What business
problem is being solved by the SAN? What are the overall goals of the project?
To determine the requirements, you should interview all affected parties, to find
out what they all hope to achieve (in other words, their goals and objectives),
and develop both a detailed technical requirements document and a timeline for
the project.

Data Analysis
Once you have gathered input from all parties, you must analyze it and put it
into a meaningful format.The first two phases together will allow you to start
with the business goals that are driving the project, and determine at a high level
the necessary technical properties required of the SAN. Once this phase is com-
pleted, all business requirements should be translated into technical requirements.
The technical requirements document will be created during the collection
phase, and completed during the analysis phase.You will also have created a
working document for a Return On Investment (ROI) proposition to justify the
expense of the project.

Architecture Development
Now that you have a list of technical requirements, you will develop a SAN
architecture that meets those requirements.This process will involve balancing
many factors. For example, there might be a tradeoff between performance con-
siderations and cost. It might be necessary for you to cycle back to the data col-
lection and analysis phases to gather more input to make compromises with input
from all affected parties.When finished, you will have a detailed architecture of
the SAN that you intend to build. A SAN architecture includes the fabric
topologies of all related fabrics, the storage vendors involved, the SAN-enabled
applications being used, and other considerations that affect the overall SAN solu-
tion.This step is the most likely to be skipped over quickly when the SAN
requirements are small.

Prototype and Testing
SANs deal directly with the mission-critical data of today’s enterprises.When
building any mission-critical solution, you must test it before releasing it to pro-
duction. In this phase, you will build a prototype of the SAN solution and test it
154   Chapter 5 • The SAN Design Process


      to ensure that it will function properly when released.This should be done using
      nonproduction systems. It might be necessary to cycle back to the architecture
      development phase if problems are found.
           Wherever possible, build a test bed identical to the solution you are imple-
      menting.This will provide the greatest assurance of success in production.
      However, budgetary concerns, limits on time and space, and other factors will
      usually prevent this from being practical. Imagine a 200-port SAN. Now imagine
      200 hosts and storage arrays plugged into it. Now imagine asking the CFO to
      buy another 200 devices to test with, and to provide administrators, space, power,
      and cooling for all of it.
           Because of this, the test phase will be a balance of conducting your own
      testing, and leveraging other organizations’ test results. Finding a document that
      says “vendor X already tested or certified this configuration” might be as good or
      better than testing it yourself. Even if the components of a solution have been
      tested by you and/or others to your satisfaction, you must test all aspects of the
      complete system prior to releasing it to production.This is due to the fundamental
      nature of a large networked system where interactions, timing, and other factors
      can produce different results from devices tested individually.The actual final test
      will occur during the release to production phase, but creation of the test plan
      should occur in this phase. At the end of this phase, all parties with an interest in
      the outcome of the project will approve it, and the transition to production will
      begin.

      Transition
      Now that you have a working prototype, and all interested parties have signed off
      on it, you will begin to transition your existing hardware onto the new SAN. If a
      SAN is already in place, this phase might be as simple as adding a new node to
      the SAN, or changing the Inter-Switch Link (ISL) architecture. If the SAN is
      completely new, it might involve a long migration process consisting of moving
      one production system at a time. In any case, there might be a need to cycle
      between this phase and the release-to-production phase repeatedly. Once a com-
      ponent has completed the transition onto the SAN, release to production can
      occur for that component.

      Release to Production
      Once a component has been transitioned onto the new SAN, it must be tested
      again and then approved before becoming a part of the enterprise’s production
      environment. Since there might be many components that must be transitioned
                                                 The SAN Design Process • Chapter 5   155


and released, it might be necessary to cycle between the transition and release-to-
production phases repeatedly until all components have entered production. After
this phase is complete, the SAN will enter the maintenance phase.

Maintenance
This is the useful life of the SAN. All of the benefits that prompted the SAN
designer to implement the SAN in the first place are found in this phase. It is
therefore desirable to have a SAN spend as much time as possible in this phase,
and as little as possible in the other phases.The goal of this phase is to keep the
SAN running as well as possible for as much of the time as possible, and to
expand its capabilities only according to the original, tested, and approved param-
eters.This phase includes adding, changing, or removing components, as well as
managing, monitoring, and troubleshooting existing components.
     During the maintenance phase, no changes should be made to the SAN that
fall outside of the original blueprint that was established in the previous phases.
Any such change necessitates a repetition of the entire lifecycle. For example, if
the SAN were originally built using vendor X storage arrays, an additional
vendor X array could be added as part of maintenance, but an array from vendor
Y would require thought and testing before its introduction. It might not require
much thought and testing, but it must, in any case, be looked into.


NOTE
     Any fundamental change to the SAN requires a repetition of the
     entire lifecycle.



   In summary, the seven phases of the SAN design lifecycle are:
     1. Data Collection
     2. Data Analysis
     3. Architecture Development
     4. Prototype and Test
     5. Transition
     6. Release to Production
     7. Maintenance
156   Chapter 5 • The SAN Design Process



      Conducting Data Collection
      The data collection phase of SAN design is the foundation upon which the SAN
      will be built. It is vital that the information collected in this phase be both com-
      plete and accurate. If the SAN requirements are poorly defined, it is guaranteed
      that the resulting SAN will meet business objectives poorly.You should therefore
      take your time with this phase.
          Some of the information you will collect is generic to any major IT project.
      If you already have an established data collection process in your company, inte-
      grate the SAN-specific material from this section into that process.
          Data collection consists of determining which people you will need to inter-
      view, interviewing them, and conducting a physical assessment of existing equip-
      ment and facilities.When this process is complete, you will have a technical
      requirements document consisting of a list of the business problems that the SAN
      will solve, the business requirements for the SAN, characteristics of all devices
      that will be attached to it, and detailed information about all relevant facilities.
      You will also have a timeline for implementation.

      Creating an Interview Plan
      Who has a stake in the SAN solution? Well, you could argue that every person
      who uses a system attached to the SAN has a stake in it.While true, this is not
      useful for creating an interview list, because there would be too many people
      involved. Similarly, you could argue that only the person who initiated and
      “owns” the project should be consulted. Again, this is not useful, because it leaves
      out people who have a strong interest in the project, and might have knowledge
      that is critical to its success.
          A balanced approach to creating an interview list is critical.You can view the
      people on this list as a SAN solution “core team.”Think about having all of these
      people together in a room, and trying to solve the SAN solution problem
      together.Try to include everyone needed to solve the problem, but nobody else.
      Typically, a core team might include:
           s   At least one systems administrator
           s   At least one storage administrator
           s   A network administrator
           s   A DBA, if a database server will be involved
                                                  The SAN Design Process • Chapter 5   157


     s   At least one application specialist associated with each application that
         will run on the SAN
     s   At least one manager who can act as an overall “owner” of the project
    It is probable that you will be one of these people, in addition to being the
SAN designer. Unless you are an external consultant, this is typically the case.
    Once you have a list of the desired members of the core team, you must con-
tact them and ask them to take time to help with the project. Ensure that each
team member has allocated the necessary time and that their management appre-
ciates the demands of participating in this team. As the SAN design goal of the
team might require a long-term process, getting this buy-in initially will mini-
mize disruption to the team later. Often in the past, SAN design teams did not
include network administrators, as the focus was on the storage side. Experience
has shown that SANs are networks, and should be coordinated with the tradi-
tional IP network groups to ensure that proper networking experience is at hand.
    Whenever possible, schedule an interview as a face-to-face, one-on-one
meeting.This format will allow you to communicate the questions and understand
the answers most effectively.You should also have a group meeting with the entire
core team after conducting individual interviews.This will allow you to resolve
any differences before analyzing the data, and review the analysis as a team.

Conducting the Interviews
Now that you know who to interview and have a schedule of when you will
interview them, you need to know what questions to ask, and what format to
put the collected data into.This section contains a suggested set of questions that
you should ask, and some detail on what each question is about. It is followed
with a summary that could be used to create an interview form.


NOTE
     Not every person you interview will be able to answer every question.
     Between the members of the core team, the expertise necessary to answer
     all of these questions should be completely represented. Some members
     might provide conflicting answers. You will be in a key position to resolve
     these differences, and achieve a compromise. It is vital that all affected
     parties agree with the deployment strategy before implementation begins.
158   Chapter 5 • The SAN Design Process


      What Overall Business Problem
      Are You Trying to Solve?
      A business problem that would initiate a SAN design might be something like:
           s   “We need to keep our business running in case of a disaster like an
               earthquake or fire.”
           s   “Our backups take so long to finish that they are impacting our ability
               to process customer orders.”
           s   “We need to save money on storage by utilizing free space
               more efficiently.”
          Chapter 6 discusses some of the more common business problems that SANs
      can solve. Brocade maintains a series of documents that detail specific SAN solu-
      tions.These documents are known as Brocade SOLUTIONware configuration
      guidelines and are available on the Brocade Web site at www.brocade.com/SAN.


      NOTE
           A SAN might be intended to solve multiple business problems. In this
           case, you should separate each business problem into a different set
           of questions and answers. You will correlate these during the
           analysis phase.




      What Are the Business Requirements of the Solution?
      Once you know the business problem that you need to solve, it should be easy to
      figure out what the business requirements of the solution must be.This is simply
      a matter of rephrasing the previous answers, with more specific criteria:
           s   “The SAN must allow all functionality of all business-critical servers at
               site X to resume within Y minutes at site Z.”
           s   “The SAN must allow the following list of servers to complete backups
               within X minutes: …”
           s   “The SAN must allow the following list of servers access to the
               corresponding list of storage arrays: …”
                                                  The SAN Design Process • Chapter 5    159


   This is useful because it acts as a migratory step toward turning the business
problem into a matching technical solution.

Moving from Business Requirements to Technical Requirements
You should not deploy a SAN simply for the sake of adopting the “hot new
technology.” SANs are hot because they solve important business problems and
allow companies to make more money.This could be fairly direct—for example,
a matter of saving more money on IT than the project cost, since SANs are very
efficient at providing a clear ROI. ROI is often achieved by management effi-
ciencies, resource efficiencies, or better utilization of resources. On the other
hand, it could be indirect—by making IT systems more efficient, thus increasing
users’ productivity.
     The first key to a successful SAN deployment is the accurate and complete state-
ment of what business problem(s) you intend for the SAN to solve. Unfortunately,
you cannot turn a business problem into a technical solution without work.
There is no silver bullet to make your backups run faster so that your users will
not have to work on a slow system. However, there are tape libraries that run fast,
and can be shared by many devices.This, when combined with an appropriate
Fibre Channel fabric, and a SAN-enabled backup application, could amount to
the same thing as the silver bullet.
     In order to know which hardware and software will solve your business
problem, you have to define in a technical way what you need to accomplish.
This is a necessary intermediate step between the business problem and the pur-
chase of specific technical solutions.
     It is fairly straightforward to change a sentence like, “We need to keep our
business running in case of a disaster like an earthquake or fire” into a sentence
like, “The SAN must allow all functionality of all business-critical servers at site
X to resume within Y minutes at site Z.” Once you have done this, you will have
the business requirements of the solution.You know that you have a business
requirements statement when you could phrase it like this, and still have it make
sense: “Our business will run better if we have a SAN that can allow all function-
ality of all business-critical servers at site X to resume within Y minutes at site
Z.”The components of the business requirements statement are “our business will
run better” (or something to that effect) followed by a reasonably specific state-
ment about what the SAN must do to make that happen.
     However, you will still not have the technical requirements detailed.This is not
something that you, the SAN designer, can simply ask in an interview.This is a
large part of what you will bring to the table as the SAN designer once you have
160   Chapter 5 • The SAN Design Process


      gathered the data and then analyzed it in the next phase. A technical requirements
      document set should list, in detail:
           s   All of the devices that are to be attached to the SAN
           s   Their locations
           s   The communication patterns between them (random I/O, streaming
               access such as video, I/O-intensive database access)
           s   Their performance characteristics (reads, writes, max/min/typical
               throughputs)
           s   What software will run on them relative to the SAN (for example, a
               LAN-free backup application, or anything SAN-specific)
           s   How all of this is expected to change over time (storage growth,
               server growth)
          The technical requirement statement would be, “The SAN needed to meet
      the business requirements outlined must have the following characteristics: …”
      This would be followed by the body of the technical requirements document.
      The rest of the questions to ask in the interview process will provide you with
      the body of this document.

      What Is Known about the Nodes
      that Will Attach to the SAN?
      You should try to get a list of all information possible about every node attached
      to the SAN. For each node, the relevant information can include questions about
      each host, storage device, facilities where hosts and storage will be located, and
      questions about the SAN itself. Questions about each host could include the
      following:
           s   What operating system is installed? What patch or service pack level?
           s   Are fabric HBA/controller drivers available? Are they well tested?
           s   What type of connection is supported (private loop, public loop,
               or fabric)?
           s   Which applications will run on this host (databases, e-mail, data
               replication, file sharing)?
                                                 The SAN Design Process • Chapter 5    161


    s   How much storage does it require?
    s   How will its storage requirements change over time?
    s   Physically, what are its dimensions? How heavy is it?
    s   Does it rack mount? Does it have a rack kit? Will it set on a shelf?
    s   If there is a management console, what type is it? (Is it a traditional key-
        board/video/mouse combo [KVM], or is it a serial connection, like a
        TTY?) Does it need to be permanently attached? (For example, a Sun
        SPARC server could have a keyboard, mouse, and monitor permanently
        attached, or it could be managed through a serial port attached to
        a modem.)
    s   How many HBAs will it have?
    s   If it has more than one HBA, what software will be used to provide
        failover or performance enhancements of multiple paths?
    s   Do these interfaces exist, or do they need to be purchased? (You should
        keep track of every piece of equipment that you need to buy for the
        project, for budgeting and ROI analysis.)
    s   If they exist, what are the make, model, and version information?
    s   If not, what kind will be purchased to meet the objective?
    s   How many Ethernet interfaces will it have?
    s   In what temperature range will it operate?
    s   Will it need a telephone line for management?
    s   Where will the node be physically located?
   These questions could be used to create an interview form for each host,
which might look like the following:
162   Chapter 5 • The SAN Design Process


            OS
            HBA Drivers                           fabric          PTP           private loop           public loop

            HBA Count

            DMP / Failover Support

            HBA New or Existing                   New            make                  model                version
                                                  Existing       make                  model                version

            Application List

            Initial Storage Requirements

            Projected Storage Requirements

            Dimensions

            Weight

            Mounting                              rack mount            rack shelf             floor        table top

            Console Type                          KVM          switched KVM          TTY       terminal server        modem

            Console Location

            Ethernet Interface List

            Operating Temperature

            Power Requirements                    Voltage           Amperage               Connector Type

            Need Telephone Line for Management?

            Physical Location


          Questions about each storage device could include the following:
           s     What are the make, model, and version information?
           s     What type of connection is supported (private loop, public loop, fabric,
                 SCSI, SSA)?
           s     How many hosts can this device serve?
           s     If it is a multiport device, does it have limits on how many hosts can
                 access it through each port?
           s     Physically, what are its dimensions? How heavy is it?
           s     What is its capacity in gigabytes?
                                                The SAN Design Process • Chapter 5   163


  s    Does it rack mount? Does it have a rack kit? Will it sit on a shelf?
  s    If there is a management console, what type is it? Does it need to be
       permanently attached?
  s    How many Fibre Channel interfaces will it have?
  s    Do these interfaces exist, or do they need to be purchased?
  s    If they exist, what are the make and model? If not, what kind will
       be purchased?
  s    How many Ethernet interfaces will it have?
  s    In what temperature range will it operate?


NOTE
  Obviously, some of these questions do not relate directly to the SAN
  deployment. However, they are generally relevant whenever making a
  large architectural change in a data center. For example, it is necessary to
  know what temperature a server can operate at in case the server is in a
  location where temperature control is an issue. In this case, adding a
  large number of switches might increase the room temperature beyond
  operating levels. As always, use your judgement about which questions
  to include in your interview, and which to skip over.



  s    Will it need a telephone line for management?
  s    Where will the node be physically located?
  s    What is the firmware level?
  s    For tape libraries, what is the capacity of each cartridge, number of car-
       tridges the library can hold, number and speed of drives, and number
       of transports?
  s    SCSI or Fibre Channel interface? What type of SCSI (wide/narrow, dif-
       ferential/single ended)?
164   Chapter 5 • The SAN Design Process



      NOTE
           While it is possible to manage an entire fabric through a single Ethernet
           connection, this is not the method that Brocade currently recommends.
           You should plan on one Ethernet connection per Brocade switch, in addi-
           tion to planning connections for hosts and other SAN devices. It is also
           advisable for the highest availability plan to balance switches across mul-
           tiple electrical circuits, even if an Uninterruptible Power Supply (UPS)
           protects them.



          Questions about facilities where hosts and storage will be located could
      include the following:
           s   Who is responsible for this facility?
           s   Are there any existing optical cables, and what type?
           s   Is there sufficient electrical power?
           s   What about cooling?
           s   Is there enough rack space?
           s   What is the network infrastructure?
           s   Physical access? If the location is on an upper floor, is there a
               freight elevator?
          Answers to questions about the SAN itself must be considered preliminary.
      They will indicate preconceptions that members of the core team have, but
      all members should be prepared to be flexible on these preconceptions as the
      SAN design process progresses. Questions about the SAN itself could include
      the following:
           s   Are there any distance considerations? (For example, long cable runs
               between floors of a building, campuswide networks, or MAN/WAN
               connections.)
           s   How many hosts will attach to the SAN?
           s   How many storage devices will attach to the SAN?
           s   If known at this point, do they require any-to-any connectivity?
               Alternately, are there groups of devices that need to communicate only
               among themselves?
                                                   The SAN Design Process • Chapter 5    165


Which SAN-Enabled Applications
Do You Have in Mind?
Will the SAN use a serverless backup application? How about clustering soft-
ware? How about volume management? This category of software requires special
attention because of its close ties to the SAN hardware you choose to build the
solution. For example, if you plan to use vendor X serverless backup software,
you must make sure that your backup hardware (tape libraries, Fibre Channel/
SCSI gateways, etc.) is supported.

Which Components of the Solution Already Exist?
Any hardware or software that is already in place and that must be included in
the solution will create points for you to build around.You must find out as
many details as possible about everything in this category.When you are finished
with the interviews, and conduct the physical assessment, you should personally
inspect every piece of hardware.This will prevent surprises later in the process.
Make sure that you find out exactly where all hardware is located, and how to
access it.
    You must pay special attention to devices that already exist and already have
Fibre Channel interfaces. Find out which kinds of HBAs are installed in hosts,
and which driver revisions are installed on them. Find out code levels for RAID
arrays and Fibre Channel tape libraries. Find out if upgrades to driver/code levels
are planned or at least allowed.


NOTE
     You must know if each device is public loop, private loop, or full fabric.
     Some devices might even be SCSI and require additional hardware to
     bridge between SCSI and Fibre Channel.



     If possible, you should not use private loop drivers on initiators unless the
device does not support fabric drivers or is not easy to upgrade. Private loop
hosts require special licenses, typically Brocade QuickLoop and Zoning. Find out
if the existing devices are configured as full-fabric devices. If not, find out if their
drivers support full fabric, or if they can be upgraded to full fabric.This is not
intended to discourage incorporation of private loop devices into a fabric:
166   Chapter 5 • The SAN Design Process


      QuickLoop and Fabric Assist exist specifically to enable this to occur. However, if
      a device can support full fabric, then integration into the SAN will be easier if it
      does so.

      Which Components Are Already in Production?
      Components that are in production require special attention in two areas:
           s   Duplicate equipment might be desired for testing.
           s   The transition phase is more complex.
         It is vital to know as much as possible about production systems that are
      going to transition onto the SAN.Therefore, somebody intimately familiar with
      and responsible for every such system should be included on the core team.

      Which Elements of the Solution
      Need to Be Prototyped and Tested?
      For relatively simple solutions that involve only components already certified to
      work together, it might be that you do not have to do any testing at all. For
      example, if you are implementing a SAN-based solution on a Brocade
      SOLUTIONware document, you might feel that you need only to do minimal
      validation.This is opposed to a solution where no documentation or testing
      information exists, which generally requires extensive validation.
          For more complex solutions involving a large number of devices that might
      be from many different vendors, you might feel that every single element needs
      to be tested in combination before release to production can occur.You should
      get input on this from every member of the core team. If any team member feels
      that you should conduct inhouse testing on a component, you should strongly
      consider doing so.

      What Equipment Will Be Available for Testing?
      Any existing equipment that is not in production, and any equipment that is
      going to be purchased specifically for this project might be good material with
      which to build a test bed. Existing equipment that is in production is not good to
      test with. If existing equipment already in production will be transitioned onto
      the SAN, it might be beneficial to budget for a representative sample of duplicate,
      nonproduction systems with which to prototype the solution. It is generally a
      good idea to have such systems available for testing in any case. It may also be
      possible to borrow systems to test with. In any case, it's probably worth asking
      your vendors for such loans.
                                                  The SAN Design Process • Chapter 5   167


    Whether or not test equipment is available, you should research what testing
third-party vendors or third-party organizations have already done. In this way,
you will avoid duplicating their efforts. If you cannot get representative test
equipment for an element that needs to be prototyped, it might be acceptable—
and necessary—to rely entirely upon the work done by others to validate
the solution.
    Again, with many solutions, this is a perfectly acceptable way to go. If you do
not feel that inhouse testing is warranted, then you can save time and money by
skipping the prototype and test phase. Just make sure that you have documenta-
tion certifying the solution before you make this decision.

How and When Are Backups to Be Done?
You need to get a list of everything that relates to the system’s backups:
     s   What backup hardware will be used?
     s   What backup software will be used for each host?
     s   Which storage arrays will be backed up by which tape libraries?
     s   When will these backups occur?
     s   How long can they take?
     s   How much data needs to be backed up?
     s   Will snapshots be used? How do they work?
     s   Will split mirrors be used? How do they work?


What Will Be the Traffic Patterns in the Solution?
You should produce a matrix showing every initiator-to-target communication
expected in the SAN.This is necessary to determine performance characteristics,
and to set up zoning on the fabric:
     s   Which hosts will use a specific storage array?
     s   Which hosts in a cluster will talk directly to each other over the SAN?
     s   Which backup devices will be performing serverless backups?
     s   Which arrays will they be backing up?
   Create a table listing every device on the SAN that can act as an initiator in
one column.This will include every host, every storage virtualization product,
168   Chapter 5 • The SAN Design Process


      and every serverless backup server. It might include storage arrays, if they have
      data replication capabilities.Then put a second column next to it with all of the
      targets that each initiator will communicate with (Table 5.1).

      Table 5.1 Initiator-to-Target Mapping

      SAN Traffic Patterns

      Initiators                           Targets
      host1                                array3
      host2                                array1
                                           array2
                                           tape1
      host3                                array1
      host4                                array1
                                           array2
      tape1                                array1
                                           array3
      array3                               array4
      array4                               array3

           Note that some devices on a SAN can act as both an initiator and a target.
       If so, they will appear in both columns. See array3 and array4 in Table 5.1.
      This is how you would indicate that array3 and array4 perform data replication
      between them.
           You will not necessarily be able to build this table by interviewing one person;
      it will likely be developed over the course of the interview process, changed as the
      implementation takes place, and maintained for the life of the SAN.

      What Do We Know about Current
      Performance Characteristics?
      Any devices that currently exist, and will be transitioned onto the SAN, are can-
      didates for empirical performance testing.
          Create a second set of columns next to the traffic pattern columns, as shown
      in Table 5.2.You will need entries for peak utilization and sustained utilization.
      Obviously, you will only be able to enter current data for initiators that already
      exist, and already communicate with the same targets they will talk to after the
      SAN is complete.
                                                      The SAN Design Process • Chapter 5      169


Table 5.2 Current Traffic

SAN Traffic Patterns                    Current Peak          Current Sustain
Initiators      Targets                MB/sec                MB/sec
host1           array3                 10                    5
host2           array1
                array2
                tape1
host3           array1                 50                    10
host4           array1
                array2
tape1           array1
                array3
array3          array4
array4          array3

    In this example, host1 and host3 already exist, and are already talking to
array3 and array1, respectively. All of the other devices are to be added, are not
talking to the same targets that they will be after the SAN is up, or performance
data might simply be unavailable.
    If the owner of a system has already done this kind of analysis, you will
simply transfer the numbers to your table. If not, you should work with the
owner to get the performance information, as this might have a substantial
impact on your SAN design.

Gathering Performance Data
On almost any kind of system, some facility exists for measuring performance.
More often than not, there will be multiple options for gathering disk I/O per-
formance information.
     For example, on a Windows NT system, you might use the diskmon feature.
You have to install this from the Windows NT Resource Kit. If you do not install
diskmon, standard Windows perfmon will not have a disk monitoring tool.
Alternately, you could install a package like Intel’s Iometer, and use that to gen-
erate a simulated load and measure performance.This tool is presently available as
a free download from Intel’s Web site.
     Under Sun’s Solaris operating system, performance can be measured using the
iostat utility, the GUI utility perfmeter, or one of a number of third-party utilities like
Extreme SCSI.There are similar tools in every UNIX variant.We are providing
170   Chapter 5 • The SAN Design Process


      examples for Solaris only, since the details of these commands will vary between
      every flavor of UNIX, and providing examples for every variant is impractical.
      Refer to the man pages for your particular version of UNIX for the exact syntax.
      There are also a number of options for generating loads under Solaris, ranging
      from the dd command, to—again—a utility like Extreme SCSI.


      NOTE
           Tools like Iometer, dd, and Extreme SCSI should be used with care. It is
           tempting to use them to generate maximum load. A more useful test to
           run is to generate a representative load. Try to determine what your
           application will actually be doing in terms of read/write ratio, and total
           bandwidth consumption, and use these tools to generate that kind of
           load on the system.



          In cases where performance data cannot be collected empirically—such as
      when the system in question does not exist yet—there is still hope. Most hosts
      are not capable of generating sustained load at full wire speed.They are generally
      going to be limited by other factors.These could include:
           s   CPU speed Although Fibre Channel has much lower overhead than
               the TCP/IP stack, it still takes a fast processor to get near to full perfor-
               mance on a 1 Gbit/sec Fibre Channel link, simply because the processor
               will be busy running whatever task is actually generating the I/O.
               While almost all hosts now shipping have sufficiently fast CPUs, you also
               need to estimate how much of that CPU resource is taken up by other
               tasks the host is performing that do not result in disk I/O (such as run-
               ning a TCP/IP stack). Moreover, many data centers have older CPU
               servers that might not be capable of running at 1 Gbit/sec even without
               taking these tasks into consideration.
           s   PCI bus speed Fibre Channel full duplex is 200 MB/sec. A 32-bit 33
               MHz PCI bus can only sustain about 120 MB/sec. A 64-bit 33 MHz or
               32-bit 66 MHz PCI bus can handle about 240 MB/sec, and a 64-bit 66
               MHz bus can handle about 480 MB/sec. Even on the higher rate buses,
               you must bear in mind that it is a shared bus. If you put two Fibre
               Channel HBAs onto a bus that can handle 240 MB/sec, that will be the
               total possible full-duplex speed for both HBAs.Therefore, you would on
                                                 The SAN Design Process • Chapter 5   171


         average get 120 MB/sec out of each interface. For example, this could—
         in a balanced read/write environment—mean that you get only 60
         MB/sec of read performance out of each card. Also bear in mind that
         there may be other cards on the bus taking up some of that bandwidth.
     s   HBA speed Although designed to work on a 1 Gbit/sec SAN, many
         HBAs cannot achieve or at least cannot sustain full 1 Gbit/sec transfers.
         Newer HBAs typically have better performance. Older HBAs might
         only be able to achieve 60 MB/sec, regardless of the other possible
         issues.
     s   RAID controller speed Many RAID controllers cannot sustain 100
         MB/sec per interface on all interfaces simultaneously. Some barely
         operate at 30 MB/sec per interface, which is more than acceptable for
         many applications! Finding out the limits of your RAID array should be
         as simple as calling the vendor’s support channel. Of course, you might
         also check third-party testing results such as those done by many
         industry magazines for an unbiased opinion.
     s   RAM quantity and speed If your system is short on RAM, it might
         spend a lot of time paging. If it does, performance will be substantially
         degraded.
     s   Disk seek time If your application does a lot of random I/O, the disk
         heads will have to jump all over the platform. Since disk seek time is an
         order of magnitude or more slower than a Fibre Channel link, you
         might have to allocate substantially less bandwidth for random I/O
         applications like a file server than for sequential I/O applications like a
         video server or decision support system.
     s   Application overhead This ties into the CPU-limit factor. How much
         CPU do you have, and how much of it is free for handling I/O?
     s   Write speed of tape device Most tape drives cannot come anywhere
         near 100 MB/sec. It is usually sufficient to ask a vendor for performance
         data in the case of tape drives, although optimistic compression ratios
         can inflate the performance numbers they provide.
     In addition, if anything is known about the application that is running on the
host, you might be able to make a good guess about how much load it will even
try to place on the disk subsystem. For example, if you know that the host is an
intranet Web server, and that it receives only 500 hits a day, you can safely guess
that its I/O requirements will be minimal.
172   Chapter 5 • The SAN Design Process


          Once you have collected your best empirical or estimated numbers for each
      factor, use the lowest common denominator approach to estimate the maximum
      bandwidth that the system could need.You can guarantee that the overall system
      will not outperform its weakest link.
          Also note that on systems with multiple HBAs, I/O load might be distributed
      across these HBAs. Achieving active-active distribution across HBAs might
      require third-party applications like the VERITAS Dynamic Multipathing soft-
      ware,Troika’s HBA driver, or one of the storage vendor’s dual-path products. If
      this is the case, you might estimate that each HBA will usually have a fraction of
      the total load. In a dual-fabric, active/active HBA architecture, each HBA nor-
      mally has 50 percent of the total load. If a system is capable of sustaining 70
      MB/sec, then each HBA will sustain 35 MB/sec. Note that this might change
      during system maintenance if you shut down one path, and the remaining path
      could then take on the full 70 MB/sec, so the design should incorporate the
      worst-case scenario. It is usually also good practice to add some padding to the
      top of this estimate (perhaps 10 percent) to allow for the unexpected.


      NOTE
           Unlike physical-disk counter data, logical-disk counter data is not col-
           lected by the NT operating system by default. To obtain performance
           counter data for logical drives or storage volumes, you must type
           diskperf -yv at the command prompt. This will cause the disk perfor-
           mance statistics driver used for collecting disk performance data to
           report data for logical drives or storage volumes. By default, the NT oper-
           ating system uses the diskperf -yd command to obtain only physical
           drive data. For more information about using the diskperf command,
           type diskperf -? at the command prompt.




      What Do We Know about
      Future Performance Characteristics?
      Performance numbers change over time. Consider a customer database for a cat-
      alog retail company. Perhaps you will install the SAN in February, because this is
      your slow month of the year, and you can get the necessary downtime.You might
      know that the database host will start talking to its storage array(s) at a sustained
      rate of 5 MB/sec during the business day, with a peak of only 10 MB/sec.
                                                      The SAN Design Process • Chapter 5   173


However, when the Christmas season comes along and your business picks up,
you might move to a 50 MB/sec sustained rate, peaking at 70 MB/sec. Because
of the potential for substantial changes in performance requirements over time, it
is essential to plan for both current and projected performance. Most of this
might be educated guesswork, since many of the systems you are going to deploy
might not yet exist.
     Again, you will need to come up with numbers for both sustained traffic and
peak traffic for each communication. Also try to determine what days/times peak
performance will occur.This will be added to your table (Table 5.3).

Table 5.3 Adding Traffic Projections

SAN Traffic             SAN Peak             SAN Sustained
Patterns               Performance          Performance            Peak Times
Initiators   Targets   Initial   Expected   Initial    Expected    Initial   Expected
host1        array3    10        10         5          5           M–F       same
                                                                   8a–5p
host2        array1    0         70         0          50
             array2    0         70         0          50
             tape1     20        20         0          0
host3        array1    50        50         10         20          M–F       + Sa
                                                                   8a–5p     10a–4p
host4        array1    0         90         0          50
             array2    0         90         0          50
tape1        array1    0         20         0          0           Sa     same
                                                                   5p–9p
             array3    0         20         0          0           Sa     same
                                                                   9p–11p
array3       array4    10        30         5          5
array4       array3    5         5          0          0

    Again, you can only enter data for systems about which you can make an
educated guess. If you know about what the peak traffic could be based only on
the limitations of a system, you might not have any way of guessing when this
would occur.You should also enter projected data for systems that you know that
you will add later.
    In Table 5.3, host2 and the application it is running might not exist yet, so
every piece of data about that system is pure guesswork. Let us say that host2 is a
Return Merchandise Authorization (RMA) system, and your rapidly growing
174   Chapter 5 • The SAN Design Process


      company has never had an RMA system before.You might not be able to reliably
      guess when customers are going to call in with RMA requests most often, or
      even how many RMAs you are going to get in a given day.The best you can do
      is determine what performance the hardware and software you are installing
      could reasonably run at, and design the SAN to support it all the time it could be
      in use.While this approach might result in over-engineering your network, this is
      better than the alternative. During future design phases, you can alter the SAN
      design to adjust or scale back the design accordingly, as well as incorporate other
      additions and changes.
          For backup devices, peak usage will always correspond with your backup
      schedule.This will usually not correspond with peak usage of the rest of the
      system.This is particularly useful knowledge when planning an ISL architecture,
      because you can often count on having low nonbackup-related utilization of ISLs
      during backup windows. An obvious exception to this is a SAN that is used
      solely for performing LAN-free backups.

      How Much Downtime Is Acceptable to
      Production Components During Implementation?
      It will likely be necessary to shut down some existing production devices during
      implementation, to ensure a safe transition onto the SAN. For example, you
      might have to shut down a host to install an HBA. Determine how much down-
      time is acceptable for each host, and at what times this can occur. Generally, you
      should try to schedule more downtime than you think you need to ensure that
      any unforeseen issues that arise during the implementation can be handled within
      the downtime window.

      How Much Downtime Is Acceptable for Routine
      Maintenance? How Much Downtime Is Acceptable
      for Upgrades and Architectural Changes?
      These two questions are intimately related, because—to an end user—there is
      really no difference between downtime to a production system for maintenance,
      and downtime for an upgrade. Once systems are in production, you will want to
      keep them running as much as possible.
          Many upgrades can be accomplished with zero downtime by using a double-
      or triple-redundant fabric architecture. No matter how well you plan the upgrade
      and maintenance processes beforehand, you will need to shut down specific hosts
                                                  The SAN Design Process • Chapter 5   175


on occasion. For example, you might want to upgrade an HBA driver, which
would typically require a reboot.


NOTE
     Wherever possible, a redundant fabric architecture should be used. This
     will ensure the best performance and reliability, and will simplify mainte-
     nance tasks. In a redundant fabric architecture, every host has at least
     two paths to every storage device it connects to, and these paths tra-
     verse two completely unconnected fabrics. While it might appear on the
     surface to be more expensive, if hosts are to be dual-attached anyway, it
     is actually less expensive to attach them to two separate fabrics than to
     use one larger fabric, or a director-class switch. This does not even
     include the downtime ROI calculation, which, in high-availability environ-
     ments, will usually overshadow the entire cost of the SAN. More details
     about redundant and resilient fabrics are provided in Chapter 7.



    You should therefore determine in advance when you will be able to
schedule downtime for every host and storage array, and for the fabric itself.You
might not have to use every scheduled outage, but having them available to you
when you do need them is essential.
    One way to do this is to make a list of applications and services provided by
the hosts on the SAN, and determine an owner for each.Take your list of SAN
devices and map these devices to the applications and services they affect.This will
provide a mapping of application/service owners, who are typically responsible for
scheduling downtime, to devices that typically require downtime. Have each owner
approve the downtime calendar for each device that affects his or her service.
    The mapping of owners to devices should be kept up to date as changes in
personnel, applications, and/or SAN infrastructure occur.

When Do You Need Each Piece
of the Solution to Be Complete?
Once you have a table detailing which of the initiators communicate with which
targets, you can begin to create a timeline for the project. Other members of the
core team will tell you something like, “the customer database application must
be online by mid-June.” It is your task to define which SAN components you
176   Chapter 5 • The SAN Design Process


      need to accomplish this, and to develop a timeline for adding these components
      that meet their requirements.

      Summary List of Questions
      This is a high-level list of some of the questions that should appear on a SAN
      design interview form:
           s   What overall business problem are you trying to solve?
           s   What are the business requirements of the solution?
           s   What is known about the nodes that will attach to the SAN?
           s   Which SAN-enabled application do you have in mind?
           s   Which components of the solution already exist?
           s   Which components are already in production?
           s   Which elements of the solution need to be prototyped and tested?
           s   What equipment will be available for testing?
           s   How and when are backups to be done?
           s   What will the traffic patterns in the solution be?
           s   What do we know about current performance characteristics?
           s   What do we know about future performance characteristics?
           s   How much downtime is acceptable to production components during
               implementation?
           s   How much downtime is acceptable for routine maintenance?
           s   How much downtime is acceptable for upgrades and architectural
               changes?
           s   When do you need each piece of the solution to be complete?


      Conduct a Physical Assessment
      You should now have the location of every piece of hardware that currently
      exists. In addition, you should know where each piece of hardware in the even-
      tual SAN will be located.
          Look at each piece of hardware. Make sure that it does exist, and has all nec-
      essary pieces to function.This could include things like power cords, keyboard,
                                                 The SAN Design Process • Chapter 5   177


mouse, monitor, Ethernet card, Ethernet cable, HBAs, and Fibre Channel cables.
Note the physical dimensions of the hardware, and its power/cooling require-
ments. Does it rack mount? Does it have a network interface? How many Fibre
Channel interfaces does it have? How much does it weigh? You should already
have this information from the interview process, but you should verify that the
information you were given is correct.
    Go to each location where SAN equipment or nodes will be installed, and
again check to see that your information was correct. Notice how the equipment
will fit into the space available. Notice how the equipment will enter the
building.You should also have a meeting with the person in charge of the facility
to discuss power, cooling, and equipment locations.

Analyzing the Collected Data
Now that you have collected information from all key stakeholders in the pro-
ject, and verified the accuracy of this information, you will analyze it to deter-
mine the characteristics of the required solution.When you have completed this
process, you will have a list of technical requirements, and an ROI analysis to
justify the project.

Processing What You Have Collected
You have a matrix detailing communication between nodes. Attempt to group
the nodes by communication patterns.The purpose of this is to determine the
amount of known locality in the SAN. Locality of reference is a concept preva-
lent in many areas of computer science, from disk drive construction to LAN
design. Locality is important in SAN design because if you can localize traffic into
specific areas of a SAN, you directly improve the SAN’s performance and relia-
bility.This will allow a more cost-effective SAN design as well, preventing over-
designing the network to handle nonexistent cross traffic. Locality is discussed in
greater detail in Chapter 7.
     A SAN with a great deal of known locality might be constructed out of
many separate fabrics, with no ISLs whatsoever. A SAN with little or no known
locality might require a high-performance ISL architecture (Table 5.4).
178   Chapter 5 • The SAN Design Process


      Table 5.4 Initiator–to-Target Mapping for Locality Example

      SAN Traffic Patterns

      Initiators                          Targets
      host1                               array3
      host2                               array1
                                          array2
                                          tape1
      host3                               array1
      host4                               array1
                                          array2
      tape1                               array1
                                          array3
      array3                              array4
      array4                              array3

          In Table 5.4, array3 would be grouped with host1, tape1, and array4. None of
      those devices will need to communicate with any of the other devices.They
      could be grouped onto a single switch, or even put onto a totally separate fabric.
      You might find it helpful to do the grouping in a diagram. For another example,
      look at Figure 5.2.
      Figure 5.2 SAN Diagram without Grouping




                                Serv                            ge
                                    ers                    Stora

                                                    SAN
                                                   The SAN Design Process • Chapter 5   179


    Nothing is known about the communication patterns in this SAN.
Consequently, there is no way to optimize ISLs for performance. After grouping
the initiators with their targets, the SAN diagram could look something like
Figure 5.3. If you look carefully, you will notice that there are only 12 connec-
tions into this SAN. If there are fewer connections than there are ports in your
switches, you do not really need to go through the grouping exercise because
localization of traffic will happen automatically. It is only useful if you will be
using ISLs; however, as most systems scale well past the size of the largest switches
available, it will be a frequent exercise. For the purposes of making the examples
more readable, we will just assume that they are all dealing with a subset of the
devices that the SAN will support.
Figure 5.3 SAN Diagram with Simple Grouping




                                          SAN

                                         Group 1
                                         Group 2
                                         Group 3
                                         Group 4




   Making a diagram such as this will allow you to see at a glance what the
communication patterns for your SAN are.
   This example is simplistic, and in large SANs, there will likely be conflicts.
When you cannot effectively group all of the communication patterns, you
should focus on grouping faster performing devices. For example, if you find that
180   Chapter 5 • The SAN Design Process


      the bulk of traffic will be between host1, array3, and array4, these could be
      grouped separately from tape1 and host2 if necessary.This could happen if you
      find that there are so many interrelationships that you end up with very many
      devices, but very few very large groups.The grouping technique does not help
      for performance if you only have one big group. It could also happen if you have
      a few devices that are shared by a great many devices, such as a large RAID array
      in a storage consolidation solution.
          Another way to combat this “group growth” problem is to account for mul-
      tiple interfaces on storage arrays. Let us say that you have a redundant fabric
      architecture.Your RAID array has eight interfaces, and each host will access only
      two of them—one interface on each fabric. List each interface on the array sepa-
      rately in your traffic pattern table.Then, you associate servers or groups of servers
      with specific interfaces.With the array listed as a single entity, a diagram of the
      communication could look something like Figure 5.4.
      Figure 5.4 SAN Grouping Diagram with Single-Entity Arrays

                                                               Server Group




                                                                              Server Group




                                           SAN A




                                           SAN B
                    RAID Array1
                                                                              Server Group




                                                               Server Group



         If, however, you separate the interfaces, your diagram could look more like
      Figure 5.5.
         You can indicate that a device crosses groups but does not need much in the
      way of performance by varying the line color, weight, or pattern. Figure 5.6
      shows that the tape robot crosses all groups, but does not need much bandwidth.
                                                                 The SAN Design Process • Chapter 5     181


Figure 5.5 SAN Grouping Diagram with Separated Interfaces

                                                                        Server Group 1




           Different array controllers                   SAN A
                                                                                       Server Group 2
          attaching to different groups.             Group 1
                                                     Group 2
                                                     Group 3
                                                     Group 4


                                                         SAN B
                                                     Group 1
                                                     Group 2
          RAID Array1
                                                     Group 3                           Server Group 3
                                                     Group 4



                                                                        Server Group 4



Figure 5.6 SAN Grouping Diagram with Tape Robot Addition

                                                                     Server Group 1




              RAID Array1
                                                   SAN A
                                                                                 Server Group 2
                                                  Group 1
                                                  Group 2
                                                  Group 3
                                                  Group 4


                                                    SAN B
                                                   Group 1
                                                   Group 2
                                                   Group 3                        Server Group 3
                                                   Group 4

                  Tape1
                                One Interface Going to
                                                                      Server Group 4
                                  Multiple Groups
182   Chapter 5 • The SAN Design Process


          If you are able to make relatively small performance groups, your SAN will
      benefit greatly from applying the principal of locality. For now, you simply need
      to be able to determine the category of architecture you will require: one that has
      lots of known locality (has well-defined performance groups), or one that does
      not.This will affect how many switch ports you need to allot for ISLs. If traffic is
      localized within an area of the SAN, it will obviously not need to make use of
      ISLs leaving that area. In this case, you will be able to get superior performance
      even with far fewer ISLs, resulting in more ports available for servers and storage.

      Establishing Port Requirements
      Now you will determine how many switch ports you will need to purchase.
      (This is a general estimate for calculating ROI; it might be a bit more or less than
      your final estimate.)
          Take the ports you found out about during the interview process. Make sure
      that you account for all ports on each node. Some RAID arrays have many ports,
      and many hosts have at least two HBAs. Add up these ports to get the total
      number of exposed ports your SAN will require.You will then divide this by the
      number of different fabrics you will be using. If you have dual-redundant fabrics,
      you will divide by two. If you have triple-redundant fabrics, divide by three, and
      so on.This will give you the number of required exposed ports per fabric.The
      number of “overhead” ports you must allocate for ISLs and for unused ports will
      depend on several factors:
           s   The total number of required ports per fabric.
           s   The amount of known locality.
           s   Your need to manage all switches as a single entity.
           s   The physical layout of your SAN—any MAN/WAN connections, or
               intra-building campus connections, or intra-floor building connections—
               might dictate use of additional ISLs and less than perfect utilization of the
               ports on each switch.
           s   Your applications’ expected performance characteristics.
           s   The rate of expected growth in port count of the fabric.
           s   Your maintenance policies regarding port usages on network devices. For
               example, you might require that a certain number of ports be left available
               for expansion or troubleshooting during the course of normal operation.
                                                   The SAN Design Process • Chapter 5   183


Simple Case
If the number of required exposed ports is less than the number of ports on a
single switch, you will generally need zero ports for ISLs. In this case, you will
require one switch per fabric. However, as larger switches utilize more hardware
internally to connect the higher number of user ports, a decision might need to
be made between using a larger switch versus utilizing a network of smaller ones.
The appropriate decision will depend on performance requirements, budget, and
design factors. In addition, if you have made small performance groups that have
no components in common, you might be able to localize traffic 100 percent,
and require no ISLs.You would have many small, unconnected SAN islands if
you follow this approach. One reason not to use isolated islands is that require-
ments change. Someday you might need access between islands at a moment’s
notice. A robust architecture can achieve your immediate connectivity require-
ments, and give you the flexibility to handle change as well.
     You will require each fabric to be a network if this is not the case, or if
you wish to design in flexibility to your configuration.You will have to reserve
port count for these. Simple case requirements include the following:
     s   Fewer ports required than exist on a single switch, or…
     s   Each performance group is well defined and smaller than the number of
         ports on a single switch.
     s   Future requirements for growth and change are minimal.
    Assume that you have two 16-port arrays (32 storage ports total), 10 dual-
HBA servers (20 ports), and two single-port tape libraries (two ports).Your total
port count is 54. However, assume further that you are using a dual-redundant
SAN architecture.Your port count per fabric is 27.You are building the fabric out
of 16-port switches. It is possible that some ISLs are required.You will need to
determine how many are needed.

Variant A
With a relatively small fabric like this and relatively high locality, you can assume
that you will have about 14 free ports per switch.Two switches with two ISLs
between them will yield 28 ports per fabric.You are using a dual-redundant
architecture, so there will be two fabrics, for a total of four switches.Your
grouping diagram will look like Figure 5.7.
184   Chapter 5 • The SAN Design Process


      Figure 5.7 Determining ISL Requirements for Variant A

                    RAID Array1                                              Server Group 1
                                                                                (5 Hosts)
                                  A1-1
                                  A1-2           4   SAN A
                                  A1-3           4   Group 1         5
                                  A1-4       4       Group 2
                                         4
                                                                     5

                                         4                           5
                                             4       SAN B
                                  A2-1
                                                     Group 1
                                  A2-2
                                                 4   Group 2         5
                                  A2-3
                                                 4
                                  A2-4
                                                                             Server Group 2
                    RAID Array2                                                 (5 Hosts)



          This grouping would result in an actual implementation resembling Figure 5.8.

      Figure 5.8 Variant A Implementation

                    RAID Array1                                              Server Group 1
                                                             SAN A              (5 Hosts)
                                  A1-1           4
                                  A1-2
                                                 4                       5
                                  A1-3
                                  A1-4           4
                                             4                           5

                                             4
                                  A2-1           4                       5
                                  A2-2
                                                 4
                                  A2-3
                                                                         5
                                                 4
                                  A2-4
                                                                             Server Group 2
                                                             SAN B
                    RAID Array2                                                 (5 Hosts)



      Variant B
      If you decide that you cannot guarantee the localization of traffic for some
      reason, grouping will not help. Assuming also that you have a requirement for
      high performance between the switches, you would add two ISLs per switch to
                                                             The SAN Design Process • Chapter 5           185


the estimate, for a total of about four ISLs per switch.Your architecture might
look Figure 5.9.
Figure 5.9 Adding ISLs for High Performance in Variant B

                                                          36 Ports per Fabric
                                                 (Balance storage and hosts across the
                                                   3 switches for best performance.)

            RAID Array1                                                                  Server Group 1
                                         SAN A                                              (5 Hosts)
                          A1-1       4
                          A1-2
                                     4                                      5
                          A1-3
                          A1-4       4
                                 4                                          5

                                 4
                          A2-1       4                                      5
                          A2-2
                                     4
                          A2-3
                                                                            5
                                     4
                          A2-4
                                         SAN B                                           Server Group 2
            RAID Array2                                                                     (5 Hosts)



    The same technique can be applied to any SAN, no matter how complex. In
fact, the larger the SAN, the greater the benefits will be from grouping traffic.

Moderate Case
If the required exposed port count is about double or triple the per-switch port
count, and some locality is known, you will be able to use very few ISLs. In this
case, estimate two ISLs per switch. Let us say that you need 26 ports, and you are
using 16-port switches.Two ISLs per switch means that you actually get 14 ports
per switch.Two switches will give you 28 ports, so you would budget for two
switches per fabric, or four switches total.
     Moderate case requirements include the following:
     s   No more than three times as many ports are required than are present
         on a single switch.
     s   Performance groups are reasonably well defined. Some locality is known.
     s   Future requirements for growth and change are minimal.
186   Chapter 5 • The SAN Design Process



      NOTE
           The low port count/high locality/low ISL count configurations work well
           for either two or three switches. Two switches would be cascaded
           together with two ISLs, with 16-port switches yielding 28 ports. Three
           switches would be connected in a ring, supporting about 40 devices. If
           you are over that limit, a four-switch full mesh can support about 50
           devices. The full-mesh architecture does not scale well beyond that point,
           and none of these work well if you have performance groups with more
           than 13 or 14 members. It is feasible to build ring or partial-mesh
           topology fabrics with higher port counts, but it is generally better to use
           a core/edge topology for higher port count solutions. These topologies
           are explained in detail in Chapter 7.




      Complex Case
      If you need more ports than one of these configurations will handle, you will
      need to allocate about four ISLs per switch.You might use fewer than four ISLs
      on some switches, and perhaps nothing but ISLs will be present on other
      switches. In the complex case for port count estimates, the intent is to average
      the ISL requirements.
          Until a detailed architecture is developed, you will have to make general esti-
      mates for a few things. If you have any distance requirements, add two ISLs per
      switch. If you have very high-performance requirements, and very little known
      locality, add two ISLs per switch.
          Take the estimated number of ISLs per switch (I) and subtract it from the
      number of ports per switch (PS). Divide the total required ports per fabric (P) by
      this number and round up.This is the estimated number of switches (S) that you
      need to budget for. For estimating complex SAN switch counts, S=P/(PS – I).
          For example, if you have a need for 30 ports per fabric (P=30), are using 16-
      port switches (PS=16), and each switch will use about two ISLs (I=2), then the
      number of switches you estimate needing per fabric is 30/(16–2).This is 2.14,
      which rounds up to 3. If you have a single fabric, this is the number of switches
      you should budget for. If you have a dual-fabric SAN, you should budget for six
      switches. Complex case requirements include the following:
                                                  The SAN Design Process • Chapter 5    187


     s   Any number of exposed ports might be required.
     s   Performance groups might or might not be defined.
     s   Future requirements for growth and change are significant.


Preparing an ROI Analysis
In any business transaction, it is important to understand the economic benefits
or the Return On Investment (ROI) that your company will receive. Preparing an
ROI analysis for your SAN project will show how your company will not only
return the capital investment, but also save additional money as well in time,
management, and other efficiencies.
    During the interview process, you made a list of all of the equipment that
you would need to purchase.To begin the ROI analysis of your SAN, determine
which components are specific to the SAN project. For example, if your com-
pany will need to buy additional storage arrays whether or not a SAN is used,
these would not be included on the expense side of the analysis. If the SAN is
expected to prevent you from having to buy an array, this cost savings would go
onto the benefit side of the analysis.You should include any hardware you intend
to buy for testing that will not be used elsewhere.
    When accounting for staff time spent on the project, make sure that you only
charge the project for time spent beyond what would be spent by not building the
SAN. If you are expected to save staff time in the long run, apply this to the ben-
efit side.Your ROI analysis will be a living document, and will be updated as the
SAN project develops.

The Return On Investment Proposition
Technical justifications for SAN infrastructure deployments can often be made
more credible by adding an ROI analysis for the proposed implementation.
Follow the guide in the following sections to produce an ROI analysis based on
SAN solutions to particular problems.

Step One: Pick a Theme or Scenario
Most implementations have a purpose.That purpose could be a server or storage
consolidation to improve infrastructure usage and gain economies of scale,
ensuring storage and server resources are utilized in the most cost-effective
manner. High-availability clustering can improve the availability of mission-critical
applications, thus ensuring business continuance and the cost saving associated
188   Chapter 5 • The SAN Design Process


      with it. SAN-based backup deployments improve data integrity by performing
      backups and restores more efficiently and quickly, again saving in business contin-
      uance time and effort.

      Step Two: Identify the Affected Infrastructure Components
      Most SAN deployments will focus on affected servers. Servers can be grouped
      according to the applications they run or the functional areas they support.
      Examples of application groupings include Web servers, file and print servers,
      messaging servers, database servers, and application servers. Functional support
      servers might include financial and personnel systems or engineering applications.
      Once the server groups are known, get the characteristics of servers in each
      group. For example, if your solution fits into a storage consolidation theme, you
      should consider factors such as:
           s   Amount of attached disk storage
           s   Storage growth rates
           s   Storage space reserved for growth (headroom)
           s   Availability requirements
           s   Server downtime and an associated downtime cost
           s   Server hardware and software costs
           s   Maintenance costs
           s   The administration effort required to keep the servers up and running

      Step Three: Identify the SAN-Enabled Benefits
      The scenario approach allows you to focus more closely on the benefits. Server
      and storage consolidation, for example, will concentrate on benefits accrued from
      more efficient use of server and storage resources, improved staff productivity,
      lower platform costs, and better use of the infrastructure. Simply take the list of
      characteristics you developed in step two, and show how a SAN can provide ben-
      efits in those areas. Establishing specific cost savings is one of the two key ele-
      ments in the ROI process, so be sure to look hard for every area of benefit.

      Step Four: Identify the SAN-Related Costs
      Determining the costs associated with the scenario involves identifying the new
      components specifically required to build and maintain the SAN.These can
                                                  The SAN Design Process • Chapter 5    189


include software licenses, switches, Fibre Channel HBAs, optical cables, and any
service costs associated with the deployment. Be careful to include only those
items that relate directly to the SAN implementation.This is the second key ele-
ment in the ROI process: if you do not correctly estimate expenses, the ROI
might be substantially better or worse than your estimate.

Step Five: Calculate the ROI
There are several standard ROI calculations in common use, such as net present
value (in dollars), internal rate of return (as a percentage), and payback period (in
months). Briefly, these can be defined as:
     s   Net Present Value (NPV) A method used in evaluating investments
         where the net present value of all cash flows is calculated using a given
         discount rate.
     s   Internal Rate of Return (IRR) A discount rate at which the present
         value of the future cash flows of an investment equal the costs of
         the investment.
     s   Payback Period The length of time needed to recoup the cost of a
         capital investment on a nondiscount basis.
    Detailed explanations of these techniques and how to use them can be found
in most accounting textbooks. It is likely that your company has a preferred
method for calculating ROI.You should determine which method this is, and if
there are standard forms for presenting your analysis. Asking your accounting
department might be a good first step.
    This approach to calculating ROI allows you to focus on a particular project
or infrastructure-based problem. It allows you to reduce deployment risk by
deploying SANs in phases by scenario. Deploying by scenario will keep invest-
ments limited to the solution at hand and create an investment base for future
deployments.The initial investment will improve the ROI on other scenarios by
reducing some of the investment required to deploy them.
190   Chapter 5 • The SAN Design Process


      The Rest of the Process and
      the Repetition of the Cycle
      Now you have the following documents:
           s   Detailed results from the interview process, which define what the SAN
               project needs to accomplish.This includes:
               s   A technical requirements document
               s   A timeline for accomplishing the tasks associated with implementing
                   the SAN
               s   A list of everything that you will need to buy to make the
                   project work
           s   A rough idea of how the SAN will be designed.
           s   An ROI analysis to justify continuing with the project.
          These will be used and maintained throughout the life of the SAN.The
      timeline will be the framework in which all activities in the SAN’s lifecycle will
      reside. In later chapters, you enter the architecture development phase and will
      use these documents to develop a detailed architecture for your SAN.This will in
      turn be used to develop a test plan.These documents will be used in the approval
      process for implementation, and will be kept up to date during the maintenance
      phase as part of the SAN’s documentation set. If any major changes to the SAN
      are needed, the lifecycle will be repeated and another set of documentation will
      be produced.
                                                  The SAN Design Process • Chapter 5    191



Summary
The SAN design process consists of seven phases, which are cycled through as
needed throughout the life of your SAN. Data collection and analysis together
define the requirements of your SAN.These requirements feed into the architec-
ture development process to produce a SAN design blueprint. After you have a
plan in place for your SAN, you must test certain components to ensure that it is
working the way you thought it would, before you can begin to transition and
release it into production. Once the SAN has entered production, it falls into an
ongoing maintenance phase, and continues in that phase until a change occurs
that causes the cycle to repeat.
    The first two phases (data collection and analysis) are critical to the health of
the SAN. Simply put, if the information on which the design is based is incom-
plete and/or inaccurate, the design will be incorrect.
    Data collection consists of a series of interviews, collecting the answers into a
meaningful format (a technical requirements document), and verifying the accu-
racy of the collected data. It is imperative that all key stakeholders in the SAN
project be included on the interview list.
    While listed as a separate phase, data analysis actually coincides with data col-
lection.The objective of the analysis phase is to turn the raw data, which is gen-
erally in the form of business requirements, into a more technical format—the
technical requirements document. Some of this occurs “on the fly” during the
interview process. However, certain tasks are done after the interviews are com-
plete. For example, detailed port count and performance requirements are gener-
ated “on the fly,” and an ROI proposition is created after the fact. Once the
requirements of the SAN are well defined, the remaining phases can take place.
These phases are covered in subsequent chapters.

Solutions Fast Track
Looking at the Overall Lifecycle of a SAN
         The SAN design process is a cycle.
         This process consists of seven phases:
         1. Data Collection
         2. Data Analysis
192   Chapter 5 • The SAN Design Process


               3. Architecture Development
               4. Prototype and Test
               5. Transition
               6. Release to Production
               7. Maintenance
               Whenever there is a fundamental change to the SAN, the cycle
               should repeat.


      Conducting Data Collection
               Data collection is the foundation on which a SAN is built.
               You should interview everybody who has an interest in the project.
               During the interview process, create a technical requirements document.


      Analyzing the Collected Data
               There are several things that you need to get out of data analysis:
               — The number of different fabrics that will make up the SAN solution
               — The port count and performance characteristics of each fabric
               — An estimate of the hardware required to meet these requirements
               You might be able to localize traffic for better performance if you can
               create well-defined groups.
               Prepare an ROI proposition to justify your SAN project.
                                                 The SAN Design Process • Chapter 5    193



Frequently Asked Questions
The following Frequently Asked Questions, answered by the authors of this book,
are designed to both measure your understanding of the concepts presented in
this chapter and to assist you with real-life implementation of these concepts. To
have your questions about this chapter answered by the author, browse to
www.syngress.com/solutions and click on the “Ask the Author” form.


Q: Once I have designed my SAN, shouldn’t it be done? I don’t want to have to
   keep reinventing the wheel!
A: Yes and no. After a SAN enters production, it is “done” until you want to
   change it in a fundamental way. As long as you are happy with leaving your
   SAN the way it is, there is no reason why you would have to repeat the
   design cycle. Simply adding a new storage array does not require a repetition
   of the cycle. Moreover, events that do cause the cycle repeat might cause it to
   repeat relatively quickly. For example, if you decide to go through the design
   process because you are adding a new type of storage array to the SAN, and
   want to validate that doing so won’t break anything, you will be able to take
   a fast track through most of the process. After all, adding this device will not
   by any stretch of the imagination require that you change your fabric
   topology, or affect much of your SAN architecture.

Q: Every end user in my company is a stakeholder in the SAN. Do I need to
   interview everybody?
A: No. It is true that everybody who uses a system is a stakeholder in that
   system. However, we mean something a little less broad.When we refer to a
   stakeholder, we mean somebody whose job revolves around taking care of
   one or more of the systems that will attach to the SAN.This can include sys-
   tems, database, and storage administrators, as well as other technical people. It
   can also include people responsible for the data that resides on these systems.
   For example, a manager responsible for a call center at a phone-in catalog
   company might be a key stakeholder in the SAN, because he or she is
   responsible for the data entered into that company’s business system—which
   is attached to the SAN.Why is this person a key stakeholder? Because he or
   she might have something to say about the availability and performance
   requirements of the system.When in doubt, try to include anybody on the
194   Chapter 5 • The SAN Design Process


          team who wants to be there. It is usually better to have more data than you
          need, rather than less.

      Q: Do I need to wait until data collection is complete before beginning data
          analysis?
      A: Actually, the data collection and analysis phases are most effective if there is
          some degree of overlap. If you have analyzed data from the first interview
          when you go into the second, you will be able to better understand the
          answers, and might also be able to direct the line of questioning along more
          useful lines. Be careful not to develop firm convictions too early on, though.
          Always approach SAN design scientifically. Never start an interview with a
          firm preconception of the outcome! Collection and analysis are divided into
          two phases because some of the analysis naturally occurs after all data collec-
          tion is complete. For example, you can’t prepare an ROI proposition until
          you have a fairly complete picture of what the SAN will need to accomplish,
          and some idea of the technical infrastructure that will be involved.
                                      Chapter 6


SAN Applications
and Configurations




 Solutions in this chapter:

     s   Configuring a High-Availability Cluster
     s   Using a SAN for Storage Consolidation
     s   LAN-Free Backup Configuration
     s   SAN Server-Free Backup
     s   Making Your Enterprise Disaster Tolerant


         Summary

         Solutions Fast Track

         Frequently Asked Questions




                                                  195
196   Chapter 6 • SAN Applications and Configurations



      Introduction
      This chapter covers configurations for some of the most common Storage Area
      Network (SAN) applications.The surest route to SAN success is to base your
      installations on proven configurations—equipment layouts that have been found
      through practice and repeated refinement to be best suited for the desired appli-
      cation.
          In this chapter, we review the major SAN configuration types and discuss the
      advantages of each.This chapter does not go into specific detail (we do not give
      specific vendor names or driver revisions) in order to keep the material generally
      useful.We give you enough information to understand and set up your own
      cluster, but we do not identify a specific storage or Host Bus Adapter (HBA)
      vendor for use in that cluster; thus, the material will be useful to users of any
      storage or HBA. If you do desire low-level detail that identifies specific configu-
      ration information such as vendor, model, and revision level, Brocade
      SOLUTIONware might be helpful.




         Using Brocade SOLUTIONware
         Brocade provides a number of pretested configurations on its Web site
         (www.brocade.com) for administrators who wish to configure their own
         SANs, and integrators who want to have a head start on developing
         solutions for general deployment. Brocade SOLUTIONware guides can be
         used to help define the basic configurations for your SAN and can be
         modified to fit your solution. SOLUTIONware guides offer very specific
         model and part numbers for configurations similar to the solutions
         given in this chapter. The solutions are specific as to the model of
         storage, switch, and HBAs, but can be extended to similar models. Thus,
         a SOLUTIONware can be used either as a “cookbook” for building a SAN
         identical to the one discussed in the paper, or as a “reference solution”
         from which other, similar solutions can be derived.



      Configuring a High-Availability Cluster
      High-availability (HA) clusters are used to support critical business applications.
      They provide a redundant, fail-safe installation that can tolerate equipment,
                                        SAN Applications and Configurations • Chapter 6       197


software, and/or network failures, and continue running with as little impact
upon business as possible. HA clusters have been in use for some time now.
However, until the advent of Fibre Channel, they were very limited in size and
reliability.This is because clusters require shared storage, and sharing Small
Computer Systems Interface (SCSI) storage subsystems is difficult and unreliable.
In fact, sharing a SCSI device between more than two initiators is completely
impractical due to SCSI cabling limitations, and SCSI’s poor support for multiple
initiators.Thus, clustering technology has been greatly enhanced by the network
architecture of SANs. SANs provide ease of connectivity, and the ability to inter-
connect an arbitrarily large number of devices. Because of this, SANs can support
more than just dual failover, and can be easily extended to support many-to-one
failover configurations.
     The advantages of an HA cluster fall into three categories: availability, man-
ageability, and scalability. Availability is the capability of a cluster to be tolerant of
hardware, network, or software errors. In short, it is the capability of a system to
“stay up.” Clustering software automatically detects error conditions, and restarts
or transfers applications from one server to another. Little or no downtime will
result from these problems, as the HA software can be configured to automati-
cally act to correct them. Manageability is the set of processes with which an
administrator keeps a system running. HA clusters enhance manageability by
allowing all servers in each cluster to be managed as a group. Moreover, when
software and/or hardware needs to be upgraded, each node in a cluster can be
upgraded separately without taking important applications or the system
offline—a concept known as a rolling upgrade. Scalability is the ability of a system
to grow. Factors that limit scalability might prevent you from adding servers to a
data center. For example, you might not have enough rack space, power, network
connections, or budget to add another server. Cluster-aware applications can take
advantage of the distributed nature of the cluster to distribute processing, dynam-
ically balancing load between servers.Thus, an HA cluster can provide better uti-
lization of the server resources you already have, saving the data center and
budgetary resources for scalability in other areas. Moreover, adding servers to the
cluster can be easier with HA software. A new server can be added to the cluster
while other servers are still online.Then, applications can be transferred to the
new server to distribute the load evenly, in much the same way that a rolling
upgrade is accomplished.
     The next section covers the configurations of an HA application or database
server, and Microsoft Cluster Server (MSCS) on Windows NT/2000.
198   Chapter 6 • SAN Applications and Configurations


      Typical HA Application or Database Server
      A typical HA application or database server consists of several
      components:
           s   Two or more redundantly configured servers
           s   One or more Fibre Channel SANs to enable the sharing of storage
           s   At least one fault-tolerant/redundant storage volume
           s   An interconnect for cluster messaging (which might also be Fibre Channel)
           s   A software mechanism for providing failover operation
          Through clustering software, the application server continually communicates
      with the clustered spare using network heartbeats to indicate to the other machines
      that everything is operating correctly.This heartbeat is typically carried over a dedi-
      cated network for clustering traffic. In case of a problem (for example, a software
      crash on the operational server or a hardware component failure), a heartbeat link
      will indicate to the other server that something has failed or is otherwise inoper-
      able. If that heartbeat is lost, the spare server takes over the function provided by the
      application service. Depending on the clustering software, either the entire server or
      only specific services on the server can be failed over or failed back.


      NOTE
           HA clusters can use either an active/active or an active/passive model of
           operation. In the active/passive model, the passive server does not provide
           any service until a failure condition causes it to assume control of the
           cluster and become the active server. Thus, the passive, or standby server
           is not utilized at all in normal operation. The active/active model, by con-
           trast, allows each server to provide service—in other words, to be uti-
           lized—even during normal operation. The downside to active/active
           clustering is that when a failure does occur, performance will be
           impacted, since there will suddenly be fewer resources providing the
           same services.



          HA clusters are also used to enable rolling upgrades—the upgrading of one of
      the machines in a cluster to new software or hardware. In a rolling upgrade,
      resources are manually transferred to a standby server from the operational server,
      and the operational server is taken offline for maintenance. Hardware or software is
                                      SAN Applications and Configurations • Chapter 6     199


added or upgraded, tested, and then the operational server is brought back online
and resources are transferred back to the server.Then, the standby server is similarly
upgraded, without end users encountering a substantial interruption in service.
This enables systems to be upgraded and maintained without affecting critical
business operations.


NOTE
     Almost all HA servers in use today require a small interruption in service
     during failover. This is the time it takes the standby server to decide that
     the primary has actually failed, and then to start up the applications that
     that server had been running. The technology required to provide a true
     zero-downtime failover is understood. Some HA databases even imple-
     ment zero downtime failover; however, this functionality is not in general
     use, due to its complexity and the need to have applications specifically
     written to take advantage of it.



     Redundant components are used throughout a system to provide high avail-
ability. Eliminating all single points of failure is important to ensure that the
failure of a single component does not bring down the entire cluster. Redundant
HBAs are used to provide two paths to the cluster’s shared storage from each
host. Redundant, noncoupled fabrics provide separate network paths to the
storage.We discuss redundant fabrics further in Chapters 5 and 7. In addition to
redundancy, fault-tolerant equipment designed with dual power supplies, inter-
nally redundant controllers, and circuitry is preferable in these environments.
     To eliminate single points of failure, a dual-fabric SAN architecture is used.
These fabrics are separated intentionally to prevent loss of service due to operator
error, the need for a rolling upgrade, or major software or hardware problems on
one of the fabrics. By using separate redundant paths, physical cabling problems
are minimized, and disruption of the network from operator error can be isolated
to just one segment.
     Dual HBAs are used in the servers to connect to each fabric. Multipathing
software, which is normally provided by your RAID vendor, is used to detect
I/O errors and redirect traffic to the other HBA, thus avoiding unnecessary
cluster node failover.Third-party software such as VERITAS’ Dynamic Multi-
Pathing (DMP) product can also be used to provide this functionality. Finally,
some HBA vendors (such as TROIKA) provide multipathing software support
built into their HBA drivers.
200   Chapter 6 • SAN Applications and Configurations



      NOTE
           The overriding design principal in an HA environment is to keep all fail-
           ures at as low a level as possible. You want to minimize the chance of
           the HA software (the highest level of protection) having to actually per-
           form a failover simply because a fan or power supply (the lowest level)
           fails. This is particularly critical in large environments, where it is statisti-
           cally predictable that even components with high Mean Time Between
           Failures (MTBF) will fail frequently, simply because there are so many of
           them. Therefore, you should not assume that the presence of an HA
           environment eliminates the need for fault-tolerant components.



          Critical to any HA application server is the use of at least one HA storage
      device, supporting either active/passive or active/active storage controllers. In the
      active/passive case, a single Logical Unit Number (LUN) is exported to two or
      more separate Fibre Channel ports. One port is used at a time, with either an
      automatic or manual failover causing the alternate port to become active. Only
      one port can be active at any given time. For active/active devices, this limitation
      does not apply. A LUN or target is exported to any number of paths.Traffic can
      flow across these multiple paths from any node to this LUN.

      Microsoft Cluster Server
      Microsoft Cluster Server is the most common way that Windows administrators
      add HA capabilities to their critical IT systems. Figure 6.1 shows a typical MSCS
      configuration.The basics of how a generic MSCS configuration works and
      descriptions of the critical parts are explained further later in this section.
          An MSCS cluster consists of two (Windows NT,Windows 2000 Advanced
      Sever) or four (Windows 2000 Data Center) server nodes connected via redundant
      HBAs to a dual-ported storage subsystem via a redundant/resilient architecture
      SAN (see Chapter 7,“Developing a SAN Architecture,” for further explanation of a
      redundant/resilient architecture).The most common MSCS setup in use today is
      the two-node cluster.The two servers are configured as Active and Standby nodes
      in the cluster.This employs the active/passive model of HA.The Active node owns
      the cluster LUN(s), but the Standby has the right to take ownership when required.
      Having ownership rights does not imply sharing. MSCS uses a share-nothing
                                                           SAN Applications and Configurations • Chapter 6   201


architecture, which means that only one server can use a LUN at a time.The man-
agement software provided by the storage vendor must allow a LUN to be accessed
by two or more hosts for use in a cluster environment.

Figure 6.1 Microsoft Cluster Server Configuration

                                                            Dual-Controller
                                                            Storage Array




                                                           C1             C2




            Dual-
           Fabric
            SAN
        Architecture
                                                Fabric A                                Fabric B




                       HBA 1            HBA 2                                   HBA 1              HBA 2




                                                                  Heartbeat




                           Active Cluster                                          Standby Cluster
                               Server                                                  Server
                                                                 Ethernet LAN
202   Chapter 6 • SAN Applications and Configurations


           The clustering software uses one or more heartbeat networks to ensure that
      the Standby server knows the status of the Active server at all times.These heart-
      beats can be sent over a dedicated (private) IP network (either Ethernet or IP
      over Fibre Channel), a public Local Area Network (LAN), or by using a shared
      disk volume. Generally, at least one dedicated connection is used to ensure that
      the heartbeat is not interrupted by outages to the public network.
           Absence of the heartbeat detected by the Standby server triggers the node
      failover function.The Standby node then takes ownership of the cluster LUN
      and continues serving the application’s clients. Once the primary node server has
      been fixed, a manual failback operation will restore the original configuration.
      Manual failback is recommended to avoid a ping-pong effect that can result from
      an automatic failback setting.The software can also be manually moved over to
      enable server maintenance or rolling upgrades.
           A dual fabric and dual HBAs are typically used in HA clusters to ensure that
      there are fully redundant hardware components in the system.The recommended
      storage is a dual-controller storage array to allow for redundant connections to
      the data.
           Microsoft Windows 2000 Data Center is a four-node cluster configuration
      that has been certified using fabric switches. Unlike MSCS, Data Center absolutely
      requires a SAN, since SCSI storage cannot support more than two initiators.




         Microsoft WHQL Certification
         Microsoft Windows Hardware Quality Lab (WHQL) provides lists and ref-
         erences to hardware that has passed certification by Microsoft for cer-
         tain applications. WHQL certification is often useful in determining if
         hardware and drivers for that hardware have been certified for working
         with Microsoft operating system software and advanced OS capabilities
         like MSCS. Special certifications are given to hardware (storage, HBAs,
         and network components) for running with MSCS, which ensures that
         Microsoft has tested all of the hardware for that application and certi-
         fied that it works. Brocade switches are cluster-certified through Original
         Equipment Manufacturers (OEMs) in bundle configurations along with
         the OEM storage and qualified HBAs. It is also possible to build custom
         HA solutions with Brocade switches, in addition to buying pre-inte-
         grated bundles.
              Microsoft provides an online database of WHQL-qualified hard-
         ware, which can be searched by keyword or category of equipment
         (www.microsoft.com/hcl).
                                     SAN Applications and Configurations • Chapter 6   203



Using a SAN for Storage Consolidation
One of the major uses of SAN technology has been for storage consolidation.
The availability of a storage network has enabled data center administrators to
centralize where their storage resources are located, making better use of precious
storage dollars through storage sharing and pooling. By having storage available in
a central pool, usage of that data space is more efficient and the data is easily
backed up, managed, and accessed.This section describes the major methods of
using a SAN for storage consolidation, reviews several of the techniques used to
accomplish this goal, and provides sample configurations.
     Before the use of storage networks, storage was dedicated to a specific host.
This resulted in often underutilized or poor distribution of storage capacity. For
example, you might have 100 GB available on one server that was being used for
user’s home directories, but only 1 GB available on another server that was run-
ning a business-critical e-commerce database.With dedicated storage, it was
impossible to reallocate those disks from the less critical, and less resource-
constrained, system to the more important one.You would have to buy additional
storage for the critical database server, even though you already owned 100 GB
of unused storage! With the use of a SAN, reallocation could be as simple as reas-
signing some of the disks associated with less important user files to the business-
critical database in a few minutes without any recabling or downtime.
     With dedicated storage, whether externally attached or internal to a server,
adding storage capacity typically requires the full reconfiguration of a system:
shutting down the host, connecting the new storage, and restarting the system.
External arrays might have the capability to add disks “on the fly,” but what if
you need to add a new cabinet? There are always fundamental limits to direct-
attach storage that do not apply to storage attached via a SAN.
     With Fibre Channel SANs, making storage available to hosts in a network is
less complex.You connect any Fibre Channel storage device to your switch, and
the device is immediately available to any of the hosts that are connected to the
fabric. Minimal configuration of the fabric is required for those storage resources
to show up on the SAN, and minimal effort is required for that storage to
become available to hosts in the network.You simply connect the storage to the
network, and all of those resources become immediately available. Usually,
all that is required is to configure the new device into the fabric’s zoning table,
if necessary, and to import those newly available volumes into the operating
system. On Windows 2000 and Solaris, this does not require a reboot, and can be
accomplished entirely without downtime.
204   Chapter 6 • SAN Applications and Configurations



      NOTE
           Some storage devices require specialized drivers, which might have dif-
           ferent procedures for adding capacity. You should check with your storage
           vendor to find out how to add access to new storage arrays “on the fly.”



          Storage consolidation is supported in almost any configuration of the fabric.
      The topology of a Fibre Channel network for storage consolidation and sharing
      really is the simplest case for SAN layout.You can use a “SAN islands” approach,
      with an interface on each multiport storage array connected to each island.You
      can also construct one large fabric.The route you take will depend on your per-
      formance and management goals.These topics are covered in greater detail in
      Chapters 5 and 7.
          With Fibre Channel, devices are dynamically added or removed at any time
      from the network.This is an advantage in terms of flexibility and control, but also
      a potential difficulty due to the way operating systems handle dynamic volumes.
      For example, some operating systems have been built to assume that they own all
      storage they are connected to, and will even try to overwrite data on any volume
      that they come across.This is particularly the case for Windows NT, which will
      write an operating system signature on anything that it discovers, even if it has
      previously been claimed by another system in the network.These problems have
      led to a number of hardware- and software-based approaches, like Brocade
      Zoning, for controlling which systems on your network have access to the
      devices you have added.To truly add storage “on the fly,” one or more layers of
      functionality might be required. For example, you might want to be able to
      dynamically resize logical volumes. Additional software and configuration might
      be required in order to facilitate this. For example,VERITAS Volume Manager
      and VERITAS File System provide this functionality.
          Operating systems currently available generally do not have native support for
      Fibre Channel volumes. Instead of directly representing Fibre Channel devices to
      an operating system, Fibre Channel HBA drivers abstract storage devices in a
      Fibre Channel network, and present them to the operating system as SCSI tar-
      gets.This leads to an area where configuration is important.You need to under-
      stand how this mapping occurs, so that you can translate between the devices as
      they appear in the operating system, and the reality of where they are located in
      the SAN. Figure 6.2 depicts a simple Fibre Channel SAN.
                                        SAN Applications and Configurations • Chapter 6   205


Figure 6.2 Simple Fibre Channel SAN for Storage Consolidation


                             Server                           Server




              Server                                                   Server




                                      Fibre Channel Storage


                                          Shared Access



     The access management techniques currently in use fall into one of three cat-
egories: switch zoning, LUN masking, and software control. In any SAN where
storage is expected to be shared between different hosts, it will be necessary to
utilize at least one of these methods to control which device has access to which
206   Chapter 6 • SAN Applications and Configurations


      storage volumes. Examples of the different kinds of storage partitioning tech-
      niques follow. Zoning is also discussed from an implementation perspective in
      Chapter 9, “SAN Implementation, Maintenance, and Management.”

      Shared Storage Using a Web Farm
      A common use of HA clustering is for making Web farms and their data available
      at all times. Unlike most HA scenarios,Web farms do not necessarily require spe-
      cialized HA software to be used as a group.This is because read-only file systems
      can be used to provide shared access to storage, and front-end IP load balancing
      switches provide the rest of the solution. In this use of Fibre Channel SANs, a
      read-only, centralized storage array is used to support a large number of Web
      servers.This approach helps to enormously reduce the costs of acquiring and
      managing storage.
           The traditional approach to designing a Web farm involves buying large
      amounts of local storage for all the servers.With a SAN, far less storage can be
      used, and it can be managed in a centralized way. Multiple Web servers are con-
      nected on a single Fibre Channel network, with shared access to the same pieces
      of storage. Static Web data is kept on this storage, which has been mounted read-
      only on all of the Web servers. Because of the high speeds of Fibre Channel
      versus the slower IP network connections to the Internet, many systems can
      access the same storage with no impact on performance.
           In a typical configuration for Web farm storage sharing, as shown in Figure 6.3,
      a large farm of Web servers would be connected to the Internet through an IP
      load balancer (Layer 4 switch).This allows traffic to be distributed to the least
      busy server, while making all servers in the farm appear as one logical entity to
      the clients on the Internet. All of the hosts would have access to the same vol-
      umes on a read-only basis. One “content master” host would need to have read-
      write access. A shared file system is required to enable this configuration, unless
      some care is taken in the design.
           This technique drastically reduces the cost of a Web farm, because you gain a
      great amount of efficiency in the use of your storage resources. Instead of dozens
      of replicated environments with the same Web data, a single large storage array is
      used to support Web requests.This results in significant savings in data storage,
      and simplifies management of the data. Less manpower is required to manage
      fewer storage devices, use of floor space is minimized, maintenance contract costs
      are reduced, and electricity and cooling costs go down.
                                      SAN Applications and Configurations • Chapter 6           207


Figure 6.3 Web Farm Using a SAN



                                          Internet




                                     Web Load Balancer


                                                              Ethernet Backbone




         Web Server     Web Server                       Web Server               Web Server




                                 Fibre Channel Storage
208   Chapter 6 • SAN Applications and Configurations


      Storage Partitioning Using Switch Zoning
      Brocade Zoning, the use of the fabric to partition storage and servers into dif-
      ferent accessible areas, can be used to partition storage into different pieces. For
      example, each disk within a Just A Bunch Of Disks (JBOD) could be assigned to
      a different host. By zoning different pieces of storage with different servers,
      sharing of the network can be used to allocate storage among hosts. Zoning
      within the fabric is particularly important in larger configurations, as it not only
      acts to provide partitions to control disk access, but also provides “broadcast con-
      tainers” similar to IP Virtual LANs (VLANs).Thus, Brocade Zoning acts to
      increase the scalability and reliability of fabrics.
          To set up storage partitioning, administrators typically map out on paper what
      kind of storage distribution they would like across their SAN. Specific storage
      targets are assigned to specific servers and workstations.The administrator then
      uses a Graphical User Iinterface (GUI) (such as Brocade WEB TOOLS) or a
      Command Line Interface (CLI), (such as the Brocade telnet CLI), to create indi-
      vidual zones for all of the servers and workstations.
          If your SAN contains multi-LUN devices such as RAID storage arrays, cur-
      rent fabric zoning cannot partition individual LUNs to different hosts, although
      hardware offering this capability is on the horizon. Using the software supplied
      with such a RAID or using HBA-based LUN masking might be required in
      these cases.These should not be seen as a substitute for fabric zoning as they do
      not provide an equivalent security model, do not allow for centralized manage-
      ment, and do not act as Registered State Change Notification (RSCN) con-
      tainers. Instead, these techniques are supplemental to fabric zoning.

      Switch Zoning Configuration for Departmental SANs
      When an administrator wants to take advantage of storage consolidation and
      enhanced manageability, he or she might want to collect several departments onto
      one large storage network. In the example in Figure 6.4, three departments
      (Engineering, Finance, and Marketing) have been consolidated onto a single
      SAN. Each department has a dedicated storage array for its operations and a set
      of hosts. Hosts and storage throughout the company can be connected into a
      single Fibre Channel storage network. By using switch zoning, engineering hosts
      have access to only the engineering storage array, finance hosts have access to
      only the finance array, and marketing hosts have access to only marketing data on
      the marketing storage array. Using the Brocade Zoning tools, the SAN adminis-
      trator would create three zones: Zone1, Zone2, and Zone3. Engineering Host A,
                                                       SAN Applications and Configurations • Chapter 6           209


Engineering Host B, and Engineering Storage Array would be added to Zone1,
as shown in Figure 6.4. Finance Server would be added to Zone2 with Finance
Storage Array. Marketing Host and Marketing Storage Array would make up
Zone3. All of these zones would be added to a single zone configuration, which
would be set as the active configuration for the fabric.The advantage of this par-
titioning includes the capability to install a single Fibre Channel-connected
infrastructure in a building or campus that can support any desktop or server
connection to storage, the ability to centrally manage access to storage by an
administrator through the switches, and the ability to centrally back up all storage
through the network.
Figure 6.4 A Departmental SAN Partitioning Using Switch Zoning


             Engineering                 Engineering                                                Marketing
                                                                      Finance Server
               Server                      Server                                                    Server




                                                              Zone2

                    Zone1                                                                          Zone3




                           Engineering                   Finance                       Marketing
                             Storage                     Storage                        Storage
210   Chapter 6 • SAN Applications and Configurations


          An administrator would set this up by first defining which departments had
      storage that could benefit from central connectivity. A Fibre Channel SAN would
      be wired throughout a building or campus, and storage for each department
      brought online within specific zones. Individual servers and hosts would be
      added, one at a time, to the appropriate zones to which they belong by identi-
      fying their World-Wide Names (WWNs) or port addresses and assigning them
      into a zone set.

      Storage Partitioning Using
      Storage LUN Masking
      Storage partitioning is also accomplished using LUN masking on the storage.
      Storage arrays from all major vendors have added this feature to control host
      access to storage volumes.The storage administrator determines which hosts talk
      to which storage volumes. Usually, this is done by specifying the port or node
      WWNs of the HBAs (and thus, hosts) that are connected in the network, and
      which physical LUNs they are allowed to access. If a host that is not granted
      access attempts to access a volume, the storage array will prevent this and reject
      any commands to that device from the alien host. Hosts that are not allowed
      access to a LUN will simply not get access. Rogue hosts on a network, human
      error, and operating system vagaries will not jeopardize the integrity of your data.
      Storage LUN masking controls access where the data is being stored. However, in
      an environment where many storage arrays exist, this decentralized management
      model might be time consuming to manage. Storage array manufacturers might
      charge a substantial extra fee for the capability and software to manipulate storage
      LUN masking.You should make sure that you understand up front whether your
      array provides this capability, and if so, at what cost. Finally, if you use more than
      one manufacturer’s array in your SAN, you need to ensure that you have exper-
      tise in each LUN management application.

      Storage Partitioning Using HBA LUN Masking
      The other end of your storage network is the HBA.This is another feasible point
      for controlling which hosts can or cannot access certain LUNs. HBAs also offer
      LUN masking functionality for this reason. An HBA can mask what devices the
      operating system accesses.This control is typically exercised through a console
      application, registry settings, text files, or third-party software. Host-based LUN
      masking produces much the same effect as storage LUN masking. It is typically
                                      SAN Applications and Configurations • Chapter 6     211


less expensive, although it requires the active participation of every HBA on your
network.The security model is substantially weaker, since one host—either inten-
tionally, through operator error, or through software malfunctions—can compro-
mise data integrity on your SAN. If even one host on your network does not
have the LUN masking set up correctly, it could mean corrupted data. A rogue
host could also gain access to your data without permission.This should not
imply that HBA LUN masking should not be used, but rather that it should be
used in conjunction with fabric zoning for maximum effectiveness.This type of
multilayer security is common in traditional IP networks: any security consultant
will tell you that the correct place to provide security is not just at the host or in
the network, but everywhere.
     Some HBAs allow you to change LUN masking “on the fly,” meaning that
changes you make to masking are reflected immediately on the network. Some
software might require a reboot of the system for the changes to take effect. In
addition, LUN access changes need to be propagated across a network to every
host that is accessing common storage—not a trivial feat.The principal advan-
tages of HBA-based LUN masking are cost and accessibility.When HBA-based
LUN masking is available for an HBA, generally the LUN masking is included as
a standard part of the cost of purchasing the hardware.

Partitioning with Software
To tackle the problems of allocating volumes across devices, software companies
have come up with a number of solutions to help you control which systems in
your network have access to which devices.These solutions usually exist as
drivers that are layered on top of file systems. Like HBA LUN masking drivers,
this software must be loaded on every system in order to work.The existence of
hosts that are accessing storage in the same zone as these machines but that are
not running the software is almost guaranteed to result in data corruption.These
software applications have one thing in common: they act as a filter for other
drivers that control which LUNs are seen by a host, and selectively allow or dis-
allow access to these devices depending on administrator or user requests.
    Some software packages actually allow multiple hosts to have read-write access
to the same device at the same time.This is a shared volume/shared file system
approach. Other software packages merely allow convenient and dynamic realloca-
tion of resources between hosts, with only one host having control at a time.
    Data-sharing applications are loaded onto servers in a network, and generally
all machines in a network are required to have cooperating servers, or must be
zoned into special areas for volume sharing. Some software applications also
212   Chapter 6 • SAN Applications and Configurations


      require the installation of a metadata server to coordinate access to volumes.
      Software applications like VERITAS Volume Manager,Tivoli SANergy, and HP
      LUN Manager allow you to manipulate which hosts in the network are allowed
      to see which volumes in a network.Through metadata passed between devices,
      information about which systems are using which volumes is exchanged with the
      drivers on all of the systems.When there is a request to share or unshare a
      volume, the software tells the hosts that are no longer allowed access to unmount
      a volume. It then tells the hosts that now have access to go ahead and mount and
      access the volume. All of this can happen automatically, without additional
      user intervention.
          These techniques are suited for very dynamic environments.With all of the
      packages available today, sharing of data can be changed “on the fly,” from minute
      to minute if necessary. A common use of this has been for rendering farms at
      Hollywood computer graphics firms, animation houses, and special effects compa-
      nies.Volumes are shared to individual animators to store work in progress.The
      high speeds of the Fibre Channel network are used to transfer the many gigabytes
      of data generated by those animators.Volumes are shared between workstations
      and groups, and can be reallocated based on changing workload requirements.
      Shared file systems also can be used to facilitate the creation of Web farms.
          The disadvantage of using software is primarily one of security. All of the
      software available today requires that all of the hosts running in the SAN zone be
      loaded with the software. Accidentally attaching hosts that are not running the
      software could cause massive data corruption.The software relies on cooperating
      hosts to be loaded properly to control individual access to volumes. Because of
      this, these software packages are best used in tightly controlled environments, and
      in conjunction with fabric zoning.

      LAN-Free Backup Configuration
      Traditional backup systems used SCSI direct-attached tape storage as a method to
      back up business-critical data accessed by application servers.This meant that
      each application server had its own tape storage, which backed up the data stored
      on locally attached disks. Server RAM, I/O bus, and CPU resources were used to
      drive the backup process.To combat management problems associated with coor-
      dinating the growing number of local tape drives and libraries, inefficient use
      of secondary storage resources, and ineffective use of personnel, companies
      implemented LAN-based backup using a server-client model.
                                    SAN Applications and Configurations • Chapter 6   213


     A central backup server would be installed on the LAN.The application
servers and workstations were configured as clients of this backup server.The
central backup server would accept requests from backup agents running on its
clients, and transfer the data through the LAN to the locally attached tape
resources it managed.This method provided a centralized, easy-to-manage backup
scheme and allowed greater efficiency by sharing tape resources over the net-
work. Unfortunately, the LAN-based solution has several shortcomings.
     Backup jobs require a large amount of block data movement, which in this
scenario is carried across the LAN.With so much data being generated every day
and backup windows extended into normal working hours, LAN connections
become swamped with backup jobs. End users complain that they cannot access
network resources, and systems administrators see network performance dwindle.
Administrators can attempt to minimize the effect of the backups on the opera-
tion of the LAN by running backups after business hours. However, the amount
of data being backed up continues to increase at nearly exponential rates.With
24x7 operations, the LAN-based solution has turned out to be unfeasible for
most enterprise environments.
     LAN-free backups using storage networks have helped solve these problems.
Because Fibre Channel is a high-bandwidth channel, and has been designed from
inception as a separate network for bulk data movement, the bandwidth problems
that appear in running backups over LANs disappear.This separate network, in
addition to offloading traffic from the LAN, also performs its operations with
less CPU overhead than the LAN approach requires.This is due to the fact that
Fibre Channel connections do not need to go through the server’s TCP/IP stack,
and because certain levels of error checking are accomplished in Fibre Channel
hardware.
     It is necessary in many solutions to have a host to manage the shared storage
and to store the backup database, which is used to locate and recover data.VER-
ITAS NetBackup is an example of backup software that facilitates LAN-free
backup.

SAN Server-Free Backup
Server-free backup is the use of the SAN to remove backup traffic from the cur-
rent Ethernet or other IP network without requiring a separate dedicated server.
A SAN-based server-free backup is therefore also LAN-free. Because Fibre
Channel storage networks are now used for data sharing or consolidation, it is
natural to design and implement a server-free fabric-wide backup scheme.This
214   Chapter 6 • SAN Applications and Configurations


      type of backup implementation is in contrast to legacy LAN-based approaches,
      where each server reads the data and using IP, sends it on the corporate LAN to
      another server with locally attached hardware or LAN-free schemes, where
      backup traffic is isolated to a separate Fibre Channel-based network. In contrast
      to one host backing up to another, whether it be on a LAN or SAN, a data
      mover, at the request of a host, reads from the disk and writes directly to SAN-
      based shared tape resources, without the requesting host ever having to be
      directly in the data path.
          What can be a data mover? Just about anything. Old servers gathering dust
      can be recycled and turned into data movers. Native “smart” Fibre Channel-
      attached tape drives might have embedded data mover functionality. Fibre
      Channel-to-SCSI bridges and routers, which allow legacy SCSI device attach-
      ment to the SAN, almost universally have this feature.Typically, data movers can
      be either Network Data Management Protocol (NDMP)-based or use the
      Extended Copy command, sometimes called third-party copy.
          NDMP is an open standard protocol for enterprise-wide backup of heteroge-
      neous storage. NDMP clients and servers pass metadata about the backup job
      status as well as the data itself.Traditionally, NDMP was used in the network-
      attached storage model.
          Extended Copy is a SCSI protocol command that allows a remote block-level
      copy to occur.The reason this method is called third-party copy is that the host
      actually requesting the copy command does not send it directly to the devices in
      question. Instead, it sends the request to a third-party device, which then sends the
      command to the appropriate targets. In general, Fibre Channel-to-SCSI routers
      and native Fibre Channel tape drives use Extended Copy, while NDMP uses
      legacy hosts for data movement from disk to tape.
          Why do server-free backups? Typically, backup used to be done through a
      centralized console or backup server.This backup server would communicate
      with backup agents on all of the hosts or servers in the corporate Ethernet net-
      work at a convenient hour, and request certain files to be sent from storage to
      tape.This works well for installations with a few file servers and small data sets.
      Enterprise companies are now finding that all other IP traffic comes to a halt
      when backups are occurring, and server CPU cycles are being saturated from all
      of the IP traffic. Moreover, with the immense data growth occurring throughout
      the industry, the amount of time a backup takes has extended from a few hours
      each night to sometimes an entire day. In extreme cases, the times required to
      back up an enterprise data set, even incrementally, have gone beyond the
      Ethernet network’s capability to transport that data within a day.To make matters
                                    SAN Applications and Configurations • Chapter 6   215


worse, the move toward 24x7, always-on Web-based business models, has made no
time of the day available for loading the corporate network. Continuous cus-
tomer access and nonstop transaction processing have become critical for
business survival.
    Today, the best solution to these dilemmas is server-free backup. Existing or
new Fibre Channel storage networks can take full advantage of server-free
backup technology. For most new SAN installations using LAN-free only
backups, the hardware functionality exists by default.What problems are solved in
the data center? Server-free backup is done directly by each data mover writing
to tape directly, without the need for agents or IP traffic, or any consumption of
CPU time on other servers. A master backup server coordinates tape sharing
between media servers. Other features such as snapshot copy, which exist as hard-
ware- and software-bundled features for enterprise RAIDs, in addition to a
backup software application option, enhance server-free backup implementations
for database servers.The idea is that instead of the data being read from disk
drives, into the memory of servers, and sent through an IP network to the
backup server, the data is block-copied from disk to tape directly. Figure 6.5
shows an example of this, where three different hosts share a single tape drive
across the SAN.

SAN-Based Third-Party Copy Data Movers
Third-party copy backup systems are very similar to the LAN-free systems dis-
cussed previously. However, with this technique, specialized pieces of hardware
and software called “data movers” are used to back up critical data from storage
arrays in the SAN without the need for a dedicated server or server(s) to handle
the data copying and movement.
    All data movers support the SCSI Extended Copy command.The third-party
copy hardware actually moves the data from the disk to tape.The backup software
controls this operation without the need for the servers in the network to get
involved in the actual movement of data. Agents are not required to run on the
server, and critical servers are not occupied backing up data.
    Third-party copy hardware, such as some Crossroads or Chaparral Fibre
Channel/SCSI bridges, works in conjunction with third-party copy-enabled
backup application software (from VERITAS, Legato, Computer Associates, and
others).This backup software operates identically to normal backup software in
terms of user interface and operation. Overall, this technique increases perfor-
mance, greatly reduces the time needed for backup windows, and eliminates the
task of backup job processing from the CPU of the server.
216   Chapter 6 • SAN Applications and Configurations


      Figure 6.5 Using Storage Networks for Server-Free Backup


                          Host A                      Host B                    Host C




                        Fibre Channel Storage   Fibre Channel Storage   Fibre Channel Tape



           A typical third-party copy configuration with currently available hardware typ-
      ically uses a Fibre Channel-to-SCSI router as a bridge to legacy SCSI tape drives,
      and combines that functionality with third-party copy support (Figure 6.6).

      Making Your Enterprise Disaster Tolerant
      As computer systems become increasingly central to business operations, the
      integrity and availability of those systems has become one of the most important
      charters of IT organizations.The loss of access to computer systems and data today
      even for a few minutes can mean millions of dollars of lost revenue, and damage
      to reputation. Because of the criticality of those systems and data, service-level
      agreements and in some cases government regulations require that systems must be
      available to provide business continuance in the face of major disasters.
                                                       SAN Applications and Configurations • Chapter 6                   217


Figure 6.6 Third-Party Data Mover Configuration


               Host A                              Host B                              Host C




                                                                                                 Command
                                                                                                  to move
                                                                                                    data           er
                                                                                                                 ov
                                                                                                             taM
                                                                                                         Da




                                                                                 Fibre Channel/SCSI Bridge



                                                                  Actual block
       Fibre Channel Storage   Fibre Channel Storage                  data
                                                                   movement




                                                                                    Fibre Channel Tape



    Enterprises that are implementing a SAN are finding that the ability to
mirror and operate their devices spread across large distances helps to provide dis-
aster tolerance to their critical installations.The presence of a SAN helps enhance
the ability of a company to protect and recover data in the case of a disaster, and
provides the tools to enable an administrator to design a disaster-tolerant system.
Fibre Channel technology can be coupled with technologies like Dense Wave
Division Multiplexing (DWDM) for Metropolitan Area Networking (MAN), and
tunneling through existing high-speed Wide Area Networks (WANs).Thus, it is
218   Chapter 6 • SAN Applications and Configurations


      now possible to separate key data sites across great distances, while still allowing
      them to share disk subsystems and backup devices.The following sections sum-
      marize the creation of a geographically separated, but fully connected SAN.They
      describe how Brocade switches are uniquely suited to handle these needs with
      features known as Remote Switch and Extended Fabrics.

      Data Replication and Remote Backup
      Data replication is used to make enterprises disaster tolerant across long distances,
      particularly in the case where bandwidth is not readily available or is very expen-
      sive. Data replication is the technique of taking a snapshot of your operational
      storage images at a specific point in time, and moving across a network to a geo-
      graphically separate storage facility.This data is then moved, on schedule, across a
      potentially slower network across large distances, and replicated at the backup
      facility. Data replication is typically done every day or at most every few hours,
      and it can be done across a city or across the globe. Because there is a delay
      between updates, data can sometimes be sent over a slow link and reassembled on
      the other side.
           Data replication can be done directly across Fibre Channel, enabling very fast
      replication and minimal mismatch between a replica and live data.This technique
      can also be used across existing IP and other network infrastructures to transport
      the data across large distances. For example, it is possible to tunnel Fibre Channel
      connections over an Asynchronous Transfer Mode (ATM) network using the
      Brocade Remote Switch product and an appropriate Fibre Channel-to-ATM
      gateway.This is an optionally licensed product available for all Brocade SilkWorm
      2000 and higher switches.
           Remote backup is the use of a long-distance link to enable backup to a
      remote site. Normal backup techniques are used, with the difference being that
      backup tapes and media are stored far away from the servers being backed up.
      This helps ensure the safety of that data in the case of a geographically limited
      disaster. Like remote replication, the task can be done via a Fibre Channel-to-
      ATM or Fibre Channel-to-IP gateway. Figure 6.7 shows a typical remote backup
      configuration utilizing existing WAN infrastructure.
                                                   SAN Applications and Configurations • Chapter 6      219


Figure 6.7 Remote Backup over WANs


                                               Primary Site


                   Server


                                                              ATM



                                                                                       Recovery Site




                                     Fibre Channel/ATM          Fibre Channel/ATM




             Fibre Channel Storage                                          Fibre Channel Tape




Metropolitan Area Network Solutions
A recent innovation in optical technology, DWDM hardware has enabled the
transport of native Fibre Channel over greater distances up to 100 km. A DWDM
allows for real-time, full-speed transport of Fibre Channel to match the very high
bandwidth requirements of real-time, mission-critical applications.
220   Chapter 6 • SAN Applications and Configurations


          This approach is used for creating disaster-tolerant solutions, by enabling
      remote mirroring of operations across large distances. Because these solutions can
      transmit full-speed Fibre Channel frames, two separate data centers can share the
      same data, and can create remote mirrors of data.This remotely mirrored data
      allows for a hot-standby system that can take over the operations of a failed
      system at a moment’s notice, with all data intact and no need for data recovery. In
      fact, DWDMs can be used in conjunction with HA software to allow this failover
      to occur both automatically and quickly.
          Brocade switches compensate for the signal delays that happen when trans-
      mitting frames over long distances with the use of extended amounts of
      buffering (buffer-to-buffer credits) available on Inter-Switch Links (ISLs).This
      delay is caused by the speed with which light travels through the glass in the
      fiber-optic cable. By configuring two switches to use extended buffer credits on
      the long-distance E_Ports, Brocade switches can handle this delay without
      losing bandwidth.
          A Brocade switch can be configured to handle extended distance fabrics by
      installing the Extended Fabrics software license, and then setting the long-dis-
      tance fabric settings to “1” in the switch configure command. Individual switch
      E_Ports can either be set to handle distances of 50 km (mode 1) or 100 km
      (mode 2, 60). See the Brocade Extended Fabrics documentation for details
      on configuration.
          One of the most typical uses of Fibre Channel SANs over MANs is sharing
      data for disaster tolerance. Banks, brokerages, and other businesses in Manhattan
      are some of the biggest users of this technology.These organizations require real-
      time backup of data to a remote site. A combination of disaster tolerance require-
      ments and the cost of real estate in Manhattan has resulted in a large number of
      organizations establishing disaster recovery sites in New Jersey for a secondary
      operations center.
          Figure 6.8 shows a typical disaster tolerance configuration used for a MAN
      topology.Two parts of the same SAN exist on either side of a MAN (DWDM),
      operating just as if they were not geographically separated. Data is mirrored
      between both sides of the SAN, and failover software can be used to provide high
      availability with backup servers on either side of the link.
                                          SAN Applications and Configurations • Chapter 6    221


Figure 6.8 Metropolitan Area Network, Bridging Fibre Channel over DWDM

                                 Site A                 Site B

             Server                                                        Server




                               DWDM                          DWDM




                                          Data Mirror


       Fibre Channel Storage                                        Fibre Channel Storage
222   Chapter 6 • SAN Applications and Configurations



      Summary
      In this chapter, we discussed the most common overall applications of SANs, and
      some sample configurations for those applications.
           HA clusters are used with Fibre Channel networks to support mission-critical
      business applications utilizing redundant SAN components.These can include
      database clusters like Oracle Parallel Database Server, and specialized failover
      packages such as Microsoft Cluster Server.
           Storage consolidation is another typical application of Fibre Channel net-
      works. SANs have made it possible for data center administrators to centralize
      their storage resources, making better use of the storage they have, and leveraging
      their budget for storage through more efficient allocation and capacity planning.
      With some attention paid to managing the allocation of storage through fabric
      zoning, LUN masking, or storage sharing software, the use of a SAN for storage
      consolidation can pay off handsomely in better use of IT budgets, enhanced man-
      ageability of data, and more reliable operation of your data center.
           The use of a SAN for LAN-free backup, server-free backup, and newer tech-
      niques like third-party copy, offers solutions to backing up the vastly growing
      amounts of storage in your enterprise. Because of the bandwidth and efficiency
      of Fibre Channel, SANs are fundamentally better suited for backup than LANs
      are. Server-free backup configurations offer even more efficient backup of storage
      resources, requiring more advanced software capable of directly backing up data
      without the intervention of servers. Finally, third-party copy provides even more
      efficient use of the network without requiring even a backup server.
           With the advent of DWDM equipment that can transmit high-speed Fibre
      Channel data across MANs, enterprises can add disaster tolerance to their data
      centers. By enabling remote mirroring and replication of data, clustering across
      long geographical distances enhances the ability of the enterprise to keep critical
      systems up and running in the most extreme conditions.

      Solutions Fast Track
      Configuring a High-Availability Cluster
               HA clusters are used for redundant, fail-safe installations of mission-crit-
               ical business applications.
               Clustering provides availability, manageability, and scalability.
                                  SAN Applications and Configurations • Chapter 6     223


     Availability is the capability of a cluster to tolerate hardware, network, or
     software errors.
     The most common use of clustering is two servers configured to share
     storage through Fibre Channel.
     Redundant HBAs and switches should be used to provide fault tolerance.
     The use of dynamic multipathing software, drivers, or HBAs can provide
     higher levels of availability to your cluster.


Using a SAN for Storage Consolidation
     Storage consolidation enables administrators to centralize
     storage resources.
     Consolidation provides more efficient use of storage, enhances
     manageability, and improves accessibility.
     Almost any layout of a storage network can be used for
     storage consolidation.
     Consolidation requires attention paid to how operating systems treat
     shared volumes.
     In order to properly partition data in a consolidation environment, you
     need to use fabric zoning, LUN masking on storage or the host, or soft-
     ware to control permissions.
     It is generally best to use fabric zoning even when also using another
     access control product to achieve a more effective security model, and to
     provide a “broadcast container,” which can increase the scalability and
     reliability of a SAN.
     An example of a typical storage consolidation setup is a shared SAN
     used to provide data storage for a Web farm, where many servers read
     the same disks to present data.
     Storage LUN masking is used to ensure that only specific hosts are
     allowed access to specific logical units of a storage array.The advantage
     of storage LUN masking is that the storage guarantees which host is
     allowed access to any volume.
224   Chapter 6 • SAN Applications and Configurations


               HBA LUN masking is also used to limit what storage a host can see, and
               requires that every host in the network participate in the same
               masking scheme.
               Software partitioning provides another type of control over LUN
               presentation, but it generally requires upper-level software and demands
               that every host in the network be loaded with that software.
               Switch zoning, available in Brocade switches, provides a convenient way
               to allocate storage to hosts, and to consolidate different departments into
               a single company network.
               Switch zoning does not currently support control at the LUN level, only
               at the port and WWN levels. Upcoming products will add this capability.
               For now, other access control techniques might need to be used in addi-
               tion to switch zoning to provide access control at the LUN level.
               Storage LUN masking provides another way to control access to
               volumes in a shared SAN.
               High-end storage arrays provide the capability to specify the port or
               node WWN of a host HBA, and specify which volumes in the array will
               respond to requests.
               By using storage LUN masking, you can ensure that only hosts with
               permission can read or write from a specified volume.
               Storage LUN masking requires the participation of the storage only to
               enforce permissions.
               HBAs provide access control to volumes through LUN masking.
               LUN masking controls which volumes an operating system can see
               through a particular HBA.
               HBA LUN masking requires the participation of all of the hosts in the
               network to avoid contention for storage resources.


      LAN-Free Backup Configuration
               Traditional backup systems used SCSI direct-attached tape storage.The
               LAN-based client-server backup model, although an improvement,
               cannot account for ever-increasing amounts of data through the LAN
               connection. LAN-free backups using storage networks solve LAN-based
               problems by offloading traffic from the LAN and increasing bandwidth.
                                 SAN Applications and Configurations • Chapter 6   225


SAN Server-Free Backup
     Server-free backup is the use of a SAN to remove backup traffic
     from a LAN.
     Backup is done directly on the SAN for each device, rather than each
     host being involved in data transfer.
     Third-party copy provides an even more efficient way to transfer data to
     tape, freeing a backup server from needing to directly access disks and
     copy data to tape.


Making Your Enterprise Disaster Tolerant
     Fibre Channel SANs are ideal for mirroring and accessing data across
     large distances.
     It is now possible to separate critical systems many miles apart.
     Brocade switches provide extended credits on ISLs to enable high
     performance and reliable long-distance operation.
226   Chapter 6 • SAN Applications and Configurations



      Frequently Asked Questions
      The following Frequently Asked Questions, answered by the authors of this book,
      are designed to both measure your understanding of the concepts presented in
      this chapter and to assist you with real-life implementation of these concepts. To
      have your questions about this chapter answered by the author, browse to
      www.syngress.com/solutions and click on the “Ask the Author” form.


      Q: My configuration does not look exactly like any of these. Is this a problem?
      A: These examples represent typical configurations for applications. Many real-
          life configurations might be more complex.

      Q: Where can I get more information on configurations?
      A: Interoperability programs through most Fibre Channel manufacturers provide
          example configurations, along with more detailed version numbers and spe-
          cific model information for tested configurations. Brocade has an extensive
          set of SAN solutions called SOLUTIONware Guides available on their Web
          site at www.brocade.com/SAN.

      Q: I am trying to control access to storage, and do not know what type of con-
          trol I need: zoning, LUN masking, or software? What should I do?
      A: The kind of control over your storage depends entirely on your application.
          Analyzing how dynamic your environment is will determine whether you
          can just use zoning or software. In most cases, you might actually use a com-
          bination of these techniques to achieve what you need.

      Q: I would like to cluster my databases for better performance.What databases
          can I use?
      A: Most major databases now support fabric switch-based clustering, including
          Oracle Parallel Server, IBM DB2 Parallel Edition, and Microsoft SQL Server.

      Q: I would like to have my Exchange Mail Server highly available.What
          should I do?
      A: Brocade has developed HA solutions for the Exchange Server that can be
          used in setting up your desired SAN configuration. For more information,
          visit the Brocde Web site: www.brocade.com/SAN.
                                      Chapter 7


Developing a SAN
Architecture




 Solutions in this chapter:

     s   Identifying Fabric Topologies and
         SAN Architectures
     s   Working with the Core/Edge Topology
     s   Determining Levels of Availability
     s   Configuring Traffic Patterns
     s   Evaluating Performance Considerations


         Summary

         Solutions Fast Track

         Frequently Asked Questions



                                                 227
228   Chapter 7 • Developing a SAN Architecture



      Introduction
      In Chapter 5, “The SAN Design Process,” you performed the requirements
      analysis to determine what your SAN needs to accomplish. In Chapter 6, “SAN
      Applications and Configurations,” you explored some of the solutions that could
      be built on your SAN. At this point in the book, you should know the following
      information about your SAN:
           s   How many ports you need for hosts
           s   How many ports you need for storage
           s   What the traffic patterns will be
           s   The network’s performance requirements
           s   Where all of the equipment will be located
           s   What, if any, MAN/WAN or campus distances will be involved
           s   What type of solution you are building (such as, storage consolidation)
           s   How all of this will likely change over time
          In this chapter, you will take this information and determine the fabric
      topology or topologies that best suit your needs as part of your overall SAN
      architecture.We discuss the different categories of fabric topologies that you
      could apply and which topology is most appropriate in any given case. Further-
      more, we describe how you could use multiple fabrics to form a highly reliable
      and scalable SAN architecture.We delve into detail on one particular topology,
      the core/edge fabric, also commonly known as a star topology network.There are
      subtle differences between “normal” star networks and a core/edge SAN that
      require using the new term. However, for the most part, if you think of a star
      network, you will not be far off base.


      NOTE
           Do not view the focus on the star topology as an indictment of the other
           approaches you could take. We chose to highlight this approach because
           it is simple. One of the strongest features of Brocade Fabric OS is the
           robust implementation of Fabric Shortest Path First (FSPF), which allows
           arbitrarily complex networks to be possible. However, the simplicity prin-
           ciple of “Occam’s razor” tells us to use simpler solutions whenever we
           can, and the core/edge SAN is simple indeed. Thus, we felt it to be an
           appropriate design to highlight.
                                           Developing a SAN Architecture • Chapter 7   229


    Methods for providing the best redundancy in all topologies and a discussion
of performance follow. Finally, we talk about how to deal with SANs that must
span long distances.

Identifying Fabric Topologies
and SAN Architectures
Before embarking on a discussion of SAN architectures, we will review a set of
working definitions for those SAN components particular to these architectures.
The terminology for describing fabric topologies, SAN architectures, and their
components is still evolving. Much of the current terminology is derived from
other networking technologies, like Ethernet, frame relay, or even high-perfor-
mance computing node interconnects in supercomputers. Since there are occa-
sionally multiple terminologies that might apply to the same entity, the terms we
use in this book are the ones that we consider the most useful in describing any
given SAN. For example, the term ring topology fabric is quite self-explanatory, and
is therefore useful in describing “a fabric in which the Inter-Switch Links (ISLs)
form a logical ring.”
     s   Fabric A fabric consists of one or more interconnected Fibre Channel
         switches.The term fabric can refer to the physical switches or to a set of
         global software components such as the routing tables, zoning configura-
         tions, and Name Servers.
     s   SAN A SAN can consist of one or more related fabrics and connected
         edge devices. It is possible to build a SAN using various networking
         technologies. However, all SANs discussed in this book are Fibre
         Channel fabric-based. Several emerging technologies might complement
         Fibre Channel in the future, so it is important to make the distinction.
         For now, though, the vast majority of SANs in production—and all
         SANs based on Brocade technology—use Fibre Channel.
     s   Fabric Topology Fabric topology is the arrangement of the switches
         that form a fabric.This term is used in the context of ISL interconnec-
         tion and does not relate to the way in which nodes are connected to the
         fabric. Moving a storage device from one port to another does not
         change the fabric topology.
     s   Resilient Core/Edge Fabric Topology Resilient core/edge fabric
         topology is when two or more switches act as a core to interconnect
230   Chapter 7 • Developing a SAN Architecture


               multiple edge switches. Nodes are attached to these edge switches.We
               discuss this topology in greater detail later in the chapter.The conven-
               tion for describing a simple core/edge fabric involves stating the number
               of edge switches, the number of core switches, and the number of ISLs
               used to interconnect each edge switch to each core switch. (For now, all
               switches are assumed to be 16 ports.) It is written as 16ex4cx1i, and is
               read as “A simple core/edge fabric consisting of sixteen edge switches,
               each of which is connected to four core switches by one ISL.” A shorter
               reading could be “A sixteen edge by four core by one ISL core/edge
               fabric.” Figure 7.1 illustrates this nomenclature.

               Figure 7.1 Core/Edge Fabric Nomenclature

                                                          16e x 4c x 1i



                                              16 e   x           4c       x          1i
                                                     by                   by
                                        dge




                                                           re




                                                                               L
                                                                              1 IS
                                                          4 co
                                       16 e




           s   Node A node is any device—usually either a host or storage device—
               that attaches to a fabric.The terms node and edge device can be used
               interchangeably.
           s   Node Count Node count is the number of nodes attached to a fabric.
               Each node might take up one or more ports in one or more fabrics.
           s   Fabric Port Count Fabric port count is the number of ports available
               for connection by nodes in a fabric.The term port count used alone refers
               to a fabric port count.
           s   SAN Port Count SAN port count is the number of ports available for
               connection by nodes in the entire SAN.This is the sum of the port
               counts in all fabrics that make up the SAN.
           s   SAN Architecture SAN architecture is the overall design or structure
               of a storage network solution.This includes one or more related fabrics,
               each of which has a topology. It might also include other networks over
               which the SAN is bridged or tunneled—such as a Metropolitan Area
               Network (MAN). More broadly, SAN architecture can include software
                                       Developing a SAN Architecture • Chapter 7     231


    components—such as path failover or data backup software—and the
    nodes that are attached to the fabric(s).
s   Hop Count Hop count can be defined in several ways. For the pur-
    pose of evaluating SAN designs, the hop count is identical to the
    number of ISLs that a frame must traverse to reach its destination. If
    traffic is localized to a switch, the hop count is zero. If traffic has to cross
    one ISL, the hop count is one.
s   Latency Each hop takes a small amount of time.This time is referred to
    as the latency of the link. It is a very small amount of time (2 micro-
    seconds maximum across a switch, 1 microsecond typical), and will
    influence performance only when the hop is a long-distance link (as in a
    SAN/MAN or SAN/WAN).This number is usually so small when
    compared to disk access times that it normally can be treated as inconse-
    quential and eliminated as a factor.
s   Over-Subscription Whenever more nodes could potentially contend
    for the use of a resource—such as an ISL—than that resource could
    simultaneously support, that resource is said to be over-subscribed. Over-
    subscription can be a desirable attribute in a fabric topology and is
    common in most networks as a cost/benefit trade-off, as long as it does
    not produce unacceptable levels of congestion. Although a design might
    have over-subscription, it does not necessarily mean that it will have
    congestion. Most nodes cannot sustain full Fibre Channel speeds, typi-
    cally running at 50 to 80 percent of the maximum 100 MB/sec. For
    performance-limiting congestion to occur, several devices must not only
    all operate at their peak at the exact same time, but must also sustain
    simultaneous peak operation. As most traffic is bursty, as well as relatively
    random, the chance of significant congestion is usually reduced to an
    insignificantly small amount.The exception to this rule is traffic such as
    video streaming.This kind of application produces a long, constant
    stream of data. If designing for this type of traffic, you must ensure
    adequate bandwidth for these streaming sources or localize the traffic
    onto a switch.
s   ISL Over-Subscription Ratio The over-subscription ratio for an ISL
    is the number of different ports that could contend for the use of its
    bandwidth.This should be calculated for an edge switch in a core/edge
    SAN by making a ratio of the number of free ports (non-ISL) on that
232   Chapter 7 • Developing a SAN Architecture


               switch to the number of ISLs. For example, in the 16ex4cx1i SAN, each
               edge switch has four ISLs (one ISL going to each of four cores). Since
               each edge switch has 16 ports, there are 12 left. 12:4 reduces to 3:1, so
               the over-subscription ratio is 3:1 for this SAN. Keep in mind that this
               just indicates the number of potential devices that could contend for
               access to an ISL.This does not necessarily mean that they actually are
               contending for access.


      NOTE
           In discussing over-subscription ratios, assume that all ports operate at
           the same speed: for example, 1 Gbit/sec. In a network where some ports
           are different speeds (some are 1 Gbit/sec, while others are 2 Gbit/sec),
           the process of calculating over-subscription can be fairly complex.



                     The worst-case scenario of meaningful over-subscription for an ISL
               on a 16-port edge switch is 15:1.This ratio means that 15 devices could
               be contending for the use of one ISL.This is not a property of Brocade
               switches. It is a mathematical property of “networks built with 16-port
               switches where all ports operate at the same speed.” One could argue
               that more than 15 nodes outside a switch could contend for access to it.
               However, this is not a meaningful definition of ISL over-subscription,
               since the nodes would be subject to performance limitations of node
               over-subscription. If two hosts are trying to access one storage port, it
               does not matter how well the network is built—the over-subscribed
               storage port will be the limiting factor. Over-subscription of a storage
               port is a completely different performance metric. See the definitions for
               fan-in and fan-out.
                     As switches get larger, over-subscription will continue to be a design
               factor. Networks and node counts will continue to grow faster than the
               ability to build larger switches. (In fact, the Fibre Channel addressing
               standards themselves limit the maximum size of a switch to 256 ports,
               while a fabric could be significantly larger than that.) Networking will
               still be required in SANs and will have the potential for congestion
               regardless of the vendor or the switch size.
                                      Developing a SAN Architecture • Chapter 7   233


s   Congestion Congestion is the realization of potential over-
    subscription. A congested link is one on which multiple devices are
    actually contending for bandwidth.These devices will be throttled down
    to consume the total bandwidth, assuming that all the devices are
    peaking and congesting the link at the same time.
        A frequent point of confusion is the difference between congestion
    and blocking. Blocking means that the data actually does not get to its
    destination, whereas congestion means that the data might simply have
    to wait a bit.
        As an analogy, consider the checkout line in a supermarket.When
    there are more customers in the store than there are cashiers, the
    checkout line is over-subscribed. However, if only one customer wants
    to check out at a time, the presence of other customers in the store is
    irrelevant to him or her. If more customers need to check out than there
    are cashiers, the checkout line is congested.This will have an impact on
    how quickly each checkout can occur. In a congested but nonblocking
    supermarket, each customer might have to wait in line a bit, but will
    eventually get to go through the checkout. If the supermarket took a
    blocking approach, excess customers would be turned away, and would
    not be allowed to wait in line. Some networks block; Brocade Fibre
    Channel networks do not.
s   Fabric Shortest Path First (FSPF) The FSPF protocol was devel-
    oped by Brocade and subsequently adopted by the Fibre Channel
    standards community for allowing switches to discover the fabric
    topology, and route frames correctly. (As an interesting side note,
    Brocade proposed FSPF as a standard to ANSI one and a half years
    before its adoption. A number of competing protocols were proposed by
    other companies in the interim, but the T11 standards committee
    selected only FSPF.) FSPF also provides for load sharing between equal-
    cost links.This capability is the key to eliminating congestion from a
    fabric. Fabric topologies that have many equal-cost links—such as the
    resilient core/edge topology—benefit greatly from FSPF load sharing
    features. Fabrics with many ISLs, but few equal-cost paths—like the full
    mesh—do not use FSPF as efficiently.
s   Single Point of Failure (SPOF) A single point of failure in a SAN is
    any component—either hardware or software—that could bring down
    an entire SAN solution.This could be a switch or an ISL in a non-
    resilient topology, or a host with no clustering software installed. In an
234   Chapter 7 • Developing a SAN Architecture


               environment where uptime is critical, even distributed fabric services—
               such as the Name Server—can be viewed as single points of failure.
               This is why dual fabrics are always recommended in high-availability
               applications.
           s   Fan-out This is the ratio of host ports to a single storage port. It is the
               view of the SAN from the storage port’s perspective: “How many dif-
               ferent hosts will be trying to access me via this port?” If you have many
               hosts accessing a single storage port, you can be assured that all of them
               together will not use more than the 1 Gbit/sec the port has available.
               Thus, if you have a fan-out of 10:1 (10 hosts sharing access to one
               storage port), it is possible that on average each host will get only one-
               tenth of the available bandwidth. Using high fan-out or ISL over-sub-
               scription ratios can be a perfectly acceptable way to build a SAN, as long
               as the performance characteristics of the applications involved are well
               understood.
           s   Fan-in This is the ratio of storage ports to a single host port. It is the
               view of the SAN from the host port’s perspective:“How many different
               storage devices will I be trying to access from this HBA port?” Fan-in
               and fan-out are both useful for planning the aggregate bandwidth needed
               in the ISL matrix. If, for example, you have a fan-in of 3:1 (three storage
               ports for each host port), and your ISLs are 3:1 over-subscribed (three
               devices could potentially contend for access to one ISL), proper device
               placement will allow this SAN to run with an actual congestion ratio of
               1:1.This is because you can guarantee that certain devices will simply
               never try to talk to each other.Thus, the SAN will not adversely affect
               performance, despite the ISLs being over-subscribed.
           There are several categories of information for which we need to define
      terms.These categories cover such topics as how the switches are interconnected
      (the topology), what strategy is used for optimizing traffic patterns (locality or
      tiering), and what the distance requirements are (MAN/WAN). In order for dis-
      tance to affect topology, the distances involved do not have to be long, they just
      have to affect your ability to cable components together. Even having SAN com-
      ponents located in different rooms or floors of the same building might affect the
      SAN topology you choose.You will need to know how many fabrics are
      involved, and whether they all have the same topology.The last two items are part
      of the SAN architecture, not the fabric topology.
                                           Developing a SAN Architecture • Chapter 7    235


    A SAN that attaches to a long-distance network such as a MAN or WAN is
described as being “over” that technology. A SAN tunneled over a WAN is
referred to as “SAN over WAN” or “SAN/WAN.” If a specific WAN technology
is known to be in use, then you can refer to technologies in the same way. For
example, if Fibre Channel is used on the SAN, and ATM is used on the WAN,
you can refer to this as “FC over ATM” or “FC/ATM.”
    The following sections provide terms and definitions for the pieces that are
not quite so easily defined. In particular, we provide terms for fabric topologies
and SAN architectural strategies. Each of the topologies that we cover includes a
description of the topology, a picture of a representative network built using that
topology, and a summary list of its properties. All examples and properties assume
that the networks are built using 16-port switches, unless otherwise noted. Size
limits are based on current production-level support. Production level implies that
Brocade has tested fabrics to these size limits, and/or has customers who have
successfully done so.This does not mean that all vendors will support the max-
imum size listed, but rather that somebody does, and that there are no fundamental
technical reasons why fabrics of that size cannot be deployed in production
environments.
    Also note that sizes are based on the most symmetrical designs. It is usually
possible to increase the size of a fabric by “hanging another switch off the edge.”
This means that you would take a “pure” design like a core/edge fabric, and plug
another switch or two into it anywhere you like to increase the port count.

Useful Topologies
The topologies for which we provide definitions are not the only topologies pos-
sible.This list does not represent all of the standard topologies that can be applied
to fabrics. Instead, we have attempted to compile a list of topologies that we have
found to be the most useful in practical, real-world SANs. For example, we
include the core/edge topology, but not the tree topology.This is because we
have not seen any instance where the tree is the best way to solve a design
problem, not because there is anything that would prevent you from using it.
     You can also combine these topologies to form complex networks, provided
that you do not exceed the supported numbers of switches or exceed the Fibre
Channel standards defined hop count limit. (There can be no more than seven
hops between an initiator and a target.) These complex networks are desirable
when building SAN/MAN or SAN/WAN solutions, but should be avoided to
keep the network as simple as possible.
236   Chapter 7 • Developing a SAN Architecture


      Scalability
      One of the properties we discuss for each topology is its ability to scale well.
      There are two metrics for measuring scalability: the size that the topology can
      scale to (in terms of port count and switch count), and the ease of performing
      this process. A number of scalability factors are involved in determining these
      metrics, including the fundamental “geometric” properties of the design, and the
      way that the channel-like fabric services perform in it. Certain topologies, such as
      the resilient core/edge topology, are well suited to taking a “start small and grow
      large” approach without needing much effort to add switches. However, some
      topologies, like the partial mesh, might scale to a large size, but require extensive
      recabling and possibly even downtime to do so.There is nothing fundamentally
      wrong with the latter categories of design, but they might not be well suited for
      dynamic environments that have high uptime requirements.

      Cascade Topology
      A cascaded fabric, illustrated in Figure 7.2, is like a bus network: it is a line of
      switches with one connection between each switch and the switch downstream
      of it.The switches on the ends are not connected.

      Figure 7.2 A Cascaded Fabric




          Cascaded fabrics are very inexpensive, easy to deploy, and easy to expand.
      However, they have the lowest reliability and limited scalability.They are most
      appropriate in situations where most if not all traffic can be localized onto indi-
      vidual switches, and the ISLs are used primarily for management traffic.
          There are cascade variations that use more than one ISL between switches.
      This will eliminate ISLs as a single point of failure and greatly increase the
                                            Developing a SAN Architecture • Chapter 7   237


reliability of the solution. However, this also greatly increases the cost of the
solution, and each switch can still be a single point of failure.Table 7.1 charts the
properties of a cascade topology.

Table 7.1 Properties of a Cascade Topology

Limit of scalability (edge port count) 114 ports / 8 switches

Properties                                  Ratings
Ease of scalability                           3
Performance                                   1
Ease of deployment                            3
Reliability                                   1
Cost (per edge port)                          3

Ratings indicate how well the topology meets the ideal requirements of that
property (1 = Not well, 2 = Moderately well, 3 = Very well).


Ring Topology
A ring is like a cascaded fabric, but with the ends connected (Figure 7.3).
Figure 7.3 A Ring Topology




    The ring has superior reliability to the cascade because traffic can route
around an ISL failure or a switch failure. It does cost more than a cascade, but not
significantly more.The ring is usually preferable to the cascade for this reason.
    Rings still do not perform well, or scale well.The ISLs are still very much
over-subscribed and will perform acceptably only when significant localization of
238   Chapter 7 • Developing a SAN Architecture


      traffic is possible. Of course, you can improve performance by using more than
      one ISL per switch interconnect, but—just as in a cascade—this substantially
      reduces the scalability and increases the cost of the fabric.To scale the fabric, you
      must disconnect at least one ISL, which might be disruptive to a production SAN.
           This design is fine if you plan for your SAN to start small and stay small. It
      can also be used when implementing SAN over MAN or WAN, where the
      topology of the MAN/WAN might dictate the topology of the Fibre Channel
      network. (Rings are common MAN/WAN topologies.) Finally, it is a good
      choice when the ISLs are mostly used for management—when performance over
      the ISLs is not a concern—but reliability of the ISL structure is still required, and
      cost is a driving factor. For example, this design might be well suited for an
      enterprise backup SAN with many backup libraries dedicated to server
      groups.(The ISLs do not carry much in the way of volume of traffic, but what
      they do carry is important traffic.) If your SAN performance requirements are low
      (as in many Windows NT environments), the ring architecture might also work
      for you.Table 7.2 charts the properties of a ring topology.

      Table 7.2 Properties of a Ring Topology

      Limit of scalability (edge port count) 112 ports / 8 switches

      Properties                                      Ratings
      Ease of scalability                             2
      Performance                                     1
      Ease of deployment                              3
      Reliability                                     3
      Cost (per edge port)                            3

      Ratings indicate how well the topology meets the ideal requirements of that
      property (1 = Not well, 2 = Moderately well, 3 = Very well).

      Mesh Topologies
      Technically, almost any topology can be described as some sort of mesh. Since
      this is not a very useful definition—and above all else, a definition must be
      useful—we will discuss and provide working definitions for two meshes: the full
      mesh and the partial mesh.
                                           Developing a SAN Architecture • Chapter 7   239


Full-Mesh Topology
In a full-mesh topology (Figure 7.4), every switch is connected directly to every
other switch. Using 16-port switches, the largest useful full mesh consists of eight
switches, each of which has nine available ports.This gives you a total of 72 avail-
able ports. Adding more than eight switches will actually reduce the number of
available ports. Figure 7.5 shows the maximum size of a full mesh.

Figure 7.4 A Full-Mesh Topology




Figure 7.5 Maximum Size of a Full-Mesh Topology




NOTE
     Scaling a full mesh can require unplugging edge devices. If you have a
     4-switch full mesh (52 edge ports) and you use all the ports with edge
     devices, you will need to unplug one device from each switch in the
     mesh in order to add another switch. Make sure that you plan for this by
     either leaving some ports free on all switches, or making sure that there
     are some devices that can withstand downtime on each switch. Because
     of this, full meshes do not have a high rating for ease of scalability.
240   Chapter 7 • Developing a SAN Architecture


           Full meshes are best used when you know that your fabric will not grow
      beyond four or five switches, since the cost of the ISLs becomes prohibitive after
      that. Also, networks that use six or more switches are very good candidates for a
      core/edge design, which will cost less, perform better, and scale much better than
      a full mesh.The good thing about performance on a full mesh is that there will
      never be more than one hop between switches unless there is a failure. However,
      since the ability to support load sharing is more of a performance enhancement
      than is keeping hop count low, this is not a good choice for most performance-
      critical applications.Table 7.3 charts the properties of a full-mesh topology.


      NOTE
           Special cases:
              A 2-switch full mesh is identical to a 2-switch cascade.
              A 3-switch full mesh is identical to a 3-switch ring.




      Table 7.3 Properties of a Full-Mesh Topology

      Limit of scalability (edge port count) 72 ports / 8 switches

      Properties                                     Ratings
      Ease of scalability                            1
      Performance                                    2
      Ease of deployment                             3
      Reliability                                    3
      Cost (per edge port)                           1

      Ratings indicate how well the topology meets the ideal requirements of that
      property (1 = Not well, 2 = Moderately well, 3 = Very well).

      Partial-Mesh Topology
      The common definition for a partial mesh (Figure 7.6) is broad enough to
      encompass almost all SANs that are not full meshes. A partial mesh is defined by
      Brocade as follows: “A partial mesh is similar to a full mesh, but with some of the
      ISLs removed. In most cases, this will be done in a structured pattern (for
      example, each switch will directly connect to its neighbor, and to every other
                                           Developing a SAN Architecture • Chapter 7   241


switch across from it).”While this definition is not in general use outside of
Brocade, it describes a desirable variant on the full mesh.
Figure 7.6 A Partial-Mesh Topology




    The network in Figure 7.6 might be useful if you knew that traffic would
never flow between the two left-hand switches.The network still is fully resilient
to failure, but you are not paying a price premium for an ISL that will not be
used. Partial meshes also scale farther than full meshes. Figure 7.7 shows a partial
mesh that has 176 free ports. (Remember that the largest full mesh has 72 ports.)
Each switch is connected to its neighbor.Two switches are skipped before the
next connection.The worst-case hop count between switches in the event of an
ISL failure is three hops.

Figure 7.7 Maximum Size of a Partial-Mesh Topology
242   Chapter 7 • Developing a SAN Architecture


          While these networks can be scaled to produce a large number of edge ports,
      they still have performance characteristics that are less than ideal. None of the
      networks listed thus far will benefit much from FSPF load-sharing capabilities, for
      example. Since bandwidth is more frequently a concern than hop count (Fibre
      Channel latency being effectively a second-order derivative in most real-world
      performance calculations), the ability of a topology to load share across ISLs is
      key to its performance.
          Moreover, partial meshes can be difficult to scale without downtime.The
      procedure for moving from the full-mesh fabric in Figure 7.5 to the partial-mesh
      fabric in Figure 7.7 would require not only adding new switches and potentially
      disconnecting nodes, but actually disconnecting ISLs that were already in place.
      The same is true for scaling between many partial-mesh designs.This would be
      disruptive to many production SANs, especially if redundant fabrics were not
      used.Therefore, meshes—either full or partial—are recommended only for net-
      works that will change infrequently.They might also be used as a static compo-
      nent of a network. For example, a full mesh could be used in an environment
      where the “SAN islands” architecture was employed, or as the core of a complex
      core/edge design (which we discuss later).Table 7.4 charts the properties of a
      partial-mesh topology.

      Table 7.4 Properties of a Partial-Mesh Topology

      Limit of scalability (edge port count)          176+ ports / 16+ switches

      Properties                                      Ratings
      Ease of scalability                             1
      Performance                                     1
      Ease of deployment                              2
      Reliability                                     3
      Cost (per edge port)                            2 to 3

      Ratings indicate how well the topology meets the ideal requirements of that
      property (1 = Not well, 2 = Moderately well, 3 = Very well).

      Core/Edge or Star Topologies
      In a resilient core/edge fabric (Figure 7.8), two or more switches will reside in
      the center of the fabric (the core) and interconnect a number of other switches
      (the edge). Switches that reside in the middle of the fabric are referred to as core
                                           Developing a SAN Architecture • Chapter 7   243


switches.The switches that are interconnected by the core switches are referred
to as edge switches. Devices such as hosts and storage are attached to free ports
on the edge switches.These ports are referred to as edge ports. Free ports (if any)
on the core switches should be reserved for use as ISLs, in order to avoid limiting
the fabric’s growth potential.
Figure 7.8 A Core/Edge or Star Topology


                                                            Edge



                     Core




    We will focus on the core/edge fabric as being a solution for scalable fabrics
for a number of reasons.The core/edge topology is:
     s   Easy to grow without downtime (“pay as you grow”).
     s   Easy to transition to future large core fabric switches, and good at
         providing investment protection.
     s   Economical, with a good cost-to-performance ratio.
     s   Simple and easy to understand.
     s   Well tested and reliable.
     s   Proven in traditional data networks such as Ethernet.
     s   Capable of exhibiting stellar performance, with full utilization of FSPF
         load sharing and redundancy features.
     s   Conducive to performance analysis. Performance in a core/edge SAN is
         deterministic, and you can easily determine how much bandwidth any
         given switch has to “get to” any other switch.
     s   Scalable to hundreds of ports now, and thousands in the future.
     s   Able to solve most design problems, and a good choice when design
         requirements are not well known.
244   Chapter 7 • Developing a SAN Architecture


          Again, we must stress that the simple core/edge SAN is by no means the only
      way to build scalable networks with Brocade switches.There are limitless ways in
      which Brocade switches could be interconnected.We will focus on this particular
      topology in the hopes that it will be beneficial to the majority of designers, and
      we will leave the limitless variations on this and other topologies as exercises for
      the reader.

      Simple Resilient Core/Edge Topology
      A simple resilient core/edge fabric has two or more core elements, each of which
      consists of a single switch. All of the core/edge fabrics depicted thus far are
      simple.Two core elements are recommended to maintain a high level of resilience
      and avoid a single point of failure (Figure 7.9).Table 7.5 charts the properties of a
      simple resilient core/edge topology.
      Figure 7.9 A Simple Resilient Core/Edge Topology


                             Two core elements
                             (       and      )
                             each of which is a
                             single switch




      Table 7.5 Properties of a Simple Resilient Core/Edge Topology

      Limit of scalability (edge port count)          224 ports / 20 switches

      Properties                                      Ratings
      Ease of scalability                             3
      Performance                                     3
      Ease of deployment                              3
      Reliability                                     3
      Cost (per edge port)                            2

      Ratings indicate how well the topology meets the ideal requirements of that
      property (1 = Not well, 2 = Moderately well, 3 = Very well).

      Complex Core Resilient Core/Edge Topology
      A complex core resilient core/edge SAN (Figure 7.10) has two or more core
      elements, each of which consists of multiple switches.These designs are more
                                          Developing a SAN Architecture • Chapter 7   245


complex, more expensive to build and maintain, and are generally not necessary.
However, in certain cases, you might want to expand beyond 16 edge switches,
and large-port-count core switches might not be available. In this case, you could
build a larger fabric out of complex core elements.Table 7.6 charts the properties
of a complex core/edge topology.
Figure 7.10 A Complex Core Resilient Core/Edge Topology




Table 7.6 Properties of a Complex Core Resilient Core/Edge Topology

Limit of scalability (edge port count)        300+ ports / 28+ switches

Properties                                    Ratings
Ease of scalability                           2
Performance                                   2
Ease of deployment                            1
Reliability                                   3
Cost (per edge port)                          1

Ratings indicate how well the topology meets the ideal requirements of that
property (1 = Not well, 2 = Moderately well, 3 = Very well).

Composite Resilient Core/Edge Topology
A composite resilient core/edge SAN has two or more cores (Figure 7.11). Each
core consists of two or more single-switch elements. It could be viewed as two
246   Chapter 7 • Developing a SAN Architecture


      stars “glued together at the edge.”This is useful when using a tiered approach to
      performance optimization (we discuss tiered SANs later).
      Figure 7.11 A Composite Resilient Core/Edge Topology


               Two cores (     and       ) each of
                which has two elements, each of
                    which is a single switch




      Topologies at a Glance
      Table 7.7 on the following page provides a compilation of all of the character-
      istics of topologies listed thus far, so that you can compare them easily.

      Complex Topologies
      It is possible to build arbitrarily complex architectures by connecting switches
      together in a seemingly random way. For example, a ring could be connected to
      the edge of a core/edge fabric, which could also have a cascade and a smaller
      core/edge fabric hanging off it.What would you call this SAN other than “com-
      plex?”While this approach might be desirable in certain cases, it is usually better
      to use a more structured approach to network design.This will ensure flexibility
      and the maintainability of the network.
           Even without deviating from the “simple” core/edge network—or, indeed,
      from a simple full mesh, or ring—it is possible to provide for variations in perfor-
      mance requirements. Figure 7.21 later in this chapter is an example of an asym-
      metrical core/edge network, which is still geometrically “simple.”

      Working with the Core/Edge Topology
      Building one or more core/edge fabrics is, at the time of this writing, the best
      way to implement a general-purpose scalable SAN. Again, more experienced
      designers should feel free to implement whatever topology they see fit; we are
      simply advocating the core/edge approach because it can solve most problems
      easily, which is, after all, what most users want.This section provides you with
      some tips on how to use your core/edge fabric most effectively.
      Table 7.7 Topology Properties Comparison
      Limit of       114 ports /   112 ports /     72 ports /    176+ ports /      224 ports /   300+ ports /
      scalability    8 switches    8 switches      8 switches    16+ switches      20 switches   28+ switches
      (edge port
      count)

                                                                                   Simple        Complex
      Topology       Cascade       Ring            Full Mesh     Partial Mesh      Core/Edge     Core/Edge
      Properties     Ratings       Ratings         Ratings       Ratings           Ratings       Ratings
      Ease of        3             2               1             1                 3             2
      scalability
      Performance    1             1               2             1                 3             2
      Ease of        3             3               3             2                 3             1
      deployment




247
      Reliability    1             3               3             3                 3             3
      Cost (per      3             3               1             2 to 3            2             1
      edge port)

      Ratings indicate how well the topology meets the ideal requirements of that property
      (1 = Not well, 2 = Moderately well, 3 = Very well).
248   Chapter 7 • Developing a SAN Architecture


      Scaling without Downtime
      You can scale a resilient core/edge network without any downtime.This requires
      a small amount of planning and attention to detail, but it is well worth the effort.
      Here are two examples of how you can scale these networks easily, by adding an
      edge switch and upgrading the core.
          The procedures listed here are designed to be applicable in a large, real-world
      production environment, where a fine degree of control over the SAN is desired.
      Consequently, they are much more complex than the plug-and-play approach,
      which would be used in a less-demanding environment. For example, it is pos-
      sible to add an edge switch by simply plugging it in, turning it on, and allowing
      the Brocade Fabric OS’s plug-and-play features to do the configuration for you.
      In larger production environments, however, it is usually desirable to control the
      process manually, to ensure that the new switch not only enters the fabric, but
      also does so in accordance with your site-specific administration policies. For
      example, Fabric OS can automatically pick a domain ID for you. Most large sites
      have a structured approach to domain ID assignment, and in these cases you
      would want to assign the domain ID manually before adding the new switch to
      the fabric.The following manual procedures are not necessary, but are recom-
      mended in these more tightly controlled environments.We leave it to you to
      decide what level of manual control you wish to take in this process, and how
      much control you want to give to the Fabric OS.


      NOTE
           It is best to use dual fabrics to achieve the highest possible uptime.
           The use of resilient fabrics, like the mesh, ring, or core/edge, is part of a
           fully redundant—or High-Availability (HA)—SAN solution, but not the
           whole of it.




      Adding an Edge Switch
      You might want to implement only part of the edge of your core/edge fabric at
      the beginning, and add the rest of the edge switches later. In other words, you
      can take the “pay-as-you-grow” route.This is easy to do without downtime, as
      long as you do not use the free ports in the core switches for edge devices, but
      rather reserve them for scalability.
                                             Developing a SAN Architecture • Chapter 7   249


    Let us say that your architecture (as shown in Figure 7.12, Step C) is a 10-
switch resilient core/edge fabric. Let us say further that you have only imple-
mented half of the edge (Figure 7.12, Step A), and then decide that you want to
add one more edge switch (Figure 7.12, Step B).
Figure 7.12 Adding an Edge Switch




                Step A                  Step B                   Step C
               4ex2cx2i                5ex2cx2i                 8ex2cx2i



Step One: Setting Up the New Switch
Set the new switch up by itself in the location where it will attach to the SAN
(bolt it to the rack), but do not attach it to the SAN at this point.


NOTE
     We discuss the process of adding a switch to the fabric in greater detail
     in Chapter 9, “SAN Implementation, Maintenance, and Management.”


    Power on the switch, assign it a host name and an IP address, and then use
either the telnet or Brocade WEB TOOLS interface to configure its domain ID,
switch name, and any special variations of fabric parameters—such as Error-
Detect Time Out Value (E_D_TOV)—that you are using at your site. Ensure that
no zoning configuration is in effect.

Step Two: Connecting the New Switch
Issue the portDisable command on the new switch for each port that you
intend to use as an ISL. For example, if you are using ports 0 through 3 as the
ISL ports, you would execute the following commands:
newswitch:admin> portDisable 0
newswitch:admin> portDisable 1
newswitch:admin> portDisable 2
newswitch:admin> portDisable 3
250   Chapter 7 • Developing a SAN Architecture


          Attach cables to these ports and route them to the appropriate core switches.
      For example, ports 0 and 1 could be connected to one core switch, and 2 and 3
      could be connected to the other core switch.
          Re-enable the first port (port 0). At this point, the new switch should join
      the fabric. After the fabric has reconfigured, you can issue the fabricShow com-
      mand to see if it has indeed joined the fabric:
      newswitch:admin> portEnable 0
      newswitch:admin> fabricShow

         Once you have verified that the switch joined the fabric successfully, re-
      enable the remaining ports.While not necessary, this procedure will help ensure
      minimal disruption to your fabric.

      Upgrading the Core
      As technology progresses, inevitably there will be a “bigger, better, faster” product
      available that you might wish to purchase. It is quite likely that you will want this
      new product to reside in the core of your fabric, as this position will probably
      benefit the most from the features that caused you to buy the product in the first
      place.This approach is common in all areas of networking: when you build a
      LAN in a rapidly growing company, you might use several entry-level Ethernet
      switches as your backbone.When the environment grows, you might replace
      them with workgroup switches, and then perhaps a bladed chassis, but the entry-
      level switches that you used at the start will still be included in the network; they
      will simply not form the core of the network anymore.
           For this example, we will say that you have a fully populated 18-switch
      resilient core/edge fabric (16ex2cx1i), as shown in Figure 7.13.
      Figure 7.13 A Fully Populated 18-Switch Resilient Core/Edge Fabric
                                           Developing a SAN Architecture • Chapter 7   251


    Since the core switches in this design are fully populated, you cannot add any
more edge switches. However, you can change the core to a larger switch without
downtime.This will allow you to add more edge switches and will even give you
the first two new edge switches “for free” in the bargain!

Step One: Adding the New Core Switches
As in the previous example, you should put the new core switches in their cor-
rect physical location, but not attach them to the fabric at this point.You should
configure their IP addresses, domain IDs, and so on, and then clear the zoning
configuration as before. However, you do not need to disable all of the ports that
will be used for ISLs. Use the switchDisable command to disable all ports on
that switch. Again, this is not necessary, and you can allow Fabric OS to do the
work for you if your site does not require such fine control.
    Telnet into the first of the two existing core switches that you want to replace.
Issue the switchDisable command:
oldswitch:admin> switchDisable

    Verify that all traffic is going over the remaining core switch. After you are
sure that the SAN is functioning correctly on its one remaining core, you can
power off the disabled core. Remove the ISL cables from the switch and the rack
cable management structure.The switch should now be completely separated
from the SAN (Figure 7.14). Cable the first new core switch—the one that you
have configured and disabled—to one of the edge switches.
Figure 7.14 Removing the First Core Switch




Step Two: Configuring the New Core Switches
Issue the switchEnable command. As before, the new switch should join the
fabric.You might see a fabric reconfigure message on the telnet session. After the
fabric has reconfigured, you can issue the fabricShow command:
252   Chapter 7 • Developing a SAN Architecture

      newswitch:admin> switchEnable
      newswitch:admin> fabricShow

        One at a time, cable in the remaining edge switches.When you are done, the
      SAN should look like Figure 7.15.

      Figure 7.15 Adding the New Core Switch




        Repeat the procedure with the other new core switch, and you will have the
      SAN shown in Figure 7.16.
      Figure 7.16 Newly Installed Core Switches




          You also have two extra switches (the former core switches that are sitting off
      to the side disconnected). Follow the procedure for adding an edge switch to
      bring these back into the fabric (Figure 7.17).This set of procedures allows you
      to future-proof your SAN, because it gives you an architecture that has a
      migration path to future technologies already built into it.
                                            Developing a SAN Architecture • Chapter 7       253


Figure 7.17 Redeploying the Former Core Switches


                Install core
              fabric switches                            Redeploy to the edge




   Target Designs for a Core/Edge SAN
   If you have a rough idea of your port count and performance require-
   ments, but do not want to spend much time designing an optimized
   SAN architecture, you can pick one of the following four designs and
   “be done with it.” You do not have to build the entire design at once.
   You can build toward the target design, and “pay as you grow.”
         Design One (the 10-switch core/edge), Design Two (the 20-switch
   core/edge), and Design Three (the 18-switch core/edge) are “pure”
   designs. They are simple, symmetrical, and easy to understand and
   deploy. You can “start small and grow large“ with them. Design Four
   (the 14-switch core/edge) is a combination of Designs One and Three.
   The approach used in Design Four is less symmetrical, but might be
   desirable if parts of your SAN have different performance requirements,
   or if you are expanding an existing SAN.
   Design One
   The 10-switch core/edge design, shown as Design One in Figure 7.18, is
   ideal for SANs that require high performance, have little known locality,
   and will not immediately grow beyond 96 edge devices.

   Identification                8 edge by 2 core by 2 ISL (8ex2cx2i)
   Edge ports                   96
   ISL over-subscription        3:1
   Switch count                 10


                                                                                Continued
254   Chapter 7 • Developing a SAN Architecture



          Figure 7.18 Design One (10-Switch Core/Edge)




         Design Two
         Design Two, the 20-switch core/edge design shown in Figure 7.19, is
         very similar to Design One. It uses twice as many switches, produces
         twice as many ports, and performs exactly the same way. This design is
         appropriate for SANs that require high performance, have little known
         locality, and will not immediately grow beyond 192 edge devices.

         Identification                16 edge by 4 core by 1 ISL (16ex4cx1i)
         Edge ports                   192
         ISL over-subscription        3:1
         Switch count                 20


          Figure 7.19 Design Two (20-Switch Core/Edge)




                                                                               Continued
                                      Developing a SAN Architecture • Chapter 7   255



Design Three
Design Three, the 18-switch core/edge design shown in Figure 7.20,
produces more ports, and takes fewer switches to implement than the
previous design. It is substantially more cost-effective. The tradeoff is
that the performance is lower, so it is appropriate if high performance is
not critical, or if some locality is known.

Identification             16 edge by 2 core by 1 ISL (16ex2cx1i)
Edge ports                224
ISL over-subscription     7:1
Switch count              18

Figure 7.20 Design Three (18-Switch Core/Edge)




Design Four
Design Four, the 14-switch core/edge design shown in Figure 7.21,
balances the higher performance of Design One with the lower cost and
higher port count of Design Three. This approach is appropriate if you
have some devices that require high performance, and some that either
do not require that degree of performance or have known locality. Many
variants are possible that use this approach, and it is possible to yield
anywhere between 96 ports and 224 ports.

Identification           12 edge by 2 core by [1 or 2] ISL
                        (12ex2cx[1/2]i)
Edge ports              160
ISL over-subscription   3:1/7:1
Switch count            14

                                                                     Continued
256   Chapter 7 • Developing a SAN Architecture



          Figure 7.21 Design Four (14-Switch Core/Edge)



                             Like the SilkWorm 6400




               It is also useful to take this approach when expanding a SilkWorm
         6400 Integrated Fabric. The internal ISL topology of the SilkWorm 6400
         is like an incomplete Design One SAN. You can complete the SAN exactly
         like Design One by attaching four switches to the SilkWorm 6400’s outer
         modules (modules 1 and 6), attaching eight switches to get Design Four,
         or attaching some switches like Design One and some like Design Four.
               These four designs will solve most SAN design problems and will all
         migrate easily to incorporate high-port-count core fabric switches in the
         future.



      Determining Levels of Availability
      Resiliency is the capability of a fabric topology to withstand failures.This is equiv-
      alent to the fault-tolerant level of availability seen in many high-uptime RAID
      arrays or enterprise-class servers.The core/edge, mesh, and ring topologies pro-
      vide at least two internal fabric routes and are considered resilient because each
      topology can withstand a switch or ISL failure while the remaining switches
      remain operational.Thus, the fabric can “heal” an “injury” without any operator
      intervention.This self-healing capability is enabled by the Brocade-authored
      FSPF protocol. (FSPF is now the standard Fibre Channel routing protocol.Visit
      www.t11.com for details on this and other Fibre Channel standards.)
           Redundancy is the duplication of components up to and including the entire
      fabric to prevent the failure of the SAN solution. For example, an airplane
      hydraulic system is resilient to failures. However, most jumbo jets also have
      redundant hydraulic systems so that the jet will not crash even if the resiliency
      fails to keep the system up. Redundancy is equivalent to the HA model popular
                                            Developing a SAN Architecture • Chapter 7    257


with critical server farms using products like VERITAS Cluster Server or
Microsoft Cluster Server (MSCS). In a redundant SAN architecture, you must
have at least two completely separate fabrics—just as an HA server solution
requires at least two completely separate servers.This type of design not only
accounts for software and hardware failures, but also for the human errors that
can be the more common cause of system downtime. As the fabrics are indepen-
dent, an incorrect operator-initiated change on one fabric (such as disabling an
entire switch instead of just a port) would not affect the redundant SAN. Since
the other SAN has a separate connection to the devices that have been discon-
nected from the first fabric, the services performed by those devices can continue.
Because of this, even a larger (director-class) single switch with highly available
characteristics is still a single point of failure, and should be used with caution in
HA environments.
    There are four primary categories of availability in SAN architecture. In order
of increasing availability, they are:
     s   Single fabric, nonresilient All switches are connected to form a
         single fabric, which contains at least one single point of failure.The
         cascade topology is an example of this category of SAN.
     s   Single fabric, resilient All switches are connected to form a single
         fabric, but there is no single point of failure that could cause the fabric
         to segment.The ring, mesh, and core/edge topologies are examples of
         single, resilient fabrics.
     s   Dual fabric, nonresilient Half of the switches are connected to form
         one fabric, and the other half form an identical, separate fabric.Within
         each fabric, at least one single point of failure exists.This design can be
         used in combination with dual-attached hosts and storage devices to
         keep a solution running even if one fabric fails.
     s   Dual fabric, resilient Half of the switches are connected to form one
         fabric, and the other half form an identical, separate fabric. Neither
         fabric has a single point of failure that could cause the fabric to segment.
         This design can be used in combination with dual-attached hosts and
         storage devices to keep an application running even if one entire fabric
         fails due to operator error, catastrophe, or quality issues.This is generally
         the best design approach for HA requirements. Another key benefit of a
         dual-SAN design is the capability to take half of the dual fabric offline
         for upgrades or maintenance without affecting production operations on
         the remaining fabric.
258   Chapter 7 • Developing a SAN Architecture


         Figure 7.22 depicts the failure of a switch in a cascade topology. Switches A
      and B are unable to communicate with the remaining switches when the switch
      marked with the “X” fails, resulting in segmentation of the fabric into three
      smaller fabrics.

      Figure 7.22 Switch Failure in a Nonredundant Cascade Topology

                                                  B


                                        A




                                   C                    Segmentation



                                    D

                                                         E




            This fabric is neither resilient nor redundant.The uptime of this fabric solu-
      tion could be improved by duplicating the fabric, as in Figure 7.23. In this
      example, even though one of the fabrics segmented, all devices are attached to
      another SAN that is still working.
            A switch failure in a ring, mesh, or core/edge topology SAN does not cause a
      loss of communication with the remaining switches, as shown in Figure 7.24, for
      the core/edge topology. If one core switch fails, the alternative core switch can
      still communicate with all edge switches.This will allow the SAN to continue to
      operate.The failover to alternate paths is performed by FSPF and is completely
      transparent to the users of the fabric.These topologies are therefore considered
      resilient.
            In order to ensure maximum uptime, you should use both resilience and
      redundancy. Figure 7.25 shows a resilient and redundant SAN solution.
            Redundant fabrics are desirable because they provide the most effective pro-
      tection against hardware and software failures, and protect against user error. For
      example, if a SAN administrator were to telnet into SAN A and accidentally
      damage the zoning configuration—making Fabric A unusable—SAN B would
      not be affected.
                                                  Developing a SAN Architecture • Chapter 7   259


Figure 7.23 Switch Failure in a Redundant Cascade Topology




                                              Devices




                                      B

                 A




             C


                 D
                                          E




Figure 7.24 Switch Failure in a Resilient Ring, Mesh, or Core/Edge Topology




                 Failed Core Switch                               Alternate Path
260   Chapter 7 • Developing a SAN Architecture


      Figure 7.25 A Resilient and Redundant SAN Solution




                                                  Hosts




                             SAN A                          SAN B




                                      Storage




      NOTE
           Whenever you have a mission-critical application that requires the
           highest uptime possible, you should use resilient/redundant fabrics to
           build the SAN infrastructure, in conjunction with other HA software
           and components.
                                          Developing a SAN Architecture • Chapter 7   261



Configuring Traffic Patterns
There are a number of approaches possible for optimizing traffic patterns within
a fabric, or an overall SAN.This section discusses two approaches: the tiered
approach and the localized approach.


Leveraging Tiers
In many SANs today, traffic predictably flows between hosts and storage devices,
not between hosts and other hosts, or storage devices and other storage devices.
If your SAN is used primarily to support hosts accessing storage, you can opti-
mize traffic flow in your SAN by following the tiered approach described in
this section.


NOTE
     The trend in leveraging tiers might change as IP over Fibre Channel,
     third-party copy, peer-to-peer copy, and cluster technologies (VI) are
     more widely adopted. These technologies require host-to-host or
     storage-to-storage data flow. However, these new technologies will
     represent a relatively small fraction of the traffic in a SAN, so even if
     these technologies are widely adopted, the tiered approach will still
     be valid.



    For example, in Figure 7.26, you have a 4-switch full mesh with hosts
attached to two of the switches and storage attached to the others. Since we can
assume that the only traffic patterns we will see with any regularity are between
hosts and storage, we can change the ISL topology as seen in Figure 7.27.
262   Chapter 7 • Developing a SAN Architecture


      Figure 7.26 Analyzing Traffic Flow in a Full Mesh




                                                            Hosts
                               Traffic




                                                   No Traffic




                                         Storage




          Removing the unused ISLs increases the number of available ports without
      affecting resiliency.This improves the scalability and the cost/benefit ratio of the
      SAN, with very little effort.
                                            Developing a SAN Architecture • Chapter 7   263


Figure 7.27 Optimizing Traffic Flow by Using a Two-Tier Approach




                                            Hosts
                        Traffic




                                  Storage




    In the example in Figure 7.28, the switches on the right are the host tier
switches and the switches on the left comprise the storage tier.This is a tiered SAN
because of the way it is being used, not because of its topology.The initial
topology was a full mesh, and the modified topology is a ring. It can also be
viewed as a two-tier SAN from a performance standpoint.
264   Chapter 7 • Developing a SAN Architecture


      Figure 7.28 Storage Tier and Host Tier in a Two-Tier SAN




                                                  Hosts




                                   Host Tier




                                  Storage Tier




                                     Storage




      NOTE
           Performance-related terms such as tiered SAN or localized traffic do not
           refer to a topology. However, the use of these performance techniques
           might influence the choice of topology for a SAN.
                                              Developing a SAN Architecture • Chapter 7   265


    The benefit of tiered SANs is that they do not need ISLs or bandwidth
optimization between switches on the same tier. Figure 7.29 is an example of a
three-tier SAN. In addition to a host and storage tier, it also has a core tier.This
SAN is a variation on the core/edge architecture that exploits the performance
characteristics of tiered SANs while maintaining the advantages of the core/edge
SAN. It is actually a composite of two 18-switch resilient core/edge SAN
components.Tiered SANs are effective for managing your SAN.When you need
to add hosts, you also add a host switch.When you need to add storage, you add
a storage switch if ports are not available.This organization makes it easier to
manage your SAN and simplifies your job as a SAN administrator.

Figure 7.29 A Three-Tier SAN




                            Hosts                               Hosts




                Core Tier




                                    Storage
266   Chapter 7 • Developing a SAN Architecture


      Exploiting Locality
      Locality is another performance optimization strategy. It is the opposite of a
      tiered approach and will always outperform the tiered SAN. However, it also
      takes more planning.
          You can attain the best performance in any network design by understanding
      the traffic patterns the network will transport. If these patterns are well under-
      stood, it might be possible to localize traffic by putting ports that need to commu-
      nicate with each other close together.This concept is known as locality and is
      used not just in SAN design, but also throughout computer science.
          The amount of known locality in a SAN will combine with application per-
      formance requirements to determine the number of ISLs required to achieve the
      SAN design objectives.This in turn will affect the number of edge ports and the
      cost of the SAN infrastructure.
          In Figure 7.30, the servers attached to Switch A are using storage also
      attached to Switch A.The servers attached to Switch B are using storage also
      attached to Switch B. No data should cross the ISL connecting the two switches.
      In the SAN shown, traffic has been 100 percent localized.

      Figure 7.30 Localizing Traffic 100 Percent




           Traffic   Switch A                                                      Traffic
                                                                        Switch B
            Flow                                                                    Flow
                                                  ISL
                                                Developing a SAN Architecture • Chapter 7    267


    Sometimes it is impossible to determine anything at all about locality during
the SAN design phase. In the case of Figure 7.31, locality is at zero percent.The
single ISL in Figure 7.31 is over-subscribed, because many more devices could be
trying to use it than it can simultaneously support. Over-subscription is not
necessarily a bad thing and does not always affect SAN performance. Only
congestion—the realization of the potential of over-subscription—can affect
performance.

Figure 7.31 Localizing Traffic Zero Percent




                                           Traffic
                                            Flow
     Traffic   Switch A                                                            Traffic
                                                                        Switch B
      Flow                                                                          Flow




    Frequently, locality knowledge will be neither zero percent nor 100 percent.
For example, it might be possible to localize all traffic between a group of hosts
and a RAID array, but the hosts might be sharing access to a tape device that
needs to be located on a different switch (Figure 7.32).
    While smaller SANs will not benefit as greatly from the use of locality as large
SANs will, all SANs will benefit somewhat.The use of locality will reduce cost,
increase scalability, increase reliability, and improve performance. It is always worth
making some effort to use locality in a SAN design. However, for low-bandwidth
applications, the management benefits of organizing your edge devices in a tiered
fashion are significant, and zero-percent locality can be quite acceptable.
268   Chapter 7 • Developing a SAN Architecture


      Figure 7.32 Using Locality in a SAN Design




                                   90% of traffic




                                                               10% of traffic




      Using Any-to-Any Connectivity
      A well-designed SAN should employ locality to minimize congestion as much as
      possible. However, congestion usually occurs only in unique traffic conditions,
      such as the following:
           s   The SAN application is extremely bandwidth-intensive. For
               example, certain video applications use a large I/O size to the order of
               64 KB or larger, and typically consume 80 MB/sec to 100 MB/sec of
               bandwidth. More common are low-bandwidth applications such as
               Online Transaction Processing (OLTP) or e-commerce, where the
               typical I/O size is approximately 2 KB to 8 KB, and bandwidth con-
               sumption peaks at 16 MB/sec.
           s   The majority of I/O is streaming, as opposed to bursty at peak
               utilization. In a SAN environment, it is unlikely that all devices will
               concurrently require peak utilization, since most SAN traffic is bursty
               and in this respect similar to LAN traffic.
                                            Developing a SAN Architecture • Chapter 7   269


     s   The majority of SAN devices support 100 MB/sec or 200
         MB/sec throughput. However, most current storage and server
         devices are incapable of sustaining 100 MB/sec or 200 MB/sec
         throughput. For example, some of the fastest tape drives can deliver a
         maximum native throughput of only 14 MB/sec. In many cases, network
         congestion is only theoretically possible. Some newer SAN deployments
         might end up with a larger mix of higher throughput devices than
         existing SANs that have evolved gradually over time. Even when the
         hardware is capable of maximizing the HBA’s theoretical bandwidth, it is
         rare to have an application that needs to do so on a sustained basis. Due
         to the typical capabilities of the SAN edge devices and normal traffic
         patterns, congestion rarely occurs in most SANs.
     If you cannot determine locality or host/storage locations in advance, or simply
feel that your SAN does not have much potential for congestion and you do not
need to use locality, you can plan for any-to-any connectivity.This could also be
necessary if you are using a clustering application or a data replication application
that requires hosts to talk to other hosts and storage to talk to other storage.
     The core/edge topology is the preferred way to build any-to-any connec-
tivity SANs, since it is symmetrical and deterministic performance characteristics
are well suited for it. It might be desirable to use a “thicker” ISL structure for an
any-to-any SAN.This provides a lower degree of over-subscription. In a case
where traffic patterns are totally unknown, high levels of over-subscription are
likely to turn into congestion.The core/edge designs that are best suited for this
are the 8ex2cx2i and 16ex4cx1i networks. Each of these SANs provides for a 3:1
over-subscription ratio, which is usually perfectly acceptable.

Evaluating Performance Considerations
Building a fabric with very low ISL over-subscription can eliminate performance
constraints within the fabric. However, this might not be desirable. If the devices
attached to the fabric have performance limits greater than the limits imposed by
the ISL topology of the fabric, the extra ISLs used will add nothing to the solu-
tion but cost.This section describes that principle, and will help you evaluate
how much performance you need to build into your fabric.
270   Chapter 7 • Developing a SAN Architecture


      When Is Over-Subscription Bad?
      Since over-subscription is only a potential for link contention, it is never a
      problem in and of itself. In fact, over-subscription is the normal state in almost all
      networks in use today. For example, over-subscription is deliberately designed
      into the Internet as a way of reducing cost.
           Congestion, which is the realization of the potential of over-subscription, is
      usually only a problem when the congestion is sustained. If you have a link that
      becomes congested for a total of five minutes per day and is underutilized the
      rest of the time, there is not really a justification for adding another link. If,
      however, the link is congested half of the day, you should re-evaluate the SAN’s
      performance optimization strategy.The crossover point is usually this: If the
      congestion is significantly affecting your application’s performance, you need to
      eliminate or at least reduce it. Fabric OS provides visibility into the fabric’s
      performance by providing accurate performance metrics for ports and switches,
      as well as methods for setting thresholds for proactive notification of over-
      utilization. Refer to the Fabric Watch manual for information on thresholds and
      the Fabric OS manual on general performance measurements.

      Considerations Outside the Fabric
      The fabric itself can affect performance by being congested or having many
      long-distance (for example, 50 to 100 km) and relatively latency-heavy links.
      Again, we must stress that “latency-heavy” is relative. Even such a long-distance
      link is typically at least one order of magnitude faster than the storage devices
      that it serves. Moreover, the higher degree of latency on these links is caused by
      the speed with which light travels through the glass fiber-optic cable—since there
      is as yet no way to make a signal travel faster than light, this latency must be con-
      sidered acceptable.These considerations can create bottlenecks in a SAN, which
      will limit the performance of applications using it. However, much more fre-
      quently, devices outside the fabric create the bottleneck. Some things that do so
      include:
           s   CPU speed
           s   PCI bus speed
           s   Resource sharing
           s   Application I/O profile
           s   File system block access profile
                                          Developing a SAN Architecture • Chapter 7   271


    See Chapter 5, “The SAN Design Process,” for a detailed discussion of these
items. In short, if you know that you have performance limits outside the fabric,
there is no reason to design the fabric itself to avoid over-subscription. If the
devices attached to the fabric cannot or do not generate 1 Gbit/sec of sustained
traffic (they very rarely do), it is not necessary for the SAN to support that.The
built-in over-subscription will never become congestion in this case. It will
simply save you money, since you will not have to use as many ports for ISLs. Of
course, if you find subsequently that there is a need for more bandwidth, it is a
simple matter of connecting more ports to achieve additional bandwidth.
272   Chapter 7 • Developing a SAN Architecture



      Summary
      A fabric consists of one or more interconnected Fibre Channel switches. A SAN
      includes one or more related fabrics and everything attached to them. Some
      fabric topologies are better suited for general purpose use—such as the core/edge
      topology—while other topologies—such as the cascade, ring, full mesh, and par-
      tial mesh—might be useful in more limited or special-case deployments.
           Core/edge fabrics have many useful traits.They can scale to accommodate a
      large number of ports, and do so easily. A resilient core/edge fabric can be scaled
      “on the fly.” Switches can be added or replaced in the fabric without downtime.
      This includes the core switches, as long as the core is resilient. Core/edge fabrics
      also handle varying degrees of locality well.This topology is the most frequently
      recommended by Brocade and is the best general-purpose choice.There are four
      availability models in SAN architecture. In order of increasing availability, they are:
           1. Single fabric, nonresilient
           2. Single fabric, resilient
           3. Dual fabric, nonresilient
           4. Dual fabric, resilient
           The last of the models—dual fabric, resilient—is always recommended for
      applications where high uptime is a strong requirement.This kind of SAN con-
      sists of two usually identical, completely unconnected fabrics, neither of which
      contains a single point of failure. It is also possible to build SANs with more than
      two fabrics for even greater availability. (For example, the “triple fabric, resilient”
      topology consists of three fabrics.)
           While a tiered approach to SAN architecture can simplify management and
      storage resource planning, the most effective approach to performance tuning a
      SAN is to localize traffic within areas of the SAN. Applying the principle of
      locality takes a certain amount of work on the front end of a project and adds to
      the complexity of managing the growth of a fabric. However, when properly
      applied, locality greatly enhances performance and scalability.
           Using locality is one way to ensure that the ISLs in a fabric will not become
      over-subscribed. Over-subscription within a fabric is not necessarily a bad thing. A
      well-designed fabric can actually benefit from deliberate use of over-subscription,
      which can drive down cost, improve manageability, and decrease network com-
      plexity. Over-subscription affects performance only when it becomes congestion.
                                          Developing a SAN Architecture • Chapter 7   273


    You should look for performance limitations outside the fabric—such as host
and storage limitations, or application I/O profiles—before building a low-over-
subscription fabric.This will produce a topology that will provide the most
appropriate performance characteristics for your environment. However, if you
are unable to do this kind of in-depth analysis, there are a number of core/edge
topologies presented in this chapter from which you can choose a general pur-
pose approach to SAN design.These topologies are well tested, reliable, scalable,
and perform well with most I/O profiles.

Solutions Fast Track
Identifying Fabric Topologies and SAN Architectures
        A fabric consists of one or more interconnected Fibre Channel switches.
        A SAN includes one or more related fabrics and everything attached
        to them.
        In a resilient core/edge fabric topology, two or more switches act as a
        core to interconnect multiple edge switches.This is the best “general-
        use” topology available, especially when combined with the dual-fabric
        approach to SAN architecture.
        In order to select the right topology, you must first decide the require-
        ments for your SAN architecture.This includes redundancy and
        scalability in addition to port count.
        In general, the cascade, ring, full mesh, and partial mesh are best used
        in architectures where the individual fabrics that comprise the SAN
        will not change much.This could be true in a static, low-growth
        environment, or in a “SAN islands” design.
        The resilient core/edge topology is the best choice for general use and
        for situations where SAN requirements are either unknown or might
        change frequently.
        The resilient core/edge topology can be combined with dual fabrics to
        achieve maximum performance, reliability, and scalability.
274   Chapter 7 • Developing a SAN Architecture


      Working with the Core/Edge Topology
               The core/edge topology offers a number of key advantages over other
               topologies. Core/edge fabrics are:
               —Easy to scale without downtime.
               —Capable of scaling to a large number of ports.
               —Flexible in terms of their cost-to-performance ratios. (You can add
                 switches to the core with a clear knowledge of how doing so will
                 affect both cost and performance.)
               —Easy to understand, manage, and performance-tune.
               —Well-tested and reliable.
               Several core/edge fabrics can be used as “cookie-cutter fabrics” when
               design information is incomplete or might change frequently.


      Determining Levels of Availability
               There are four levels of availability that a SAN architecture might
               employ.The dual-fabric, resilient approach is the most reliable and the
               most frequently recommended.
               In most cases, this approach is not more expensive to implement than
               the other three approaches, and it might be less expensive in some cases.
               This approach allows for the failure of anything up to and including an
               entire fabric without application downtime.


      Configuring Traffic Patterns
               Tiered fabrics allow simplified management and storage resource plan-
               ning, but are the worst-case scenario from the standpoint of locality.
               Locality is the most effective approach to performance tuning in a SAN,
               but it is frequently unattainable.
               You should view locality as a “moving target,” since network complexity
               increases over time. However, it is worth getting as much locality as is
               practical into a SAN, since all SANs benefit in several ways from this
               technique.
                                           Developing a SAN Architecture • Chapter 7   275


Evaluating Performance Considerations
         Over-subscription is never a bad thing in and of itself. It is only when
         over-subscription becomes congestion that problems might arise.
         Latency is almost never a driving consideration in real-world SAN
         performance, since fabric latency is at least one order of magnitude
         lower than typical disk subsystem latency. Exceptions to this rule include
         clustering software and some highly performance-sensitive applications.
         In almost all cases, considerations outside the fabric will limit perfor-
         mance, such as CPU speed of hosts or the I/O profile of an application.


Frequently Asked Questions
The following Frequently Asked Questions, answered by the authors of this book,
are designed to both measure your understanding of the concepts presented in
this chapter and to assist you with real-life implementation of these concepts. To
have your questions about this chapter answered by the author, browse to
www.syngress.com/solutions and click on the “Ask the Author” form.


Q: What is the difference between a fabric and a SAN? I have usually heard these
   terms used interchangeably.
A: A SAN is a storage area network.This could be comprised of any underlying
   technology, not just Fibre Channel fabrics.While certain traditional net-
   working technologies are not fundamentally well suited to SAN construction
   (for example, Gigabit Ethernet), a number of emerging technologies can also
   be used to build enterprise-class SANs.Thus, a SAN is a fairly general term
   and does not limit itself to one specific approach. A fabric, on the other hand,
   is very specific. It is a set of interconnected Fibre Channel switches.The
   terms are frequently used interchangeably because Fibre Channel fabrics are,
   at this point, pretty much the only available production-level technology for
   SANs. Right now, nearly all SANs are Fibre Channel fabrics.

Q: All of this sounds very complicated. I just want to build the thing! Is there a
   fabric that I can just build?
276   Chapter 7 • Developing a SAN Architecture


      A: Sure.You can build a dual-fabric, resilient core/edge SAN.You can pick one
          of the designs from the core/edge target design sidebar earlier in this chapter,
          and you will have a design that will probably do what you need. It might not
          be the least expensive way to solve your design problem, but it is certainly the
          approach that requires the least planning.We presented the more advanced
          design material because many users want to take more control of their SANs;
          not because you must apply all of it in order to get the SAN you need.

      Q: I have heard that you should always try to minimize the number of hops
          between hosts and their storage. Is this always true?
      A: The best performance is always obtained by localizing traffic within a switch.
          This is a zero-hop scenario. Once you go outside of a single switch and cross
          an ISL, you can get to your destination in one hop. In most cases, getting the
          best performance requires an additional hop.Two or more hop equal-cost
          path networks make better use of FSPF load-sharing capabilities. Since band-
          width usually makes much more of a difference to performance than latency,
          the added bandwidth from FSPF load sharing more than compensates for the
          extra hop. From a practical standpoint, when using a core/edge design, this
          means that you should always put the hosts and storage devices around the
          edge, rather than locating some devices directly on the core.
                                        Chapter 8


SAN
Troubleshooting




 Solutions in this chapter:

     s   The Troubleshooting Approach: The SAN
         Is a Virtual Cable
     s   Troubleshooting the Fabric
     s   Troubleshooting Devices that
         Cannot Be Seen
     s   Troubleshooting Marginal Links
     s   Troubleshooting I/O Pauses


         Summary

         Solutions Fast Track

         Frequently Asked Questions

                                             277
278   Chapter 8 • SAN Troubleshooting



      Introduction
      A SAN is a complex system that can consist of multiple switches, hosts, storage
      devices, routers, and hubs. A SAN can also be as simple as a single switch with
      attached storage and hosts. A breakdown of the individual components yields a
      range of subcomponents, from simple subcomponents, such as cables, to complex
      subcomponents, such as switches. At a macro level, the fabric itself is considered a
      component that might require troubleshooting. Switches are logically positioned
      in the middle of the network between hosts and storage, and have visibility to
      both storage and hosts.This visibility into both sides of the storage network
      enables you to use switches to determine the cause of any malfunction in the
      SAN.This chapter presents a structured process for identifying marginal or faulty
      SAN components by helping you figure out where to start and then to methodi-
      cally home in on the problem. Specific areas of focus include troubleshooting the
      following symptoms and SAN components:
           s   Fabric
           s   “Missing” devices
           s   Marginal links
           s   Input/Output (I/O) interruptions
           The context of your problem influences how to interpret the data output by
      the variety of commands available in Fabric OS. For example, focus on the port
      state information for switchShow output when you are troubleshooting a port
      issue, and the switch status information from the same command when investigating
      a fabric issue. We will cover the details of how to troubleshoot using Fabric OS
      commands such as switchShow, errShow, portStatsShow, and other com-
      mands. Understanding host behavior and interpreting host information is also an
      important part of the troubleshooting process we discuss in this chapter.

      The Troubleshooting Approach:
      The SAN Is a Virtual Cable
      When first approaching troubleshooting, think of the SAN as a virtual cable.
      Storage traditionally involved connecting a Small Computer Systems Interface
      (SCSI) disk via a SCSI cable to a host; with this scenario, you focus on four
      components: the storage, the Host Bus Adapter (HBA), the host’s OS, and the
      cable/terminator.Troubleshooting a SAN is more challenging, but still has many
                                                   SAN Troubleshooting • Chapter 8   279


things in common with the traditional storage troubleshooting process.To the
operating system, the SAN provides a link to a disk, just as a traditional SCSI
connection would.
     You can apply the same “tried-and-true” process of elimination used to
troubleshoot a direct-attach SCSI issue or Ethernet network issue to SAN
troubleshooting. At a macro level, if you consider the SAN a virtual cable, the
issue can reside in three possible areas: the host, the “cable,” or the storage.
Troubleshooting can work like a binary search when you start investigating these
areas. Start in the middle and determine whether you are “above” or “below” the
problem, and then keep dividing the suspect path until you resolve the problem.
     When troubleshooting with a simple single-switch configuration, a single
host, and a single storage device, you need to focus on the HBA, the Gigabit
Interface Converter (GBIC), the host’s OS, the cable, the switch, and the storage.
Brocade fabrics run a single-image distributed operating system known as Fabric
OS. Fabric OS delivers functionality such as Name Server, Registered State
Change Notification (RSCN), Zoning, and security.These functions are part of
the SAN and are also variables in the troubleshooting equation. A large SAN can
consist of dozens of switches and is capable of growing to thousands of ports.
Knowing where in the SAN to initiate troubleshooting can be daunting.The
next section uses a typical SAN troubleshooting scenario—a host unable to “see”
its disks—to illustrate the method of resolving the problem by treating the SAN
as a virtual cable and working with a process of elimination.

A Typical Scenario: “I Cannot See My Disks”
We provide the scenario described in this section to introduce the trou-
bleshooting process and to establish a framework with which you are familiar.
Some terms, commands, and concepts may seem foreign.This is okay.We address
everything discussed in this section in greater detail later in the chapter.
    When a host cannot see its disks, one thing to check is whether that device is
logically connected to the switch by reviewing the output from the switchShow
command. If the device is not logically connected (that is, it does not show up as
an Nx_Port), you need to focus on the port initialization. Notice that port 15 in
Figure 8.1 indicates a logically connected device, as this port is connected as an
F_Port. Port 14 is an example of an unsuccessful device connection, as the device
connected to port 14 is connected as a G_Port. A G_Port indicates an incom-
plete connection to the fabric. Initially knowing that the missing device is not
logically connected eliminates the host and everything on that side of the data
280   Chapter 8 • SAN Troubleshooting


      path from the suspect list, as depicted in Figure 8.2.This includes all aspects of
      the host’s OS, the HBA driver settings and binaries, the HBA Basic Input Output
      System (BIOS) settings, the HBA GBIC, the cable going from the switch to the
      host, the GBIC on the switch side of that cable, and all switch settings related to
      the host.That is quite a lot for one command! If the missing device is logically
      connected to the switch, you need to check to see if the device is present in the
      Simple Name Server (SNS).

      Figure 8.1 Example of a Successful and Unsuccessful Device Connection
      core2:admin> switchshow
      switchName:        core2
      switchType:        2.4
      switchState:       Online
      switchRole:        Subordinate
      switchDomain:      5
      switchId:          fffc05
      switchWwn:         10:00:00:60:69:10:9b:5b
      switchBeacon:      OFF
      port 0: sw     Online             E-Port   10:00:00:60:69:11:f9:f7 "edge1"
                                        (upstream)
      port   1: sw    Online            E-Port   10:00:00:60:69:10:9b:52 "edge2"
      port   2: sw    Online            E-Port   10:00:00:60:69:11:f9:f7 "edge1"
      port   3: sw    Online            E-Port   10:00:00:60:69:10:9b:52 "edge2"
      port   4: sw    Online            E-Port   10:00:00:60:69:12:f9:8c "edge3"
      port   5: sw    Online            E-Port   10:00:00:60:69:12:f9:8c "edge3"
      port   6: —    No_Module
      port   7: —    No_Module
      port   8: —    No_Module
      port   9: —    No_Module
      port 10: —     No_Module
      port 11: id     Online            E-Port   10:00:00:60:69:12:f9:8c "edge3"
      port 12: —     No_Module
      port 13: —     No_Module
      port 14: cu     Online            G-Port //incomplete fabric connection
      port 15: id     Online            F-Port   50:06:04:82:bc:01:9a:0c
                                                                 SAN Troubleshooting • Chapter 8   281


Figure 8.2 The SAN Virtual Cable

                                                            OK


      Storage                                                                         Host




                   Problem
                                        Virtual SAN Cable



     The SNS is a directory service provided by the fabric. Initiators query the
Name Server much in the same way you would query a telephone directory
looking for a particular person or service. If a device is not in the Name Server, it
is essentially invisible to other devices in the fabric.When a device connects to
the fabric, that device will register itself with the Name Server.This is similar to
the situation in which you change neighborhoods and have your name listed in
the new telephone directory.When an initiator, which is normally an HBA,
enters the fabric, it queries the Name Server to identify all accessible devices and
obtain the addresses of these devices, just like you might scan your telephone
directory for a name. Some targets also will query the Name Server.Then the
initiator starts the process of establishing a connection with those devices for
which the Name Server provides addresses.
     Check the Name Server for the presence of your missing device by issuing
the nsShow command on the switch to which the device is attached (see the
sample output in Figure 8.3).This will list all of the nodes connected to that
switch, allowing you to determine if a particular node is accessible on the net-
work. An alternate method is to check the Name Server list in the WEB TOOLS
Graphical User Interface (GUI) on any switch in the fabric, as it contains a con-
solidated list of all devices in the fabric. Note that we started the process in the
middle of the virtual SAN cable, which is the fabric.This is the process we
described earlier as being like a binary search algorithm.You start in the middle
half of the data path, figure out if you are “above” the problem or “below”it and
keep dividing the suspect path in half until you identify the problem.
282   Chapter 8 • SAN Troubleshooting


      Figure 8.3 nsShow Sample Output
      ore2:admin> nsshow
      The Local Name Server has 9 entries {
          Type Pid       COS       PortName              NodeName             TTL(sec)
      *N       021a00;         2,3;20:00:00:e0:69:f0:07:c6;10:00:00:e0:69:f0:07:c6; 895
               Fabric Port Name: 20:0a:00:60:69:10:8d:fd
          NL    051edc;          3;21:00:00:20:37:d9:77:96;20:00:00:20:37:d9:77:96; na
               FC4s: FCP [SEAGATE ST318304FC           0005]
               Fabric Port Name: 20:0e:00:60:69:10:9b:5b
          NL    051ee0;          3;21:00:00:20:37:d9:73:0f;20:00:00:20:37:d9:73:0f; na
               FC4s: FCP [SEAGATE ST318304FC           0005]
               Fabric Port Name: 20:0e:00:60:69:10:9b:5b
          NL    051ee1;          3;21:00:00:20:37:d9:76:b3;20:00:00:20:37:d9:76:b3; na
               FC4s: FCP [SEAGATE ST318304FC           0005]
               Fabric Port Name: 20:0e:00:60:69:10:9b:5b
          NL    051ee2;           3;21:00:00:20:37:d9:77:5a;20:00:00:20:37:d9:77:5a; na
               FC4s: FCP [SEAGATE ST318304FC           0005]
               Fabric Port Name: 20:0e:00:60:69:10:9b:5b
          NL    051ee4;          3;21:00:00:20:37:d9:74:d7;20:00:00:20:37:d9:74:d7; na
               FC4s: FCP [SEAGATE ST318304FC           0005]
               Fabric Port Name: 20:0e:00:60:69:10:9b:5b
          NL    051ee8;          3;21:00:00:20:37:d9:6f:eb;20:00:00:20:37:d9:6f:eb; na
               FC4s: FCP [SEAGATE ST318304FC           0005]
               Fabric Port Name: 20:0e:00:60:69:10:9b:5b
          NL    051eef;          3;21:00:00:20:37:d9:77:45;20:00:00:20:37:d9:77:45; na
               FC4s: FCP [SEAGATE ST318304FC           0005]
               Fabric Port Name: 20:0e:00:60:69:10:9b:5b
          N      051f00;        2,3;50:06:04:82:bc:01:9a:0c;50:06:04:82:bc:01:9a:0c; na
               FC4s: FCP [EMC          SYMMETRIX        5267]
               Fabric Port Name: 20:0f:00:60:69:10:9b:5b
      }


          At this point, if the device is not present in the Name Server, you have
      narrowed your search along the virtual SAN cable to the Name Server interface
                                                     SAN Troubleshooting • Chapter 8   283


between the storage.The missing device process defined in this section is summa-
rized in flowchart form in Figure 8.4. Note that Figure 8.4 is an excerpt from the
complete missing-device troubleshooting process, which is shown in Figure 8.25.
Remember that we will go deeper into this missing-device troubleshooting
process and flowchart later in the chapter.

Figure 8.4 Flowchart Excerpt of Troubleshooting a Missing Device
(See Figure 8.25 for the Complete Flowchart.)
                              Storage
                            device not
                             visible to
                                host




                          Is the storage              Issue between
                              device                 storage device
                                            No
                            present in              and switch. Not a
                           switchShow?                   host issue



                               Yes



                             Is storage               Issue between
                           device visible            storage device
                                            No
                         in name server?            and switch. Not a
                                                         host issue




Where to Start and What Data to Gather
As stated in the previous section, SAN troubleshooting should begin in the
center of the SAN and proceed outward. Once you know where to start trou-
bleshooting, the next question is how to proceed. Start the troubleshooting pro-
cess by gathering a preliminary set of data, and then analyze this data to identify
where the problem resides: the host, the fabric, or the storage.Then gather addi-
tional data from the appropriate area and home in on the cause of the problem.
A plethora of data is available from the switches, hosts, and storage. Knowing
what data to look at and when to look at it is fundamental to the SAN
troubleshooting process.
284   Chapter 8 • SAN Troubleshooting


      Take a Snapshot: Describe the
      Problem and Gather Information
      Start with a general description of the problem and identify as much supporting
      detail as possible. At the very least, this should include a statement about what the
      “bad” behavior is, and a statement about what you are doing or have done to
      expose this behavior. Note that this is not the same as describing what you have
      done that causes the behavior.You might be doing something correctly, like plug-
      ging in a disk array and adding it to a zone, yet it might affect something else in
      the fabric if there is an underlying problem that is exposed whenever a zone
      change occurs.
          For example, an HBA responding incorrectly to an RSCN could fail when
      the new zone configuration is enabled. An RSCN is a fabric service for which an
      edge device optionally registers.When a device registers for an RSCN, it is
      asking the fabric to send that device a notice anytime something in the fabric
      changes. For example, when a new device is added to the fabric, any devices that
      registered for RSCNs will receive a notice.The registered device receiving the
      RSCN then checks the Name Server to see what has changed and takes appro-
      priate action. For example, if the registered device is a host and a new disk drive
      is added to the SAN, the host might create the necessary device operating system
      structures so the new device is accessible to the user.
          This information will help you with the problem resolution, and might be
      necessary if you need to contact Brocade or any Brocade-authorized support
      channel. Some examples of a general problem description include:
           s   When I enable a switch zoning configuration with cfgEnable, storage
               devices are no longer accessible to the host.
           s   There are frequent pauses in I/O when I copy large files between arrays.
           s   My edge device sometimes connects as an N_Port, and other times it
               connects as a Node Loop (NL)_Port when I power it up.
           s   The fabric segments and the following error message is logged (provide
               error message in your description). It does this under normal operation, even
               when I do not touch any device on the SAN.
          Include the answers to the following questions with your problem description:
           s   Can you recreate the problem on demand? If so, how? (Go into detail.)
           s   Is the problem intermittent? If so, how frequently does it occur?
                                                     SAN Troubleshooting • Chapter 8   285


     s   Has anything at all changed recently on the fabric? If so, what? (Provide
         a complete list.)
     s   Is the problem localized or fabric-wide? For example, is the problem
         happening with other devices in the fabric, or just locally with a single
         device attached to the switch?
     s   Is this an initial install and the device was never working, or was the
         device working and now it has stopped working?
   Other information to record:
     s   If there are any error messages, include them with the problem
         description.
     s   Firmware and driver versions for the affected HBA and storage devices.
     s   Firmware and operating system versions for affected hosts and all
         fabric switches.
     s   External switch information, such as LED state.
     s   External HBA and port information, such as LED state.
     s   A diagram of the SAN configuration.
     s   If long-distance links are present, include information about the length
         and quality of the lines, and the mechanism being used to achieve the
         distance (for example, “The line is 10 km long, and we are using Long
         Wavelength [LWL] GBICs,” or “It is 80 km long, and we are using a
         Dense Wave Division Multiplexor [DWDM] and the Extended Fabrics
         product”).
    Finally, gather supportShow information from the switches.The
supportShow command is a switch command used to gather information about
the switch and the fabric; it can provide valuable clues about what is happening
in your switch network. It is like a macro in that it executes a long list of switch
commands, which Brocade identifies as important for the troubleshooting pro-
cess. Note that the commands that supportShow executes vary between Fabric
OS releases.The v2.4.1 supportShow command executes the following switch
commands:
     s   version
     s   uptime
     s   tempShow
286   Chapter 8 • SAN Troubleshooting


           s   psShow
           s   licenseShow
           s   diagShow
           s   errDump
           s   switchShow
           s   portFlagsShow
           s   portErrShow
           s   mqShow
           s   portSemShow
           s   portShow
           s   portRegShow
           s   portRouteShow
           s   fabricShow
           s   topologyShow
           s   qlShow
           s   faShow
           s   portCfgLport
           s   nsShow
           s   nsAllShow
           s   cfgShow
           s   configShow
           s   faultShow
           s   traceShow
           s   portLogDump
          One benefit of supportShow is that you do not have to repeatedly retrieve
      various types of data, since most of the data you need is available from
      supportShow in one place. As this command rapidly streams in a telnet window,
      capture mode should be turned on prior to executing the command so that it
      can be captured to a text file for later review.
                                                     SAN Troubleshooting • Chapter 8   287



NOTE
     It is important to execute the supportShow command at the time the
     problem is occurring, rather than waiting until the fabric is functioning
     normally.



    Due to the large volume of data created by supportShow, you might choose
to gather the supportShow data once and then selectively issue a subset of its
commands as part of your troubleshooting process.

Troubleshooting Tools
Many tools are available to the SAN troubleshooter. Many of these tools are
switch commands. Other tools involve viewing the switch LEDs, host informa-
tion such as Solaris’ /var/adm/messages file, Fibre Channel analyzers, and
diagnostics available on many storage arrays. Rarely is it possible to use a single
tool to successfully troubleshoot a problem. It is more common to use several
tools to attain a successful resolution of a problem.

Using the Switch LEDs
A significant amount of information can be gathered just by looking at the
switch LEDs. At a rudimentary level, it is possible to identify that a device
has faulted or is not yet online by looking for a “fast yellow.” If the switch is
located in another room, you can get a visual real-time LED status using the
WEB TOOLS interface. Fast flickering green lights are a sign of a healthy SAN.
By physically observing the switches that comprise a SAN, it is possible to detect
patterns and identify a marginal or faulty component. For example, if you have a
situation in which you are trying to identify a device that is repeatedly toggling
online and offline, you can use the switch LEDs.
    While observing a functional fabric, you can easily identify a potentially dis-
ruptive device by scanning for a port that goes offline (no LED light), sends light
(steady yellow), comes online (steady green), and then cycles through the same
steps—blank, yellow, green.You also want to look for correlations or patterns,
such as one device going offline followed by a group of devices going offline and
back online again.This situation is common in QuickLoop configurations when
the first device going offline is sending a Loop Initialization Primative (LIP),
which then causes the other devices to LIP.
288   Chapter 8 • SAN Troubleshooting




         How to Identify a Healthy SAN Using the LEDs
         A settled and healthy fabric should have solid green or fast flickering
         green lights. A solid green light indicates an active link, while a fast flick-
         ering green light indicates I/O activity.




         How to Identify a SAN Problem Using the LEDs
         A yellow light or blinking yellow light indicates a problem with your
         SAN. An LED that transitions from yellow to green, however, is not a
         problem. A powered-off edge device, or edge device that is not yet
         online, might cause the switch LEDs to blink yellow.


          Another helpful use for the LEDs is for fabric “bring up.”When bringing up
      a fabric, one sign to look for that indicates a fabric has reached convergence are
      steady green lights.When the fabric is coming up, the Inter-Switch Links (ISLs)
      go through initialization, which appear to the observer as flickering green and
      yellow lights prior to the fabric fully converging. Once the fabric is converged,
      the lights go to a steady green.Then, as I/O in the fabric begins, you will see
      flickering green lights on the ISL ports and the edge device ports.
          A slowly flashing switch power LED indicates that the switch failed the
      Power-On Self-Test (POST) and is not able to come online. Refer to the associ-
      ated switch manual for the location of the power LED.Table 8.1 lists the port
      LEDs and their definitions (you can also find this table in the Brocade SilkWorm
      2800 Hardware Reference Manual).

      Table 8.1 Front Panel LED Port Indicators

      Ports                 LED Definition
      No light showing      No light or signal carrier (no module, no cable) for media
                            interface
      Steady yellow         Receiving light or signal carrier, but not yet online
                                                                                 Continued
                                                    SAN Troubleshooting • Chapter 8   289


Table 8.1 Front Panel LED Port Indicators

Ports                LED Definition
Slow yellow          Disabled (result of diagnostics, switchDisable, or
(flashes two          portDisable command)
seconds)
Fast yellow          Error, fault with port
(flashes a half
second)
Steady green         Online (connected with external device over cable)
Slow green           Online, but segmented (loopback cable or incompatible
(flashes two          fabric parameters)
seconds)
Fast green           Internal loopback (diagnostic)
(flashes a half
second)
Flickering green     Online and frames flowing through port


Switch Diagnostics
A robust set of switch diagnostics is available so you can validate the operational
level of a SilkWorm switch. Several of these diagnostics, such as
portLoopbackTest, are also helpful in the troubleshooting process. For example,
if you suspect a bad GBIC or switch port, you can use portLoopbackTest to
confirm your suspicion. Using portLoopbackTest for troubleshooting is dis-
cussed in the section “Troubleshooting Marginal Links” later in the chapter.The
supportShow diagnostic command in particular, discussed in detail later in this
chapter, is very helpful to the troubleshooting process.The Brocade Fabric OS
manuals provide detailed description regarding the usage of diagnostic com-
mands.To see what diagnostic commands are available online, issue the command
diagHelp at the switch prompt.The following list of diagnostic commands is
available in the V2.4.1 Fabric OS:
     s   ramTest System DRAM diagnostic
     s   portRegTest Port register diagnostic
     s   centralMemoryTest Central memory diagnostic
     s   cmiTest CMI bus connection diagnostic
     s   camTest QuickLoop CAM diagnostic
290   Chapter 8 • SAN Troubleshooting


           s   portLoopbackTest Port internal loopback diagnostic
           s   sramRetentionTest SRAM Data Retention diagnostic
           s   cmemRetentionTest Central Mem Data Retention diagnostic
           s   crossPortTest Cross-connected port diagnostic
           s   spinSilk Cross-connected line-speed exerciser
           s   diagClearError Clear diag error on specified port
           s   diagDisablePost Disable Power-On-Self-Test
           s   diagEnablePost Enable Power-On-Self-Test
           s   setGbicMode Enable tests only on ports with GBICs
           s   setSplbMode Enable 0=Dual, 1=Single port LB mode
           s   supportShow Print version, error, portLog, etc.
           s   diagShow Print diagnostic status information
           s   parityCheck Dram Parity 0=Disabled, 1=Enable
           s   spinFab ISL link diagnostic
           s   loopPortTest L_Port cable loopback diagnostic


      Helpful Commands
      With dozens of switch commands at your disposal, it can be difficult to deter-
      mine which command to use in a given situation. An annotated list of helpful
      commands follows in this section, with additional commands highlighted as they
      relate to specific issues discussed in following sections.This list of commands is a
      starting point for gathering data and initiating your troubleshooting process.
      While the information generated by these commands is also available in
      supportShow, you will want to use individual commands as you advance
      through the troubleshooting process. SupportShow creates a significant amount
      of data and is helpful when you want to perform the original snapshot of the
      configuration and environment (to report a problem to your switch supplier), or
      you are not sure what data to capture.
                                                  SAN Troubleshooting • Chapter 8   291



NOTE
    Although the switch commands are shown with various capitalization as
    originally coded in Fabric OS, the commands are no longer case-sensitive
    and can be entered with all lowercase if desired.



    Entering the command help at the switch prompt generates a list of com-
mands available to the user as shown in Figure 8.5. Entering the command help
<command> generates a help page (similar to UNIX man pages) for that spe-
cific command. Many commands differ by the extension show or dump (for
example, errShow and errDump). The difference is that show commands
require you to type a return between entries, while the dump commands stream
data to the screen without any pauses. Dump commands are used when you have
a facility for logging command output to a file. It might be necessary to execute
commands on more than one switch in the fabric, especially if the location of the
problem is unclear.


NOTE
    As of Fabric OS 2.4.1, there is no time synchronization among the
    switches, which can make troubleshooting a challenge if the clocks
    between the switches are skewed. Before you begin troubleshooting
    your fabric, you should make a note of any time skew so that you can
    compensate for it when reading command outputs. You should also
    make an effort to keep switch clocks set correctly during normal
    operation to avoid this problem.




Figure 8.5 Use the help Command to See What Commands Are Available or
Type the help Command for Help About a Specific Command
dev172:admin> help


agtcfgSet                    Set SNMP agent configuration
agtcfgShow                   Print SNMP agent configuration
agtcfgDefault                Reset SNMP agent to factory default
                                  .

                                                                        Continued
292   Chapter 8 • SAN Troubleshooting


      Figure 8.5 Continued
                                            .
                                            .
      qlHelp                            Print quick loop help info
      routeHelp                         Print routing help info
      trackChangesHelp                  Print Track Changes help info
      zoneHelp                          Print zoning help info


      dev172:admin> help errShow


      NAME
             errShow - display the error log


      SYNOPSIS
             errShow


      AVAILABILITY
             all users


      DESCRIPTION
             This command displays the error log, prompting the user to type
             return between each log entry. It is identical to errDump, except


                                            .
                                            .
                                            .
      SEE ALSO
             errDump, uptime



      The errShow Command
      The errShow command provides a listing of up to 64 logged errors and is
      helpful for identifying where a problem might reside. It sends messages to the
      console and to the error log. Note that the error log is cleared after a reboot or
      power cycle; if you want to maintain error logs that persist after reboots or power
                                                     SAN Troubleshooting • Chapter 8     293


cycles, consider using the syslog facilities of the switch to log errors to persistent
storage. See syslogdIpAdd, syslogdIpRemove, and syslogdIpShow for further
detail on how to set up persistent logging.
    When examining errShow data, which can be quite wordy, look for trends
or patterns. For example, look for an excessive number of errors associated with a
specific port. In addition, watch for high error-count values, which indicate a
repeated error that has been logged many times. Logging error counts limits
errors that occur multiple times from consuming the space provided for the error
log. It is important to note that with every error, a severity level is associated. A
warning (error level 3) is just that—a warning. An error (error level 2) or critical
(error level 1) message is more severe and requires further attention.
    An excerpt from the errShow help entry is provided in Figure 8.6. Please
refer to the help page or the Fabric OS manual for details on interpreting Diag
Err#, as the list of codes is lengthy. A Diag Err# usually indicates a problem
with hardware, so contact your switch supplier for further assistance.
    In addition to software errors, errShow logs environmental issues, such as
over-temperature conditions, and equipment issues such as fan failures or power
supply failures. A detailed list of error messages, descriptions, probable causes, and
actions is maintained in the Fabric OS Reference Manual Version 2.4
(Publication Number 53-0001569-01).

Figure 8.6 Excerpt from the errShow help Entry
Each entry in the log has the same format:


   Error Number
   ——————
   taskId (taskName): Time Stamp (count)
          Error Type, Error Level, Error Message
   Diag Err#


Error Number           Starting from one. If there are more error than
                       the size of the log, only the most recent errors
                       are shown.


Task Id & Name         The ID and name of the task recording the error.



                                                                            Continued
294   Chapter 8 • SAN Troubleshooting


      Figure 8.6 Continued
      Time Stamp             The date and time of the first occurrence of
                                 the error.


      Error Count            For errors that occur multiple times, the repeat
                             count is shown in parenthesis. The maximum count
                             is 999.


      Error Type             An uppercase string showing the firmware module
                             and error type. The switch manual contains a
                             detailed explanation of each error type.


      Error Level            0     panic (the switch reboots)
                             1     critical
                             2     error
                             3     warning
                             4     information
                             5     debug


      Error Message          Additional information about the error.


          Figure 8.7 is an example of an errShow message. The fabric is segmented,
      meaning that the switch that generated this message is logically disconnected
      from the SAN, and any devices in the SAN that are not directly connected to
      this switch are inaccessible to this switch. Moreover, any devices located on this
      switch are unable to access other devices in the fabric. The error level is a
      warning (3). The task ID (0x10e2b7f0) can be cross-referenced by issuing the
      telnet command “i” to obtain additional information on the task in question.
      The Task Name is self-explanatory, and interpreting it is somewhat intuitive. For
      example, tTransmit is the transmit task. The Task Name can be helpful in identi-
      fying the nature of the problem. Finally, the error message indicates that there is a
      discrepancy between the zone information contained on this switch and the zone
      information contained in the rest of the fabric. When the switch tried to join
      the fabric with this conflicting information, the join request was denied; hence,
      the segmented fabric. The message even identifies the zone that is causing the
                                                                 SAN Troubleshooting • Chapter 8   295


conflict; in this case, it is the “red” zone. This zone should be checked and com-
pared to the rest of the fabric, and if the zone information is different, either cor-
rect or delete it.
Figure 8.7 errShow Example


                                Task Id
                                              Task Name
        Error 04

        --------
        0x10f6f4f0 (tTransmit): May 15 09:38:57 (6)

         Error FABRIC -SEGMENTED, 3, port 0, zone conflict: content mismatch: red




                   Error Type                             Port      Error Message
                                          Warning
                                           Level




The portErrShow Command
The portErrShow command is an effective command for troubleshooting
marginal ports. This command provides an error summary for all ports associated
with the switch and provides a status of all ports from a link integrity perspective.
The key to interpreting the statistics is looking for a very high number of errors
relative to the frames transmitted and frames received. For example if 2,000,000
frames have been received and only three Cyclic Redundancy Check (CRC)
errors have been logged, the CRC errors relative to the frames received is a very
low ratio and the associated port is not suspected as being marginal. On the
other hand, if 2,000,000 frames have been received and 10,000 CRC errors have
been logged, the CRC errors relative to the frames received is a high ratio and
the associated port should be examined further. A rough guideline is to look for
errors in excess of 0.5 percent of the total number of frames transferred.
    Another important trend to watch is a steadily increasing number of errors.
You can track increasing errors by sampling every five or ten seconds and moni-
toring the delta between the samples. Simple Network Management Protocol
(SNMP) polling can be used to facilitate this. Also, the optionally licensed Fabric
296   Chapter 8 • SAN Troubleshooting


      Watch product can be used to note changes in error rates over time and send out
      an SNMP trap or error log entry. Streaming errors is a high-order indicator and
      requires close monitoring—even if the error rate is less than one percent.While
      the error count relative to frames transmitted or received might be low, a steadily
      increasing number of errors indicates a marginal port.
           The portErrShow statistics shown in Figure 8.8 were gathered from a
      switch that had a marginal NL_Port (HBA), connected to port 6. It turned out
      that the Gigabit Link Module (or GLM, a connector similar to a GBIC) was
      failing and causing a degraded signal. Note how high the enc_in and CRC
      errors are!

      Figure 8.8 portErrShow Example
            frames            enc       crc    too   too       bad       enc disc link loss loss frjt fbsy
               tx        rx        in    err    shrt long        eof        out       c3 fail sync    sig
        ——————————————————————————————————
      port 0: 2.9g 1.7g                   0     12   0     0         0      0     0    0     2        1   0   0
      port 1: 305m 3.0g                   0      0   0     0         0      0     0    0      1       0   0   0
      port 2: 1.2g 892m                   0      0   0     0         0      0     0    0    556      27   0   0
      port 3: 1.1m            25m         0      0   0     0         0     82     0    4     9        4   0   0
      port 4:        0         0          0      0   0     0         0      0     0    0      0       0   0   0
      port 5: 9.5m 4.0g                   0      0   0     0         0      0     0    0   1.4k   1.4k    0   0
      port 6: 668m 4.0g 6.0m 66m                     0     0    236       51m     0   87    54       11   0   0


          The error statistics shown in boldface are the primary statistics on which to
      focus.The following listing explains relevant statistics and associated definitions:
           s        enc_in Received data: the number of 8b/10b encoding errors that have
                    occurred inside frame boundaries.This counter is generally a zero value,
                    although occasional errors might occur on a normal link and give a
                    nonzero result. (Minimum compliance with the link-bit error rate speci-
                    fication on a link continuously receiving frames would cause approxi-
                    mately one error every 20 minutes.) Reinitialization or reboots of the
                    associated Nx_Port can also cause these errors, resulting in a low-count
                    error count.
           s        crc_err Received frames: the number of CRC errors detected. A CRC
                    error indicates that the contents of a frame are no longer valid.
                                                     SAN Troubleshooting • Chapter 8    297


         Reinitialization or reboots of the associated Nx_Port can also cause
         these errors, resulting in a low count.
     s   too_long Received frames: the number of frames that were longer than
         the maximum Fibre Channel frame size (such as a header with more
         than a 2112-byte payload).
     s   bad_eof The number of frames received with a badly formed
         end-of-frame.
     s   enc_out Receive link: the number of 8b/10b encoding errors recorded
         outside frame boundaries.This number might become nonzero during
         link initialization, but it indicates a problem if it increments faster than
         the allowed link-bit error rate (approximately once every 20 minutes).
     s   er_disc_c3 Receive link: the number of Class 3 frames discarded. Class
         3 frames can be discarded due to timeouts or invalid or unreachable des-
         tinations.This quantity could increment at times during normal opera-
         tion, but might be used for diagnosing problems in some situations.


NOTE
     Steadily increasing errors between samples is a very strong sign that the
     associated port is not functioning properly.



    Marginal link troubleshooting and related troubleshooting commands are dis-
cussed in more detail in the “Troubleshooting Marginal Links” section later in
this chapter.

The switchShow Command
The switchShow command is another powerful command that has many uses
for the troubleshooting process. An excerpt from the switchShow help entry is
provided here. It is helpful for troubleshooting fabric as well as edge device con-
nectivity issues.This command is likely to be one of the first commands you will
execute as part of your troubleshooting process.The nature of the problem will
dictate what switchShow data to focus on and how to interpret this data. As
shown in Table 8.2 and Figure 8.9, switchShow data is loosely organized into
three categories.
298   Chapter 8 • SAN Troubleshooting


      Table 8.2 How switchShow Data Relates to the SAN Functional Areas

      Fabric-Related           Edge Device-Related           Miscellaneous
      switchState              port state                    switchId
      switchRole                                             switchBeacon
      switchDomain                                           switchType
      port state                                             switchName


      Figure 8.9 switchShow Definitions
      This switchShow command displays switch and port status information.
      Some information varies with the switch model, e.g. number of
      ports, and Domain ID values. The lines of the display show:


      switchName         The switch's symbolic name.
      switchType         The switch's model and revision numbers.
      switchState        The switch's state: Online, Offline, Testing, Faulty.
      switchRole         The switch's role: Principal, Subordinate, Disabled.
      switchDomain       The switch's Domain ID: 0-31 or 1-239.
      switchId           The switch's embedded port D_ID.
      switchWwn          The switch's Worldwide Name.
      switchBeacon       The switch's beaconing state (either ON or OFF).


      The switch summary is followed by one line per port:


      port number            The port number: 0-7 or 0-15.


      module type            The port module type (GBIC or other):
                              — - no module present
                              sw - shortwave laser
                              lw - longwave laser
                              cu - copper
                              id - serial ID


      port state             The port's state:

                                                                            Continued
                                                    SAN Troubleshooting • Chapter 8    299


Figure 8.9 Continued
                        No_Card     - no interface card present
                        No_Module - no module (GBIC or other) present
                        No_Light    - the module is not receiving light
                        No_Sync     - receiving light but out of sync
                        In_Sync     - receiving light and in sync
                        Laser_Flt - module is signaling a laser fault
                        Port_Flt    - port marked faulty
                        Diag_Flt    - port failed diagnostics
                        Lock_Ref    - locking to the reference signal
                        Testing     - running diagnostics
                        Online      - the port is up and running
comment                The comment field may be blank, or may show:
                        Disabled    - the port is disabled
                        Bypassed    - the port is bypassed (loop only)
                        Loopback    - the port is in loopback mode
                        E_Port      - fabric port, shows WWN of attached
                                      switch
                        F_Port      - pt-pt port, shows WWN of attached
                                      N_Port
                        G_Port      - pt-pt but not yet E_Port or F_Port
                        L_Port      - loop port, shows number of
                                      NL_Ports


                        if a port is configured as a long-distance port,
                        the long distance level is shown in the format of
                        "Lx", x being the long-distance level number.
                        See portCfgLongDistance for the level description.


    When troubleshooting issues involve the fabric services or a switch’s ability to
participate in the fabric, the important parts of switchShow data to focus on are
switchState, switchRole, and switchDomain.
    Port state is applicable from a fabric perspective for observing the state of
expansion ports (E_Ports). E_Ports associated with ISLs are the ports used to
300   Chapter 8 • SAN Troubleshooting


      connect multiple switches together forming a fabric. Port state is also useful for
      troubleshooting connectivity problems with end devices (F_Ports and FL_Ports).
          In a running fabric, the switchState should always be online. If not, access to
      and from the switch is not possible. It is possible that the switch may be in a tran-
      sitory state as it comes online from a power cycle or reboot, so check again to
      make sure this is not the case. It is also possible that the switch has been manually
      disabled using the switchDisable command.
          A switch can be operating as a principal, subordinate, or disabled, which is
      indicated by the switchRole variable.There is only one principal switch in the
      fabric, and if the principal fails, another switch will assume this role.The principal
      switch facilitates the bring up of the fabric and assignment of domain IDs. A
      switch domain ID is an address that defines the switch in a fabric. Domain IDs
      are automatically assigned as part of the fabric initialization process by the prin-
      cipal switch. It is possible to manually assign a domain ID as well. SilkWorm
      1000 series switches use the domains 0–31, and SilkWorm 2000 series switches
      and beyond use the domains 1–239. If a switch is not a principal, it operates in a
      subordinate switch role. If the switch role indicates disabled, access to and from
      the switch is not possible and it is likely that someone disabled the switch by
      typing switchDisable, or the switch was unable to obtain a domain ID.When a
      switch is disabled, a comment of “unconfirmed” accompanies the domain ID
      (Figure 8.10). Normally, a switch will be in disabled state after issuing the com-
      mand switchDisable.The “unconfirmed” attribute could also be caused by a
      problem with the fabric, which causes a switch to be unable to confirm its
      domain ID even though the switch is enabled.When the switch is disabled, the
      LEDs will blink yellow every two seconds and the port state will indicate disabled.

      Figure 8.10 Switch Disabled and Unconfirmed Domain
      core1:admin> switchshow
      switchName:        core1
      switchType:        2.4
      switchState:       Offline
      switchRole:        Disabled
      switchDomain:      1 (unconfirmed)
      switchId:          fffc01
      switchWwn:         10:00:00:60:69:10:8d:fd
      switchBeacon:      OFF
                                                                                  Continued
                                                      SAN Troubleshooting • Chapter 8     301


Figure 8.10 Continued
port   0: sw     Laser_Flt      Disabled
port   1: sw       In_Sync      Disabled
port   2: sw       In_Sync      Disabled
port   3: sw       In_Sync      Disabled
port   4: sw       In_Sync      Disabled
port   5: sw       In_Sync      Disabled
port   6: —      No_Module      Disabled
port   7: —      No_Module      Disabled
port   8: —      No_Module      Disabled
port   9: —      No_Module      Disabled
port 10: —       No_Module      Disabled
port 11: —       No_Module      Disabled
port 12: —       No_Module      Disabled
port 13: —       No_Module      Disabled
port 14: —       No_Module      Disabled
port 15: —       No_Module      Disabled


    The SilkWorm 1000 series of switches uses the domain IDs 0–31, and the
SilkWorm 2000 series and beyond switches use the domain IDs 1–239. Normally,
a domain ID is automatically assigned when a switch joins the fabric; however,
there are circumstances that can result in domain ID conflicts.This can happen
when connecting two online switches that have already been assigned the same
domain ID.When two switches in a fabric have the same domain ID, the fabric
segments along an ISL that allows domain IDs to be unique in each segment.
    The port state information generated by switchShow is pertinent to fabric-
related issues if an ISL port is affected. One issue that relates to ISLs involves the
port’s inability to fully initialize.While the port is online, it remains in a generic
port (G_Port) state since it could not initialize as an E_Port. Another issue that
affects ISLs occurs when the link is unable to initialize, resulting in the port not
coming online at all.This could be caused by a marginal link, an offline switch
connected to the other end of the ISL, or a fabric initialization issue. In either cir-
cumstance, it is incumbent upon the SAN administrator to establish that the port
is an ISL port or an edge device that is not connected, as there is no way to tell
the type of device connected until after the port initializes. Execute the commands
302   Chapter 8 • SAN Troubleshooting


      portDisable and portEnable, providing the offending port number as an argu-
      ment to try to reinitialize the port.
           The Switch Name is assigned by the user and does not have to be unique in
      the fabric. However, uniquely naming each switch can make your SAN adminis-
      tration easier.With some Fabric OS versions,WEB TOOLS might not function
      properly if the Switch Name does not match the switch’s actual host name.You
      assign a Switch Name with the switchName command.
           The switchId value is the switch’s 24-bit Destination ID (D_ID) address in
      the fabric.This is the Fibre Channel address that another switch would use to
      send the frame to the switch itself, rather than to a device connected to the
      switch.This value might appear in portLog data—for example, when the switch
      probes an edge device for Name Server information.
           Using the switchBeacon switch command, you can have the switch flash a
      back-and-forth pattern (from left to right, and right to left) in yellow to identify
      the switch.This is helpful if you are doing maintenance and need to identify a
      switch that is positioned in a rack with many other switches. Finally, the
      switchType information indicates the switch model and revision in the form
      model.revision, as shown in Table 8.3.

      Table 8.3 switchType Values and Associated Architecture

      SwitchType Value                  Switch Model
      1                                 1000 series
      2                                 2800
      3                                 2400
      4                                 20x0
      5                                 22x0

           Information in the port state section includes the port state, the type of
      media, the World-Wide Name (WWN) of the connected device, and the
      switch name if the attached device is a switch, private, phantom, and upstream or
      downstream information.
           The port state will typically be online or offline; however, as shown in Figure
      8.4, a laser fault is also indicated when encountered.The type of interface media is
      shown as well, indicating the type of GBIC used. SW is for shortwave GBICs, LW
      is for long wavelength GBICs (for longer distances), and ID is for serial ID GBICs.
      Serial ID GBICs are smart GBICs with serial number and state information.
                                                    SAN Troubleshooting • Chapter 8   303


    A private device is normally a loop device that does not perform a Fabric
Login (FLOGI) and uses an 8-bit address. A phantom address is a 24-bit translated
address for an 8-bit device. A phantom is created for the embedded port so that
services and other devices within the SAN can communicate with the devices on
a private loop.The switch recognizes only device addresses of 24 bits in length.
Therefore, services on the switch that need to communicate with the private
devices need to have a 24-bit proxy for their 8-bit addresses. Each device that
wants to communicate with devices on a private loop needs to be “represented”
on the loop directly.This is done by creating a phantom device for each host that
wants to communicate with devices on the private loop.This phantom is acting
on behalf of each of the devices that wish to communicate to devices on the loop.
    The terms upstream and downstream designate that particular switch’s position
in reference to the principal switch in the fabric.These paths are used in the pro-
cess for assigning switch domain IDs. In Figure 8.11, notice that switch core1 is
the principal switch, and all “stream” designators are downstream. For switch
edge1, the path to the principal switch is upstream through port 2.There is also a
downstream path from switch edge1.This path is used by switch core2 to access
switch core1; hence, port 3 is designated as a downstream port.The principal
switch has no upstream ports.
    The port state section of the switchShow output is very helpful in identi-
fying edge-device connection issues. These issues can involve a range of prob-
lems, from missing devices to devices initializing with the wrong topology (for
example, a loop-configured device initializing as point-to-point topology). The
explanation of port states and associated comments is fairly straightforward.
When in doubt, check to see that the port is online, assuming a device is
attached, and that the topology is correct (F_Port or L_Port). If neither of these
values is present, you will need to do further analysis.

The nsShow Command
An excerpt from the nsShow help entry is provided in Figure 8.12. The most
important thing about nsShow output is whether the device in which you are
interested appears in the command output. If a device does not appear in the
Name Server, other devices will not be able to access it. There are some instances
where initiators bypass the Name Server and directly communicate with a device
by using an earlier obtained address or doing a table scan of addresses. This
behavior is considered suspect, as it is bypasses a standard methodology. Note that
hard zoning prevents such activities from occurring, ensuring that all devices
behave appropriately within the SAN.
304   Chapter 8 • SAN Troubleshooting


      Figure 8.11 Upstream and Downstream Paths in Reference to
      switchShow Output

             switchName:   core1
             switchType:   2.4
             switchState:  Online
             switchRole:   Principal
             switchDomain: 1
             switchId:     fffc01
             switchWwn:    10:00:00:60:69:10:8d:fd
             switchBeacon: OFF
             port 0: sw Laser_Flt
             port 1: sw Online    E-Port 10:00:00:60:69:10:9b:52   "edge2" (downstream)
             port 2: sw Online    E-Port 10:00:00:60:69:11:f9:f7   "edge1" (downstream)
             port 3: sw Online    E-Port 10:00:00:60:69:10:9b:52   "edge2"
             port 4: sw Online    E-Port 10:00:00:60:69:12:f9:8c   "edge3" (downstream)
             port 5: sw Online    E-Port 10:00:00:60:69:12:f9:8c   "edge3"

             switchName:   edge1
             switchType:   2.4
             switchState:  Online
             switchRole:   Subordinate
             switchDomain: 2
             switchId:     fffc02
             switchWwn:    10:00:00:60:69:11:f9:f7
             switchBeacon: OFF
             port 0: sw No_Light
             port 1: sw Online    E-Port 10:00:00:60:69:10:9b:5b "core2"
             port 2: sw Online    E-Port 10:00:00:60:69:10:8d:fd "core1" (upstream)
             port 3: sw Online    E-Port 10:00:00:60:69:10:9b:5b "core2" (downstream)
             port 4: — No_Module



                                   Principal

                                      core1            core2




                           edge1               edge2               edge3




      NOTE
           If the device is not in the Name Server, it is most likely invisible to the
           rest of the fabric and therefore inaccessible.
                                              SAN Troubleshooting • Chapter 8   305


Figure 8.12 nsShow help Page
NAME
       nsShow - display local Name Server information


SYNOPSIS
       nsShow


AVAILABILITY
       all users


DESCRIPTION
This command displays local Name Server information, which
includes information about devices connected to this switch,
and cached information about devices connected to other
switches in the fabric.


The message "There is no entry in the Local Name Server" is displayed
if there is no information in this switch, but there still may be
devices connected to other switches in the fabric. The command
nsAllShow shows information from all switches.


Each line of output shows:
*               an asterisk indicates a cached entry from another switch.
Type            U for unknown, N for N_Port, NL for NL_Port.
Pid             The 24-bit Fibre Channel address.
COS             A list of classes of service supported by the device.
PortName        The device's port Worldwide Name.
NodeName        The device's node Worldwide Name.
TTL             The time-to-live (in seconds) for cached entries, or
                'na' (not-applicable) if the entry is local.


There may be additional lines if the device has registered any of
the following information (the switch automatically registers
SCSI inquiry data for FCP target devices): FC4s supported,
                                                                   Continued
306   Chapter 8 • SAN Troubleshooting


      Figure 8.12 Continued
      (node) IP address, IPA, port and node symbolic names, fabric
      port name, hard address and/or port IP address.


           Often, the returned SCSI inquiry data is meaningful and indicates telling
      information such as the vendor, model, and the firmware revision level of the
      attached device, as shown in Figure 8.13. For HBAs, SCSI inquiry data occasion-
      ally is not returned and the Name Server entry is a bit sparse, so it is harder to
      identify the device. Some vendors are starting to allow administrators to manually
      populate this field to allow the textual information to be site-specific, such as
      node names or locations.
      Figure 8.13 The nsShow Output Explained


              The Seagate disks support
               Class 3 service and the
               EMC supports Classes 2 & 3                     An HBA—not much info


                 Type Pid                COS        PortName                               NodeName
            TTL(sec)
                 N        0a1000;              2,3;20:00:00:e0:69:40:13:19;10:00:00:e0:69:40:13:19;   na
                 NL       0a19cb;                3;21:00:00:20:37:26:b0:6c;20:00:00:20:37:26:b0:6c;   na
                         FC4s: FCP [SEAGATE ST39102FCSUN9.0G0D29]
                 NL       0a19cc;                3;21:00:00:20:37:26:84:22;20:00:00:20:37:26:84:22;   na
                         FC4s: FCP [SEAGATE ST39102FCSUN9.0G0D]
                 N        0a1b21;              2,3;50:06:04:84:35:46:b5:4d;50:06:04:84:35:46:b5:4d;   na
                         FC4s: FCP [EMC                   SYMMETRIX                    5265]
                 N        0a1c21;              2,3;50:06:04:84:3a:3b:1f:4d;50:06:04:84:3a:3b:1f:4d;   na
                         FC4s: FCP [EMC                   SYMMETRIX                    5265]



                                            An EMC storage device with 5265 firmware



                                    FCP = SCSI over Fibre Channel




         It can be confusing understanding the difference between a device node
      WWN and a port WWN. A device has only one node WWN and can potentially
      have one or more port WWN(s).This way, it is possible to uniquely identify
                                                                     SAN Troubleshooting • Chapter 8   307


multiple paths or interfaces to the same device. For example, today’s Just a Bunch
of Disks (JBOD) systems usually have two ports (A and B), and each port has an
associated port WWN.This enables two paths to connect to the same disk. How
do you know it is the same disk? The node WWN is the same for each path,
with each path having a unique port WWN. In Figure 8.14, if the entry for Port
ID (PID) 0a19cb is connected on both ports A and B, the node WWN stays the
same (20:00:00:20:37:26:b0: 6c), the A port would have a WWN of
21:00:00:20:37:26:b0: 6c, and the B port would have a WWN of
22:00:00:20:37:26:b0: 6c.
Figure 8.14 The Difference between Port WWN and Node WWN

                    A                                                                 B

               Port WWN =                                                        Port WWN =
                                    Node WWN = 20:00:00:20:37:26:b0:6c
          21:00:00:20:37:26:b0:6c                                           22:00:00:20:37:26:b0:6c




    The use of node WWNs and port WWNs is not always strictly followed, and
the Fibre Channel specifications are not clear on their usage. A node WWN
sometimes is used to represent an entire system and all ports (Port WWNs) asso-
ciated with that system.
    The Name Server also provides information about a device’s PID. Knowing
how to decode a PID is helpful in translating a device’s SAN logical address into
a SAN physical location. If you know a device’s PID, you know the physical port
that device is attached to, the domain ID of the switch that device is attached to,
and whether that device is an N_Port or an NL_Port. Figure 8.15 explains this
decoding process further.

The topologyShow Command
The topologyShow command displays the fabric topology, as seen by the local
switch. topologyShow output consists of a list of all domains that are part of the
fabric, and for each of those domains, all the possible paths to reach these
domains from the local switch. In addition, topologyShow displays the total
number of switches in the fabric, and the domain ID of the local switch. It is also
helpful to issue the switchShow command to identify directly connected
switches. Look for E_Ports and the name of the switch located at the other end
of the E_Port to create a SAN topology. Perform a switchShow for every switch
308   Chapter 8 • SAN Troubleshooting


      in the fabric. First, write down the name of the switch on which the command is
      issued. For each E_Port on that switch, write down the name of the switch to
      which the E_Port connects.Then draw a line between the switch on which the
      command is being run and the switch that shows up on the other end of the
      E_Port.The data in Figure 8.16 indicates that switch edge3 is directly connected
      to switches core1 and core2.To identify direct-connect switches in the
      topologyShow output, look for domain entries with a hop count of one.To
      obtain additional information on the switches in the fabric, such as their IP
      address, use the fabricShow command.
      Figure 8.15 How to Interpret the Port Addressing

                 Port Addressing
                 0x XX 1Y ZZ where:
                      s      XX is a value between 0x1 to 0xef inclusive and indicates the domain id of the
                             switch to which the device is attached
                      s      The “1” will always be there in 2000 series switches
                      s      Y is the port number (0-F hex) that the device is attached to
                      s      ZZ is the AL_PA for a loop device or 00 for an F_Port

                 An example: 021500
                      XX=02 Domain_ID of the switch
                      Y=5 Port #
                      ZZ=00 If 00, then F_Port.
                      IF non-zero, then ALPA of the device on the FL_Port


      SAN Profile
      It is recommended that you create a profile of your fabric when it is functioning
      normally so that you always have a baseline to compare the current state of your
      SAN.You will want to create a profile before making any changes to the SAN,
      such as firmware upgrades or additions or deletions of switches or edge devices.
      This information can be captured from a logging facility within telnet and stored
      as a text file.
                                                          SAN Troubleshooting • Chapter 8   309


Figure 8.16 Use topologyShow to Determine the Number of Online Switches
in the SAN

         5 domains in the fabric; Local Domain ID: 3
         Domain    Metric    Hops    Out Port    In Ports    Flags    Name
         --------------------------------------------------------------------
            1       1000       1         0      0x00000002     D      "core1"
                                         2      0x00000008     D
            2       2000       2         0      0x00000000     D      "edge1"
                                         2      0x00000000     D
                                         3      0x00000000     D
                                         1      0x00000000     D
            4       2000       2         0      0x00000000     D      "edge3"
                                         2      0x00000000     D
                                         3      0x00000000     D
                                         1      0x00000000     D
            5       1000       1         3      0x00000001     D      "core2"
                                         1      0x00000004     D




                                core1             core2




                        edge1             edge2                edge3




     When you finish your maintenance or suspect a problem, take a new profile
and compare the baseline profile to your current profile. Any discrepancies
require further investigation. For troubleshooting purposes, a profile should con-
sist of the following information extracted from a healthy SAN:
     s    The number of domains in the fabric, which can be obtained from
          topologyShow outputs.
     s    The overall topology of the fabric, again from topologyShow and
          switchShow outputs.
310   Chapter 8 • SAN Troubleshooting


                s    The number of noncached Name Server entries for each switch in the
                     fabric, which can be obtained by issuing the command nsShow.
                s    The total number of Name Server entries, which can be determined by
                     issuing the command nsAllShow.
          You can also obtain this data by issuing the command supportShow for every
      switch and then pulling the required data out of log. Another option is to auto-
      mate the acquisition of data and then parse out the necessary fields. Figure 8.17
      and Table 8.4 are examples of the necessary data collection and what a SAN
      profile looks like.The data to collect is bolded in Figure 8.17 as well.

      Figure 8.17 Data to Collect When Establishing a SAN Profile
      BigSAN102:admin> nsShow
      The Local Name Server has 2 entries {
          Type Pid            COS   PortName         NodeName               TTL(sec)
          N         661600;
                3;50:00:60:e8:02:76:b9:04;50:00:60:e8:02:76:b9:04; na
                FC4s: FCP [HITACHI OPEN-9              0112]
                Fabric Port Name: 20:06:00:60:69:10:67:c4
          N         661b00;
                3;50:00:60:e8:02:76:b9:00;50:00:60:e8:02:76:b9:00; na
                FC4s: FCP [HITACHI OPEN-9              0112]
                Fabric Port Name: 20:0b:00:60:69:10:67:c4
      }
      BigSAN102:admin> nsAllShow
      16 Nx_Ports in the Fabric {
              641300 661600 661b00 6a1100 6b1000 6b1101 6b1600 6d1100
              6d1200 6d1300 7215e1 761d01 761e00 771d00 771f00 781e00
      }
      BigSAN102:admin> topologyShow


      26 domains in the fabric; Local Domain ID: 102


      Output truncated. Make sure you capture all domain Ids in the fabric.
                                               SAN Troubleshooting • Chapter 8   311


Table 8.4 Formatted SAN Profile

Switch                           Local NS Entries
BigSAN100                        0
BigSAN101                        0
BigSAN102                        2
BigSAN103                        4
BigSAN104                        1
BigSAN105                        1
BigSAN106                        4
BigSAN107                        4
BigSAN108                        0
BigSAN109                        0
BigSAN110                        0
BigSAN111                        0
BigSAN112                        0
BigSAN113                        0
BigSAN114                        0
BigSAN115                        0
BigSAN116                        0
BigSAN117                        0
BigSAN118                        0
BigSAN119                        0
BigSAN120                        0
BigSAN121                        0
BigSAN122                        0
BigSAN123                        0
BigSAN124                        0
BigSAN125                        0
Total Nodes                      16
Total Switches                   26
312   Chapter 8 • SAN Troubleshooting


      What Data Can a Host Provide?
      A host can provide a significant amount of data to aid the SAN troubleshooting
      process.Think again of the SAN as a virtual cable. A working virtual SAN cable
      means that edge devices that are expected to communicate with each other are
      successfully connected as N_Port or NL_Port (verify this with switchShow),
      and that the devices are present in the Name Server (verify this with nsShow).
      Assuming that zoning is properly configured, these edge devices should be able to
      communicate with each other, just as if they are directly connected to each other
      with a cable.
           A host can indicate if devices are visible to that host. In a Windows environ-
      ment, do this by running Disk Administrator; in a UNIX environment, do this by
      issuing the format command. Many tools from HBA vendors are GUI-based and
      allow for real-time, live viewing of connection status to storage devices. Some
      examples of these tools are TROIKA’s SAN Command and JNI’s EZ Fibre. If the
      devices do not show up at the host when these commands are issued, the next
      step is to see why these devices are not visible to the host.The key to this involves
      reviewing the host log files. For Solaris, the message log file is normally located in
      the file /var/adm/messages.You can watch the SAN HBA events in real time,
      by doing a tail –f /var/adm/messages. For a Microsoft environment, you can
      use the Event Viewer to see the HBA-related activity.You might need to change
      the verbosity levels or set the HBAs to debug mode to see detailed data in the
      message logs. An example of a log from a Solaris host is provided in Figure 8.18 to
      familiarize you with the data and how you might use it to assist in the trouble-
      shooting process.
           In Figure 8.18, the HBA recognizes and has visibility to seven JBOD disks. At
      this point, you can conclude that the SAN virtual cable is working fine and that
      the host has visibility to the devices at a SAN level.The next step is to see if the
      devices are visible to the operating system. Based on the next set of error messages,
      the indications are that the devices are not visible to the operating system and that
      the HBA should be investigated further.The problem illustrated in Figure 8.18
      occurred because the host HBA drivers were not configured to bind with any tar-
      gets; hence, the disks were not presented to the operating system.To resolve this
      issue, it is necessary to follow the HBA directions for binding SAN targets.
      Figure 8.18 Solaris Host SAN-Related Messages


         Solaris 2.8 host
                                                                                                     The link was reset
         JNI HBA

        May 16 20:08:29 sun1 jnic: [ID 619166 kern.notice] jnic0: Loss of sync detected
        May 16 20:08:29 sun1 jnic: [ID 229332 kern.notice] jnic0: Link Down
        May 16 20:08:29 sun1 jnic: [ID 957663 kern.notice] jnic0: Port 0214DC (WWN 2000002037D97796:2100002037D97796) removed.                The devices go
        May 16 20:08:29 sun1 jnic: [ID 832114 kern.notice] jnic0: Port 0214E0 (WWN 2000002037D9730F:2100002037D9730F) removed.                    away
        May 16 20:08:29 sun1 jnic: [ID 515504 kern.notice] jnic0: Port 0214E1 (WWN 2000002037D976B3:2100002037D976B3) removed.
        May 16 20:08:29 sun1 jnic: [ID 355286 kern.notice] jnic0: Port 0214E2 (WWN 2000002037D9775A:2100002037D9775A) removed.
        May 16 20:08:29 sun1 jnic: [ID 682426 kern.notice] jnic0: Port 0214E4 (WWN 2000002037D974D7:2100002037D974D7) removed.
        May 16 20:08:29 sun1 jnic: [ID 944579 kern.notice] jnic0: Port 0214E8 (WWN 2000002037D96FEB:2100002037D96FEB) removed.
        May 16 20:08:29 sun1 jnic: [ID 846798 kern.notice] jnic0: Port 0214EF (WWN 2000002037D97745:2100002037D97745) removed.
                                                                                                                                              The link comes
        May 16 20:08:35 sun1 jnic: [ID 184835 kern.notice] jnic0: Link Up
                                                                                                                                                 back up
        May 16 20:08:35 sun1 jnic: [ID 475247 kern.notice] jnic0: Port 0214DC (WWN 2000002037D97796:2100002037D97796) available.
        May 16 20:08:35 sun1 jnic: [ID 861423 kern.notice] jnic0: Port 0214E0 (WWN 2000002037D9730F:2100002037D9730F) available.




313
        May 16 20:08:35 sun1 jnic: [ID 887895 kern.notice] jnic0: Port 0214E1 (WWN 2000002037D976B3:2100002037D976B3) available.
        May 16 20:08:35 sun1 jnic: [ID 172876 kern.notice] jnic0: Port 0214E2 (WWN 2000002037D9775A:2100002037D9775A) available.
        May 16 20:08:35 sun1 jnic: [ID 900463 kern.notice] jnic0: Port 0214E4 (WWN 2000002037D974D7:2100002037D974D7) available.
        May 16 20:08:35 sun1 jnic: [ID 121592 kern.notice] jnic0: Port 0214E8 (WWN 2000002037D96FEB:2100002037D96FEB) available.
        May 16 20:08:35 sun1 jnic: [ID 549483 kern.notice] jnic0: Port 0214EF (WWN 2000002037D97745:2100002037D97745) available.
                                                                                                                                               The devices return


        May 16 21:30:45 sun1 jnic: [ID 232625 kern.notice] jnic0: Target12 Lun0: Initialization failed: No fibre channel bindings provided.
        May 16 21:30:45 sun1 jnic: [ID 353184 kern.notice] jnic0: Target13 Lun0: Initialization failed: No fibre channel bindings provided.
        May 16 21:30:45 sun1 jnic: [ID 473743 kern.notice] jnic0: Target14 Lun0: Initialization failed: No fibre channel bindings provided.
        May 16 21:30:45 sun1 jnic: [ID 594302 kern.notice] jnic0: Target15 Lun0: Initialization failed: No fibre channel bindings provided.



                                                                                                                          Now there looks
                                                                                                                           to be an issue
                                                                                                                          with the bindings
                                                                                                                          —an HBA issue ?
314   Chapter 8 • SAN Troubleshooting


      When to Use portLog and Other Advanced Tools
      The portLog debugging tool is a low-level tool for debugging the SAN.The
      portLog facilities are available in two forms: portLogDump and portLogShow.
      The help page for portLogShow is reasonably detailed and helpful for decoding
      portLog entries.To effectively understand portLog data, you will need a solid
      background in Fibre Channel fundamentals.Training on decoding portLog data is
      available from Brocade (www.brocade.com/education_services).
          An annotated example of a portLog entry is shown in Figure 8.19 to provide
      some insight into how a portLog entry is decoded.You will most likely
      encounter portLog data when entering the supportShow command, which calls
      portLogDump, or if you are requested to obtain this data by Brocade support.
      Figure 8.19 portLog Entry Example

                                           Port 6
                                                                    D_ID = name server
                                                                                            ELS -> PLOGI


           21:01:30.216 tReceive   Rx3              6   116   22fffffc,00011600,00f2ffff,03000000




                         Task
                                                                                 S_ID =
                                                           Frame
             Timestamp                                                          Domain 1
                                 Class 3                  payload
                                 frame                      size                 Port 6
                                received



          For Fibre Channel developers and people who are intimately involved with
      SANs, a programmer’s guide is available (Fabric Programming Guide Revision 2.1
      Publication number 53-0001561-01Rev. A 4/11/00).The guide is available from the
      Brocade Web site (www.brocade.com); however, a login and password are required.
      Instructions for obtaining a login and password are posted on the Web site.
          Another low-level debugging tool is a Fibre Channel analyzer. Companies
      such as Finisar (www.finisar.com) and Xyratex (www.xyratex.com) manufacture
      Fibre Channel analyzers. An analyzer is typically used in a development environ-
      ment and rarely to debug production environments. Analyzers can generate a
      tremendous amount of data. An analyzer is usually inserted into the SAN
      between the switch and an edge device, or between two switches. Normally, a
                                                    SAN Troubleshooting • Chapter 8    315


detailed analysis and troubleshooting effort is required to identify where to insert
the analyzer into the SAN and what data the analyzer should look for. Again, an
extensive background in Fibre Channel is necessary to effectively use an analyzer.




   In-Depth Troubleshooting with Fibre Channel Analyzers
   Although configuring SANs is getting easier with each new generation
   of equipment, it is often useful to have the appropriate tools for con-
   figuring and testing your SAN. As in standard Ethernet-based networks,
   and even in local parallel SCSI bus installations, network sniffers and bus
   analyzers are very handy tools to really understand what is going on.
   Fibre Channel cable testers can be purchased for nominal amounts, link
   activity analyzers for several hundred dollars, and full-blown protocol
   analyzers for several thousand dollars.
        Fibre Channel cable testers provide simple connectivity tests for a
   cable; in the case of copper cables, they test for connectivity between
   two ends of a cable. Similar optical tools are available for checking the
   amount of light that is transmitted through an optical cable, and they
   provide convenient diagnostic capabilities for cable integrity.
        An affordable alternative to full-blown protocol analyzers is a link
   activity analyzer. Link activity analyzers attach to Fibre Channel cables
   and analyze basic activity on the link. Basic functionality includes LEDs
   to indicate when traffic is being sent and received, as well as informa-
   tion such as MB/sec counters, online or offline information, error lights
   for CRC errors, and optical signal quality indicators. These types of link
   activity analyzers are ideal for isolating specific problem areas in a SAN,
   and identifying questionable links or devices.
        Finally, for the most information about what is happening on a
   SAN, protocol analyzers are the best tools available. These tools will
   record every bit of information that comes across a wire, and through
   user software can play back activity, show errors, highlight questionable
   transactions, and more. Ranging from simple two-channel analyzers
   embedded in a PC to multichannel testers that can test all of the ports
   of a Fibre Channel switch in a single box, these analyzers are invaluable
   if you really want to know what is going wrong with your network.
        These tools can be invaluable for debugging problems directly at
   the source, and are often bundled with training and classes to help you
   learn the basic protocol and debugging techniques. For many problems

                                                                          Continued
316   Chapter 8 • SAN Troubleshooting


         you encounter in a development environment, a protocol analyzer is the
         only tool that will help you really see what is going on. However, in pro-
         duction environments it is unnecessary to invest in a full analyzer for
         day-to-day operation.



      Troubleshooting the Fabric
      A problem with the fabric is a pervasive issue that often affects more than one
      device.When a fabric issue is experienced in a resilient SAN, it might have no
      impact on SAN functionality since the SAN redundancy compensates for the
      marginal situation.Table 8.5 provides a high-level review of problematic fabric
      symptoms and associated possible causes. Fabric issues are normally associated
      with heterogeneous storage and server environments in which all devices have
      not been tested as a system.

      Table 8.5 Symptoms Indicative of a Fabric Problem

      Symptom                      Possible Causes
      Multiple edge devices        s    Fabric segmentation (zone conflict, mismatched
      are inaccessible from             fabric parameters)
      multiple hosts               s    Switch failure
                                   s    Edge device timeout or communication conflict
                                        when accessing the Name Server (FFFFFC) or
                                        Fabric F_Port (FFFFFE)
                                   s    Unconfirmed domain
                                   s    Message Queue (MQ) issues
                                   s    Hosts and/or storage attempted to access the
                                        fabric prior to fabric convergence
                                   s    Domain ID conflict
                                   s    Port configuration conflict
                                   s    No fabric license installed
      Incompletely initialized     s    Marginal link
      ISLs: ISL port initializes   s    Fabric initialization error
      as a G_Port or does not
      come online

          The remainder of this section identifies what tools to use and data to analyze
      when a fabric issue is suspected. Symptoms are explained in further detail and
      specific issue traits are identified.Where possible, workarounds or corrective
      actions are specified.
                                                     SAN Troubleshooting • Chapter 8     317


What to Look for in a Malfunctioning Fabric
If a switch is unable to join the fabric, all devices on that switch become inacces-
sible to the fabric and possibly to each other.When edge devices time out or are
unable to properly communicate with fabric services, communication between
numerous edge devices is interrupted and some devices become inaccessible.


NOTE
     When initially identifying a fabric issue, look for a large number of edge
     devices to be behaving marginally or not communicating at all. See if
     you can identify a pattern. Is the outage random throughout the fabric,
     or can you correlate the outage to a particular switch? Does the outage
     correlate to one particular host type or storage device?




Host Behavior
Hosts that are involved with a fabric problem exhibit a variety of symptoms, one
of which is that some or all edge devices become inaccessible.You can verify this
situation for UNIX hosts using the command format to see if any devices have
disappeared. For Microsoft Windows 2000, start up the Disk Management utility
and check if any devices have disappeared.The Solaris /var/adm/messages file
and the Microsoft Event Viewer might provide further insight into the issue. ISL
initialization issues normally are invisible to the host, as the fabric will reroute
around failed ISLs and ensure connectivity—unless the ISL failure results in the
SAN becoming segmented, in which case edge devices will become inaccessible
to the host. Another possible symptom on the server is reduced performance of
the application. In the event of an ISL failure, the fabric will reroute the traffic as
mentioned.When this happens, typically, the traffic will have to share ISLs with
more devices than normal, possibly resulting in reduced performance due to con-
gestion on the ISLs. Utilities supplied by your HBA vendor can also be helpful in
identifying host SAN status.

SAN Profile
If you suspect a SAN issue, create a new SAN profile and compare your baseline
SAN profile to your newly created SAN profile. Any unexplained discrepancies
318   Chapter 8 • SAN Troubleshooting


      require further investigation—whether one or more switches have dropped out,
      or if there are several missing Name Server entries.

      Switch LEDs
      If you can observe the SAN switches while the problem is occurring, try to
      detect an LED pattern. Focus on the ISL ports first. Any yellow lights (blinking
      or steady) indicate that manual intervention is required. At this point, log in to
      the switch with yellow lights and issue the command supportShow to extract
      debugging information for further analysis. If the switch is disabled (all ports
      blinking a slow yellow), issuing a switchEnable command will bring the switch
      back into the SAN. If a port is yellow (blinking or steady), you can bring the
      device back online by issuing the command portDisable and then a
      portEnable on the yellow port. Issue the command switchShow to verify the
      port state or a disabled switch.

      The errShow Command
      Start the troubleshooting process by reviewing errShow data for every switch in
      the fabric. Fabric segmentation and Message Queue (MQ) errors are indicative of
      an error that will cause the switch and its connected devices to become inaccessible
      to the fabric. Fabric segmentation is also caused by zone conflicts, incompatible
      fabric parameters, or domain conflict. Review the errShow as a starting point.

      The switchShow Command
      When investigating fabric issues, you need to look at switchShow for port state
      information and for fabric-related information. Issue the switchShow command
      on every switch in the fabric. Examine the port state section of the switchShow
      data for incompletely initialized E_Ports, which will show up as G_Ports or as
      ports that are not online. If the port does not reinitialize itself, then manually
      reinitialize the ISL by executing the commands portDisable and portEnable,
      providing the offending port number as an argument.
            A fabric issue that has less impact involves incomplete ISL initialization. If ISL
      initialization issues occur, it is usually during fabric bring up. ISL initialization
      issues can also occur during a fabric reconfiguration, which is triggered when an
      ISL is added or removed or when a switch is added or removed. If the SAN is
      designed to be resilient, an incomplete ISL initialization minimally impacts the
      fabric, since there are multiple ISLs connecting the switches and edge devices are
      still able to communicate with each other. On the other hand, if the SAN is not
                                                       SAN Troubleshooting • Chapter 8   319


resilient, an ISL initialization problem may result in a segmentation of the SAN
and many devices may lose communications with the SAN.
     Resilient topologies deliver at least two internal fabric routes and are consid-
ered more resilient because each topology is capable of sustaining a switch or ISL
failure while the remaining switches and fabric remain operational.This self-
healing capability is enabled by Fabric Shortest Path First (FSPF) and is depicted
in Figure 8.20.
Figure 8.20 In a Resilient SAN, an ISL Failure Does Not Affect Communication



                                               A


                                                   B


           failed ISL




                   A                                                 C
                           B



     Figure 8.20 also depicts the failure of an ISL in a cascade topology, which is
the SAN located on the left. Note that switches A and B are unable to communi-
cate with the remaining switches when the ISL marked with the “X” fails.
However, a similar switch failure in a resilient topology SAN (located on the
right) does not sever communications between the remaining switches. If the ISL
fails, it is still possible for switch A to communicate with switch C, using several
paths, such as the path highlighted in Figure 8.20. In a resilient topology, an ISL
failure might go unnoticed unless some type of monitoring is used (such as
Fabric Watch, a separately licensed product available from Brocade). Additionally,
with the loss of an ISL, there may also be performance degradation due to a loss
of overall available bandwidth.
     When reviewing the fabric-related information of switchShow, search for a
switch that is disabled or has an unconfirmed domain. An unconfirmed domain
indicates that the switch was unable to communicate with the principal switch in
the fabric to obtain a domain ID.To resolve either situation, issue the command
switchDisable followed by switchEnable to enable the switch to join the fabric.
320   Chapter 8 • SAN Troubleshooting


      The topologyShow Command
      The topologyShow information is straightforward.You have to issue the
      topologyShow command on only one switch, unless that switch happens to be
      disabled or segmented. If this is the case, the topologyShow data will indicate
      the number of switches in your fabric as one, and you need to pick another
      switch to obtain the topologyShow information. The number of domains
      should equal the number of switches in the SAN.You can reference your SAN
      profile to establish the expected number of switches in the SAN. If there is an
      unexplained discrepancy, you most likely have a failed, segmented, or disabled
      switch.You can use switchShow data to identify a disabled or segmented switch.
          If a switch that is supposed to be part of the fabric does not show up in the
      topologyShow output (the previous SAN profile helps here), the administrator
      should identify the switch, log in to it, and try first a portDisable-portEnable
      sequence on any of the ports that should be an E_Port. If this does not work, try
      a switchDisable-switchEnable sequence.

      The nsShow and nsAllShow Commands
      Issue the command nsAllShow on any switch in the fabric to obtain the total
      number of edge devices registered with the Name Server. Note that issuing the
      nsAllShow command on a switch that is segmented or disabled will return
      Name Server data for only the switch and not the entire fabric. If there is an
      unexplained discrepancy between this number and the number of Name Server
      entries recorded in your SAN profile, you will need to identify which switches
      are associated with the missing Name Server entries. First, check to see if there
      are a number of missing devices; if so, then it is likely that one of the switches has
      segmented or is offline.This should have been seen in the prior step. If you are
      unsure of what devices are missing, issue the command nsShow on each switch
      in the SAN and compare the number of Name Server entries to your SAN pro-
      file. Next, attempt to correlate the missing Name Server entries. Are the missing
      entries all associated with any particular switch or edge device? Once you rule
      out a segmented or disabled switch, determine if the port associated with the
      missing devices is online. If the port is not online, bring the port online by exe-
      cuting the commands portDisable and portEnable, supplying the questionable
      port number as an argument to these commands.This should refresh the Name
      Server with the missing port edge devices. If the missing Name Server device
      port comes online, and it still does not register with the Name Server, then it
      indicates that there is either a timeout or a conflict in communication between
                                                    SAN Troubleshooting • Chapter 8    321


the Name Server and the edge device in question. It is now time to work with
your switch supplier and edge-device supplier to resolve this complex problem.

Now that You Suspect a
SAN Issue: Digging Deeper
Now that you suspect a SAN issue, you will need to investigate further to iden-
tify the root cause.The use and context of each command follows, relative to
troubleshooting a SAN issue.Where possible, workarounds or corrective actions
are identified. Several commands must be run on each switch; this is something
that can be automated.The details for doing so are presented later in the book in
Chapter 9.

Timeout of Edge Devices during Fabric Bring Up
If the problem occurs after a SAN bring up or during reconfiguration, it is pos-
sible that the edge devices came online before the SAN is ready. If this is the
case, you will see flickering green and possibly flickering yellow lights on the ISL
ports as the SAN converges while the edge ports remain steady green.You will
also see messages on the switch console as edge devices attempt to FLOGI and
Port-to-Port Login (PLOGI). Normally this is acceptable; however, if the SAN
requires an extended period of time for bring up, devices might time out. Be
careful to differentiate between an edge device that successfully retries PLOGIs
and FLOGIs while the fabric converges, and do not interpret these retries as fail-
ures.When the fabric is completely up, most devices that time out will try again;
however, if they do not, a timeout failure is to be expected.
     If you suspect a PLOGI/FLOGI timeout failure during fabric convergence,
you can confirm your suspicions by reviewing the host logs.You can determine
the SAN state, by issuing a topologyShow command and verifying that the cor-
rect number of domains are in the fabric. If the edge devices are not tolerant of
the time it takes the SAN to converge, they might time out their FLOGI or not
successfully interact with the Name Server. In either case, that device will be
inaccessible to the fabric. If you suspect this is happening with your SAN, investi-
gate the edge device logs to conclusively determine that timeouts are occurring,
the type of timeout, and how long these timeouts last. If timeouts are occurring,
one resolution is to increase the timeout values in the fabric (Resource
Allocation Time Out Value [R_A_TOV] or Error-Detect Time Out Value
[E_D_TOV]) or with the edge devices.There might be other timeout values on
the edge device that might help prevent this issue; however changing timeout
322   Chapter 8 • SAN Troubleshooting


      values is a complex procedure and it is suggested that you work with your switch
      supplier and edge device supplier at that point.

      Port Configuration Conflict
      or Missing Fabric License
      If your switch is not configured with a fabric license, it cannot join the fabric.
      The port state section of the switchShow will indicate that the E_Ports are
      unknown.When you issue the command licenseShow, you should see a fabric
      license. If the switchShow data indicates unknown E_Ports and you do not have
      a fabric license installed, you will not be able to join that switch into a fabric
      until you acquire a fabric license from your switch supplier.The SilkWorm 2010
      and 2100 switches are entry-level switches and are not configured with a fabric
      license, but can be upgraded with a simple license key.These switches are designed
      for switched loop connectivity using Brocade QuickLoop.They can have a single
      E_Port for connecting another QuickLoop switch; however, if additional ISLs are
      connected, they will not come online.The SilkWorm 2240 and 2250 are entry
      fabric switches designed for small SANs or for the edge of a larger SAN.They
      can also only support a single E_Port unless you upgrade them to a full fabric
      license. Figure 8.21 provides an example of a properly installed fabric license.

      Figure 8.21 Example of a Properly Installed Fabric License
      core1:admin> licenseShow
      SRzy9Sz9zeTS0zAG:
           Web license
      bbSz9eQb9zccT0AQ:
           Zoning license
      RdzdSRcSyzSe0eTn:
           QuickLoop license
      cSczRScd9RdTd0SY:
           Fabric license     <------A Fabric license is properly installed


          It is possible to prevent switches from connecting into a fabric by disabling
      E_Port functionality.You might want to do this for security purposes to prevent
      unauthorized switches from joining the SAN. If your E_Ports are unknowingly
      disabled, it will not be possible to join the switch into a fabric.To verify the status
      of the switch E_Ports, issue the command portCfgEport, as shown in Figure 8.22.
                                                                   SAN Troubleshooting • Chapter 8   323


Note that switch port 0 E_Port capability is disabled and that you cannot use this
port as an E_Port.To disable or enable E_Port support for a port, use the
portcfgEport command.You might want to do this for security purposes, since
turning off an E_Port prevents someone from attaching a switch to your fabric
without first obtaining your approval.

Figure 8.22 E_Port Configured as Disabled Example
core1:admin> portcfgEport
Ports:    0       1   2   3   4       5   6   7   8       9   10   11   12   13   14   15
         -----------------------------------------------------------
         NO   -       -   -       -       -   -       -       -    -    -    -    -    -



Segmented Fabrics
A fabric can segment for a variety of reasons, including zone conflicts, incompatible
fabric parameters, and domain ID conflicts.This section helps you identify whether you
have fabric segmentation, and what type of fabric segmentation you are experi-
encing. A fabric might segment when you add a new switch to the fabric or upon
fabric reconfiguration or bring up.The segmented fabric error message will occur
on any switch to which the new switch is trying to connect.The new switch that
is trying to join the fabric will show the E_Ports as unknown output from the
switchShow command. If the fabric segments during a reconfiguration or bring
up, you will have to search for a switch with unknown E_Ports, which can be
determined by examining the switchShow output.You can also compare your
current SAN profile to your baseline SAN profile to identify the missing switch.

Zoning Conflict
A zone conflict and fabric segmentation can occur when introducing a single- or
multiple-switch fabric into an existing fabric. As these conflicts may affect the
connected online devices, the switches segment and await human intervention to
determine the proper resolution.There is no way to identify the correct configu-
ration without first investigating the nature of the conflict. If there are conflicts, it
may be easier to clear the configuration on the conflicted switch and then have
that switch absorb the zone information when it becomes part of the fabric.
Typically, there are three conditions that will create a zone conflict:
     s    Multiple zoning configurations enabled Enabling zoning on both
          fabrics when they are connected will create a zone conflict. Only one
324   Chapter 8 • SAN Troubleshooting


               zone configuration can be enabled in a single fabric at a time. An
               example of this is if the Day configuration is enabled on one switch and
               the Night configuration is enabled on the other.The administrator will
               have to decide which one is appropriate and disable the other.
           s   Zone definition type conflict This occurs when introducing a
               single- or multiple-switch fabric into an existing fabric that has zoning
               definitions already defined, but the definition type (in other words, alias,
               zone) is in conflict. An example of this would be a definition of Red as a
               zone defining one fabric, and Red as an alias definition on another
               fabric.This is a definition conflict and will segment the fabric.
           s   Zone definition content conflict This occurs when introducing a
               single- or multiple-switch fabric into an existing fabric that has zoning
               definitions (in other words, alias, zone) already defined, but the content
               is in conflict.This is where the definition name and type match, but the
               content is different. An example of this would be a Red zone defined on
               both fabrics. On the first fabric, the Red zone was defined with domain
               5, port 4, and the second fabric has the Red zone defined with domain
               7, port 3. Both have a zone definition of Red, but the content is in con-
               flict and will cause the fabric to segment. Again it will require that the
               administrator determine which Red zone is correct and either update
               the incorrect one or delete it. Once the fabrics merge, the proper Red
               zone will be propagated to all the switches in the fabric.
          The workaround for this situation involves correcting the conflicts or clearing
      the zoning information on either the fabric or new switch, depending on which
      zoning configuration you consider to be correct and want to keep.You can clear
      a zoning configuration by issuing a cfgClear <configuration you want to
      delete> command followed by a cfgDisable <active configuration you
      want to delete> command.You should first save a copy of the zone configura-
      tion by issuing cfgShow and saving the output to a file (in case you mistakenly
      delete the wrong configurations).The configUpload command is also useful for
      this operation. Figure 8.23 shows a zone conflict error message.

      Figure 8.23 Zone Conflict Error Message
      0x10addf10 (tZone): May 15 09:37:01 (12)
           Error FABRIC-SEGMENTED, 3, port 4, zone conflict
                                                   SAN Troubleshooting • Chapter 8    325


Incompatible Fabric Parameters
Certain system configuration settings are changed by issuing the command con-
figure.The fabric parameter system configuration settings must be the same for
every switch in the fabric.The fabric will segment if there is a difference between
the parameters that exist in the fabric and the parameters on a switch that is
trying to join the fabric.The following parameters must be consistent with the
switch that is joining the fabric and the fabric:
  BB credit: (1..27) [16]
  R_A_TOV: (4000..120000) [10000]
  E_D_TOV: (1000..5000) [2000]
  Data field size: (256..2112) [2112]
  Sequence Level Switching: (0..1) [0]
  Disable Device Probing: (0..1) [0]
  Suppress Class F Traffic: (0..1) [0]
  SYNC IO mode: (0..1) [0]
  VC Encoded Address Mode: (0..1) [0]
  Core Switch PID Format: (0..1) [0]
  Per-frame Route Priority: (0..1) [0]
  Long Distance Fabric: (0..1) [0]



NOTE
     The range of values and defaults for Fabric OS 2.4.1a are shown in the
     list of parameters in this section. Fabric parameters are subject to
     change, and you should consult the documentation of the Fabric OS ver-
     sion you intend to use.



    Figure 8.24 shows an example of an incompatible fabric parameters error mes-
sage occurring on switch edge1, and the resulting switchShow data from that
switch. It is necessary to track down the switch connected on ports 0 and 2 of
switch edge1 and compare the fabric parameters from that switch to those of
edge1. Once you identify the discrepancy, use the configure command to change
the discrepant fabric parameters of the joining switch to those of switch edge1.
326   Chapter 8 • SAN Troubleshooting


      Figure 8.24 Incompatible Fabric Parameters

            edge1:admin> switchShow

            switchName:    edge1
            switchType:    2.4
            switchState:   Online
            switchRole:    Subordinate
            switchDomain:  2
            switchId:      fffc02
            switchWwn:     10:00:00:60:69:11:f9:f7
            switchBeacon:  OFF
            port 0: sw Online          E-Port (unknown)
            port 1: sw Online          E-Port 10:00:00:60:69:10:9b:5b "core2" (upstream)
            port 2: sw Online          E-Port (unknown)
            port 3: sw Online          E-Port 10:00:00:60:69:10:9b:5b "core2"
            port 4: -- No_Module
            port 5: -- No_Module
            port 6: cu Online          L-Port 7 public
            port 7: -- No_Module
            port 8: -- No_Module
            port 9: -- No_Module
            port 10: -- No_Module
            port 11: -- No_Module
            port 12: -- No_Module
            port 13: -- No_Module
            port 14: -- No_Module
            port 15: -- No_Module



            Error 01

            --------
             0x10f6f4d0 (tTransmit): May 17 18:10:38 (8)
                Error FABRIC-SEGMENTED, 3, port 0, incompatible flow control parameters




                           Edge1

            0 1 2 3 4 5 6 7 8 9 10 1 12 12 14 15




                                                                    Switch with incompatible
                             ?                                         fabric parameters




      Domain ID Conflict
      A domain ID conflict can occur if you join a switch that is in the online state
      into a fabric, and the joining switch domain ID conflicts with the domain ID of
                                                     SAN Troubleshooting • Chapter 8    327


a switch in the fabric. Normally, domain IDs are automatically assigned; however,
once a switch is online, the domain ID cannot change, as it would change the
port addressing and potentially disrupt critical I/O.The resolution for this
problem involves performing a switchDisable followed by a switchEnable on
the joining switch.This will enable the joining switch to obtain a new domain
ID as part of the process of coming online.The fabric principal switch will allo-
cate the next available domain ID to the new switch during this process.


NOTE
     Changing domain IDs can have an impact on port zoning entries. Be sure
     to check to see if any port zoning entries exist for devices on a switch
     before changing its domain ID, and update any affected zones to reflect
     the change.




Message Queue Errors
An MQ error is a message queue error.You can identify an MQ error message by
looking for the two letters M and Q in the error message. MQ errors can result
in edge devices dropping from the Name Server or preventing a switch from
joining the fabric. MQ errors are rare and difficult to troubleshoot, and it is sug-
gested that you resolve them by working with your switch supplier.When you
encounter MQ errors, execute the supportShow command to capture debug
information about the switch. A switch reboot will likely clear any associated
problems.Then forward the supportShow data to your switch supplier for
further investigation.

Troubleshooting Devices
that Cannot Be Seen
A host that is unable to access a SAN device is a more common SAN issue that
can arise. Again, consider the virtual SAN cable analogy to start the troubleshooting
process.We want to determine whether the SAN is the cause of the problem or
whether it is an edge device issue.To do this you need to work your way along the
virtual SAN cable to the edge device(s) that cannot be seen. Figure 8.25 depicts a
flowchart that outlines the process for troubleshooting a missing device.
      Figure 8.25 Troubleshooting Devices that Cannot Be Seen

                              Storage Device
                               not visible to
                                   host




                                                       Is the storage                                     Follow marginal
                               Is it a fabric              device                Is there a port                link
                                                No       present in       No      configuration     No    troubleshooting
                                   issue ?
                                                        switchShow?                  conflict?               procedure

                                   Yes
                                                                                      Yes

                               Follow fabric
                                                            Yes
                             troubleshooting                                   Problem Identified
                                procedure




328
                                                     Is storage device               Node
                                                                                                              Name Server
                                                                          No     configuration       No
                                                      visible in name                                       conflict - Escalate
                                                                                    issue ?
                                                          server ?

                                                                                      Yes

                                                            Yes
                                                                               Problem Identified




                                                                               Timeout or Name
                                                       Is it a zoning     No    Server conflict -
                                                           issue ?
                                                                                    Escalate

                                                            Yes


                                                     Problem Identified
                                                     SAN Troubleshooting • Chapter 8     329


What to Look for in the Fabric
The first step is to determine whether the missing device problem is a fabric issue.
A quick way to determine this is to establish if the problem is localized to just a
single missing device or multiple missing devices.You also want to ensure all
switches are online in the fabric.You can quickly check your fabric status by issuing
the command topologyShow to verify that the correct number of domains exist
in your fabric.You can verify that the missing device is a localized issue by entering
the command nsAllShow to establish the total number of devices in the fabric. If
you suspect a fabric issue, since multiple devices are missing, follow the fabric
troubleshooting process. If you suspect a missing device issue, since only one or
two devices are unaccessible, move on to the next section,“Are the Host and
Storage Visible via switchShow on Their Respective Switches?”

Are the Host and Storage Visible via
switchShow on Their Respective Switches?
Use the command switchShow on the switch to which the subject host is con-
nected.Verify that the host port and the storage port are online. If both the
storage and the host port are online, move on to the next section, as the virtual
SAN cable is logically connected to both the storage and the host. If the port is
not online, your host or storage might be malfunctioning, you might have a link
initialization issue, or you might have a marginal link. If the edge port is not
online or is a G_Port, this is analogous to having a disconnected cable. A host
malfunction is a very broad term and can include problems such as incorrect or
improperly installed HBA drivers, HBA parameters, or a faulty HBA. A storage
malfunction can include an incorrect or improperly configured storage interface
or a faulty storage interface.


NOTE
     A quick method of identifying the cause of a missing device is to visibly
     inspect your switch LEDs. Any steady or flashing yellow lights indicate
     that a port is not online and manual intervention is required.



    Brocade SilkWorm switches by default automatically configure the appropriate
port topology based on the connecting port topology, which is either N_Port or
330   Chapter 8 • SAN Troubleshooting


      NL_Port, or in the case of a switch, an E_Port.This functionality is invaluable for
      SAN management, because it alleviates the SAN administrator from managing and
      maintaining the configuration for potentially thousands of ports. In some situations,
      it is necessary to configure a port for a particular topology by using one or more of
      the commands portCfgEport, portcfgFAport, or portcfgLport to lock the
      port into a certain state.This may help with an issue where the edge device sup-
      ports multiple port topologies and does not initialize in the mode that is desired.
           A switch or port might also be configured for QuickLoop. First, check to see
      that the switch or port in question is configured correctly for the intended pur-
      pose. For example, if the attaching edge device is configured as an NL_Port and
      the switch port is configured as an F_Port, there is a conflict and that edge device
      might initialize as a G_Port. Initializing as a G_Port is just as bad as not initial-
      izing at all, as the associated device is essentially inaccessible.The G_Port, or
      generic port, is a transitional state defined in the standards as a device transitions
      to an F_Port or an E_Port. If the port connecting to your edge device is not
      intended to be a QuickLoop port, you will need to reconfigure that port, or the
      edge device might not initialize properly. If there is any conflict, resolve the con-
      flict with the switch, by reconfiguring the port, or with the edge device and
      move onto the next section, “Do the Devices Show Up in the Name Server?” If
      the devices support both loop and fabric modes, utilize the fabric setting to get
      the best performance and fault isolation.
           See Figure 8.26 for the usage and examples of various port configuration
      commands. Switch core1 is configured for QuickLoop, as evidenced by the
      enabled entries in the QuickLoop mode column. Switch core1 port 8 is config-
      ured as a loop port, and no ports are configured as Fabric Assist (FA) ports.You
      can also use the command qlShow to determine if the switch is configured for
      QuickLoop. If the switch is in QuickLoop mode and no QuickLoop is required,
      you can issue a qlDisable command to disable QuickLoop for the entire switch.
      If QuickLoop is required, but is not needed for the port in question, use the
      qlPortDisable <port #> command for the port that needs to be changed.

      Figure 8.26 Port Configuration Examples
      core1:admin> qlportshowall


      PortNum QuickLoop Mode      Port State
       0        Enabled             fabric              E PORT
       1        Enabled             fabric              E PORT

                                                                                 Continued
                                                       SAN Troubleshooting • Chapter 8   331


Figure 8.26 Continued
 2        Enabled                fabric           E PORT
 3        Enabled                fabric           E PORT
 4        Enabled                fabric           E PORT
 5        Enabled                fabric           E PORT
 6        Enabled                offline
 7        Enabled                offline
 8        Enabled                fabric
 9        Enabled                offline
10        Enabled                fabric
11        Enabled                offline
12        Enabled                offline
13        Enabled                offline
14        Enabled                offline
15        Enabled                offline
core1:admin> portcfgLport
Ports:   0   1   2   3   4   5    6   7 8     9   10   11 12    13   14   15
         --------------------------------------------------------
Lock     -   -   -   -   -   -    -   - YES   -   -    -   -    -    -    -


Private -    -   -   -   -   -    -   - -     -   -    -   -    -    -    -


core1:admin> portcfgFAport
Ports:   0   1   2   3   4   5    6   7 8     9   10   11 12    13   14   15
         ---------------------------------------------------------
         -   -   -   -   -   -    -   - -     -   -    -   -    -    -    -


    If the port is not online or initializes as a G_Port, attempt to reinitialize the
port by executing the commands portDisable and portEnable, supplying the
port number in question as an argument to these commands. If this process works,
monitor the situation carefully. If the host port consistently does not come online
or comes up as a G_Port repeatedly, you might have a marginal link issue, a faulty
HBA, HBA driver, or some type of configuration conflict between the host and
the switch. At this point, you need to follow the process of troubleshooting a
332   Chapter 8 • SAN Troubleshooting


      marginal link. If the link is not marginal, contact your switch supplier and HBA
      supplier to assist with further troubleshooting.
          Follow a similar process for the storage port. If the storage port is not online
      or is a G_Port, this is analogous to a disconnected cable at the storage end.
      Attempt to reinitialize the port by issuing a portDisable/portEnable. Next,
      rule out a marginal link, faulty storage equipment, and configuration conflict
      between the storage and the switch. If you are still unable to establish the root
      cause, work with your switch supplier and your storage supplier to assist with fur-
      ther troubleshooting.

      Do the Devices Show Up in the Name Server?
      At this point, you have verified that the host and storage are logically connected
      to the virtual SAN cable, and it is now necessary to confirm that the two edge
      ports are able to communicate. Use nsShow on the switch to which the storage
      is connected and the switch to which the host is connected to verify that these
      edge devices are registered with the Name Server. If you intend to verify that an
      Emulex HBA located on switch core1 port 8 is registered with the Name Server,
      the data in Figure 8.27 would confirm this.

      Figure 8.27 nsShow Example—Verifying that an Emulex HBA Is Registered
      with the Name Server
      core1:admin> nsShow
      The Local Name Server has 2 entries {
          Type Pid        COS   PortName              NodeName              TTL(sec)
          N     011800;
              2,3;10:00:00:00:c9:21:5f:a7;20:00:00:00:c9:21:5f:a7; na
              NodeSymb: [35] "Emulex LP8000 FV3.02        DV5-4.52A7 "
              Fabric Port Name: 20:08:00:60:69:10:8d:fd
          N     011a00;
              2,3;20:00:00:e0:69:f0:07:c6;10:00:00:e0:69:f0:07:c6; na
              Fabric Port Name: 20:0a:00:60:69:10:8d:fd
      }


          If the devices in question are registered with the Name Server, it is possible
      that you are experiencing a zoning mismatch or a host/storage issue. If one or
      both devices are not registered with the Name Server, it is possible that there is a
                                                     SAN Troubleshooting • Chapter 8     333


timeout or communication issue between the edge device(s) and the Name
Server. Check with the edge device documentation to determine if there is a
timeout setting or parameter that may help. If this does not work, contact the
support organization for the product that appears to be timing out.

Rule Out Zoning Issues
It is easy to rule out a zoning mismatch if zoning is not enabled. Check to see if
zoning is enabled by issuing the cfgShow command. If the output states that no
configuration is in effect, zoning is not enabled. If zoning is enabled, it is possible
that the two edge devices are unable to communicate with each other due to
zoning conflicts.To confirm whether this is the case, review the active zoning
configuration.You can do this by again issuing the command cfgShow, as shown
in Figure 8.28. In this example, host1 can access disk1, and host2 can access
disk2, but host1 cannot access host2 or disk2, and host2 cannot access host1 or
disk1. Confirm that the specific edge devices that need to communicate with
each other are in the same zone. If they are not, and zoning is active, you need to
update your zoning configuration before the edge devices in question are able to
communicate with each other. For example, if host1 needs to get access to disk2,
it is necessary to update the zoning configuration to enable this access. Once the
zone changes are made via the command line or WEB TOOLS-based GUI, the
devices should be able to access one another; however, some operating systems
might require that you run a disk utility such as format or disk administrator.
It is also possible that some operating systems might require a reboot to allow
discovery of the new devices.

Figure 8.28 Zoning Example
core1:admin> cfgshow
Defined configuration:
 cfg:     colors   red; yellow
 zone:   red        host1; disk1
 zone:   yellow    host2; disk2
 alias: disk1      0,0
 alias: disk2      0,1
 alias: host1      1,14
 alias: host2      1,15


Effective configuration:

                                                                            Continued
334   Chapter 8 • SAN Troubleshooting


      Figure 8.28 Continued
       cfg:    colors
       zone:   red       1,14
                          0,0
       zone:   yellow    1,15
                          0,1




      NOTE
           If zoning is active, any devices that are not explicitly defined in a zone
           together are not able to communicate with each other.



          At this point, if you establish that there is no switch zoning mismatch, then
      you have established that the SAN virtual cable is working and that it is likely a
      host or storage issue. One possible host or storage issue that could be causing the
      “missing” devices is a mismatch with the HBA or storage-based zoning; be sure
      to check this first when troubleshooting the edge devices.


      NOTE
           Incorrect or incomplete zoning is one of the most common causes of
           SAN communication problems. Checking for this is analogous to
           checking to see if a “malfunctioning” computer monitor is plugged in.




      Edge Device Not in the Name Server
      Reaching this point implies that you have verified that the edge devices in ques-
      tion are connected to the switch, and that one or more of the edge devices are
      not registered in the Name Server. Attempt to reinitialize the edge device(s) with
      the Name Server by executing the commands portDisable and portEnable,
      supplying the port number(s) in question as an argument to these commands. If,
      after you do this, the devices successfully register with the Name Server, you have
      resolved the problem. However, pay attention to this issue because if the problem
      recurs, it indicates a complex problem that is best resolved by working with your
                                                         SAN Troubleshooting • Chapter 8   335


switch and edge device suppliers.You should also seek this type of assistance if
after issuing a portDisable/portEnable, the devices do not register with the
Name Server.This fact indicates a complex issue such as a communication con-
flict or timeout condition. Although edge devices should reconnect to the fabric
and register when the port is disabled, some older devices might time out and no
longer retry logging in. If this happens, you might need to reboot the device to
get that device to reset and log into the fabric and Name Server.

Troubleshooting Marginal Links
A marginal switch port is defined as a switch port that is either receiving a marginal
incoming signal, or the switch receiver is not functioning properly. A marginal
Nx_Port transmit can be caused by an Nx_Port failing optical component (GBIC
or GLM) or a cable issue. A failing Fx_Port receiver can be caused by a failing
switch optical component or a failing switch port, as depicted in Figure 8.29.
Figure 8.29 Marginal Port Elements



                                                            Potential Faults

                                                  A marginal Fx_Port (switch port)
                                                  is termed a marginal Fx_Port receive




                                                  A marginal cable or Nx_Port GBIC is
                                                  termed a marginal Nx_Port transmit
                A point-to-point N_Port or loop
                (NL_Port connection)
                Corresponding switch is either
                F_Port or FL_Port.




Marginal Point-to-Point/Fabric Device Links
The impact of a marginal port can be significant. For example, a large storage
device such as an HP XP512, an IBM Enterprise Storage Server, or EMC
Symmetrix port might be accessed by potentially dozens of hosts.The marginal
336   Chapter 8 • SAN Troubleshooting


      behavior of this storage device has the potential to impact all devices that access
      this storage port. Imagine that you are a part of a geographically distributed team
      of six workers.The primary communication for this team is via telephone.
      Assume that your telephone is functioning marginally (similar to a poor cellular
      connection). Anyone who wants to call you will not be able to communicate
      effectively with you. Conversely, anyone who you call will also be unable to
      communicate effectively with you. If you are a team leader for this group, the
      impact of your marginal telephone capabilities is significant, since many people
      utilize you as a resource. Note that the others in the group are free to communi-
      cate with each other without experiencing any impact from your telephone
      problems.The story can have a happy ending if you gain access to two tele-
      phones, and realizing the marginal nature of one telephone line, switch to the
      working telephone. Note that many SANs are constructed in a similar fashion to
      Figure 8.30, with dual paths between hosts and storage, and a single failure does
      not result in an I/O failure. In applications where availability is key, dual- or even
      triple-redundant fabrics are always recommended.
      Figure 8.30 Dual-Fabric SAN Design

                             Server        Server          Server          Server




                                Fabric A                                            Fabric B




                Tape drive                                                                     Tape drive




                                  Data              Data            Data            Data
                                                     SAN Troubleshooting • Chapter 8    337


Marginal Loop Connections
While a marginal point-to-point link affects only devices that access the point-
to-point device, the ramifications of a malfunctioning loop-connected device can
impact all devices in that loop. Extending the geographically distributed team
analogy further, imagine that the only way the team communicates is via telecon-
ference.Whenever the team needs to communicate, everyone dials in to a confer-
ence call. Unfortunately, the teleconference is disrupted by your marginal
telephone link.What makes things even worse is that communication between
any other team members is impossible or very difficult. For example, it is very
difficult for one member to speak with another on the teleconference because
your marginal telephone continually creates static on the teleconference.
    Brocade QuickLoop and Fabric Assist are unique Fibre Channel topologies
that combine aspects of arbitrated loop and fabric topologies.They are composed
of multiple private arbitrated loops (looplets) interconnected by a fabric. It can be
best described as a Private Loop Fabric Attach, as compared to Private Loop
Direct Attached (PLDA) or Fabric Loop Attachment (FLA).The FL_Port of each
looplet is hidden from the NL_Ports. QuickLoop is a logical PLDA that complies
with the FC-AL standard. Although NL_Port devices are attached to different
arbitrated loops interconnected by a fabric, the fabric and the physical device
locations are transparent. QuickLoop enables switches to be used in place of hubs
in environments where all attached devices are private devices. Fabric Assist mode
allows the configuration of a virtual private loop in which a private host can see
and access public or private targets anywhere on the fabric. Such a private loop is
called QuickLoop Fabric Assist mode zone. Fabric Assist mode enables private
hosts to access public or private targets anywhere on the fabric, provided they are
configured in the same Fabric Assist zone. A public target accessed by a private
host remains public, with full fabric functionality.
    The nature of loops is such that the behavior of an unhealthy device on the
loop can adversely impact the behavior of the remaining devices on the loop. For
example, a marginal GBIC could degrade the signal to the point where the con-
nected NL_Port (host or storage) device is no longer able to effectively commu-
nicate.This in turn causes the loop to reset.When a loop resets, so do the
individual hosts or storage devices connected to that loop. Under normal circum-
stances, a loop reset does not cause any harm. However, if a device is constantly
resetting, I/O flow can become severely restricted or halted.
    Loop Initialization Primitives (LIPs) are part of a healthy loop and are used
for a variety of purposes—most commonly to signal other devices on the loop
338   Chapter 8 • SAN Troubleshooting


      that a new device has been added, or that an existing device has left the loop.
      When a loop or NL_Port resets, LIPs are generated. However, an excessive
      number of LIPs will make a loop unstable.
           The Fibre Channel standards community is making great strides in further
      enhancing the functionality of loops. However, loops are starting to become a
      legacy issue. It is important to note that Fibre Channel and SilkWorm switches
      also support point-to-point topologies, which are not subject to the same disrup-
      tive behaviors that loops are.When a public device accesses a private device
      (known as translative mode), the LIP is not propagated to that public device, nor
      is that public device subject to disruption.

      Nx_Port (Host/Storage) Behavior
      with a Marginal Port in the Loop
      When a marginal device disrupts the loop, a variety of symptoms can be present.
      Performance for devices connected to the QuickLoop or devices accessing a
      common device can be described as slow. Host logs (that is, /var/adm/messages,
      eventlog, or syslog) might indicate that I/O is timing out or that the interface is
      being reset.The switch LEDs should be green or a blinking green light. Green
      lights mixed with yellow lights or flashing yellow lights indicate that the ports are
      resetting themselves. Devices on the affected loop might FLOGI and/or PLOGI
      repeatedly onto the fabric as part of a reset process initiated by the HBA.This
      would show up on the console or telnet management session for the switch to
      which the affected device was attached. N_Port devices are less susceptible to dis-
      ruption for reasons stated earlier.

      Marginal GBIC/Cable
      You can use the er_enc_out statistic to identify a marginal GBIC. Active devices
      (such as disks) normally clean up an encoding error as these errors are encoun-
      tered, and mark the frame as having bad CRC. Any er_enc_out errors are
      encoding errors outside a frame, and do not generate a CRC error. If a high
      count (for example, several thousand) or incrementing counts of er_enc_out
      errors are experienced on a particular port, this indicates that the signal is
      marginal between the connected device’s transmit port and the switch’s receive
      port. Because this situation is being recorded as encoding errors, the implication
      is that there is no active device cleaning up the errors between the switch receive
      and the connected device transmit.The diagnosis: marginal GBIC or cable on the
      connected device.
                                                      SAN Troubleshooting • Chapter 8    339


Connected Device
Note that LIPs are normal in a healthy loop. An imbalance where the Lip_in
count is larger than the Lip_out count indicates that the associated connected
device is the originator of LIPs in the loop. A device that generates a large
number of LIPs might be malfunctioning.The switch will propagate LIPs in
accordance with the Fibre Channel specification. Propagated LIPs are recorded as
Lip_out.

Fault Isolation
Once a marginal port is identified, it is necessary to identify where the fault
resides. Figure 8.31 depicts a suggested fault isolation process. Fault isolation on a
loop is very difficult, which is one of the reasons why loops had limited success.

How the Switch Can Help: Fabric
Watch and QuickLoop Zoning
By virtue of being positioned between storage and host, the switch is a natural
resource for gathering statistics and troubleshooting. As shown earlier, the switch
can help mitigate the issues that arise when a marginal device disrupts a loop or
other N_Port devices.
    Brocade Fabric Watch allows each switch to continuously monitor fabric ele-
ments for irregular conditions. Fabric Watch can assist in rapidly identifying and
escalating potential problems.This proactive management improves the overall
availability of the SAN. Specific to troubleshooting marginal links, Fabric Watch
can detect such failing port symptoms as excessive CRC errors and proactively
send an SNMP alert. It is also possible to telnet into the switch and quickly ana-
lyze statistics to identify the marginal port.
    To minimize the impact of a marginal device in a loop, you can utilize
QuickLoop zoning or Fabric Assist to compartmentalize various host/storage
pairs. QuickLoop zoning or Fabric Assist prevents LIPs from propagating between
QuickLoop zones. In some respects, QuickLoop zoning turns one loop into mul-
tiple virtual loops. In Figure 8.32, a LIP generated by Host A in zone qlZone1
due to a marginal port does not propagate to qlZone2 or qlZone3.Without
QuickLoop zoning, a marginal port has the potential to limit or halt I/O for all
devices connected to the switch!
340   Chapter 8 • SAN Troubleshooting


      Figure 8.31 Marginal Link Fault Isolation

                         START




                     Move suspected
                   marginal port cable
                   to another port on
                       the switch




                      Do the errors                Cable or
                         stop or           No      Nx_port
                      symptoms go                   issue
                        away ?


                           Yes


                        Switch port             Try a new cable
                         or switch
                         GBIC is
                         marginal
                                                                       Follow Nx_port (i.e.
                                                                           HBA, storage
                                                 Do the errors
                                                                            interface)
                                                    stop or       No
                                                                         troubleshooting
                     Replace GBIC on             symptoms go
                                                                            procedures
                      marginal port                away ?


                                                      Yes



                     Run portLoopBack              BAD Cable
                   test on marginal port




                          Does                   Replaced
                      portLoopback         No
                         test fail              GBIC is BAD



                           Yes


                       BAD port
                        Replace
                      mother board
                                                            SAN Troubleshooting • Chapter 8   341


Figure 8.32 QuickLoop Zoning Example


                                          qlZone1            qlZone2


                                     c0          c1
                            Host A
                             b8
                                                                      Host C
                                                                         ba



                      0,0                  0,1                                 0,2




                      0,3                  0,4                                 0,5




                 e0          e1      e2             e3
                                                                   Host B
                                                                    b9
                                                         qlZone3




Overview of SilkWorm Port Error Statistics
Additional SilkWorm port statistics can be obtained by executing the following
telnet commands:
     s   portShow <port #>
     s   portStatsShow <port #>
     Use portStatsShow for error statistics (such as CRC, encoding, bad End of
Frame [EOF], etc.), and use portShow for link-level and LIP statistics (such as
link failure, loss of sync, loss of signal, etc.).The portShow command offers sim-
ilar statistics to portStatsShow. However, the statistics gathered by portShow
342   Chapter 8 • SAN Troubleshooting


      are updated in software whenever a port interrupt is received, while the statistics
      for portStatsShow are updated in hardware registers as they occur.The signifi-
      cance in this difference is that many errors, such as CRC errors, could occur
      between interrupts.The hardware counters (portStatsShow) will capture these
      between interrupt errors, while the software counters (portShow) might not.
      Another difference between the two commands is that portShow provides LIP
      statistics and link statistics (link failure, loss of signal, loss of sync), while
      portStatsShow does not. A partial listing of relevant portShow statistics follows:
           s   Lip_in Number of LIPs transmitted from the connected device to the
               switch port. Does not apply to F_Port.
           s   Lip_out Number of LIPs transmitted from the switch port to the
               connected device. Does not apply to F_Port.
           s   Lip_rx Type of LIP (F7, F8) last received by the switch from the
               connected device. Does not apply to F_Port.


      Troubleshooting I/O Pauses
      I/O pauses happen, and both the SAN and edge device can and should tolerate
      such events.The term I/O pause is somewhat generic. An I/O pause can be as
      harsh as the powering down of a host or storage device while I/O is in transit,
      which will cause I/O to cease. Alternatively, it can be as lightweight as a port-
      level RSCN, which might be a problem for only the most latency-sensitive of
      applications. Most HBAs currently pause I/O during RSCN processing; however,
      updated drivers are expected to minimize this effect. Relative to the SAN, fabric
      events can also cause a pause in I/O. A fabric event can be broken down into a
      change, such as a switch reboot, and the resultant activity to respond to that
      change. In the case of a switch reboot, not only are the devices connected to that
      switch affected, but also devices connected to the fabric—even if the fabric is
      resilient.This is because the fabric needs to reroute, which takes less than a
      second, and because all devices connected to the SAN that have registered for
      state change notification must process a global RSCN. Edge devices such as
      HBAs and storage devices should be tolerant of such pauses in I/O. It is possible
      to adjust the settings for these devices to accommodate longer or shorter delays
      in I/O when a SAN event occurs. RSCNs are normal and key to SAN operation.
           Several applications are very sensitive to latency and/or RSCNs, such as
      video-on-demand and applications that are evolving into the SAN model, such as
                                                     SAN Troubleshooting • Chapter 8   343


tape backup. High latencies and large numbers of RSCNs can adversely affect
these applications. Storage vendors, switch vendors, application vendors, and HBA
vendors are working with the standards bodies (T11) as well as modifying their
product implementations to handle these types of exceptions.Table 8.6 lists
common events that cause fabric rerouting and/or fabric RSCNs.

Table 8.6 Fabric Events and Their Impact

                                               Generate            Will Result in
Event                                          Global RSCN?        Reroute?
SwitchDisable Disabling a switch in the        Yes                 Yes
fabric will require the fabric to reconfigure
and a new set of data path routes to be
established for the resulting downsized
fabric.
SwitchEnable The corresponding mode            Yes                 Sometimes
to the disable. A new switch added to
the fabric will result in new route
calculations to allow for the added ports.
E_Port connection/disconnection Adding         Yes                 Sometimes
or removing an ISL will cause a fabric RSCN.
A zone update, which occurs when you           Yes                 No
execute a cfgEnable or cfgDisable
command.
Adding/removing a switch to/from               Yes                 Sometimes
the fabric.

    Troubleshooting fabric events and their adverse impact on applications and
the SAN is a complex process. If you suspect that a fabric event is adversely
affecting your SAN, work with your switch supplier for resolution.
344   Chapter 8 • SAN Troubleshooting



      Summary
      It can be helpful to think of the SAN as a virtual cable when it comes to trou-
      bleshooting, approaching the problem by breaking components down to a host,
      the SAN virtual cable, and the storage.To the operating system, the SAN provides
      a link to a disk, just as a traditional SCSI connection would.Troubleshooting a
      SAN is more challenging, but still has many things in common with the tradi-
      tional storage troubleshooting process. Switches are logically positioned in the
      middle of the network between hosts and storage, and have visibility to both
      storage and hosts.This visibility into both sides of the storage network enables
      you to use switches to determine the cause of any malfunction in the SAN.
           SAN troubleshooting should begin in the center of the SAN and proceed
      outward. Once you know where to start troubleshooting, the next question is how
      to proceed. Start the troubleshooting process by gathering a preliminary set of
      data, and then analyze this data to identify where the problem resides: the host,
      the fabric, or the storage. Next, gather additional data from the appropriate area
      and focus in on the cause of the problem. A plethora of data is available from the
      switches, hosts, and storage.
           Many tools are available to the SAN troubleshooter. Several of these tools are
      switch commands. Other tools involve viewing the switch LEDs, host informa-
      tion, Fibre Channel analyzers, and diagnostics available on many storage arrays. It
      is rarely possible to use a single tool to successfully troubleshoot a problem. It is
      more common is to use several tools in concert.
           A fabric problem is a pervasive issue that can often affect more than one
      device.When a fabric issue is experienced in a resilient SAN, it might have no
      impact on SAN functionality, because the SAN redundancy compensates for the
      marginal situation. However these “soft” errors can cause degradation in the per-
      formance of the enterprise application and thus require immediate attention.
      Fabric issues are normally associated with large fabrics, which are defined as fab-
      rics consisting of 10 or more switches and 100 or more edge devices.
           A host that is unable to access a SAN device is a more common issue.This
      type of issue is classified as a missing device. Use of the commands switchShow
      and nsShow can quickly reveal the cause of the missing device. Missing device
      issues are normally limited to a few devices. If more devices are involved, it is
      likely a fabric issue.
           The impact of a marginal port can be significant. For example, a large storage
      device might be accessed by potentially dozens of hosts.The marginal behavior of
      this storage device then has the potential to impact all devices that access this
                                                    SAN Troubleshooting • Chapter 8    345


storage port. A marginal link involves the connection between the switch and the
edge device. Isolating the exact cause of a marginal link involves analyzing and
testing many of the components that make up the link: switch port, switch
GBIC, cable, edge device GBIC, and the edge device.
     I/O pauses do happen, and both the SAN and edge device can and should
tolerate such events.The term I/O pause is somewhat generic. An I/O pause can
be as severe as the powering down of a host or storage device while I/O is in
transit, which will cause I/O to cease. Alternatively, it can be as lightweight as a
port-level RSCN, which might be a problem for only the most latency-sensitive
applications. Relative to the SAN, fabric events can also cause a pause in I/O.
Calibrating your edge devices to handle I/O pauses and troubleshooting I/O
pauses is a complex process.

Solutions Fast Track
The Troubleshooting Approach:
The SAN Is a Virtual Cable
         Use the SAN’s visibility to both storage and hosts to start your trouble-
         shooting process.
         The switchShow, nsShow, nsAllShow, errShow, and topologyShow
         commands are extremely informational and invaluable to the trouble-
         shooting process.
         The UNIX format command or HBA vendor-supplied utilities are also
         helpful in troubleshooting.
         When you start the troubleshooting process, determine if the issue is
         fabric related or device related. A fabric-related issue impacts many
         devices, and a device issue impacts only a few devices.


Troubleshooting the Fabric
         A fabric issue impacts many devices. A logical switch outage, such as
         segmentation or physical switch outage, can cause many devices to drop
         out of the fabric. Problems with ISL initialization are also considered
         fabric issues.
346   Chapter 8 • SAN Troubleshooting


               The quickest way to narrow your search of a fabric problem is to com-
               pare your baseline SAN profile to your current SAN profile and investi-
               gate discrepancies.
               A SAN profile includes the number of devices per switch (nsShow),
               number of devices in the fabric (nsAllShow), and number of switches
               in the fabric (topologyShow).The errShow and switchShow com-
               mands are also helpful in tracking down fabric issues.
               Some fabric issues are caused by a mismatch in fabric service timeout
               variables and the edge device timeout settings. Careful analysis of both
               the fabric and the edge devices is necessary to resolve this complex issue.


      Troubleshooting Devices that Cannot Be Seen
               The first thing to check is that the missing device is logically connected
               to the SAN as indicated by switchShow output.
               Next, check to see that the device is present in the Name Server, using
               the command nsShow. If the device is not in the Name Server, it is
               invisible to the other devices in the fabric.
               Other common causes of missing devices are zone conflicts or
               marginal links.


      Troubleshooting Marginal Links
               Use portErrShow to establish if there are a relatively high number of
               errors, such as CRC errors. Look for a steadily increasing number of
               errors to confirm a marginal link.
               A marginal link can impact multiple devices. For example, a shared
               storage device with a marginal link can cause communication problems
               with all devices that access that shared storage.
               A marginal link can be caused by any of the components that make up
               the link: switch port, switch GBIC, cable, edge device GBIC, and the
               edge device.
                                                   SAN Troubleshooting • Chapter 8   347


Troubleshooting I/O Pauses
        I/O pauses happen, and both the SAN and edge device can and should
        tolerate such events.
        An I/O pause can be as harsh as the powering down of a host or storage
        device while I/O is in transit, which will cause I/O to cease.
        Alternatively, it might be as lightweight as a port-level RSCN, which
        might be a problem for only the most latency-sensitive applications.
        Relative to the SAN, fabric events can also cause a pause in I/O.
        Several applications, such as video-on-demand and applications that are
        evolving into the SAN model, such as tape backup, are very sensitive to
        latency and/or RSCNs. High latencies and large numbers of RSCNs
        can adversely affect these applications.
        Storage vendors, switch vendors, application vendors, and HBA vendors
        are working with the standards bodies (T11) as well as modifying their
        product implementations to handle these types of exceptions.


Frequently Asked Questions
The following Frequently Asked Questions, answered by the authors of this book,
are designed to both measure your understanding of the concepts presented in
this chapter and to assist you with real-life implementation of these concepts. To
have your questions about this chapter answered by the author, browse to
www.syngress.com/solutions and click on the “Ask the Author” form.


Q: When I activate a zone change (cfgEnable), I notice a pause in I/O and
   several of my hosts log warnings.What causes this?
A: When you issue a zone change, an RSCN is delivered to any host in the
   fabric that registers to receive an RSCN.The pause you notice is the initiator
   responding to the RSCN, which involves the initiator querying the Name
   Server and resolving any changes to the fabric.

Q: If I exhaust my troubleshooting options and cannot resolve an issue after
   reading this chapter, what should my next step be?
348   Chapter 8 • SAN Troubleshooting


      A: Contact your switch supplier and request support. Provide the information
          outlined earlier in this chapter. Of special importance is the supportShow,
          which is ideally captured while the problem is happening.

      Q: How can I tell if my fabric is segmented?
      A: Normally, a segmented fabric will generate an error message on the switch
          that segments.You can view errors by issuing the command errShow.

      Q: How come my device inconsistently connects to the switch as either an
          N_Port or an NL_Port ?
      A: It is likely that there is a bug in the port initialization of either the edge
          device or the switch. A short-term solution is to configure a port for a spe-
          cific topology. For example, configure a port as an FL_Port by using the
          command portcfgLport. Longer term, you should resolve this behavior by
          escalating the problem to your switch supplier and your edge device supplier.

      Q: What is a quick way to reinitialize to clear a fault or re-enable a link?
      A: The commands portDisable and portEnable will cause a port to reinitialize
          and potentially clear a fault. Doing so will cause the edge device to register
          with the Name Server.
                                       Chapter 9

SAN Implementation,
Maintenance, and
Management




 Solutions in this chapter:

     s   Installation Considerations
     s   Automating Switch
         Administration Activities
     s   Brocade Zoning Considerations
     s   Validating Your Fabric
     s   SAN Maintenance


         Summary

         Solutions Fast Track

         Frequently Asked Questions



                                           349
350   Chapter 9 • SAN Implementation, Maintenance, and Management



      Introduction
      Once you have completed your SAN design, you can then focus on implementa-
      tion, management, and maintenance.To arrive at a design requires a significant
      data-gathering effort during which you establish the requirements that shape
      your SAN: application drivers, availability, scalability, manageability, and perfor-
      mance.With these decisions made, you can then create a SAN architecture that
      meets your needs.The process of deploying a SAN is iterative as you build, test,
      and refine your original design. Once you have a SAN design, the next step is to
      implement, maintain, and manage the SAN. SAN implementation is the process
      of taking your design from paper to physical setup. Implementation is an ongoing
      activity that is very visible during the middle stages of your SAN’s lifecycle. In
      transitioning to a management and maintenance mode, you will periodically
      implement changes such as SAN expansion, fabric upgrades, and node move-
      ment. SAN management and maintenance activities are also reactive, such as
      replacing a failed switch or Gigabit Interface Converter (GBIC) optical module.
          This chapter is organized similarly to how you would set up and run a SAN.
      First, we discuss topics that require thought prior to implementation, such as
      zoning, cabling, and installation decisions.Then we present topics such as how to
      validate your fabric prior to transitioning to production. Finally, once you have
      your fabric up and running, we discuss topics like managing your SAN with
      automation and maintenance topics such as adding devices to your fabric and
      fabric upgrades.
          This chapter provides unique tips and ideas as you deploy and manage your
      SAN, with a focus on practical techniques and tools that require minimal depen-
      dencies.You can directly apply the processes in this chapter to your SAN today.
      There are a wide variety of SAN management packages that offer varying degrees
      of functionality for SAN management, maintenance, and implementation, such as
      VERITAS (SANPoint Control), Computer Associates (Unicenter TNG), BMC
      (PATROL), Sun (HighGround), Hewlett-Packard (OpenView Storage Area
      Manager), SANavigator (SANavigator), Prisa (Visual SAN), Micromuse (Netcool),
      and IBM Tivoli Storage Network Manager (TSNM).
          There are many choices for developing your own SAN management and main-
      tenance infrastructure using a mix of commercial packages and your own scripts
      and processes.Throughout this chapter, we provide examples of how to automate
      certain SAN management activities.The scripts discussed are freely downloadable
      and available for your use from the book’s Web site, found at www.syngress.com/
      solutions.The level at which you use these scripts and the process outlined in this
                      SAN Implementation, Maintenance, and Management • Chapter 9       351


chapter depends on your technical skill level, what SAN management software you
deploy, and your company’s information technology practices.

Installation Considerations
Several decisions and considerations regarding your SAN solution are necessary
prior to installation. Upfront planning and review result in significant time sav-
ings.This section identifies the areas of SAN installation that require planning,
and the upfront decisions that you need to make. For example, it might be diffi-
cult to install or maintain an Ethernet connection for a remote SAN. In-band
management via Internet Protocol over Fibre Channel (IPFC) is an option that
addresses this issue.When you install your SAN, you want to be sure that the
switches are running the appropriate level of Brocade Fabric OS, which is not
always the latest release! You need to know which version of Fabric OS to use
prior to installation. Other installation considerations include setting switch
parameters and verifying that you have the necessary licenses to operate your
SAN in accordance with the design requirements.

How to Cable Your SAN for Ease of Operation
Installation time is when you should plan your cabling and implement a cable
layout scheme that is manageable, flexible, and maintainable. An effective cable
management scheme should not only enable ease of maintenance, but also be
aesthetically pleasing.While aesthetics might not seem like a necessary design
principal, it turns out that cable plans that “look nice” also usually turn out to be
the ones that are easier to manage.
      The Inter-Switch Link (ISL) cabling plans in this chapter are optimized to
facilitate clean cable management, and this should be easily achievable as long as
normal cable management practices are followed. In particular, the cables used for
ISLs should be carefully labeled and bundled so that they cannot be mistaken for
host or storage cables. A well-managed ISL cable layout is shown in Figure 9.1.
      While the layout depicted in Figure 9.2 might look fairly clean, it does have
some potential problems. If a switch in the middle of this set were to fail, it
would be difficult to replace it without shutting down the other three.This is
because the ISLs from the top switches run in front of the lower switches.The
configuration in Figure 9.1 clearly does not have this problem. Another problem
with the configuration in Figure 9.2 is the fact that the switches are stacked on a
shelf rather than being rack-mounted. Even if the ISLs were cleaned up, it would
still be difficult to remove a switch from the bottom of the stack.
352   Chapter 9 • SAN Implementation, Maintenance, and Management


      Figure 9.1 An ISL Cable Layout That Is Easy to Maintain




      Figure 9.2 An ISL Cable Layout That Is Difficult to Maintain




          Figure 9.3 shows much the same configuration of switches shown in Figure
      9.2, with a recommended cable layout scheme that is easy to maintain.The
      Figure 9.3 switches are also racked.


      NOTE
           Ensure that ISLs run in front of only the switches to which they are con-
           nected. This will allow the switches to be removed without downtime for
           the fabric.
                      SAN Implementation, Maintenance, and Management • Chapter 9         353


Figure 9.3 The Switches From Figure 9.2 Are Recabled to Enable
Ease of Maintenance




    It is not always possible to use cables that are “cut to length” for the ISLs. If
you are using cables that are excessively long, it is desirable to take up the slack in
some manner, as shown in Figure 9.4. Ideally, do this away from the switch to
avoid clutter at the switch itself. However, be sure that you do not exceed the
bend radius specification of the optical cable.
Figure 9.4 Take Up Slack to Avoid Clutter




    Figure 9.5 depicts a high-performance, 32-port configuration that uses six
switches.The switches are mounted with cable management above, below, and on
both sides. Management can also be used in between switches, if needed. Only four
of those switches have any ports available for user wiring.The other two are used
exclusively for ISLs. It is desirable to have all available user ports in one
contiguous block to ease cabling of edge devices and simplify troubleshooting
and monitoring. Figure 9.5 shows the ports available to the user (edge ports).
354   Chapter 9 • SAN Implementation, Maintenance, and Management


      Figure 9.5 Six Switches Racked for Edge Wiring and ISL Wiring




                                        User wiring goes            ISL wiring goes
                                            this way                    this way




                        6' x 19" Rack



      Clearly, the ISL wiring can be bundled to the top, bottom, and right side of the
      rack and kept completely separate from the user wiring that would run to the left.
          The ISLs within this group should all be formed using 1 meter cables, if
      you are using 1.5 U switches (such as the SilkWorm 2250), or 2 meter cables, if
      you are using 2 U switches (such as the SilkWorm 2800).The length of the ISLs to
      the other groups of switches will vary greatly depending on rack configuration and
      should therefore be measured beforehand. Note that these cable lengths apply only
      to the group depicted. For different SANs, different cable lengths might be
      required.
          The ISLs used to interconnect the switches in these configurations are
      assumed to be semipermanent. It is useful to have these semipermanent ISLs
      colored differently from the host/storage/other ISL cables. Most multimode
      Fibre Channel cables are orange. Using gray, black, blue, or some other different
      color for the ISLs should help to differentiate between ISL cables and edge
      device cables.

      Racking Considerations
      If you are employing a dual-fabric SAN architecture, it is important that the
      duality be employed throughout the SAN implementation as shown in the left-
      hand configuration of Figure 9.6. Deploying two fabrics that are part of a SAN
      solution within the same rack makes that rack a single point of failure.The odds
                                  SAN Implementation, Maintenance, and Management • Chapter 9   355


of a rack falling over of its own accord are low. However, it is possible to picture
a contract cable management worker on a ladder falling off and hitting a rack, or
a leak spraying water into a rack.The concept of dual fabrics is to avoid a single
point of failure. For high-availability fabrics, ensure that you have separate power
circuits available, as shown in the right-hand configuration of Figure 9.6. For dual
power supply switches, use separate circuits for the left and right power supplies.
If you are using single power supply switches, it is still important to use separate
circuits if your SAN is configured for high availability.This means connecting half
of the switches in your fabric to one circuit in such a way that if these switches
are powered off, the other half, which are connected to a different circuit, can still
comprise a working fabric.
Figure 9.6 Racking and Powering for High Availability

                                                                      Separate Circuits




        Rack A,                                        Rack B,
        Circuit A                                      Circuit B
                                    Hosts




                SAN A                          SAN B




                        Storage
356   Chapter 9 • SAN Implementation, Maintenance, and Management


      In-Band or Out-of-Band Management?
      For some situations, it is not possible or practical to dedicate an Ethernet connec-
      tion for each switch. It is possible to manage Brocade switches via direct Ethernet
      connections or via IPFC.When using Ethernet connections, it is only necessary to
      configure the switch IP information and attach an Ethernet cable to each switch.
      When using IPFC, you can use a single Ethernet connection to bridge to the other
      switches via IPFC.To configure IPFC, it is necessary to configure the switches and
      in some cases also to configure the Host Bus Adapters (HBAs) to run IPFC.There
      are advantages and disadvantages to an Ethernet bridge to IPFC approach, as pre-
      sented in Table 9.1.The flexibility does exist to do in-band management with a
      single Ethernet connection. If you can do out-of-band management with Ethernet
      connections to each switch, you will need to allocate an IP address, Ethernet port,
      and cable for each switch.There is also the option of doing full IPFC-based man-
      agement, with an IPFC-capable HBA connected to a switch that can talk to all
      other switches in the fabric via IPFC.

      Table 9.1 Advantages and Disadvantages of Using IPFC to Manage Your SAN

      Advantages                     Disadvantages
      Fewer or no Ethernet           Single point of management failure—only one
      connections                    Ethernet path or IPFC path to fabric. If a switch
                                     goes down anywhere in the fabric or gets
                                     “switch disabled,” all management capabilities
                                     stop at that point—no Brocade WEB TOOLS and
                                     no telnet support.
                                     This is no different than management via the
                                     Ethernet ports. In that case, you would have the
                                     management station and all the switches con-
                                     nected to an Ethernet switch or hub. If the
                                     Ethernet cable goes bad, you cannot manage
                                     any switch: single point of failure.
      Fewer Ethernet hubs/power      Static IP addresses—no Dynamic Host
                                     Configuration Protocol (DHCP) support.
      Remote management              No easy gateways exist for routing IPFC like you
                                     have on Ethernet unless you piece one together
                                     using routed devices on a UNIX box or use some
                                     kind of routing software for Windows NT.
                               SAN Implementation, Maintenance, and Management • Chapter 9                                        357


IPFC In-Band Guidelines
Figure 9.7 depicts an IPFC in-band configuration.The management station,
where the browser runs, does not need to have a Fibre Channel interface, or be
IPFC-capable: it only needs an Ethernet connection. Only one of the switches in
the fabric needs an Ethernet connection, which must be in the same subnet as
the management station. However, this is not strictly necessary.You can configure
the default gateway on the management switch and also add an appropriate static
route on the management station and all routers between it and the management
switch. This is a bit complex, and probably not worthwhile in many cases. In
addition, with this configuration correctly implemented, it is also possible to
telnet into every switch in the fabric from the management station.

Figure 9.7 A Five-Switch IPFC In-Band Setup

       Management Station
       IP: 192.168.164.109
       Subnet: 255.255.255.0
       GW:192.168.164.1




                                                     Same Ethernet IP can be used on all in-band switches.
                                                        Note: Ethernet IP cannot be [0.0.0.0] or None.



       SW1                     SW2                     SW3                         SW4                       SW5




       Management Switch       IP: 192.0.0.1          IP: 192.0.0.1                IP: 192.0.0.1             IP: 192.0.0.1
       IP: 192.168.164.28      FC_IP: 172.17.50.2     FC_IP: 172.17.50.3           FC_IP: 172.17.50.4        FC_IP: 172.17.50.5
       FC_IP: 172.17.50.1      GW: 172.17.50.1        GW: 172.17.50.1              GW: 172.17.50.1           GW: 172.17.50.1
       GW: 192.168.164.1
       (Gateway Switch)

                                               Gateway on in-band switches must point to the first switch's FC_IP
                                               address. All switches' FC_IP address must be in the same subnet. A
                                                      Static Route must be entered from the management
                                                               station pointing to the FC_IP Subnet.
358   Chapter 9 • SAN Implementation, Maintenance, and Management


          A guide for setting up a five-switch SAN for in-band management follows.
      You can adapt this guide to fit your SAN environment. A summary of the con-
      figuration is listed in Table 9.2.This summary highlights the relationship between
      the IPFC and the Ethernet IP addresses:
           s   All switches must have their Fibre Channel IP addresses on the
               same subnet.
           s   The management station and switch Ethernet port must have their IP
               addresses on the same subnet.
           s   The management station must have a static route to the IPFC subnet, or
               the default gateway pointing to the Fibre Channel IP address of the switch
               connected to Ethernet. Either solution will work (for example, on a
               Solaris machine, route add IPFC mask IPFC MASK IPADDR metric 1).
           s   The in-band managed switches not connected to Ethernet must have their
               default gateway set to the IPFC address of the switch that is connected to
               Ethernet. In Figure 9.7, switches 2, 3, 4, and 5 have their default gateway
               set to [172.17.50.1].These switches must have their Ethernet IP addresses
               set to an address that is different from the Ethernet IP subnet specified on
               Switch 1 (SW1).The Ethernet IP address cannot be [0.0.0.0] or None.
               The Ethernet IP address can be the same as illustrated in Figure 9.7, as
               long as the switches are not connected to the IP network.
           s   The gateway address on Switch 1 (the gateway switch) should be set to
               the default gateway on the network. However, this is not required if
               Switch 1 and the management station are on the same subnet.


      Setting Switch Parameters
      Before the switches are cabled together, certain parameters should be set.These
      include the IP information and the switch name, which should be the same as the
      host name that maps to the switch’s IP address. Set the IP address and switch name
      of each switch to an appropriate and unique ID.The gateway and subnet mask
      might also need to be set. See your network administrator for appropriate values.
          If possible, have a contiguous block of addresses reserved for all Brocade
      switches. It might also be beneficial to keep the last octet of these addresses
      below 239. One popular way to administer Fibre Channel domain IDs is to have
      them match the last octet of the IP address. For example, switch 192.168.62.100
      would get domain ID 100. Since the highest valid domain ID is 239, this scheme
      works only if the last octet of the IP address is 239 or lower.
      Table 9.2 Five-Switch IPFC Configuration Detail
      Node            Ethernet IP Address   Subnet Mask     IPFC Address   Default Gateway Notes
      Management      192.168.164.109       255.255.255.0                  192.168.164.1   Static Route:
      Station                                                                              route add
                                                                                           172.17.50.0
                                                                                           mask
                                                                                           255.255.255.0
                                                                                           192.168.164.28
                                                                                           metric 1
      Management      192.168.164.28                        172.17.50.1    192.168.164.1   Gateway Switch
      Switch 1
      Switch 2        192.0.0.1                             172.17.50.2    172.17.50.1
      Switch 3        192.0.0.1                             172.17.50.3    172.17.50.1
      Switch 4        192.0.0.1                             172.17.50.4    172.17.50.1




359
      Switch 5        192.0.0.1                             172.17.50.5    172.17.50.1
360   Chapter 9 • SAN Implementation, Maintenance, and Management




         Switch Naming Tips
         Having a well thought-out switch-naming convention enables easy iden-
         tification of physical switches if a problem arises. Use a switch-naming
         convention that scales across the organization, keeping in mind that the
         SAN might start small but can be extended enterprise-wide over time. If
         you have to change a switch name, it is very easy to do—just execute the
         command switchName. Changing a switch-naming convention is more
         difficult, as you will most likely have to change all the switch names in
         the SAN affected by the naming convention change. For example, if you
         evolved your SAN from a four-switch mesh to an eight-switch core/edge
         topology, you might want to rename your switches with either the term
         core or edge embedded in the name to reflect the role of the switch.
         Consider using the following items when making up the switch name
         field:
               s   Incorporate an ID for the site or building where the switch is
                   located.
               s   Add a component to identify the floor or room where the
                   switch is located.
               s   Use the switch topology function (such as core or edge).
               s   Add a component that shows to which organization or
                   project the switch belongs.
               s   Include the rack ID in the name to further detail switch
                   location.
               s   Embed the switch type into the switch name (such as the
                   SilkWorm 2800, 2250, or 2400).
               s   If redundant fabrics are being used, select an ID for
                   complementary fabrics.
              Example: CORE1_A_B6_230_R5 = core Switch 1, fabric A, building
         6, room 230, rack 5
              Note that switch names can be up to 19 characters long, must
         begin with a letter or digit, and must consist of letters, digits, and
         underscore characters. Spaces are not allowed.
                    SAN Implementation, Maintenance, and Management • Chapter 9        361


   To set these parameters, execute the following steps:
    1. If the switch has a serial port, connect to it with a serial cable and log in
       as the administrator. If the switch has a control panel instead of a serial
       port, use the buttons on the panel according to the Brocade Fabric OS
       documentation to set the IP address, netmask, and gateway, and then
       telnet in and perform the rest of the configuration as documented here.
    2. To set the switchName parameter, use the switchName command:
        switch:admin> switchName "switch1"
        Updating flash ...
        switch1:admin>

    3. Type ipAddrSet. A menu will appear. Answer the questions appropri-
       ately. Note that step 3 is not necessary if you enter the IP address via the
       front panel:
        switch1:admin> ipAddrSet
        Ethernet IP Address []: 192.168.163.110
        Ethernet Subnetmask [255.255.255.0]: 255.255.255.0
        Fibre Channel IP Address [none]:
        Fibre Channel Subnetmask [none]:
        Gateway Address []: 192.168.163.1
        switch1:admin>

    4. Connect the switch to the Ethernet and ping the address to verify that it
       has been set correctly.


What Fabric OS Version Should I Use?
Deciding which version of Fabric OS to use can be a challenging process, espe-
cially if your SAN consists of multiple vendor edge devices or switches.The most
recent version of Fabric OS might not always be the best version to use. In some
cases, you might experience conflicting Fabric OS requirements, with multiple
vendors each specifying a different version of Fabric OS. Many switch suppliers
extensively test their SAN products with Brocade switches in varying configura-
tions.To support their products and Brocade switches, they require that you run a
specific version of Fabric OS. One suggestion is to work with your switch supplier
and your SAN vendors to identify if there is an intersection of supported Fabric
362   Chapter 9 • SAN Implementation, Maintenance, and Management


      OS versions. For example, your switch supplier might support Fabric OS versions
      v2.2.2 and v2.4.1. If your storage vendor and HBA vendor support v2.2.2, your
      choice would be to install Fabric OS v2.2.2. In some cases, there might not exist
      an intersection of support requirements, at which point you might want to use the
      version of Fabric OS recommended by your switch supplier or negotiate a sup-
      port agreement with your SAN vendors. Another determining factor for running
      a version of Fabric OS is availability of features or support.
          There currently exist two Fabric OS trees, the v1.x tree for the SilkWorm
      1000 series of switches and the v2.x tree for the SilkWorm 2000 series of
      switches.The naming convention for v2.x Fabric OS is formatted as
      dM.m.fp_t...t, with each variable replaced by the information specified in Table
      9.3.The Fabric OS version used for examples in this book is v2.4.1.c.The soft-
      ware major version is 2, the minor version is 4, the maintenance version is 1, and
      the patch version is c. Many features and enhancements have been added since
      Fabric OS v2.0.Table 9.4 lists summaries of these feature and enhancement addi-
      tions.The information in Table 9.4 can help you determine which Fabric OS is
      right for you, should you have the option to choose.The key is to establish
      which version of Fabric OS to run as part of the installation process.

      Table 9.3 How to Decode a Fabric OS Version

      Variable    Meaning      Format        Definition
      d          Deployment  Lowercase Indicates the deployment target for the
                 indicator   letter    release. Does not indicate any func-
                                       tional changes. Normally the letter “v”.
      M          Software    Number    Indicates a release that incorporates
                 major                 significant functional changes to the
                 version               software, as compared to releases with
                                       a lower software major version.
                                       Generally follows architectural changes
                                       in the core operating system or
                                       hardware.
      m          Software    Number    Indicates a release that incorporates
                 minor                 added functionality within a major
                 version               software version.
      f          Software    Number    Indicates a maintenance release for a
                 maintenance           minor software version. Usually indi-
                 version               cates a release of bug fixes only.
                                       (Brocade attempts to prevent functional
                                       changes from occurring in software
                                                                              Continued
                       SAN Implementation, Maintenance, and Management • Chapter 9   363


Table 9.3 Continued

Variable    Meaning        Format        Definition
                                       maintenance versions.) Any functional
                                       changes that do occur are clearly
                                       documented.
p           Software        Letter     Indicates a release that incorporates a
            patch                      patch within a minor software mainte-
            version                    nance version; otherwise, functionally
                                       identical to the maintenance version.
                                       Each patch incorporates all preceding
                                       patches for the same maintenance
                                       version; for example, v2.0.2c would
                                       incorporate the patches implemented
                                       in both v2.0.2a and v2.0.2b.
t…t         Special type    Letter(s), Indicates a special nonproduction build
            (nonpro-        possibly   (“N” in the following definitions is the
            duction         followed   iteration of the build):
            release)        by number s An Alpha release is “..._alphaN”
                                          (abbreviated to “aN” in bug lists).
                                       For example, “_alpha3 or “a3”.
                                       s A Beta release is “..._betaN” (abbre-
                                          viated to “bN” in bug lists). For
                                          example, “_beta1” or “b1”.
                                       s A release candidate is “..._rcN”
                                          (abbreviated to “rcN” in bug lists).
                                          For example, “_rc2” or “rc2”.


Table 9.4 v2.x Fabric OS History with Feature and Enhancement Additions

Fabric OS
Version        New Feature                     Enhancements

v2.2.0         s   Fabric Watch                 s   Fabric Shortest Path First
               s   Extended Fabrics                 (FSPF) routing failover
               s   FA Management                    enhancements
                   Information Base (MIB)       s   WEB TOOLS enhancements
               s   FC-GS3 Management                — Switch Status
                   Server                           — Dramatically better Fabric
                                                       View
                                                s   Switch Beaconing
                                                s   Serial ID Gigabit Interface
                                                    Converters (GBICs)
                                                                         Continued
364   Chapter 9 • SAN Implementation, Maintenance, and Management


      Table 9.4 Continued

      Fabric OS
      Version       New Feature                     Enhancements

      v2.2.1                                        s  E_Port Enable / Disable
                                                    s  Simple Network Management
                                                    Protocol (SNMP) Access Control
                                                    Lists (ACLs)
                                                    s Extended Fabrics
                                                       — Configurable on any switch
                                                       — License on portal switches
      v2.2.2         s   SilkWorm 6400 Support
                         —Group Definition Phase I
                         —ISL Topology Check /
                            Monitor
                         —SNMP Group Support
                         —New GETS for groups
                     s   Fabric Watch trap
                         enhancements
                         —Traps now include
                            thresholds
                     s   SCSI Enclosure Services
                         (SES) enhancements
                         —Config File Upload/
                            Download via SES
                         —Fabric OS Image Upload/
                            Download via SES
                         —SupportShow via SES
                     s   Fabric Access API v1.0     s   Fabric Watch
      v2.3.0             Switch Side                    —Alarm Enable/Disable
                     s   QuickLoop Fabric Assist        —Threshold Reset to Defaults
                         Mode                       s   WEB TOOLS
                         —Reengineering of              —Faster to load and run
                            QuickLoop from Hub          —Support for QLFA zoning
                            Emulation to Virtual        —Support for ED5000 IOP
                            Loops                         Mode
                         —Industry-leading LIP          —Many enhancements
                            isolation               s   Management Server
                         —Loop Hosts talk to            —FC-GS-3 Platform Support
                            Fabric Targets          s   FA MIB v2.2
                                                    s   FC-GS3 Name Server
                                                        —More GET calls
                                                        —More Register calls
                                                                            Continued
                   SAN Implementation, Maintenance, and Management • Chapter 9   365


Table 9.4 Continued

Fabric OS
Version     New Feature                        Enhancements

             s   McData ED5000 switch
                 interoperability with these
                 constraints
                 —Limited to 31 switches
                    in fabric
                 —WWN-based zoning, no
                    hardware-enforced
                    zoning
                 —SilkWorm 1000 not
                    supported in ED5000
                    mode
                 —Zoning management
                    limitations
                 —All switches in fabric
                    must run 2.3+ED5000
                    mode
                 —No QuickLoop Fabric
                    Assist Mode in ED5000
                    mode gets
                 —No Alias Server or
                    Management Server
v2.4.0       s   New features for
                 SilkWorm 6400
                 —Fabric Manager 1.0
                 —Group Definition
                    Phase 2
                 —Based on Management
                    Server technology
                 —Permits group manage-
                    ment operations to be
                    done on one switch
                    rather than all
                 —Group SupportShow
366   Chapter 9 • SAN Implementation, Maintenance, and Management


      Licenses
      All switches must have fabric capability if you want to interconnect these
      switches. Brocade WEB TOOLS, Fabric Watch, and Zoning licenses are also
      desirable, but not required to build a fabric.You will need a QuickLoop license if
      you intend to integrate private hosts into your fabric.The SilkWorm 20x0 and
      the 22x0 switches offer varied fabric capabilities:
           s   2010/2210 = No fabric license; loop switches
           s   2040/2240 = Entry-fabric license; minimal switch-to-switch
               connectivity (single ISL support)
           s   2050/2250 = Full-fabric license; unlimited switch-to-switch
               connectivity (multiple ISL support)
      You should check each switch to verify that you have the licenses necessary to
      build your SAN solution.The command licenseShow is used to determine what
      licenses are installed on your switch, as shown in Figure 9.8. Note that a single key
      can enable multiple features. If this is the case, you will not have a one-to-one
      mapping of features and a license key. If you do not have the appropriate licenses,
      you will need to contact your switch supplier to acquire the necessary licenses.
      When acquiring a new license, it is necessary to supply the switch World-Wide
      Name (WWN), which is available from the output of the switchShow com-
      mand, and the switch serial number, which is available from the switch chassis.

      Figure 9.8 Use licenseShow to Determine What Licenses Are Installed on
      Your Switch
      core1:admin> licenseShow
      SRzy9Sz9zeTS0zAG:
           Web license
      bbSz9eQb9zccT0AQ:
           Zoning license
      RdzdSRcSyzSe0eTn:
           QuickLoop license
      cSczRScd9RdTd0SY:
           Fabric license
                     SAN Implementation, Maintenance, and Management • Chapter 9     367



Automating Switch
Administration Activities
If you have to perform SAN administration activity more than once, consider
writing a script.You can use the Tcl/Tk-based Expect scripting language to
interface with the switch. In the future, you will also have the option to use the
Fabric OS Application Programming Interfaces (APIs) for automating switch
management functions. At the writing of this book, the Fabric OS APIs are avail-
able only to Brocade partners, with plans to make these APIs available to all
switch users in the future.We discuss the following topics in this section:
     s   Fabric OS APIs
     s   Expect scripting
Using Expect to interface with the switch is not as powerful or effective a solu-
tion as using the APIs. However, if you need to implement a solution now,
Expect is a good choice. Because of the power that the APIs deliver, further dis-
cussion is warranted to assist you with your planning. As we discuss the subjects
in this chapter, we provide examples of how to automate related functions by
using Expect.You can freely download and use the scripts mentioned in this
book by accessing the book’s Web site (www.syngress.com/solutions). Several
examples of the types of switch management functions you might want to auto-
mate follow later in this chapter:
     s   Download new firmware to all of the switches in your fabric.
     s   Reboot all of your switches at once or in a sequence.
     s   Automate zone changes.
     s   Facilitate troubleshooting.


Fabric OS APIs
The Fabric OS API is a programming interface to allow applications to access
fabric information and to perform control operations. Access to the switch func-
tions is based on IP access either through out-of-band Ethernet or via IPFC from
a suitable HBA. Host-resident libraries and header files are required. Support for
Solaris,Windows 2000, and Hewlett-Packard HP-UX currently exists.
Application programs are compiled and linked to the library interfaces.The
library uses Remote Procedure Call (RPC) over a TCP/IP connection to a
368   Chapter 9 • SAN Implementation, Maintenance, and Management


      switch to get information and perform control operations. A Perl interface is
      planned. One of the benefits of using the APIs over scripts is that it simplifies
      complex tasks into simple commands; commands that would require many lines
      of scripting can potentially be a single command via the API.The API will be
      rolled out for end-user support; however, the initial support is to third-party
      management applications.
          The application provider typically distributes host libraries and headers.Target
      users are SAN management application providers with availability to all switch
      users planned.The following companies provide SAN management applications:
           s   VERITAS (SANPoint Control)
           s   Computer Associates (Unicenter TNG)
           s   BMC Software (PATROL)
           s   Sun Microsystems (HighGround)
           s   Hewlett-Packard (OpenView Storage Area Manager)
           s   SANavigator (SANavigator)
           s   Prisa (Visual SAN)
           s   Micromuse (Netcool)
           s   IBM Tivoli (TSNM)
         The Fabric OS API is intended to provide the following operations:
           s   Discovery applications can quickly discover the fabric topology
               (switches, ports, and routes) and devices within the SAN.
           s   Zoning provides full access to Brocade Zoning management facilities. A
               transaction model with rollback manages multi-application access to
               safeguard against concurrent access.
           s   Switch and port management provides application control of indi-
               vidual switches and ports. Applications have access to Switch, Ports, Port
               Statistics (PortStats), and Port Errors (PortErrors) objects for in-depth
               information of critical SAN information. Obtain firmware versions from
               all switches in your company.
           s   Device management provides access to node and device objects that
               provide information about the end points within the SAN.
           s   Route management provides access to route control information to
               assist users in discovering and managing routes within the fabric.
                     SAN Implementation, Maintenance, and Management • Chapter 9       369


Expect Scripting
Expect is a powerful tool for managing the switch in an automated fashion using
telnet commands; it not only automates applications such as telnet, ftp, passwd,
fsck, rlogin, and tip, but it is also used for testing them.The Expect home page
(http://expect.nist.gov) is an excellent source of information on Expect, the
foundation software required by Expect (Tcl,Tk) and Expect applications.
Another Web resource for Expect is the Tcl Developer Xchange, found at
www.scriptics.com. Expect is available for a variety of UNIX, Microsoft, and
Macintosh environments.

A Switch Management Wrapper Using Expect
As mentioned earlier, the scripts discussed in this book are available on the book’s
Web site (www.syngress.com/solutions). Although these scripts are not coding
works of art, they are a great foundation to build utilities for your switch
management.
    A wrapper that allows you to execute a single command on a switch is pro-
vided as an example (Figure 9.9).The name of the script is run_sw_cmd. The
script takes two arguments: the command you wish to execute and the name of
the switch you want to execute the command on.This wrapper enables you to
run a switch command in an automated fashion.The hard part of the script and
the majority of lines for this program are focused on establishing a “connection”
(lines 1 through 62). Once the connection is made, it is very easy to just issue a
command to the switch, and it takes only two lines (lines 63 through 64) to do
this.The script is somewhat primitive since you need to set the user and password
information in the script. However, there is nothing to prevent modification of
the script to enable user and password arguments to be passed into the script.
Because only one telnet session with the switch is permitted at a time, you
cannot run an Expect script on a switch with an active telnet session. If you do,
the Expect script will not be able to gain a connection.

Figure 9.9 An Expect Script Wrapper for the SilkWorm Switch
Usage:   run_sw_cmd <command> <switch ip name>
1   #!/usr/local/bin/expect
2
3   # Author:          Chris Beauchamp, Brocade Communications
4   # Date:            06/01/01

                                                                          Continued
370   Chapter 9 • SAN Implementation, Maintenance, and Management


      Figure 9.9 Continued
      5
      6    proc telnetLogin {user passwd prompt} {
      7         expect {
      8                 timeout
      9                      {puts "FAIL\nTelnet attempt for $user timed out\n"
      10                                return 1
      11                     }
      12                     eof
      13                     {puts "FAIL\nTelnet login prompt for $user never
                                   happened\n"
      14                                return 1
      15                     }
      16                     # this is the case where we connect with the switch
      17                     "login:"
      18            }
      19            send "$user\r"
      20            expect "Password:"
      21            send "$passwd\r"
      22            expect $prompt
      23            return "0"
      24   }
      25
      26   #
      27   # main
      28   #
      29
      30   # bail out if not enough args supplied
      31   if {$argc != 2} {
      32            puts "\nincorrect number of arguments supplied"
      33            puts "       \nusage: $argv0 <command> <switch>"
      34            puts "\nexiting ..."
      35            exit
      36            }

                                                                          Continued
                       SAN Implementation, Maintenance, and Management • Chapter 9   371


Figure 9.9 Continued
37
38   set cmd [lindex $argv 0]
39   set switch [lindex $argv 1]
40
41   # change these values if you have different password or user
     # requirements
42   set spasswd password
43   set suser admin
44   set sprompt admin>
45
46   set timeout 60
47
48   puts "telneting to switch"
49   spawn telnet $switch
50   set sw_spid $spawn_id
51
52   # exit since it was not possible to connect to the switch
53   catch {telnetLogin $suser $spasswd $sprompt} code
54   if {$code != 0} {
55              puts "unable to access switch"
56              exit
57   }
58
59   puts "switching context to switch telnet"
60   set spawn_id $sw_spid
61
62   # send the command
63   send "$cmd\r"
64   expect $sprompt
65   puts "\n"
66   return 0
372   Chapter 9 • SAN Implementation, Maintenance, and Management


          You can also modify this script to read from a file of switches so that
      you can execute the script on multiple switches. Alternatively, as shown in Figure
      9.10, you can call the script from a UNIX shell script or Perl to obtain the
      Fabric OS version from all switches in the fabric.

      Figure 9.10 Integrating run_sw_cmd Expect Script with UNIX Shell Scripts
      sun1# foreach switch ( core1 core2 edge1 edge2 edge3 )
      ? echo $switch
      ? run_sw_cmd version $switch | grep "Fabric OS:"
      ? end
      core1
      Fabric OS:    a2.4.1c
      core2
      Fabric OS:    a2.4.1a
      edge1
      Fabric OS:    a2.4.1a
      edge2
      Fabric OS:    a2.4.1a
      edge3
      Fabric OS:    a2.4.1a



      Brocade Zoning Considerations
      If you use switch-based zoning, you need to determine if you want to use hard
      or soft Brocade Zoning, and how to manage your zones. A related zoning topic
      that you also need to explore is where to zone.This section addresses these par-
      ticular issues.
           Brocade Zoning, which is an optionally licensed product, enables you to logi-
      cally group devices into virtual SANs. Zoning is used to set up barriers between
      different operating environments, to deploy logical fabric subsets by creating
      defined user groups, or to create test and/or maintain areas that are separate
      within the fabric. Zoning is an all-or-nothing operation: once a zone is enabled,
      all devices must be defined in a zone, or each device will exist in a zone con-
      sisting of just that device, and that device will be inaccessible to other devices in
      the fabric. In effect, this sets up an access by inclusion policy such that, by rule, a
      host or storage device is not permitted to participate in the fabric until it is
      positively included in at least one zone.With Brocade Zoning, you can define
      multiple zoning configurations. However, only one zoning configuration is active
                      SAN Implementation, Maintenance, and Management • Chapter 9         373


at one time. It is possible to rapidly change zone configurations by just issuing
the cfgEnable <zone configuration> command.
     One use of this capability is to facilitate policy-based management.This capa-
bility can be used in many ways. For example, a policy can be defined to provide
access to the tape library to Windows NT hosts during the day for continuous
backup, but migrate access to the UNIX hosts at the end of the day. Alternatively,
you might want to zone systems based on organizational structure.

Where to Zone?
It is possible to zone at various points in the SAN, such as the HBA or at the
host level, and you might even decide not to use switch-based zoning at all.You
might also want to use switch zoning in combination with other zoning
methods, such as using the HBA or storage controller to accomplish zoning, as
each might have a different level of granularity. As zoning is a component of
security, a combination of zoning at different locations in the SAN can be viewed
as an additional level of security. Many customers feel that you can never have
enough security.To provide context, see Figure 9.11 for the various zoning
methods and where these methods can be employed.
     Much discussion surrounds the subject of where to zone. Major characteris-
tics of zoning solutions include the need or lack of need for host resident soft-
ware, zone configuration control, the ability of zoning to ease SAN management,
the ability to zone at a Logical Unit Number (LUN) level, and security. If you
use HBA zoning or a host resident zoning package, you need to install and main-
tain this software on all hosts that are part of the fabric. If one host is not running
the host resident software, your fabric is subject to illicit access or data corrup-
tion, as the fabric is unprotected without the resident software installed. Host res-
ident or HBA zoning software is also subject to configuration changes at multiple
points, making management a challenge. Storage-based zoning, host resident
zoning, and HBA zoning normally are capable of LUN-level zoning, which is a
lower level of granularity than SAN switches can currently achieve.
     Due to the inherent risk mentioned in zoning at this upper application layer,
it is advisable to supplement this solution with switch-based zoning as well to
prevent a newly attached device from accessing storage until it is properly config-
ured. In this mode, the administrator configures zoning at both the host and the
switch. Doing so prevents any potential inappropriate data access if the host is not
configured properly.
     As mentioned earlier, the SilkWorm zoning today cannot zone to the LUN
level.To do LUN-level zoning, you will need to choose an additional zoning
method. If you have multiple storage and HBA providers, it might be necessary
374   Chapter 9 • SAN Implementation, Maintenance, and Management


      Figure 9.11 Where Zoning Can Happen in the SAN



                                                                 Host Zoning




                                                                  HBA Zoning
                                             Hosts




                                                                       Switch
                           SAN A                     SAN B            Zoning




                                                               Storage Zoning

                                   Storage




      to learn, manage, and implement multiple zoning applications.The remainder of
      this section focuses on switch-based zoning. At a minimum, you will likely want
      to use switch zoning for the following reasons:
           s   SilkWorm switches offer hard zoning, which is the most secure zoning
               available in your SAN.
           s   Switch zoning provides a single point of control—you need to manage
               only one zoning interface as opposed to multiple HBA, storage, and host
               zoning interfaces.
           s   Switch zoning minimizes the impact devices have on each other by
               limiting fabric activity such as Registered State Change Notification
               (RSCN) to only those zone members affected by the RSCN or
               limiting broadcast frames.
           s   Some SAN devices can support only a limited number of device con-
               nections.With zoning, you can enforce the number of devices that exist
               in a zone to align with the edge device connection limits.
                         SAN Implementation, Maintenance, and Management • Chapter 9    375


Hard Zoning or Soft Zoning?
Current Brocade SilkWorm switches support both hardware- and software-based
zoning. As there is not a setting to turn on one or the other, it is often a point of
confusion for administrators in terms of which one is being used.The type of
zoning you use depends on how the zones are defined. If you use a device
WWN or an Arbitrated Loop Physical Address (AL_PA) to define a zone object,
you are using soft zoning. If you use a device physical port number, in the form
(domain, port), you are using hard zoning (Figure 9.12).
Figure 9.12 Hard and Soft Zone Examples
core1:admin> cfgshow
Defined configuration:
 cfg:    hard     green; yellow
 cfg:    soft     red; blue
 zone:   blue     jbod1; jbod2; softhost2
 zone:   green    hardhost1; hardarray1
 zone:   red      softjbod1; softjbod2; softhost1
 zone:   yellow   hardhost2; hardarray2
 alias: hardarray1
                  2,0
 alias: hardarray2
                  3,9                                           Soft Zone
 alias: hardhost1
                  0,8
 alias: hardhost2
                  1,1
 alias: softhost1
                  10:00:00:20:42:d9:78:31
 alias: softhost2
                  20:00:00:50:37:d2:75:50
 alias: softjbod1
                  21:00:00:20:37:d9:77:46
 alias: softjbod2
                  21:00:00:20:37:d9:77:47

                                                    Hard Zone
Effective configuration:
 cfg:    hard
 zone:   green    0,8
                  2,0
 zone:   yellow   1,1
                  3,9
376   Chapter 9 • SAN Implementation, Maintenance, and Management


          The difference between hard and soft zoning is that hard zoning is enforced
      at the Name Server and the Application-Specific Integrated Circuit (ASIC). Soft
      zoning is enforced only at the Name Server.With hard zoning, each ASIC main-
      tains a list of source port IDs that have permission to access any of the ports on
      that ASIC, and the ASIC hardware itself will actually block inappropriate frames
      from passing through it, dropping them if they attempt to talk outside their
      zones.Your choice of zoning also influences how you maintain and operate your
      SAN.
          When a device requests a list of nodes from the Name Server, this is analogous
      to calling a telephone directory service.When the Name Server responds to a
      request, it returns nodes that the requesting device is allowed to access based on
      zoning definitions.When you contact a telephone directory service, unlisted tele-
      phone numbers are not returned. However, if you know the unlisted party’s tele-
      phone number or randomly guess an unlisted telephone number, there is nothing
      to prevent you from calling the unlisted party’s telephone number.With hard
      zoning, even if the device is aware of and attempts to use an “unlisted” port ID,
      the hardware will prevent communications from happening. Some edge devices
      either cache port IDs or bypass the Name Server under certain circumstances and
      will attempt to communicate with another device even though that device is not
      in the Name Server. Normally, this type of behavior is the nature of the device.
      For example, an initiator might not respond to an RSCN by design. An RSCN is
      normally sent to tell the initiator that the zones have changed and some devices
      that were previously being accessed are no longer available. If a device does not
      respond to this RSCN, it will continue to access even addresses that have been
      removed from its view of the Name Server. A malicious initiator might start scan-
      ning addresses to discover “live” ports. Hard zoning prevents initiators from
      accessing devices under such circumstances. If you are using soft zoning, these
      types of accesses are not prevented. Figure 9.13 shows the difference in security
      between hard zoning and soft zoning. Note that with hard zoning, you have
      protections at the Name Server and at the port, as depicted by the padlock icons.
                       SAN Implementation, Maintenance, and Management • Chapter 9                         377


Figure 9.13 Hard Zoning Is More Secure than Soft Zoning

             With Hard Zoning you have zoning enforcement at the hardware level and with the Name Server




             Simple
             Name
             Server




                        With Soft Zoning you have zoning enforcement at the Name Server only




             Simple
             Name
             Server
378   Chapter 9 • SAN Implementation, Maintenance, and Management


      Hard Zoning and Soft Zoning Differences
      When you zone by WWN (soft zoning), you have the flexibility of physically
      moving that device anywhere within the fabric without redefining your zones.
      This is because the device WWN has no dependencies on physical connection.
      Currently with hard zoning, the zone definition is based on the physical location
      of the edge device. If you move that edge device, you need to modify your zone
      definitions, since the zone definition is no longer valid.When you replace a failed
      device with a new device, you will need to modify your zone data with the new
      device WWN if you are using soft zoning. It is not necessary to modify a zone
      definition when replacing a hard-zoned device, since the device’s physical loca-
      tion is not changing.
          Hard zones are easier to implement since you just need to know the switch
      domain and port number of the device you want to zone.When you use soft
      zoning, however, you need to obtain the device WWN and it is harder to visu-
      alize the relationship between zone definition and a physical device.
          When you use hard zones, it is easier to replicate the zoning environment,
      since the domain and port identifiers do not need to be changed.You might want
      to replicate a zone environment when you implement the second fabric of your
      dual-fabric solution. Replicating SAN environments using soft zones is not as
      easy since re-entry of the unique WWNs associated with each SAN node is
      required. Because domain IDs are subject to change, hard zoning definitions
      might need to be redefined when a domain ID changes.

      Zone Management
      Zoning is a fabric-wide resource administered from any switch in the fabric,
      which automatically distributes itself to every switch in the fabric. Zoning
      administration can be managed via telnet commands,WEB TOOLS, or the
      Fabric OS API to any switch in the fabric.You can use each of these zone
      management interfaces standalone or in combination with each other.
          The fabric provides maximum redundancy and reliability, since each switch
      stores the zoning information locally and can distribute it to any switch added to
      the fabric. For large zoning configurations or frequent zone changes, it might be
      desirable to automate these operations. Downloading the zoning configuration
      into a text file for manipulation and maintenance might also be desired.While
      the zoning information is redundantly distributed throughout the fabric, you are
      encouraged to make at least one backup copy of your zoning configuration by
      using the command configUpload.The configUpload command saves not
                      SAN Implementation, Maintenance, and Management • Chapter 9         379


only the switch configuration, but also the zoning configuration information to a
file located on a specified host. Note that to enable a zone configuration with
configDownload, you need to first disable the switch (use the command
switchDisable).

Scripting Zoning Operations
You have the option to use scripting to automate certain zoning operations. For
example, you can create a script to automatically change a zoning configuration by
enabling a predefined zone configuration.You might want to do frequent zone
changes to virtually move a tape drive to different zones in a fabric as you perform
your backup. Scripting is also effective for changing zoning configurations based on
policy. For example, in a disaster recovery scenario, your policy might dictate to dis-
able noncritical access to the SAN so that production systems can take over the
resources used by noncritical systems. By automating the zone change process, you
speed up the zone changes and minimize the potential for human error.
    With multiple zoning configurations defined in your fabric, it is quite easy to
switch between configurations by issuing the cfgEnable <configuration>
command. If you need to change configurations frequently or based on policy, you
might consider writing a script to cfgEnable the appropriate configuration.The
script would be very similar to the script shown in Figure 9.9 (run_sw_cmd).
    Another option to leverage scripting for your zoning operations is to auto-
mate your zone creation with a script. Such a script would also serve as a backup
to the zone configuration running in your SAN.You can modify this script to
add or delete zone objects.When you need to restore a zone or implement zone
changes, just execute the script.The script flow is as follows:
     1. Log in to the switch.
     2. Clear the existing zone objects.
     3. Create the zone objects.
     4. Enable the desired configuration.
    Figure 9.14 is a code fragment of an Expect script that can be used to create
or modify a zone configuration.This script is called make_zone.You need to
modify the zone entries within the script.The script is based on the script
run_sw_cmd (Figure 9.9) and is also available on the book’s Web site
(www.syngress.com/solutions).The syntax might appear a bit awkward or con-
fusing, since you need to “escape” the double quotes (“) with a backslash (\) so
that the double quotes are passed to the switch and not interpreted by the Expect
script.
380   Chapter 9 • SAN Implementation, Maintenance, and Management


      Figure 9.14 A Zone Creation Expect Script
      Usage: make_zone <switch>


      Switch login code
      .
      .
      .


      # clear out the existing configuration
      send "cfgclear\r"
      expect $sprompt


      # create your zoning objects
      send "alicreate \"jbod1\",\"21:00:00:20:37:d9:77:46\"\r"
      expect $sprompt
      send "alicreate \"jbod2\",\"21:00:00:20:37:d9:77:47\"\r"
      expect $sprompt
      send "zonecreate \"red\",\"jbod1;jbod2;0,0;0,1\"\r"
      expect $sprompt
      send "zonecreate \"blue\",\"jbod1;jbod2;1,0;1,1\"\r"
      expect $sprompt
      send "zonecreate \"green\",\"2,0;2,1;3,0;3,1\"\r"
      expect $sprompt
      send "cfgcreate \"colors1\",\"red;green\"\r"
      expect $sprompt
      send "cfgcreate \"colors2\",\"blue;green\"\r"
      expect $sprompt


      # enable the desired configurations
      send "cfgenable \"colors1\"\r"
      expect $sprompt
      puts "\n"
      return 0
                    SAN Implementation, Maintenance, and Management • Chapter 9        381


Zoning Tips
The Brocade Zoning manual extensively documents the use of zoning.The fol-
lowing list of tips will guide you through your zoning implementation as a
supplement to the Brocade manuals:
    s   Minimal unwanted interactions To minimize unwanted interactions
        between devices and to facilitate fault isolation, limit the number of
        HBAs/initiators in a zone to one.The exception is clustering applica-
        tions where HBAs need to communicate with each other.
    s   Heterogeneous environments To reduce challenges related to
        operating system interoperability, zones can be created that fence off dif-
        ferent operating systems. If a shared target such as a tape drive is needed,
        an overlapping zone can be used while still protecting the different oper-
        ating systems from each other.
    s   Aliases Use aliases to define your zone members. If a zone member
        changes, you need only to update the alias versus potentially changing
        multiple zone definitions. Aliases also give meaningful names to a device,
        much in the same way an IP name gives a meaningful name to an IP
        address. Aliases can be used for either single devices or for a group of
        multiple devices. From a service perspective it is the best method for
        getting a textual list of what is attached to what port.
    s   Addition of a new switch To avoid zone conflicts and fabric segmen-
        tation when a new switch joins a fabric, clear and save the zone on the
        new switch prior to that switch joining the fabric. Do this with the
        commands cfgClear, cfgDisable, and cfgSave. A new switch added to
        the fabric automatically inherits the active zoning configuration infor-
        mation in the fabric and immediately begins enforcement.
    s   Node and Port WWN When a zone member is specified by Node
        Name, then all ports on that device are in the zone.When a zone
        member is specified by Port Name, only that single device port is in the
        zone. A device has one Node WWN and one or more Port WWN(s).
        For flexibility, consider using the Node WWN for your zoning entries if
        you must use soft zoning.
    s   Zone changes When issuing the configDownload command to
        enable a given zoning configuration, you should insert the keyword
382   Chapter 9 • SAN Implementation, Maintenance, and Management


               “clear:” into the file immediately before the zoning lines.This will
               ensure that the new zones take effect and that there is no segmentation.
           s   Backup When you have finished your zoning implementation, make a
               backup of your zoning data by using the command configUpload.


      Validating Your Fabric
      Prior to transitioning your fabric to production, it is important to validate that the
      SAN you have implemented is ready.The time to identify and correct any prob-
      lems is during the validation of your fabric and prior to transitioning to produc-
      tion. Doing so involves establishing your SAN profile, which we discuss in
      Chapter 8,“SAN Troubleshooting.” After baselining your SAN profile, you need to
      inject faults into the fabric to verify that the fabric and the edge devices are
      capable of recovering.The next step involves generating an I/O load in the SAN
      that approximates various application I/O profiles. Finally, you will want to run an
      I/O load on your SAN while also doing fault injection to approximate a worst-
      case scenario—a failure in your SAN while your SAN is in production. After
      completing the validation phase, you can then hand off the SAN to production.
          In the next section, we cover baselining your SAN profile, fault injection,
      running in I/O load, and using I/O generators.

      Baseline Your SAN Profile
      We discuss the SAN profile process in Chapter 8.You need a baseline of your
      SAN so that you can quickly determine if the testing you execute results in any
      discrepancies.To baseline a SAN you need to bring up the SAN and all edge
      devices and verify that the fabric is stable and all devices are present and
      accounted for. Using the commands nsShow, nsAllShow, topologyShow, and
      switchShow, verify that the number of switches and devices you expect to be
      present are indeed visible to the fabric. If there is a discrepancy, refer to Chapter 8
      for troubleshooting guidance. Once you verify that the correct number of
      switches and devices are present in the fabric, update your SAN profile form. In
      addition to the SAN profile information identified for collection in Chapter 8,
      you also need to identify the ISL ports in your fabric.You can do this by
      reviewing the output of switchShow on each switch in the fabric. Figure 9.15
      provides a graphical depiction of a fabric with a SAN profile example.
                     SAN Implementation, Maintenance, and Management • Chapter 9   383


Figure 9.15 SAN Profile Example (Profile)

        Total Name       Number
        Server Entries   Switches
        --------------       --------
        9                         5

        core1
             name server entries:       2
             isl ports:                         0: 1: 2: 3: 4: 5:
             number of ISL ports:       6
        core2
             name server entries:       0
             isl ports:                         0: 1: 2: 3: 4: 5:
             number of ISL ports:       6
        edge1
             name server entries:       7
             isl ports:                         0: 1: 2: 3:
             number of ISL ports:       4
        edge2
             name server entries:       0
             isl ports:                         0: 1: 2: 3:
             number of ISL ports:       4
        edge3
             name server entries:       0
             isl ports:                         0: 1: 2: 3:
             number of ISL ports:       4




                                            core1                core2




                                    edge1                edge2           edge3
384   Chapter 9 • SAN Implementation, Maintenance, and Management


          The data extracted from the SAN shown in Figure 9.15 was extracted via a
      shell script.This shell script in turn used the Expect script run_sw_cmd. For
      large fabrics where you need to repeatedly capture a SAN profile, a script like
      get_san_profile is a real time saver.This shell script is available on the book’s
      Web site (www.syngress.com/solutions).

      Fault Injection
      Fault injection is the process of creating scenarios in the SAN that mimic poten-
      tial faults. It is effective for uncovering marginal connections and malfunctioning
      devices. Fabric and edge devices should gracefully recover after a fault is injected.
      The process of fault injection and SAN verification is straightforward:
           1. Capture a SAN profile baseline.
           2. Inject a fault.
           3. Compare the SAN profile baseline to a current SAN profile.
           4. Check edge devices to verify that no devices have dropped off (are no
              longer visible to the hosts or switches).
           5. If there are any unexpected discrepancies, go to Chapter 8 for
              troubleshooting guidance.
          Fault injection should involve both the fabric and the edge devices. Power
      cycling and resetting are typical fault injection activities for edge devices.You can
      simulate an edge device going offline and online by doing a
      portDisable/portEnable for a particular edge port. For the fabric, you have
      several fault injection activities from which to choose:
           s   Reboot a switch or power cycle a switch.
           s   Disable and enable a switch (switchDisable/switchEnable).
           s   Disable and enable ISL ports (portDisable/portEnable).
          An exhaustive testing of the fabric and all edge devices is not usually war-
      ranted. However, spot-checking is useful prior to transitioning to production. For
      edge devices, select one or two representative hosts and storage devices for the
      edge device fault injection. Power cycle and/or reset these devices. After the
      device recovers from being power cycled or reset, check the other edge devices
      to verify that no devices, except the device being reset, are dropped. A dropped
      device is a device previously visible to the edge device that is no longer acces-
      sible: for example, a disk device that was visible via the UNIX format command
                     SAN Implementation, Maintenance, and Management • Chapter 9        385


but is no longer visible via the format command after a fault injection is
considered dropped. A dropped device is considered an error that requires further
troubleshooting.
    For the fabric, select two or three switches for reboot, power cycle, and dis-
able/enable fault injection. If you are using a core/edge architecture, one of the
switches used for fault injection should be a core switch. For the ISL testing,
choose three to five ISLs spread across multiple switches to disable and then
enable. After each fault injection, capture a SAN profile. Compare this SAN pro-
file to the baseline. Also check the edge devices to see that no devices drop out. If
there are no discrepancies, the SAN passes the test.

Running an I/O Load
It is very important to establish a stable SAN prior to moving to testing a SAN
with an I/O load. If the SAN is not stable prior to load testing, it becomes
difficult to establish a root cause if problems arise, since these problems can be
stability-related and/or load-related. Some problems only arise under load, such as
marginal links.When you do load testing in your SAN, you should run a variety
of load types, focusing on a load that is most similar to the type of I/O you
expect in your SAN. Once you are able to test the SAN with a variety of loads,
try doing so with fault injection.The level of fault injection during load testing
should be less intensive than the fault injection phase of testing. A suggested level
of fault injection testing to perform while the SAN is under load is as follows:
     s   Reboot a switch.
     s   Reboot/reset one storage device and one host (you can simulate this
         situation by using the portDisable/portEnable command).
     s   Disable and enable two or three ISLs that are each located on a
         different switch.
    This is the final test. If you can run I/O in your SAN while doing fault
injection and you are able to recover after the fault, it is time to move to produc-
tion. Some faults might cause some I/O not to recover.This can happen because
the host driver is unable to recover I/O under certain circumstances (for
example, tape I/O), or because timeout values on the edge devices or in the
SAN require tuning. Adjusting timeout settings in the SAN is a complex process
that involves edge device and switch settings.Timeout settings are normally
configured in the HBA or storage device. Refer to the HBA or storage device
386   Chapter 9 • SAN Implementation, Maintenance, and Management


      configuration documentation for the specifics of how to make these changes and
      what the impact is of doing so.

      Types of Load
      I/O can be classified in three ways: either a read or a write, random or sequential,
      and I/O size. A second-order description of I/O is whether the I/O is band-
      width-intensive. If the I/O is bandwidth-intensive, it is more meaningful to mea-
      sure this I/O by throughput (in MB/sec). If the I/O is not bandwidth-intensive,
      you normally measure this I/O in terms of I/O Per Second (IOPS).
          Usually, I/O is a mix of reads and writes. However, some applications are very
      biased. For example, a video server I/O activity will normally be almost 100
      percent reads.
          I/O can further be classified as random or sequential. Some examples of
      random I/O are an e-mail server or an Online Transaction Processing (OLTP)
      server. Sequential I/O is characteristic of decision support (such as data ware-
      housing) or scientific modeling applications.
          The third characteristic of I/O is the size of the I/O. I/O sizes typically range
      from 2 KB to over 1 MB.
          Table 9.5 lists the application I/O profiles that establish the typical magnitude
      of application bandwidth consumption. For SAN design performance purposes,
      I/O is classified by bandwidth utilization: light, medium, and heavy. It is important
      to ultimately support test assumptions by gathering actual data when possible.You
      can gauge the type of I/O activity in your existing environment by using I/O
      measurement tools such as iostat (UNIX) or diskperf (Microsoft).

      Table 9.5 Application I/O Profiles

      Application              Bandwidth        Read /          Typical      Typical
                               Utilization      Write Mix       Access       I/O Size
      OLTP, e-commerce,        Light            80% read /      Random       8 KB
      e-mail, UNIX File                         20% write
      System (UFS),
      Common Internet
      File Services (CIFS)
      OLTP (raw)               Light            80%   read /    Random       2 KB–4 KB
                                                20%   write
      Customer Response        Light            85%   read /    Random       2 KB–4 KB
      Management (CRM)                          15%   write
                                                                                 Continued
                      SAN Implementation, Maintenance, and Management • Chapter 9     387


Table 9.5 Continued

                         Bandwidth        Read /          Typical      Typical
Application              Utilization      Write Mix       Access       I/O Size
Decision support,        Medium to        90% read /      Sequential 16 KB–
high-performance         Heavy            10% write                  128 KB
computing, seismic,
imaging, technical
computing
Video server             Heavy            98% read /      Sequential > 64 KB
                                          2% write
SAN applications:        Heavy            Variable        Sequential > 64 KB
serverless backup,
snapshots, third-
party copy


I/O Generators
You need an I/O generator to place a load in the SAN. Use the application I/O
profiles outlined in Table 9.5 as a guide for providing input to your I/O gener-
ator.You should run a light-bandwidth and a heavy-bandwidth I/O load in your
SAN for testing, tweaking one of these profiles to match your anticipated load
profile. An even better approach is to use the target applications for load testing.
However, doing so is often difficult or not possible.When deciding on which
tool to use for your testing, focus on the tool’s ability to do the following:
     s   Generate variable I/O sizes
     s   Generate sequential and random I/O
     s   Generate a mix of reads/writes
     s   Generate one or more process(es)/thread(s) per disk or LUN
    For Microsoft environments, Iometer is a popular and robust tool that meets
these requirements. Iometer is available from Intel (http://developer.intel.com/
design/servers/devtools/iometer) and the tool is free. For UNIX environments,
finding a tool for testing I/O is more challenging.There are also public domain
tools available, such as IOzone (www.iozone.org). More flexible and powerful
tools are not in the public domain and you need to obtain these tools directly
from storage and host suppliers. vxbench is a very powerful and flexible tool
available from VERITAS that can generate the loads outlined in Table 9.5.You
will need to contact a VERITAS representative to obtain a copy of vxbench.
388   Chapter 9 • SAN Implementation, Maintenance, and Management




         Tips for Generating an I/O Load
               s   Many I/O utilities do both reads and writes. Writes can
                   be destructive and can cause loss of data. Make sure that
                   the storage you are writing to does not contain data that
                   you need.
               s   For UNIX operating systems, use the raw devices to achieve
                   maximum bandwidth. If you use a “cooked” device (a device
                   with a file system), you will incur CPU overhead related to
                   the file system and anomalous results due to buffering.
               s   To achieve maximum bandwidth, use sequential I/O with one
                   or two threads/processes per device if you are using a JBOD.
                   When using a RAID device, you need to use multiple threads,
                   starting with two and doubling your thread/process count
                   until you observe a reduction in bandwidth.
               s   To achieve maximum IOPS, start with two threads/processes
                   per device and double your thread/process count until you
                   see a reduction in IOPS. The switch does not measure IOPS,
                   so you will need to use an external tool such as iostat (UNIX)
                   or diskperf (Microsoft) to establish your IOPS.
               s   To observe the performance for a particular switch, issue the
                   command portPerfShow from a telnet session on that switch.


           Many storage and host suppliers have internally developed I/O testing tools
      for UNIX and Microsoft environments that are available if you ask for them. A
      reliable standby in the UNIX environment is the tool dd, which is available from
      many UNIX operating systems. With dd, it is possible to perform variable size
      I/O and to generate one or more processes. However, generating random I/O or
      a mix of reads and writes is difficult. An example of running a heavy-bandwidth
      load using the UNIX utility dd and the Microsoft environment tool Iometer fol-
      lows (Figures 9.16 and 9.17). For dd, it is necessary to only use one process per
      disk to achieve maximal bandwidth.
                      SAN Implementation, Maintenance, and Management • Chapter 9   389


Figure 9.16 Generating a Heavy-Bandwidth Load Using a Shell Script and
the UNIX Utility dd
#! /bin/csh -f


# 100% reads to 3 disks
dd if=/dev/rdsk/c1t0d0s2 of=/dev/null bs=64k count=2000 &
dd if=/dev/rdsk/c1t1d0s2 of=/dev/null bs=64k count=2000 &
dd if=/dev/rdsk/c1t2d0s2 of=/dev/null bs=64k count=2000 &


# 100% writes to 4 disks
dd if=/dev/zero of=/dev/rdsk/c1t3d0s0 bs=64k count=2000 &
dd if=/dev/zero of=/dev/rdsk/c1t4d0s0 bs=64k count=2000 &
dd if=/dev/zero of=/dev/rdsk/c1t5d0s0 bs=64k count=2000 &
dd if=/dev/zero of=/dev/rdsk/c1t6d0s0 bs=64k count=2000 &




Figure 9.17 Generating a Heavy-Bandwidth Load Using the Tool Iometer—All
Seven Disks Are Selected
390   Chapter 9 • SAN Implementation, Maintenance, and Management


          To generate maximum Fibre Channel bandwidth of 100 MB/sec you will
      need multiple disks, if using a JBOD, and possibly multiple storage arrays, if the
      array is not capable of sustaining 100 MB/sec (Figures 9.18 and 9.19). For
      example, if you run a single disk in a JBOD, you are never going to hit 100
      MB/sec—you will need to run a number of drives to saturate a link.

      Figure 9.18 The Access Pattern Is 100 Percent Sequential Read with an I/O
      Size of 64 KB




      Figure 9.19 The Bandwidth Is 95+ MB/sec—Approaching Fibre Channel
      Maximum
                     SAN Implementation, Maintenance, and Management • Chapter 9     391



SAN Maintenance
There are multiple maintenance functions that you will need to perform
throughout the life of your SAN. Some of these maintenance activities will be
planned, and others will happen unexpectedly. If your SAN is designed to be
resilient or redundant, unexpected maintenance should have minimal or no
impact on your SAN operations. Failed or malfunctioning devices are normally
the cause of unexpected maintenance.We provide a suggested process in this
section for each maintenance activity discussed, including verification procedures.
Other detailed SAN maintenance procedures are available from Brocade refer-
ence manuals as well as whitepapers.The focus of this section is to present an
overview of the tasks and actions necessary to do SAN maintenance that is
integrated with the methodologies and process defined in this book:
     s   Maintaining a configuration log
     s   Backing up and restoring a switch configuration
     s   Bringing up a fabric
     s   Expanding a fabric: merging fabrics, adding a switch, or replacing
         a switch
     s   Upgrading your fabric
     s   Replacing or adding an edge device in the fabric


The Configuration Log: Key Information to
Gather and Maintain about Your SAN
A configuration log is an up-to-date compilation of information and configura-
tion details about your SAN. Enough information about your SAN should exist
in your configuration log so you can recreate your SAN based on it.Whenever
you make a change to your SAN, you should also update your configuration log.
The configuration log can exist in hardcopy, softcopy, or both.There are some
aspects of your configuration log that are not printable, such as firmware. If you
maintain your configuration log in softcopy and this softcopy is stored in your
SAN, you should also maintain a hardcopy or disaster backup, in case a disaster
makes your SAN inaccessible. Having a softcopy of your configuration log
enables rapid searches and easy updates of your SAN configuration data.You will
need to access your log for a variety of reasons:
392   Chapter 9 • SAN Implementation, Maintenance, and Management


           s   Disaster recovery
           s   Troubleshooting
           s   Recreating a switch whose configuration is destroyed
           s   Planning SAN additions (for example, replacing your core switches with
               large core switches)
           s   Modifying or expanding a SAN design
           s   Recovering accidentally deleted licenses
           s   Recovering or reconfiguring a zoning configuration
         The key to a successful configuration log is diligent updates.Without the
      updates, your configuration log is not very useful. A suggested structure for your
      configuration log resembles the following (Figure 9.20 shows a Microsoft
      Windows Explorer view of an online configuration log):
           1. Detailed diagrams of your SAN:
               A. Switch topology
               B. Host and storage connections
           2. Firmware log of all SAN devices:
               A. A table for all devices, listing the device and related firmware
               B. A directory structure containing an entry for each device’s firmware
           3. A log book where any additions, deletions, or modifications to your
              SAN are logged
           4. A directory structure for the switches:
               A. A copy of each switch’s configuration is maintained. Use the com-
                   mand configUpload to save a switch’s configuration to a host.
               B. supportShow information for each switch, captured after the SAN is
                   tested and verified.
           5. Your SAN profile
           6. A script directory for any scripts you create
           7. A zoning directory for zoning configurations
                     SAN Implementation, Maintenance, and Management • Chapter 9      393


Figure 9.20 Explorer Screenshot of a Configuration Log




Backing Up and Restoring a Switch
Configuration
When you implement a new SAN, change your switch configuration, add a new
switch to your SAN, or replace a switch in your SAN, you should create a
backup of each switch’s configuration on a host.You do so with the command
configUpload, which generates an editable text file. You can then restore a
switch’s configuration with the command configDownload. The direction of
upload and download is relative to the switch. Sometimes is it confusing whether
you are backing up a configuration or restoring a configuration.To back up a
configuration, you upload to a host.To restore a configuration, you download
from a host.
    You can also create a standard configuration profile, suitable for configuring all
switches in your SAN, by stripping out the switch-specific data from the switch
configuration file. A switch configuration profile enables you to perform rapid
configuration of your switch or switches for initial implementation, additions, or
replacements.The alternative is a manual and time-consuming configuration of
fabric parameters, SNMP information, and Fabric Watch information.The switch
configuration file can also be used as a backup for your zoning configuration or
as a zoning reconfiguration tool. If you ever lose a switch’s license information,
you can recover this information from the configuration backup data.When
replacing a switch, you can reference the switch configuration backup for IP
address information.
394   Chapter 9 • SAN Implementation, Maintenance, and Management


          The configuration file is written as three sections.The first section contains
      the switch boot parameters (otherwise known as the switch’s identity) and is pre-
      ceded by the heading [Boot Parameters]. It has variables such as the switch’s
      name and IP address.This section corresponds to the first few lines of output of
      the configShow command.The second section contains general switch configu-
      ration variables, such as diagnostic settings, fabric configuration settings, Fabric
      Watch setting, license key information, and SNMP settings.This section corre-
      sponds to the output of the configShow command (after the first few lines),
      although there are more lines uploaded than shown by the command.The second
      section is preceded by the heading [Configuration].The third section contains
      the zoning configuration. It corresponds to the output of the cfgShow com-
      mand and is preceded by the heading [Zoning].
          To create a standard switch configuration profile, strip out the [Boot
      Parameters] and [Zoning] headings and section data, leaving the
      [Configuration] heading and data.You might also want to strip out QuickLoop
      data from the configuration section if you are not running QuickLoop on every
      switch.To restore or load a configuration, it is necessary to disable the switch
      (switchDisable) before downloading the configuration information.To load a
      configuration, use the command configDownload, specifying the standard pro-
      file or backup configuration file as the configuration file.
          It is also possible to use the configuration file as a zoning backup and as a
      zoning reconfiguration tool. By stripping out the boot parameters and configura-
      tion information from the configuration file and then downloading the resulting
      zoning information, you can restore or change a zone configuration. If you are
      using the configuration file for zoning configuration changes, you need to insert
      the keyword clear: in the configuration file to clear out the existing SAN zone
      configuration and prevent zone conflicts.

      Bringing Up a Fabric
      There are several instances, such as power failure, initial bring up, or fabric-wide
      firmware upgrade, when you will need to bring up an entire fabric.The ideal
      order of bring up is as follows:
           1. Bring up the fabric.
           2. Bring up the storage devices.
           3. Bring up the hosts.
                      SAN Implementation, Maintenance, and Management • Chapter 9         395


    This order stems from the fact that the host must have visibility to the
storage, especially during boot when devices are configured: for the host to have
visibility to the storage, the storage and SAN must be online.You can bring up
the storage first and then bring up the SAN. However, it is recommended that
you power up the SAN first. Unfortunately, this order is difficult to implement.
Powering off or disconnecting edge devices is frequently very time consuming or
very difficult to schedule. A more likely scenario involves bringing the SAN
down via a power cycle or a reboot of all switches in the fabric, and then
bringing the fabric back up while edge devices are powered on and connected.
An example of unplanned bring up is a power outage.When a power outage
occurs, the order in which hosts, storage, and switches come online is variable.To
bring up a fabric, use the following steps as a guideline:
     1. Bring up the switches. Either power on the switches or issue the
        command reboot to all of the switches in the fabric.
     2. Verify the fabric. Once the fabric is up, you need to verify that all
        edge devices and switches are present. Use the SAN profile to compare
        the previous baseline of your fabric switch count and device count to a
        current profile. If you see any discrepancies, follow the troubleshooting
        procedures detailed in Chapter 8. Use topologyShow to verify that all
        switches are online and use nsAllShow to verify that the correct
        number of devices are present in the fabric. Even if you are able to exe-
        cute the ideal order of bring up (fabric, storage, host), it is still necessary
        to compare the baseline SAN profile to the current SAN profile, since it
        is possible that all edge devices did not come back online. Note that this
        is becoming less of an issue, especially with newer edge devices.


Expanding a Fabric: Merging Fabrics, Adding
a Switch, or Replacing a Switch
Merging two fabrics, replacing a switch, and adding a switch to a fabric are sim-
ilar processes. It is important that the zoning configurations and fabric configura-
tion parameters are consistent between the new switch or fabric and the existing
fabric. Execute the following steps when adding a switch or switches to
the fabric:
     1. If necessary, update your SAN profile with the current state of the SAN.
     2. Resolve any zone conflicts.
396   Chapter 9 • SAN Implementation, Maintenance, and Management


           3. Resolve any switch configuration parameter conflicts and make any nec-
              essary switch-specific configuration changes such as port configuration
              changes, enabling QuickLoop, SNMP, Fabric Watch settings, and other
              configuration changes. If you have a standard switch configuration, you
              can download this configuration with the command configDownload.
           4. Resolve any domain ID conflicts or connect a disabled/powered-down
              switch.
           5. Verify that the new switch or switches are licensed consistently with the
              existing fabric licensing scheme.
           6. Check the new switch’s or switches’ Fabric OS version and, if possible,
              make the Fabric OS version consistent for the whole SAN.
           7. Verify that your SAN devices are minimally impacted by an RSCN. If
              your SAN devices have difficulty handling RSCNs or your applications
              are adversely impacted, consider stopping I/O on those devices.
           8. Connect the new switch or switches to the existing SAN.
           9. Enable or power up the new switches.
         10. Connect your edge devices.
         11. Capture a new SAN profile to verify that the correct number of edge
             devices and switches are present in the fabric. If there are any disparities,
             reference Chapter 8 for guidance on troubleshooting. Once the correct
             number of switches and edge devices are accounted for, create a baseline
             SAN profile for future reference.
         12. Back up the configuration for the added switch or switches with the
             command configUpload.
          To avoid zone conflicts, it is simplest to clear out the zone configurations for
      the new switch or switches by executing the commands cfgClear, cfgDisable,
      and cfgSave on the switch(es) being added. If you are merging multiple fabrics,
      select one of the fabrics as the active fabric; add the zone entries from the nonac-
      tive fabrics to the active fabric zoning configuration; and then clear out the non-
      active fabric switches’ zone information. Once you add the “blank” switches into
      the fabric, these blank switches will absorb the zoning configuration of the
      active fabric.
          Certain configuration parameters in the fabric must be the same.To review
      your switch configuration parameters, issue the command configShow. You
                     SAN Implementation, Maintenance, and Management • Chapter 9      397


must resolve any conflicts in fabric configuration parameters before adding a new
switch to the fabric. For example, if there is a difference with the variable Error
Detect Timeout Value (E_D_TOV), it is necessary to either change this setting on
the new switches or the existing switches so that the value is consistent on all
switches that are going to be part of the same fabric.
     You can compare fabric configurations from your new switch and the
existing fabric by examining the output from the command configShow. You
can upload your standard switch settings from a configuration file to ensure con-
sistency of your switch configuration throughout the fabric.You can create and
restore switch configurations by using the commands configUpload and
configDownload. If a backup of your switch configuration is not current, exe-
cute a configUpload to capture a current backup of your switch configuration
and license information.When merging fabrics or adding a new switch to a
fabric, you need to check for domain ID conflicts and resolve these conflicts by
changing one of the conflicting domain IDs. If you bring a disabled switch or
powered-down switch into a fabric, you do not need to resolve domain ID con-
flicts, since the new switch will negotiate an acceptable domain ID.
     When a disabled or powered-down switch joins the fabric, if there is a
domain ID conflict, the added switch will negotiate a new domain ID. If domain
IDs change, verify your zone definitions to identify and correct any hard zones
affected by the domain ID change. Recall that a domain and a port number
define a hard zone. Also, some edge devices might have dependencies on a device
port ID, which is a function of the domain ID. If the domain ID changes, it
might be necessary to reboot your host or have your host rescan for devices.


NOTE
     What is a domain ID? The Fibre Channel specification Fabric Generic
     Requirements (FC-FG), available from the Technical Committee T11 of the
     National Committee for Information Technology Standards (NCITS) at
     www.t11.org, defines the concept of a domain as “the highest or most
     significant hierarchical level in the three-level addressing hierarchy.” A
     SilkWorm switch is considered a domain. The domain number uniquely
     identifies the switch in a fabric. Within a fabric, a domain is identified by
     an address ranging from 1 to 239 (domain ID). The range of allowed
     values varies depending on the switch model and other system settings.
     SilkWorm switches automatically assign domain IDs as part of the switch
     initialization process.
398   Chapter 9 • SAN Implementation, Maintenance, and Management


          To maintain a full feature set across all switches in a single fabric, you should
      run the same version of Fabric OS on all switches in that fabric. Before adding a
      new switch, check the license information with the command licenseShow to
      verify that a consistent license set exists with the new switch and the existing
      fabric.When you add a new switch to a fabric, you should try to do so when
      I/O is quiescent.You can verify if any I/O is occurring in your fabric by issuing
      the command portPerfShow on each switch in your fabric. When you add a
      new switch or switches, there will be a pause in any active I/O as the fabric
      reconfigures and edge devices respond to RSCNs. If you successfully tested your
      fabric with fault injections while generating an I/O load, you will have an accu-
      rate idea about how your fabric and edge devices will respond to the new
      switch addition(s).

      Upgrading Your Fabric
      We describe both processes in the following two sections. A hot upgrade (also
      called a rolling upgrade) requires that your edge devices be configured with redun-
      dant paths and software capable of managing path failover.With a hot upgrade,
      you reboot one switch at a time for the new firmware to take effect.With a cold
      upgrade, you reboot all of your switches at the same time.You would perform a
      hot upgrade if you were unable to take down an entire fabric. A cold upgrade
      should take your fabric down only for a few minutes as you reboot for the new
      firmware to take effect.

      Issues Applicable to Both
      Hot and Cold Upgrades
      The actual process of downloading firmware (firmwareDownload) does not
      require you to take the switch down. For the new firmware to take effect, you do
      need to reboot your switch.Wait until all switches are running new firmware
      before configuring any new software features or zoning parameters. It is recom-
      mended that all switches be upgraded to the same firmware level, to support
      all features in the current fabric; however, rolling upgrades are possible and
      supported.When performing a rolling upgrade, note that the new functionality
      might not be available on the switches until all switches are running the new
      version of Fabric OS.
                    SAN Implementation, Maintenance, and Management • Chapter 9        399




NOTE
   To minimize the reboot process, use the fastboot command after the
   firmware download. This skips the Power-On Self-Test (POST) and goes
   right to loading code and bringing up the switch. The fastboot time is
   approximately 30 seconds compared to approximately two minutes if
   POST is run during a normal reboot.




Performing a Hot Fabric Upgrade
   1. If necessary, update your SAN profile with the current state of the SAN.
   2. Make sure the switch has redundant paths for devices attached to it. If
      possible, force the I/O path on the devices to fail over to a neighboring
      switch using software provided on those devices.
   3. Verify that there is no traffic on the switch, using the perfShow telnet
      command if manual failover was possible in step 2.
   4. Download the new firmware (firmwareDownload) onto the switch.
   5. Reboot the switch for the firmware to take effect.When you take a
      switch down, it will cause the fabric to reconfigure and you will see a
      pause in any outstanding I/O as the fabric reconfigures.When the
      switch re-enters the fabric, you will also see a pause in I/O as the fabric
      reconfigures.
   6. Capture a new SAN profile to verify that the correct number of edge
      devices and switches are present in the fabric. If there are any disparities,
      reference Chapter 8 for guidance on troubleshooting.
   7. Re-enable the redundant paths for the attached devices.
   8. Repeat steps 2 through 7 until all switches in the fabric have
      been upgraded.
   9. Configure any new software features or zoning parameters.
   10. Capture a new SAN profile to verify that the correct number of edge
       devices and switches are present in the fabric. If there are any disparities,
       reference Chapter 8 for guidance on troubleshooting and resolving
       the discrepancies.
   11. Create a baseline SAN profile for future reference.
400   Chapter 9 • SAN Implementation, Maintenance, and Management


      Performing a Cold Fabric Upgrade
             1. If necessary, update your SAN profile with the current state of the SAN.
             2.   Verify that there is no traffic on the switch, using the perfShow
                  telnet command.
             3. Download the new firmware (firmwareDownload) onto all switches
                in the fabric.
             4. Reboot all switches in the fabric.
             5. Capture a new SAN profile to verify that the correct number of edge
                devices and switches are present in the fabric. If there are any disparities,
                reference Chapter 8 for guidance on troubleshooting and resolving
                these discrepancies.
             6. Configure any new software features or zoning parameters.
             7. Capture a new SAN profile to verify that the correct number of edge
                devices and switches are present in the fabric. If there are any disparities,
                reference Chapter 8 for guidance on troubleshooting.
             8. Create a baseline SAN profile for future reference.


      How to Automate firmwareDownload
      Using the run_sw_cmd script as a base, you can automate the download of
      firmware to a switch.You can then call this script from a “for” loop to download
      to multiple switches.The following excerpt in Figure 9.21 is from a script called
      fw_download, which is available on the book’s Web site (www.syngress.com/
      solutions).

      Figure 9.21 Excerpt from a Firmware Download Automation Expect Script
      # using ftp version of fw download -           note requires cleartext password
      set cmd "firmwareDownload
      \"192.168.162.102\",\"root\",\"/book/v2.4.1c\",\"fooba
      r\""
      # send the command
      send "$cmd\r"
      expect $sprompt
                     SAN Implementation, Maintenance, and Management • Chapter 9      401


Replacing or Adding an Edge
Device in the Fabric
When you add a new device to the fabric, you need to do some work ahead of
time. Ideally, you will connect the new device to a switch with the highest con-
centration of devices that the new device is expected to access. Carefully consider
placing new devices on the core of a core/edge fabric, since taking up core
switch ports with edge devices limits your expansion capabilities. See Chapter 7
for more detail on device placement in the SAN. If you are using zoning, you
need to update your zoning configuration with information from the new
device. If you are doing AL_PA zoning or hard zoning and you replace a device
on the same port, you do not need to make any zoning changes. Remember that
an AL_PA is an arbitrated loop physical address. An AL_PA is an 8-bit address and
is used to identify a private loop device (for example, ef ). An AL_PA might also
be part of a public loop 24-bit address (for example, 0102ef ). AL_PAs are “soft,”
meaning the address is dynamically assigned, or AL_PAs are “hard,” meaning the
AL_PA is manually set.There might be some port configuration requirements
such as QuickLoop, or a port might require “locking” to a G_Port (F_Port or
E_Port) or L_Port. By default, Brocade SilkWorm ports are auto-configuring, the
switch ports will match the topology of the edge port, and “locking” a port
should not be necessary. Configuring a switch port for G_Port locks the switch
port topology to a point-to-point connection, and configuring for an L_Port
locks the switch port for a loop connection. If you are adding a new device,
establish whether any port configuration dependencies exist and make any neces-
sary changes. Adding a new device should have minimal impact on any active
I/O. Devices within the new device’s zone might experience a pause in I/O as
they respond to the RSCN notifying the SAN that there has been a change
when the device is added.The following steps detail how to add or replace a
device in the fabric:
     1. Choose a location for the device in the SAN when making an addition.
     2. If necessary, update your SAN profile with the current state of the SAN.
     3. Update your zoning configuration to reflect the new edge devices prior
        to connection.This is extremely important as most devices perform
        discovery during login, so the zones need to be in place.
402   Chapter 9 • SAN Implementation, Maintenance, and Management


           4. If necessary, make any port changes such as locking a port to specific
              topology or enabling QuickLoop.
           5. Connect your new device or replace the existing device. If replacing a
              server, be sure to remove it from the zone, and if using software-based
              zoning, either reboot it or disconnect it from the SAN, as it might
              potentially still access mounted and known storage subsystems.
           6. Capture a new SAN profile to verify that the correct number of edge
              devices and switches are present in the fabric. If there are any disparities,
              reference Chapter 8 for guidance on troubleshooting. For example, if
              there are 42 devices in the SAN before you connect your F_Port device,
              expect 43 devices to exist after you connect the device.
           7. Create a baseline SAN profile for future reference.
                      SAN Implementation, Maintenance, and Management • Chapter 9         403



Summary
Several decisions and considerations regarding your SAN solution are necessary
prior to installation. Installation time is when to plan your cabling and to imple-
ment a cable layout scheme that is manageable, flexible, and maintainable. An
effective cable management scheme not only enables ease of maintenance, but it
is also aesthetically pleasing. Key areas to pay attention to when cabling are ISL
cabling and taking up slack.When racking your switches, make sure to avoid
single points of failure.This means separating redundant fabrics into different
racks and powering resilient fabrics in such a way that a power failure does not
cause the fabric to fail. For some situations, it is not possible or practical to dedi-
cate an Ethernet connection for each switch. It is possible to manage Brocade
switches via direct Ethernet connections or via IPFC. Using IPFC currently does
have some disadvantages, such as a single point of management failure; however,
this is no different from management via the Ethernet ports. In that case, you
would have the management station and all the switches connected to an
Ethernet switch or hub. If the Ethernet cable goes bad, you cannot manage any
switch. Other areas that require thought prior to installation are a switch-naming
convention, the version of Fabric OS to install, and ensuring that the correct
licenses have been purchased.You should have a well thought-out switch-naming
convention to enable easy identification of a physical switch in case a problem
arises. Choose your Fabric OS version before you implement your SAN.The
Fabric OS selection process might involve several of your SAN device providers,
since Fabric OS support levels vary between SAN device vendors.
     If you have to do a SAN administration activity more than once, consider
writing a script.You can use the Tcl/Tk-based Expect scripting language to
interface with the switch. In the future, you will also have the option to use the
Fabric OS APIs for automating switch management functions.With scripting you
can automate many activities, such as downloading firmware, rebooting switches,
automating zone changes, and facilitating troubleshooting. A wrapper script that
enables the automation of SilkWorm switch commands is provided as an example
and as a tool to use in your SAN.You can use this script as is, or modify it to suit
your needs.
     If you use switch-based zoning, you need to determine if you want to use
hard or soft Brocade Zoning and how to manage your zones. A related zoning
topic that you also need to explore is where to zone. Zoning enables you to logi-
cally group devices into virtual SANs. Zoning is used to set up barriers between
different operating environments, to deploy logical fabric subsets by creating
404   Chapter 9 • SAN Implementation, Maintenance, and Management


      defined user groups, or to create test areas or maintenance areas that are separate
      within the fabric. Zoning is an all-or-nothing operation: once a zone is defined,
      all devices must be defined in a zone, or each device will exist in a zone con-
      sisting of just that device, and that device will be inaccessible to other devices in
      the fabric. It is possible to zone at various points in the SAN, such as the HBA or
      at the host level, and you might even decide not to use switch-based zoning at
      all.You might also want to use switch zoning in combination with other zoning
      methods such as using the HBA or storage controller to accomplish zoning.
      Zoning offers many advantages, such as a single point of management, hard
      zoning, and the ability to create virtual SANs. Hard zoning is the most secure
      zoning available because it is enforced at the Name Server and at the ASIC.There
      are differences between hard zoning and soft zoning, such as flexibility in moving
      zoned devices in the fabric and ease of implementation. A domain and a port
      number define a hard zone, while a device WWN defines soft zoning.You can
      script repetitive zoning operations. Scripting large zoning configurations might
      also be easier than using other zoning management interfaces, such as
      WEB TOOLS.
           Prior to transitioning your fabric to production, it is important to validate
      that the SAN you have implemented is ready.The time to identify and correct
      any problems is during the validation of your fabric and prior to transitioning to
      production. Doing so involves establishing your SAN profile and injecting faults
      into the fabric to verify that the fabric and the edge devices are capable of recov-
      ering. Generating an I/O load in the SAN that approximates various application
      I/O profiles is also an important part of the SAN validation process.To approxi-
      mate a worst-case scenario—a failure in your SAN while your SAN is in produc-
      tion—you will want to run an I/O load on your SAN while also doing fault
      injection. Suggested fault injections involve rebooting or power cycling your
      switches and edge devices. I/O can be classified in three ways: a read or a write,
      random or sequential, and I/O size. A second-order description of I/O is
      whether the I/O is bandwidth-intensive.When you do load testing in your SAN,
      you should run a variety of load types, focusing on a load that is most similar to
      the type of I/O you expect in your SAN. Once you are able to test the SAN
      with a variety of loads, try doing so with fault injection.The level of fault injec-
      tion during load testing should be less intensive than the fault injection phase of
      testing.There are several publicly available load generators, such as Iometer. You
      might also want to ask your storage or host provider for tools they use in their
      load testing.
                     SAN Implementation, Maintenance, and Management • Chapter 9    405


    There are multiple maintenance functions that you will need to perform
throughout the life of your SAN. A suggested process is provided for several
maintenance activities: creating a configuration log, backing up and restoring a
switch configuration, bringing up a fabric, expanding a fabric (merging fabrics,
adding a switch, or replacing a switch), upgrading your fabric, and replacing or
adding an edge device in the fabric. Other detailed SAN maintenance procedures
are available from Brocade reference manuals as well as whitepapers. A diligently
maintained configuration log can help you with disaster recovery, trou-
bleshooting, recreating a switch whose configuration is destroyed, SAN design
modifications or expansion, recovering accidentally deleted licenses, and recov-
ering a zoning configuration.When you implement a new SAN, change your
switch configuration, add a new switch to your SAN, or replace a switch in your
SAN, you should create a backup of each switch’s configuration on a host using
the command configUpload.

Solutions Fast Track
Installation Considerations
        Ensure that ISLs run in front of only the switches to which they are
        connected.This will allow the switches to be removed without down-
        time for the fabric.
        When racking your switches, be sure to avoid single points of failure.
        This means separating redundant fabrics into different racks and pow-
        ering resilient fabrics in such a way that a power failure does not cause
        the fabric to fail.
        Carefully consider solely using in-band management of your switch.
        Consider using both in-band and out-of-band management.
        Have a well thought-out switch-naming convention to enable easy iden-
        tification of a physical switch in case a problem arises.
        If you intend to implement a large fabric, work closely with your switch
        supplier to identify a version of Fabric OS that supports the size of SAN
        you intend to build.
406   Chapter 9 • SAN Implementation, Maintenance, and Management


      Automating Switch Administration Activities
              If you have to do a SAN administration activity more than once,
              consider automating the activity with a script.
              Use Expect for automating your SAN administration activities today and
              consider using Fabric OS APIs when they become available.
              Take the Expect script example (run_sw_cmd) and modify it for your
              SAN administration activities.
              If you use Expect scripting, you need the supporting software. See the
              following URL for Expect installation guidance: http://expect.nist.gov.


      Brocade Zoning Considerations
              Determine whether you want to use hard or soft zoning prior to
              implementing your zone scheme.
              Hard zoning is more secure than soft zoning, as hard zoning is enforced
              at the Name Server and at the hardware level and will actually block
              inappropriate access.
              There are differences between hard zoning and soft zoning from a
              maintenance perspective. For example, you need to update your zone
              information if you replace a device that is part of a soft zone.
              Consider using a script to build and maintain large zoning configura-
              tions. Scripts can also be helpful to implement disaster recovery policies
              that are implemented in zoning.


      Validating Your Fabric
              Baseline your fabric first so that you can quickly identify failures when
              you validate your SAN. Use the SAN profile as your baseline.
              You can automate many validation activities, such as taking your SAN
              profile and fault injection.
              The key to fault injection is to establish how your entire system behaves
              when a fault is encountered.
                 SAN Implementation, Maintenance, and Management • Chapter 9    407


    Key fabric fault injections include: switch power cycle, ISL
    disable/enable, and switchDisable/switchEnable. Key edge device
    fault injections include: reset or power cycle the edge device.You can
    simulate this event by doing a portDisable/portEnable.
    Running a load in your SAN can shake out issues like marginal links.
    The final test is to run a load on your SAN while you do fault
    injections. If your SAN is able to handle this test, you are ready
    for production.
    Pick a minimum of two load types for your SAN I/O testing: one that
    approximates your SAN application load, and a load that is bandwidth-
    intensive.
    Ask your host supplier, HBA supplier, or storage supplier for the tools
    they use for I/O testing in a UNIX environment. If you want to use a
    Microsoft/Intel tool, Iometer is a good choice.


SAN Maintenance
    If possible, use the “cold” upgrade process for fabric upgrades. It takes
    only a few minutes of downtime.
    When adding a switch to the fabric, clear out the zone information.
    It is simplest to power down or disable your switch prior to connecting
    that switch to an existing SAN. Doing so will avoid domain ID conflicts.
    A diligently maintained SAN configuration log can help you with dis-
    aster recovery, troubleshooting, recreating a switch whose configuration
    is destroyed, SAN design modification or expansion, recovering acciden-
    tally deleted licenses, and recovering a zoning configuration.
    Back up your switch configuration with the command configUpload
    whenever you add or replace a switch.
    Maintaining a baseline SAN profile is essential to many SAN mainte-
    nance activities. Make sure you know how to create and maintain a
    SAN maintenance profile.
    The act of loading firmware does not impact SAN operations.The
    process of activating does impact SAN operations because a reboot
    is required.
408     Chapter 9 • SAN Implementation, Maintenance, and Management



      Frequently Asked Questions
      The following Frequently Asked Questions, answered by the authors of this book,
      are designed to both measure your understanding of the concepts presented in
      this chapter and to assist you with real-life implementation of these concepts. To
      have your questions about this chapter answered by the author, browse to
      www.syngress.com/solutions and click on the “Ask the Author” form.


        Q: Do I have to take my SAN down to perform a fabric upgrade?
        A: No. If you have your SAN configured such that all edge devices have dual
            paths, it is possible to perform a “hot” upgrade.

        Q: When I merged my fabrics, several disks were no longer accessible from the
            hosts.What happened?
        A: If you were using hard zoning, it is possible that the domain IDs of one of the
            switches you merged changed.This change in domain ID might have invali-
            dated some of your zones.

        Q: Are rogue hosts a real threat to soft zoning?
        A: Hosts that intentionally bypass the Name Server are as likely a threat to
            security as a device that caches a Name Server entry or does not respond
            to RSCNs.

        Q: Why do I have to use a complex scripting language to manage my SAN?
        A: You do not have to write any scripts to manage your SAN.WEB TOOLS
            and other commercially available SAN management software is available
            to perform a variety of SAN management tasks. Scripting is for users who
            prefer the flexibility and power that scripting enbles for their SAN manage-
            ment tasks.

        Q: When will the Brocade APIs be available for end-user use?
        A: Check with your switch supplier.The current target date is planned for the
            end of 2001.
                                   Appendix

Building SANs with
Brocade Fabric
Switches
Fast Track



 This Appendix will provide you with a quick,
 yet comprehensive, review of the most
 important concepts covered in this book.




                                            409
410   Appendix • Building SANs with Brocade Fabric Switches Fast Track


      ❖ Chapter 1: Introduction to SANs

      Overview of SANs
           SAN technology evolved from direct-attach interconnects like Small
           Computer Systems Interface (SCSI).
           Fibre Channel supports SCSI, Internet Protocol (IP), and the Fibre Channel
           Virtual Interface (FC-VI) Protocol.
           The distance between Fibre Channel nodes can be as much as 10 km.
           Fibre Channel supports copper, multimode optical, and single-mode
           optical media.
           SAN technology has moved from Fibre Channel Arbitrated Loop to full
           Fibre Channel switch fabric.


      Taming the Storage Monster
           Data storage needs are increasing rapidly.
           Requirements due to databases, e-mail, multimedia, and the Internet have
           dramatically increased the required amount of storage for data.
           Disk farms, storage arrays, and storage consolidation are the keys to solving
           the storage problem.


      Benefits of Building a SAN
           Fibre Channel is ideal for supporting high-availability configurations and
           business-critical back-end operations, due to the ability to set up redundant
           networks and clusters.
           SAN technology allows for storage consolidation and data pooling for more
           efficient use of storage resources.
           Backup windows are shrinking, and backup traffic on the LAN can be easily
           reduced by using a SAN to reduce network congestion due to backup.
           Block-level, high-speed access through SCSI-Fibre Channel Protocol (FCP)
           can accelerate data access between storage and hosts, and can free up host
           resources that would be occupied serving files and data through IP.
                    Building SANs with Brocade Fabric Switches Fast Track • Appendix   411


Chapter 1 Continued
    Cluster protocol access through FC-VI frees up CPU cycles in hosts and
    enables clustered database operations.
    One of the major advantages of SAN technology is its long-distance
    capability for disaster tolerance.


When to Deploy a SAN
    The most important part of determining whether to deploy a SAN is to
    focus on the actual business application that will be served with the
    SAN deployment.
    Speed and bandwidth requirements determine if the technology is right for
    the application. Compared with other technologies, such as IP-based file
    sharing and Network Attached Storage (NAS), the Fibre Channel protocol
    provides for more usable bandwidth and faster data transfer.
    A SAN is ideal for block-level access to shared storage.
    Fibre Channel works well for centralized access to storage arrays, redundant
    connections, clustered configurations, and disaster tolerance.


Steps to a Successful SAN Deployment
    Data collection Evaluate the goals of the deployment to determine
    options in achieving high availability, redundancy, fault tolerance, data con-
    solidation, cost reduction, and so forth.
    Data analysis Investigate the hardware and software options that support
    those goals.
    Architecture development Design and install a SAN testbed to set up
    configuration and components. Select the software and hardware carefully to
    avoid any interoperability problems.
    Testing the prototype Test the configuration for interoperability, func-
    tionality, error handling, and fault tolerance.
    Transition existing hardware in a controlled release to production
    Stage the deployment by rolling out the setup gradually, making changes on a
    limited basis to minimize risk.
412   Appendix • Building SANs with Brocade Fabric Switches Fast Track


      ❖ Chapter 2: Fibre Channel Basics

      The Architecture of SANs
           A Fibre Channel SAN provides the advantages of increased speed, reliability,
           and scalability.
           Fibre Channel presently transmits at 1062.5 Gbit/sec over single- and multi-
           mode optical and copper cabling.
           A SAN implemented using the Fibre Channel protocol incorporates the
           benefits of a channeled connection and a network.
           A SAN is constructed from three primary types of elements: initiating
           devices, switches, and target devices.
           A target device is a storage device on a SAN. Device enclosures like tapes,
           JBODs, or RAIDs are the most common type of target device.
           An initiating device is a device that actively seeks out and interacts with
           target devices on the SAN.
           Switches create the foundation of the Fibre Channel SAN. A group of inter-
           connected switches is called a fabric.


      Fibre Channel Protocol Basics
           Fibre Channel is primarily used to transport the SCSI and IP protocols.
           Devices are identified by an 8-bit Arbitrated Loop Physical Address
           (AL_PA) in an arbitrated loop topology, and a 24-bit address for switched
           fabric connections.
           Frames start with a primitive Start Of Frame (SOF) and end with an End
           Of Frame (EOF) primitive.
           There are five Fibre Channel layers, designated FC-0 through FC-4.
           The FC-0 layer is the physical media layer and includes the media selection
           and connectors.
           The FC-1 layer is the signal encoding and decoding layer.The FC-1 layer
           uses 8b/10b encoding.
           The FC-2 layer is the Fibre Channel protocol layer.
                     Building SANs with Brocade Fabric Switches Fast Track • Appendix   413


Chapter 2 Continued
    The FC-3 layer is the Fibre Channel common services layer.The services
    are servers in a Fibre Channel fabric that manage connections between
    devices connected remotely through the switched fabric.
    The FC-4 layer is the Fibre Channel ULP mappings layer.


Classes of Service
    Classes of service specify what mechanisms are required for transmission of
    different types of data.
    Class 1—Acknowledged connection-oriented service.
    Class 2—Acknowledged connectionless service.
    Class 3—Unacknowledged connectionless service.
    Class 4—Fractional bandwidth connection-oriented service.
    Class F—Used for inter-switch communication.


Storage Network Topologies
    There are three primary types of topologies in Fibre Channel: point-to-
    point, arbitrated loop, and switched fabric (also called point-to-point).
    The primary use of the point-to-point topology is to connect devices
    directly to switches or other bridge devices.
    The arbitrated loop topology allows up to 127 devices in a ring formation
    to share the bandwidth of a single line without a switch.
    Fabrics allow thousands of devices to be interconnected.
    Switches have three types of ports. FL_Ports are fabric loop ports that attach
    arbitrated loops to the fabric. F_Ports are fabric ports that connect single
    devices to the fabric in a point-to-point topology. E_Ports connect a switch
    to another switch.
    Fabric-attached devices have a three-part address.The first segment indicates
    the physical switch, the second part indicates the physical port, and the last
    part is the arbitrated loop address in a loop device or 0x00 for a fabric device.
414   Appendix • Building SANs with Brocade Fabric Switches Fast Track


      Chapter 2 Continued

      Fabric Services
           Switches exchange information in their servers so that all individual switch
           servers contain the same information.This creates distributed servers.
           The fabric port is used to log a device into the fabric.The response frame
           from login assigns the device its 24-bit address.
           The Name Server is used as a database to register and store information
           about all devices on the fabric.
           The Fabric Controller at well-known address 0xFFFFFD provides state
           change notification service to registered nodes. State change notification is a
           service that notifies devices when a change in fabric topology occurs.
           The Management Server provides information about the fabric without
           stipulation as to zone.


      ❖ Chapter 3: SAN Components and Equipment

      Overview of Fibre Channel Equipment
           Understanding the features of your Fibre Channel equipment is key when
           building a robust infrastructure.
           A Fibre Channel network is comprised of cabling, GBICs, hubs, switches,
           HBAs, and routers.
           Fibre Channel shares much of the same terminology as Ethernet
           networking, but the functionality of similarly named equipment is not
           necessarily identical.


      Cabling and GBICs
           Copper cabling is almost always terminated with either an HSSDC or DB-9
           male connector.
           Multimode optical fiber is terminated using a variety of optical connectors,
           including SC, LC, and MT-RJ.
           Single-mode fiber is the most expensive media type, but preferable for
           long distances.
                     Building SANs with Brocade Fabric Switches Fast Track • Appendix   415


Chapter 3 Continued
    Single-mode fiber, because of its small diameter (9 µm), has the highest
    transmission speed potential.
    Copper cabling is available in two types: active and passive. Active copper
    lines provide twice the distance of passive copper lines.
    The HSSDC connector was specifically designed as a Gigabit copper con-
    nector, improving density and performance over the DB-9 style connector.
    GBICs are removable transceivers used in all types of Fibre Channel devices,
    including switches, hubs, and HBAs.
    GBICs offer the option of interfacing with almost all types of connectors.
    A Media Interface Adapter (MIA) converts DB-9 copper connectors to
    optical SC connectors.


Using Hubs
    Hubs serve as a very basic level for connecting different ports in a
    network together.
    Hubs can connect up to 127 devices together in an FC-AL loop.
    Simple hubs contain no intelligence, just electrical connections.
    Managed hubs provide a level of error tolerance and management features.
    Managed hubs provide LIP isolation, automatic port bypass, signal retiming,
    and management interfaces.
    Fibre Channel LIPs can be a major source of problems in arbitrated loop
    configurations.
    To avoid an earlier generation of problems due to loop architecture, most
    people are moving to switched fabric devices instead.


Using Switches and Fibre Channel Fabrics
    Switches are classified into three categories: entry-level, scalable fabric, and
    core fabric switches.
    Entry-level switches are focused on small workgroups of 8 to 16 ports, usu-
    ally are geared toward low cost, and deliver limited scalability and manage-
    ment. Fabric switches provide the capability to cascade switches together to
    create larger fabrics.
416   Appendix • Building SANs with Brocade Fabric Switches Fast Track


      Chapter 3 Continued
           A core fabric switch is designed for interconnecting multiple edge switches
           to form multihundred-port SANs.
           HBAs are used to connect servers to the network.They map SCSI commands
           in the operating system to Fibre Channel frames on the network. HBAs range
           from low-end, loop-only devices to high-end, fabric multipathing adapters.
           Major protocols supported by HBAs are SCSI-FCP for storage, IPFC for
           networking, and VIFC for clustering.
           HBAs either support 1 Gbit/sec or 2 Gbit/sec speeds, with current genera-
           tion cards supporting 1 Gbit/sec, and emerging cards supporting both.
           HBAs can be found in single one-port configurations or dual-port adapters
           for higher density.
           LUN masking enables control of access to devices in the network from
           the HBA.
           Persistent binding is the mapping of a Fibre Channel device into an oper-
           ating system at a specific device location.
           Dynamic discovery is the capability to dynamically add and remove drives
           from your system without reboot.
           HBA API support is an important feature that allows management of your
           HBA by SAN management software.
           Remote booting is the use of an HBA to boot an operating system image
           across the SAN and is used to dynamically change hosts and enable ease of
           disaster recovery.


      Connecting Legacy Devices into Your SAN
           Α Fibre Channel router, which is also known as a bridge, allows legacy par-
           allel SCSI devices to attach to your Fibre Channel SAN.
           A Fibre Channel router plugs into Fibre Channel on one side and a SCSI
           bus on the other.
           Frames are translated from SCSI-FCP to parallel SCSI bus signals
           through routers.
                     Building SANs with Brocade Fabric Switches Fast Track • Appendix   417


Chapter 3 Continued
    Routers provide many different features, including different numbers
    of SCSI buses and different support for parallel SCSI protocols
    and termination.
    Advanced features include selective LUN presentation, extended copy sup-
    port, and various management interfaces.
    Selective LUN presentation is the capability of a router to mask the pres-
    ence of devices to different hosts in the network and allow for better secu-
    rity and control over resources.
    Extended copy support (third-party copy) allows software to directly back
    up data on the SAN, saving CPU and network traffic.
    Available management interfaces include telnet, SNMP, Ethernet, and
    serial ports.


Bridging and Routing to IP Networks and Beyond
    Fibre Channel-to-DWDM technology multiplexes Fibre Channel signals
    onto higher bandwidth fiber for transmission over MAN distances (up to
    100 km).
    Use of DWDM is transparent to Fibre Channel switches, except for
    buffer settings.
    It is necessary to increase buffer credit settings to handle the long distances/
    delays involved in MANs.
    Fibre Channel can also be transported across IP networks like ATM and
    Gigabit Ethernet.
    FC_IP (not to be confused with IPFC) encapsulates Fibre Channel frames
    in the IP protocol and can be used for remote backup and extending
    SAN distances.

Fibre Channel Storage
    Fibre Channel storage is important as the core of the data storage on
    your network.
    Fibre Channel storage ranges from simple JBOD devices to multiterabyte
    storage arrays.
418   Appendix • Building SANs with Brocade Fabric Switches Fast Track


      Chapter 3 Continued
           A JBOD is a cabinet of independent disks, all connected into the Fibre
           Channel network in a loop.
           Hosts individually address disks in a JBOD.
           RAID arrays provide additional protection and performance to your storage.
           Different RAID levels are appropriate for different applications.
           High-end storage arrays add support for multiple terabytes of data. Other
           types of connections include parallel SCSI, ESCON, and FICON.
           High-end arrays also generally include a large amount of cache, which is
           used to speed up data access.
           Selective LUN presentation is the ability of high-end storage to control
           access by hosts to data and to ensure data integrity.
           LUN export across multiple ports is used for redundancy and high avail-
           ability, but requires dynamic multipathing software or drivers to work.
           Snapshot backup volumes are used to enable backup on live databases and
           data images.


      ❖ Chapter 4: Overview of Brocade
           SilkWorm Switches and Features
      Selecting the Right Switch
           Identify your requirements for availability, port density, functionality,
           and cost.
           Decide whether you need an arbitrated loop or full-fabric environment.
           Learn which switch functions best satisfy your requirements.
           Consider what strategic direction you want to take, and whether your
           current switches will scale easily to meet your needs.


      Understanding the Brocade Fabric OS
           Fabric OS is the operating system for all Brocade SilkWorm switches.
           Key functions include auto-discovery, in-order frame delivery, zoning,
           and others.
                    Building SANs with Brocade Fabric Switches Fast Track • Appendix   419


Chapter 4 Continued
    Fabric OS provides the capability to work with other storage
    management applications.


Using Optional Brocade Features
    You can use Brocade Zoning to isolate devices into separate, virtual SANs.
    Zoning is ideal for multiple customer environments where data security
    is critical.
    Extended Fabrics enables the benefits of Fibre Channel technology at dis-
    tances up to 100 km.
    Fabric Watch tracks switch and fabric events to help you optimize fabric-
    wide performance and proactively identify problems before they happen.
    QuickLoop integrates private loop-based devices into switched fabric
    environments.
    QuickLoop helps support legacy devices to protect existing investments
    while also providing performance and reliability advantages.
    WEB TOOLS is an advanced monitoring tool that sends alerts about fabric
    events to help prevent downtime.
    You can use a Web browser interface and Java plug-in to monitor switched-
    fabric SANs from any workstation in your enterprise.


Future Capabilities in the Brocade
Intelligent Fabric Services Architecture
    The Brocade Intelligent Fabric Services Architecture includes the SilkWorm
    family of fabric switches, advanced fabric services, open fabric management
    tools, and enterprise-class security products.
    ISL Trunking is an optional software product ideal for optimizing perfor-
    mance of Brocade 2 Gbit/sec Fibre Channel fabric switches.
    Frame Filtering enables a variety of new capabilities for monitoring and
    managing SAN fabrics and enhancing both security and reliability.
    Secure Fabric OS is the most comprehensive SAN security architecture
    available, addressing vulnerabilities in the SAN fabric and supporting
    multiple authentication methods.
420   Appendix • Building SANs with Brocade Fabric Switches Fast Track


      ❖ Chapter 5: The SAN Design Process

      Looking at the Overall Lifecycle of a SAN
           The SAN design process is a cycle.
           This process consists of seven phases:
           1. Data Collection
           2. Data Analysis
           3. Architecture Development
           4. Prototype and Test
           5. Transition
           6. Release to Production
           7. Maintenance

           Whenever there is a fundamental change to the SAN, the cycle
           should repeat.


      Conducting Data Collection
           Data collection is the foundation on which a SAN is built.
           You should interview everybody who has an interest in the project.
           During the interview process, create a technical requirements document.


      Analyzing the Collected Data
           There are several things that you need to get out of data analysis:
           —The number of different fabrics that will make up the SAN solution
           —The port count and performance characteristics of each fabric
           —An estimate of the hardware required to meet these requirements

           You might be able to localize traffic for better performance if you can create
           well-defined groups.
           Prepare an ROI proposition to justify your SAN project.
                    Building SANs with Brocade Fabric Switches Fast Track • Appendix   421


❖ Chapter 6: SAN Applications
   and Configurations
Configuring a High-Availability Cluster
   HA clusters are used for redundant, fail-safe installations of mission-critical
   business applications.
   Clustering provides availability, manageability, and scalability.
   Availability is the capability of a cluster to tolerate hardware, network, or
   software errors.
   The most common use of clustering is two servers configured to share
   storage through Fibre Channel.
   Redundant HBAs and switches should be used to provide fault tolerance.
   The use of dynamic multipathing software, drivers, or HBAs can provide
   higher levels of availability to your cluster.


Using a SAN for Storage Consolidation
   Storage consolidation enables administrators to centralize storage resources.
   Consolidation provides more efficient use of storage, enhances manage-
   ability, and improves accessibility.
   Almost any layout of a storage network can be used for storage consolidation.
   Consolidation requires attention paid to how operating systems treat
   shared volumes.
   In order to properly partition data in a consolidation environment, you need
   to use fabric zoning, LUN masking on storage or the host, or software to
   control permissions.
   It is generally best to use fabric zoning even when also using another access
   control product to achieve a more effective security model, and to provide
   a “broadcast container,” which can increase the scalability and reliability of
   a SAN.
   An example of a typical storage consolidation setup is a shared SAN used to
   provide data storage for a Web farm, where many servers read the same disks
   to present data.
422   Appendix • Building SANs with Brocade Fabric Switches Fast Track


      Chapter 6 Continued
           Storage LUN masking is used to ensure that only specific hosts are allowed
           access to specific logical units of a storage array.The advantage of storage
           LUN masking is that the storage guarantees which host is allowed access to
           any volume.
           HBA LUN masking is also used to limit what storage a host can see,
           and requires that every host in the network participate in the same
           masking scheme.
           Software partitioning provides another type of control over LUN presenta-
           tion, but it generally requires upper-level software and demands that every
           host in the network be loaded with that software.
           Switch zoning, available in Brocade switches, provides a convenient way to
           allocate storage to hosts, and to consolidate different departments into a
           single company network.
           Switch zoning does not currently support control at the LUN level, only at
           the port and WWN levels. Upcoming products will add this capability. For
           now, other access control techniques might need to be used in addition to
           switch zoning to provide access control at the LUN level.
           Storage LUN masking provides another way to control access to volumes in
           a shared SAN.
           High-end storage arrays provide the capability to specify the port or node
           WWN of a host HBA, and specify which volumes in the array will respond
           to requests.
           By using storage LUN masking, you can ensure that only hosts with permis-
           sion can read or write from a specified volume.
           Storage LUN masking requires the participation of the storage only to
           enforce permissions.
           HBAs provide access control to volumes through LUN masking.
           LUN masking controls which volumes an operating system can see through
           a particular HBA.
           HBA LUN masking requires the participation of all of the hosts in the
           network to avoid contention for storage resources.
                    Building SANs with Brocade Fabric Switches Fast Track • Appendix   423


Chapter 6 Continued

LAN-Free Backup Configuration
    Traditional backup systems used SCSI direct-attached tape storage.The
    LAN-based client-server backup model, although an improvement, cannot
    account for ever-increasing amounts of data through the LAN connection.
    LAN-free backups using storage networks solve LAN-based problems by
    offloading traffic from the LAN and increasing bandwidth.

SAN Server-Free Backup
    Server-free backup is the use of a SAN to remove backup traffic from
    a LAN.
    Backup is done directly on the SAN for each device, rather than each host
    being involved in data transfer.
    Third-party copy provides an even more efficient way to transfer data to
    tape, freeing a backup server from needing to directly access disks and copy
    data to tape.


Making Your Enterprise Disaster Tolerant
    Fibre Channel SANs are ideal for mirroring and accessing data across
    large distances.
    It is now possible to separate critical systems many miles apart.
    Brocade switches provide extended credits on ISLs to enable high perfor-
    mance and reliable long-distance operation.


❖ Chapter 7: Developing a SAN Architecture

Identifying Fabric Topologies and SAN Architectures
    A fabric consists of one or more interconnected Fibre Channel switches.
    A SAN includes one or more related fabrics and everything attached
    to them.
    In a resilient core/edge fabric topology, two or more switches act as a core
    to interconnect multiple edge switches.This is the best “general-use”
    topology available, especially when combined with the dual-fabric approach
    to SAN architecture.
424   Appendix • Building SANs with Brocade Fabric Switches Fast Track


      Chapter 7 Continued
           In order to select the right topology, you must first decide the requirements
           for your SAN architecture.This includes redundancy and scalability in addi-
           tion to port count.
           In general, the cascade, ring, full mesh, and partial mesh are best used in
           architectures where the individual fabrics that comprise the SAN will not
           change much.This could be true in a static, low-growth environment, or in
           a “SAN islands” design.
           The resilient core/edge topology is the best choice for general use and for
           situations where SAN requirements are either unknown or might change
            frequently.
           The resilient core/edge topology can be combined with dual fabrics to
           achieve maximum performance, reliability, and scalability.


      Working with the Core/Edge Topology
           The core/edge topology offers a number of key advantages over other
           topologies. Core/edge fabrics are:
           —Easy to scale without downtime.
           —Capable of scaling to a large number of ports.
           —Flexible in terms of their cost-to-performance ratios. (You can add
             switches to the core with a clear knowledge of how doing so will affect
             both cost and performance.)
           —Easy to understand, manage, and performance-tune.
           —Well-tested and reliable.

           Several core/edge fabrics can be used as “cookie-cutter fabrics” when design
           information is incomplete or might change frequently.


      Determining Levels of Availability
           There are four levels of availability that a SAN architecture might employ.
           The dual-fabric, resilient approach is the most reliable and the most
           frequently recommended.
                    Building SANs with Brocade Fabric Switches Fast Track • Appendix   425


Chapter 7 Continued
    In most cases, this approach is not more expensive to implement than the
    other three approaches, and it might be less expensive in some cases.
    This approach allows for the failure of anything up to and including an
    entire fabric without application downtime.


Configuring Traffic Patterns
    Tiered fabrics allow simplified management and storage resource planning,
    but are the worst-case scenario from the standpoint of locality.
    Locality is the most effective approach to performance tuning in a SAN, but
    it is frequently unattainable.
    You should view locality as a “moving target,” since network complexity
    increases over time. However, it is worth getting as much locality as is prac-
    tical into a SAN, since all SANs benefit in several ways from this technique.


Evaluating Performance Considerations
    Over-subscription is never a bad thing in and of itself. It is only when over-
    subscription becomes congestion that problems might arise.
    Latency is almost never a driving consideration in real-world SAN perfor-
    mance, since fabric latency is at least one order of magnitude lower than
    typical disk subsystem latency. Exceptions to this rule include clustering soft-
    ware and some highly performance-sensitive applications.
    In almost all cases, considerations outside the fabric will limit performance,
    such as CPU speed of hosts or the I/O profile of an application.


❖ Chapter 8: SAN Troubleshooting

The Troubleshooting Approach:
The SAN Is a Virtual Cable
    Use the SAN’s visibility to both storage and hosts to start your trouble-
    shooting process.
426   Appendix • Building SANs with Brocade Fabric Switches Fast Track


      Chapter 8 Continued
           The switchShow, nsShow, nsAllShow, errShow, and topologyShow
           commands are extremely informational and invaluable to the trouble-
           shooting process.
           The UNIX format command or HBA vendor-supplied utilities are also
           helpful in troubleshooting.
           When you start the troubleshooting process, determine if the issue is fabric
           related or device related. A fabric-related issue impacts many devices, and a
           device issue impacts only a few devices.


      Troubleshooting the Fabric
           A fabric issue impacts many devices. A logical switch outage, such as seg-
           mentation or physical switch outage, can cause many devices to drop out of
           the fabric. Problems with ISL initialization are also considered fabric issues.
           The quickest way to narrow your search of a fabric problem is to compare
           your baseline SAN profile to your current SAN profile and investigate
           discrepancies.
           A SAN profile includes the number of devices per switch (nsShow),
           number of devices in the fabric (nsAllShow), and number of switches in
           the fabric (topologyShow).The errShow and switchShow commands are
           also helpful in tracking down fabric issues.
           Some fabric issues are caused by a mismatch in fabric service timeout vari-
           ables and the edge device timeout settings. Careful analysis of both the
           fabric and the edge devices is necessary to resolve this complex issue.


      Troubleshooting Devices that Cannot Be Seen
           The first thing to check is that the missing device is logically connected to
           the SAN as indicated by switchShow output.
           Next, check to see that the device is present in the Name Server, using the
           command nsShow. If the device is not in the Name Server, it is invisible to
           the other devices in the fabric.
           Other common causes of missing devices are zone conflicts or
           marginal links.
                     Building SANs with Brocade Fabric Switches Fast Track • Appendix   427


Chapter 8 Continued

Troubleshooting Marginal Links
    Use portErrShow to establish if there are a relatively high number of
    errors, such as CRC errors. Look for a steadily increasing number of errors
    to confirm a marginal link.
    A marginal link can impact multiple devices. For example, a shared storage
    device with a marginal link can cause communication problems with all
    devices that access that shared storage.
    A marginal link can be caused by any of the components that make up the
    link: switch port, switch GBIC, cable, edge device GBIC, and the edge device.


Troubleshooting I/O Pauses
    I/O pauses happen, and both the SAN and edge device can and should
    tolerate such events.
    An I/O pause can be as harsh as the powering down of a host or storage
    device while I/O is in transit, which will cause I/O to cease. Alternatively, it
    might be as lightweight as a port-level RSCN, which might be a problem
    for only the most latency-sensitive applications. Relative to the SAN, fabric
    events can also cause a pause in I/O.
    Several applications, such as video-on-demand and applications that are
    evolving into the SAN model, such as tape backup, are very sensitive to
    latency and/or RSCNs. High latencies and large numbers of RSCNs can
    adversely affect these applications.
    Storage vendors, switch vendors, application vendors, and HBA vendors are
    working with the standards bodies (T11) as well as modifying their product
    implementations to handle these types of exceptions.


❖ Chapter 9: SAN Implementation,
    Maintenance, and Management
Installation Considerations
    Ensure that ISLs run in front of only the switches to which they are con-
    nected.This will allow the switches to be removed without downtime for
    the fabric.
428   Appendix • Building SANs with Brocade Fabric Switches Fast Track


      Chapter 9 Continued
           When racking your switches, be sure to avoid single points of failure.This
           means separating redundant fabrics into different racks and powering
           resilient fabrics in such a way that a power failure does not cause the fabric
           to fail.
           Carefully consider solely using in-band management of your switch.
           Consider using both in-band and out-of-band management.
           Have a well thought-out switch-naming convention to enable easy identifi-
           cation of a physical switch in case a problem arises.
           If you intend to implement a large fabric, work closely with your switch
           supplier to identify a version of Fabric OS that supports the size of SAN
           you intend to build.


      Automating Switch Administration Activities
           If you have to do a SAN administration activity more than once, consider
           automating the activity with a script.
           Use Expect for automating your SAN administration activities today and
           consider using Fabric OS APIs when they become available.
           Take the Expect script example (run_sw_cmd) and modify it for your
           SAN administration activities.
           If you use Expect scripting, you need the supporting software. See the
           following URL for Expect installation guidance: http://expect.nist.gov.


      Brocade Zoning Considerations
           Determine whether you want to use hard or soft zoning prior to imple-
           menting your zone scheme.
           Hard zoning is more secure than soft zoning, as hard zoning is enforced
           at the Name Server and at the hardware level and will actually block
           inappropriate access.
           There are differences between hard zoning and soft zoning from a mainte-
           nance perspective. For example, you need to update your zone information
           if you replace a device that is part of a soft zone.
                    Building SANs with Brocade Fabric Switches Fast Track • Appendix   429


Chapter 9 Continued
    Consider using a script to build and maintain large zoning configurations.
    Scripts can also be helpful to implement disaster recovery policies that are
    implemented in zoning.


Validating Your Fabric
    Baseline your fabric first so that you can quickly identify failures when you
    validate your SAN. Use the SAN profile as your baseline.
    You can automate many validation activities, such as taking your SAN pro-
    file and fault injection.
    The key to fault injection is to establish how your entire system behaves
    when a fault is encountered.
    Key fabric fault injections include: switch power cycle, ISL disable/enable,
    and switchDisable/switchEnable. Key edge device fault injections
    include: reset or power cycle the edge device.You can simulate this event by
    doing a portDisable/portEnable.
    Running a load in your SAN can shake out issues like marginal links.The
    final test is to run a load on your SAN while you do fault injections. If your
    SAN is able to handle this test, you are ready for production.
    Pick a minimum of two load types for your SAN I/O testing: one that
    approximates your SAN application load, and a load that is bandwidth-
    intensive.
    Ask your host supplier, HBA supplier, or storage supplier for the tools they
    use for I/O testing in a UNIX environment. If you want to use a
    Microsoft/Intel tool, Iometer is a good choice.


SAN Maintenance
    If possible, use the “cold” upgrade process for fabric upgrades. It takes only a
    few minutes of downtime.
    When adding a switch to the fabric, clear out the zone information.
    It is simplest to power down or disable your switch prior to connecting that
    switch to an existing SAN. Doing so will avoid domain ID conflicts.
430   Appendix • Building SANs with Brocade Fabric Switches Fast Track


      Chapter 9 Continued
           A diligently maintained SAN configuration log can help you with disaster
           recovery, troubleshooting, recreating a switch whose configuration is
           destroyed, SAN design modification or expansion, recovering accidentally
           deleted licenses, and recovering a zoning configuration.
           Back up your switch configuration with the command configUpload
           whenever you add or replace a switch.
           Maintaining a baseline SAN profile is essential to many SAN maintenance
           activities. Make sure you know how to create and maintain a SAN mainte-
           nance profile.
           The act of loading firmware does not impact SAN operations.The process
           of activating does impact SAN operations because a reboot is required.
                                                                                       Index
A                                               B
ACK frame, 43                                   backups, configuration
active/active storage controllers, 198, 200       switch configuration file, 393–394
active/passive storage controllers, 198, 200      zoning configuration, 382
Adapters, Media Interface (MIAs). See           backups, network
       Media Interface Adapters (MIAs)            accelerating cycling of, 14
addressing, switched fabric, 48–49, 358, 361      collecting information on in interviews,
administrative activities, automating,                 167
       367–372                                    LAN-free, 212–213
  Expect scripting and, 369–372                   reducing network congestion from, 13
  Fabric OS API and, 367–368                      remote, 218–219
Alias Server, 50, 52, 134                         SAN-based server-free, 213–216
analyzer, Fibre Channel, 287, 314–316           bad_eof error statistic, 297
any-to-any connectivity, 268–269                bandwidth
APIs                                              assessing need for SAN, 17
  Fabric OS API, 367–368, 378                     Fibre Channel hub, 34
  HBA API, 101–103                              baseline SAN profiles, 308–311, 317–318,
application service providers (ASPs), data             382–384
       sharing and, 12                          block-level protocols, data access speed and,
Application-Specific Integrated Circuit                14
       (ASIC), 80                               bonded SC connectors, 72
applications                                    bridges, Fibre Channel, 35, 60, 64–65,
  assessing need for SAN in interviews,                106–109
       16–17                                      extended copy support by, 108–109
  selecting during design phase, 165              management interfaces for, 109
  switch management via, 94                       number of SCSI buses on, 107
arbitrated loop topology, 4, 5, 33, 39, 47–48     SCSI termination type and, 108
architecture, SAN. See SAN architecture           selective LUN presentation by, 108
asynchronous transfer mode (ATM), 31              types of SCSI ports available on, 108
automatic device discovery, 133                 broadcasting, IP over Fibre Channel (IPFC),
  Fabric OS and, 133                                   88, 96–97
  Host Bus Adapters (HBAs) and, 101             Brocade Fabric Access Layer API, 137
automatic path failover, 134                    Brocade Fabric Assist, 79, 138–139, 147, 337
automatic port bypass, managed hubs and,        Brocade Fabric Manager, 131
       78                                       Brocade Fabric OS, 132–135, 228
availability levels, SAN architecture and,        adding switches with, 248
       256–260                                    automatic device discovery, 133
availability, high-availability (HA) cluster,     command-line interface, 135
       197
                                                  continuous port monitoring, 133
                                                  dynamic routing services, 134
                                                                                         431
432    Index


 Fabric Access Layer API, 137                SilkWorm 2050, 126, 128
 Fabric OS API and, 367–368, 378             SilkWorm 2200, 126
 Fibre Channel services provided by,         SilkWorm 2210, 127
      133–134                                SilkWorm 2240, 127
 history of features and enhancements,       SilkWorm 2250, 128
      363–365                                SilkWorm 2400, 126, 128, 129
 in-band interface, 135                      SilkWorm 2800, 126, 128, 129–130
 Management Information Bases (MIB)          SilkWorm 6400 Integrated Fabric, 126,
      provided by, 135                            130–131, 256
 switch beaconing and, 135                   SwitchType values, 302
 syslog daemon interface, 135                zoning of, 373–375
 universal port support by, 133              See also switches, Fibre Channel
 version, selecting which to use, 361–365   Brocade SOLUTIONware guides, 158, 196
Brocade Fabric Watch, 75, 138, 147, 339,    Brocade Web site, 158
      366
                                            Brocade WEB TOOLS, 139–140, 147, 366,
Brocade Intelligent Fabric Services               378, 408
      Architecture, 140–143
                                            Brocade Zoning, 136, 146–147, 372–373
 frame filtering and, 142
                                             licensing for, 366, 372
 hardware-enforced zoning, 142
                                             private loop devices and, 165–166
 ISL Trunking, 140–142
                                            buffer credits, switch port, 86–87
 performance analysis and, 143
                                            buffer-to-buffer flow control, 43
 Secure Fabric OS and, 143
                                            business goals, identifying, 153, 158
Brocade QuickLoop, 79, 80, 138–139
                                            business requirements, identifying, 158–159
 isolating marginal port faults, 339
 private loop devices and, 165–166, 337
Brocade Remote Switch, 218                  C
Brocade Secure Fabric OS, 143               cable testers, 315
Brocade SilkWorm switches, 124–132, 146,    cabling, 61, 65–68, 164
      147                                     copper, 61, 65–66
 entry-level series, 126–128                  layout of, 351–354
 licensing, 136, 322                          multimode optical, 36, 61, 66–67
 Metropolitan Area Networking (MAN)           single-mode optical, 36, 61, 68–69
      and, 219                              camTest command, 289
 port error statistics, 339, 341–342        cascade topology, 236–237
 scalable, 128–131                            compared with other topologies, 247
 selecting most appropriate, 124–126          resiliency of, 257, 258
 SilkWorm 1000, 300, 301                    central memory diagnostics, 289
 SilkWorm 12000 Core Fabric Switch,         centralMemoryTest command, 289
      126, 131–132                          cfgClear command, 324
 SilkWorm 2000, 126, 300, 301               cfgDisable command, 324, 343
 SilkWorm 2010, 126, 127                    cfgEnable command, 343, 347
 SilkWorm 2040, 126, 127                    cfgShow command, 286, 324, 333
                                                                           Index     433


channel protocols, 31                        configuration management software, HBA,
channels, 31, 38                                    101
Chaparral products                           congestion, network, 13, 233, 270
  Fibre Channel/SCSI bridges, 215            connectors, Fibre Channel, 61–62, 69–73
  Network Storage, 14                          D-B9 connectors, 69–70
character encoding, Fibre Channel, 36, 41      high-density optical connectors, 72–73
Class F service, 45, 50                        HSSDC connectors, 70–71
classes of service, Fibre Channel, 37, 39,     SC connectors, 71–72
       43–45                                 copper cabling, 61, 65–66
  Class 1, 43–44                             copper connectors, 62, 69–71
  Class 2, 44, 84                            core/edge fabric, 228, 229–230, 242–246
  Class 3, 44, 84                              adding edge switches to, 248–250
  Class 4, 44                                  compared to other topologies, 247
  Class F, 44, 50, 84                          complex core/edge, 244–245
Cluster Server, Microsoft’s, 11, 200–202       resiliency of, 256, 257, 258
clustering techniques, FC-VI standard and,     scaling without downtime, 248
       14–15                                   simple core/edge, 244
clusters, high-availability. See High-         target designs for, 253–256
       Availability (HA) clusters              upgrading core switches, 250–253
cmemRetentionTest command, 290               core switches, 81–82, 242–243, 250–253
CMI bus connection diagnostics, 289            adding new to core/edge fabric, 251
cmiTest command, 289                           configuring new core/edge fabric,
cold fabric upgrades, 398, 400                      251–252
combination adapters, 98                     core team, SAN
command-line interface, Fabric OS, 135         identifying people to include, 156–157,
complex core resilient core/edge topology,          193–194
       244–245, 247                            interview process for, 157–176
components, 61–65                            costs
  attention to those in production during      Brocade SilkWorm switches, 125, 126
       SAN planning, 166                       cabling media, 65, 67, 68
  evaluating pre-existing, 165–166             cascade topology, 236, 237
  redundant HA cluster, 199                    complex core resilient core/edge topolo-
  validation of, 166                                gy, 245
  See also specific components                 full-mesh topology, 238
composite resilient core/edge topology,        partial-mesh topology, 242
       245–246                                 ring topology, 238
Computer Associates Unicenter TNG, 350,      CPU speed, pre-SAN performance data
       368                                          and, 170
configDownload command, 393, 394             CRC errors, 295
configShow command, 286, 394                 crc_err error statistic, 296–297
configUpload command, 324, 382, 393          crossPortTest command, 290
configuration logs, 391–393                  Crossroads Systems, 14, 215
434     Index


D                                              data sharing, 12, 18–19, 203–212
D characters, 41                                 file-level sharing, 19
data access, increasing speed of, 14             LUN masking and, 210–211
data analysis, SAN design process and, 153,      resource sharing, 19
       194                                       software management of, 211–212
  port requirements, establishing, 182–187       switch zoning and, 208–210
  Return On Investment (ROI)                     volume-level sharing, 19
       proposition, 153, 159, 187–189            with Web farms, 206–207
  SAN grouping process, 178–182                  See also switch zoning
data characters, 41                            data storage
data collection, SAN design process and,         consolidating with SANs, 9–10, 11–13,
       153, 156–177, 194                              203–212
  backup information, 167                        increased need for, 8–9
  business problem identification, 158           sharing among multiple hosts, 12,
  business requirement identification,                203–212
       158–159                                 database servers, HA cluster configuration,
  component testing needs, 166–167                    198–200, 226
  components, identifying those in place,      DataCare SANsymphony, 12
       165–166                                 DB-9 connectors, 62, 69–70
  components, identifying those in produc-     DB-9 serial cabling, 69, 70
       tion, 166                               dd tool, UNIX, 388–389
  current performance data, 168–172            debuggers
  design interview form, 175–176                 Fibre Channel analyzer, 314–315
  host information, 160–162                      portLog, 314
  initiator-to-target communications matrix,     protocol analyzers, 315–316
       167–168                                 dedicated connection types, 43
  maintenance downtime, 174–175                Dense Wave Division Multiplexing
  node information, 160–161                           (DWDM), 109–110, 217, 219
  performance, determining future needs,       design interview form, 176
       172–174                                 Destination ID (D_ID), switch, 302
  physical assessment of hardware, 176–177     devices
  processing collected data, 177–182             adding to fabrics, 401–402
  SAN-enabled applications needed, 165           automatic discovery of, 133
  SAN implementation downtime, 174               collecting information on in interviews,
  selecting people to interview, 156–157              160–164
  storage device information, 162–163            determining if existing require additional
  storage facility information, 164                   hardware, 165–166
  technical requirement identification,          timeout at bring up, 321–322
       159–160                                   troubleshooting missing, 279–283,
  timeline creation from, 175–176                     327–335
data movers, 14, 214                           DiagErr# message, 293
data replication techniques, 218               diagClearError command, 290
                                                                             Index      435


diagDisablePost command, 290                   merging fabrics and, 397
diagEnablePost command, 290                    setting, 358
diagHelp command, 289                        downstream information, 302, 303, 304
diagnostic switch commands, 289–308          downtime
  errShow, 278, 292–295, 318                   determining acceptable for maintenance
  help, 291–292                                     and changes, 174–175
  nsAllShow, 286, 320, 329, 382, 395           determining acceptable for SAN imple-
  nsShow command. See nsShow command                mentation, 174
  portDisable, 249, 302, 318, 330, 334–335     scaling core/edge networks without,
                                                    248–253
  portEnable, 302, 318, 330, 334–335
                                             drivers, determining if full fabric, 165–166
  portErrShow, 286, 295–297
                                             dual-fabric SANs, 248
  portLoopbackTest, 289, 290
                                               HA clusters and, 199, 202
  show vs. dump, 291
                                               nonresilient, 257
  supportShow. See supportShow command
                                               resilient, 257
  switchEnable, 251, 318
                                             dual-port adapters, 98
  switchShow. See switchShow command
                                             DWDM. See Dense Wave Division
  topologyShow, 307–308, 309, 320, 329              Multiplexing (DWDM)
diagnostics, storage array, 287              dynamic discovery
diagShow switch command, 286, 290              Fabric OS and, 133
disabled switches, 300                         Host Bus Adapters (HBAs) and, 101
disaster tolerance, SANs and, 15–16, 31,     dynamic routing services, Fabric OS, 134
       216–221
  data replication and remote backup,
       218–219                               E
  Metropolitan Area Networks (MANs)          e-mail, data storage and, 8
       and, 219–221                          E_Ports (ISLs), 48–49, 87
Disk Administrator,Windows, 312, 317           cable layout and, 351–354
disk farms, 9                                  incomplete initialization of, 318–319
disk I/O performance, 169–172, 385–390         ISL over-subscription ratio, 231–232
disk monitoring tools, 169–170                 load sharing through, 134
disk seek time, 171                            port configuration conflicts and, 322–323
diskmon feature,Windows NT, 169                trunking, 140–142
diskperf -yd command, 172                    edge devices. See nodes
diskperf -yv command, 172                    edge ports, 243
diskperf utility, 172, 386                   edge switches, 243, 248–250
disks                                        18-switch core/edge SAN design, 255
  storage on, 32, 33, 64, 111                Electrically Erasable Programmable Read-
  troubleshooting missing, 279–283                 Only Memory (EEPROM), 74–75
distance requirements, SANs and, 17–18       Emulex HBA configuration utility, 101, 102
distributed fabric services, 41              enc_in error statistic, 296
domain IDs, switch, 48, 300–301, 319         enc_out error statistic, 297
  conflicting, 326–327                       End Of Frame (EOF) primitive, 40
436     Index


end-to-end flow control, 43                   Login Server, 38, 50
er_disc_c3 error statistic, 297               Management Server, 51, 85, 86, 133
er_enc_out statistic, 338                     Name Server. See Name Server
errDump command, 286                          Registered State Change Notification
error logs, 292–295                                (RSCN), 51, 85–86, 284, 343–344,
error messages, 284, 285, 293, 324, 338            347
error statistics, 296–297, 341–342            switches and, 85–86
errors                                        Time Server, 52, 86
  displaying logged, 292–295                Fabric Shortest Path First (FSPF), 90–91,
                                                   134, 228, 233, 256
  MQ errors, 318, 327
                                            Fabric/Switch Controller, 38, 50
errShow command, 278, 292–295, 318
                                            fabric topologies, 229, 235
Ethernet, 31
                                              cascade topology, 236–237
  in-band/out-of-band switch management,
       356                                    comparison of properties of, 247
  router management by, 109                   complex, 246
Exchange Mail Server, 226                     congestion and, 270
Expect scripting                              core/edge fabric, 242–246, 248–256
  firmwareDownload automation by, 400         distance and, 234
  switch management by, 369–372               factors affecting performance, 270–271
  zoning management by, 379–380               full-mesh topology, 239–240
Extended Copy command, 108–109, 214,          over-subscription and, 270
       215                                    partial-mesh topology, 240–242
Extended Fabrics, 136–137                     resiliency of, 256
Extreme SCSI, 169, 170                        ring topology, 237–238
EZ Fibre, JNI’s, 312                          SAN architecture availability and, 257–260
                                              scalability of, 236
                                              vs. SANs, 275
F                                           fabric troubleshooting, 316–327
F_Ports, 48–49, 87
                                              comparing SAN profiles to identify prob-
Fabric Access Layer API, 137                       lem, 317–318
Fabric Assist, 79, 138–139, 147, 337          domain ID conflicts, 326–327
Fabric Configuration Server, 51               fabric information available from hosts,
fabric licenses, 322, 366                          317
Fabric Login (FLOGI) frame, 50                fabric licenses and, 322
Fabric Loop Attachment (FLA), 337             I/O pauses and, 343–344
Fabric Manager, 131                           incompatible fabric parameters, 325–326
Fabric OS API, 367–368, 378                   MQ errors, 327
fabric port count, 230                        Name Server discrepancies, 320–321
fabric segmentation, 318, 323–327, 348        port configuration conflicts, 322–323, 347
fabric services, 36, 38, 41, 49–52, 85–86     switch LEDs and, 288
  Alias Server, 50, 52, 134                   symptoms indicative of fabric problem,
  Fabric/Switch Controller, 51                     316
                                                                               Index    437


  timeout of devices at bring up, 321–322        FCIA. See Fibre Channel Industry
  topologyShow command and, 309, 320,                   Association (FCIA)
       321, 329, 382, 395                        FCP/SCSI protocol, 96
  zoning conflicts, 323–324                      Fiber Distributed Data Interface (FDDI),
  See also SAN troubleshooting                          31, 43
Fabric Watch, 75, 138, 147, 339, 366             fiber-optic cables, 36, 61, 65
Fabric Zone Server, 51                           fiber-optic connectors, 62, 71–73
fabrics, 5, 34, 36, 49–52                          high-density optical connectors, 72–73
  adding edge devices to, 401–402                  SC optical connectors, 71–72
  adding switches to, 395–398                    Fibre Channel, 30–58
  bringing up, 321–322, 394–395                    broadcasting via, 88
  defined, 229                                     character encoding, 36, 41
  licenses for, 322, 366                           classes of service, 37, 39, 43–45
  merging, 395, 408                                cost of, 28
  segmented, 318, 323–327, 348                     CPU requirements, 170
  timeout of devices at bring up, 321–322          disk seek time requirements, 171
  troubleshooting. See fabric troubleshooting      distances supported, 2, 39
  upgrading, 398–400, 408                          HBA speed requirements, 171
  verification of, 382–383, 395                    history of, 2
  zone management and, 378–379                     hot-plug systems and, 104–105
fabricShow command, 250, 251, 286                  interoperability of, 28, 58
fan-in, 234                                        layers, 36–37
fan-out, 234                                       PCI bus speed requirements, 170–171
faShow command, 286                                protocols supported, 2, 31, 39, 42
fastboot command, 399                              RAID speed requirements, 171
fault injection techniques, 384–385                RAM requirements, 171
fault tolerance                                    resources on, 8
  of High-Availability (HA) clusters, 199, 200     routing IP over to Gigabit Ethernet, 65,
  of SANs, 10–11                                        110–111
faultShow command, 286                             routing over IP networks, 110
FC-0 Fibre Channel layer, 36, 40                   SCSI performance vs., 57
FC-1 Fibre Channel layer, 41                       speed of, 14, 31, 38
FC-2 Fibre Channel layer, 36, 41                   standards, 35
FC-3 Fibre Channel layer, 36, 41                   switched fabric installation, 5, 6
FC-4 Fibre Channel layer, 36, 41–42                topologies, 37, 39
FC-AL. See Fibre Channel Arbitrated Loop           transfer rates, 35
       (FC-AL)                                   Fibre Channel analyzers, 287, 314–316
FC-GS-3 standard, 41, 50                         Fibre Channel Arbitrated Loop (FC-AL), 4,
FC-VI standard, 15                                      5, 33, 39, 47–48, 60
FC_IP (Fibre Channel across IP), 110             Fibre Channel common services layer. See
                                                        FC-3 Fibre Channel layer
                                                 Fibre Channel disks, 33
438    Index


Fibre Channel Industry Association (FCIA),   G
        8, 22, 35, 58
                                             GBICs. See Gigabit Interface Connectors
Fibre Channel layers, 36–37, 40–42                  (GBICs)
   FC-0, 36, 40                              Get All Next (GA_NXT) request, 51
   FC-1, 36, 41                              get_san_profile script, 384
   FC-2, 36, 41                              Get_Time frame, 52
   FC-3, 36, 41                              Get_Time_Response frame, 52
   FC-4, 36, 41–42                           Gigabit Ethernet, 14, 30, 57
Fibre Channel Management MIB, 93               cost of, 28
Fibre Channel Protocol (FCP), 96               routing Fibre Channel to, 110–111
Fibre Channel protocol layer. See FC-2       Gigabit Interface Connectors (GBICs), 60,
        Fibre Channel layer                         61–62, 73–75, 121
Fibre Channel standards projects, 35           Brocade SilkWorm switches and, 126
Fibre Channel Storage Area Network             disadvantages of, 74, 75
        (SAN). See SANs (Storage Area
        Networks)                              GBIC ports on equipment, 74
Fibre Channel-to-Dense Wavelength              serialized, 74–75
        Division Multiplexing (DWDM), 16,
        109–110                              H
Fibre Channel-to-Gigabit Ethernet bridges,   HA clusters. See High-Availability (HA)
        65, 110–111                                 clusters
FibreAlliance Management Information         hard zoning, 83–84, 303, 374, 375–378
        Base (MIB), 92–93
                                             hardware forwarding, 38
FibreAlliance MIB, 109
                                             hardware, SANs, 21, 61–65
fields, frame, 39–40
                                               identifying pieces in place during design
file-level sharing, 19                              phase, 165–166
File Transfer Protocol (FTP), router man-      initiating devices, 33
        agement by, 109
                                               interconnecting devices, 33–34
firmware upgrades, 89, 398–400
                                               physical assessment of in design process,
   cold fabric upgrade, 400                         176–177
   hot fabric upgrade, 399                     selecting, 21–22
   scripting of, 400                           target devices, 32–33
FL_Ports, 48–49, 87                            See also specific pieces
flow control, 43                             HBA. See Host Bus Adapters (HBAs)
format UNIX command, 312, 317                HBA API, 101–103
14-switch core/edge SAN design, 255–256      headers, frame, 37, 39–40
frame filtering, 142                         heartbeat, network, 198, 202
frames, 37, 39–40, 41                        help command, 291–292
   classes of service and, 43–45             Hewlett-Packard
   headers, 39–40                              LUN Manager product, 212
free ports, 243                                OpenView, 12, 350, 368
full-mesh topology, 239–240, 247             high-availability applications, 198–200
fw_downloand script, 400
                                                                               Index      439


high availability, ensuring with SANs,           protocol access permissions, 100
       10–11, 31                                 protocols supported, 95, 96–97
High-Availability (HA) clusters, 196–202,        QuickLoop and, 139
       257                                       remote booting and, 103–104
  active/active model, 198                       speed requirements, 97, 171, 172
  active/passive model, 198                      static discovery by, 101
  advantages of, 197                             storage partitioning by, 210–211
  database servers and, 198–200                  types of, 95–96
  fault tolerance of, 199, 200                   zoning, 373
  high-availability applications and, 198–200   host tier switches, 263
  Microsoft Cluster Server and, 200–202         hosts, 33
  redundancy and, 199                            checking for with switchShow command,
  storage devices for, 200                             329
  zero-downtime failovers, 199                   collecting information on in interviews,
high-availability storage devices, 200                 160–162
high-density optical connectors, 62, 72–73       troubleshooting information available
high-end storage arrays, 113–114                       from, 287, 312–313, 317
  LUN export across multiple ports by,          hot fabric upgrades, 398, 399
       113–114                                  hot-plug systems, 104–106, 126
  selective LUN presentation by, 113            hot-swappable components, 86, 125
  snapshot backup volumes by, 114               HSSDC-2 connectors, 71
High-Performance Parallel Interface             HSSDC connectors, 62, 70–71
       (HiPPI), 31, 38, 42–43                   hubs, Fibre Channel, 4, 34–35, 60, 63, 76–80
HighGround, Sun’s, 350                           LIP process and, 77, 78–79
hop counts, 231, 276                             managed, 35, 60, 76–78
Host Bus Adapters (HBAs), 33, 60, 64, 95         simple electrical, 34, 35, 60, 76
  combination adapters, 98                       vs. switches, 57
  configuration management software for, 101
  connecting hosts to fabric with, 95
  default LUN access permissions and, 100
                                                I
                                                I/O generators, 387–390
  drivers for, 172
                                                I/O load, 385–390
  dual in HA clusters, 199
                                                  generation of, 387–390
  dynamic discovery by, 101
                                                  types of, 386–687
  Fabric Assist and, 139
                                                I/O pauses, 342–343, 347
  fabric-capable, 98–99
                                                IBM DB2, 15, 226
  HBA API and, 101–103
                                                in-band switch management, 356–358
  hot-plug systems and, 104–106
                                                incomplete ISL initialization, 318–319
  LUN mapping (persistent binding) and,
       99–100                                   InfiniBand technologies, 30, 57, 58, 71
  LUN masking and, 99                           initiating devices, 33
  ports available on, 98                        initiator-to-target communications,
                                                       mapping, 167–168
  private (loop-based), 98
                                                installation, SAN. See SAN installation
440     Index


Integrated Drive Electronics (IDE), 2            performance, determining future needs,
Intelligent Fabric Services Architecture,             172–174
       140–143                                   SAN-enabled applications desired, 165
  frame filtering and, 142                       storage device information, 162–163
  hardware-enforced zoning, 142                  storage facility information, 164
  ISL Trunking, 140–142                          technical requirement identification,
  performance analysis and, 143                       159–160
  Secure Fabric OS and, 143                      timeline creation from, 175–176
Intelligent Peripheral Interface (IPI), 42     Iometer, Intel’s, 169, 170, 387, 389
Inter-Switch Links (ISL; E_ports), 48–49, 87   iostat utility, Sun Solaris, 169, 386
  cable layout and, 351–354                    IOzone, 387
  incomplete initialization of, 318–319        IP addresses, setting switch, 358, 361
  ISL over-subscription ratio, 231–232         IP networks
  load sharing through, 134                      routing Fibre Channel across, 110
  port configuration conflicts and, 322–323      routing over Fibre Channel to Gigabit
  trunking, 140–142                                   Ethernet, 110–111
interconnecting devices, 33–34                 IP protocol, 2, 31, 39, 42, 96–97
Internal Rate of Return (IRR) calculations,    IP targets, 32, 33
       189                                     IPFC protocol, 88, 96, 356–358
Internet, data storage and, 9                  iSCSI, 57–58
Internet Protocol (IP), 2, 31, 39, 42, 96–97   ISLs (E_Ports). See Inter-Switch Links (ISLs)
Internet Service Providers (ISPs), 12–13
interoperability labs, 22–23                   J
interviews, SAN design process and, 150,       Java-based Web pages, switch management
       153                                            by, 93–94
  backup information, identifying required,    JNI EZ Fibre, 312
       167                                     Just a Bunch Of Disks (JBOD), 32, 33, 64,
  business problem identification, 158                111
  business requirements identification,
       158–159
  component testing needs, 166–167             K
  current performance data, 168–172            K characters, 41
  design interview form, 175–176               Key Distribution server, 50
  host information, collecting, 160–162
  identifying people to interview, 156–157,    L
       193–194                                 LAN-based backup configurations,
  implementation, determining acceptable             212–213, 214
       downtime for, 174                       LAN-free backup configurations, 212–216
  initiator-to-target communications matrix,   latency, 231
       167–168                                 LC connectors, 72
  maintenance downtime, determining            LEDs, switch, 287–289, 318, 329
       acceptable, 174–175
                                                                              Index     441


legacy devices, connecting to SANs, 106        HBA-based, 99
Legato NetWorker, 13                           hot-plug systems and, 105
license, fabric, 322, 366                      storage partitioning by, 210–211
licenseShow command, 286, 322, 366
LIP (Loop Initialization Primitive), 77–78,   M
        337–338, 339
                                              make_zone script, 379–380
LIP process, 77, 78–79
                                              MAN technologies. See Metropolitan Area
LIP storm, 79                                       Networking (MAN)
Lip_in count, 339, 342                        manageability, high-availability cluster, 197
Lip_out count, 339, 342                       managed Fibre Channel hub, 34–35, 60,
Lip_rx count, 342                                   76–78
load testing, I/O, 385–390                    Management Information Base (MIB),
locality, performance optimization through,         92–93, 135
        266–268                               Management Server, 38, 85, 86, 133
logical unit number. See LUN (Logical Unit    mapping, LUN, 99–100
        Number)                               marginal GBICs, 338
Login Server, 38, 50                          marginal loop connections, 337–338
logs, configuration, 391–393                  marginal point-to-point/fabric device links,
logs, error, 292–295                                335–336
loop environments, 4, 5, 33, 90               marginal ports, 335
   isolating marginal port faults, 339         disrupted loops and, 338
   LIP imbalances and, 339                     fault isolation and, 339, 340
   LIP process and, 78–79                      marginal GBICs, 338
   marginal GBICs and, 338                     marginal loop connections, 337–338
   marginal loop connections, 337–338          marginal point-to-point/fabric device
   marginal port behavior on disrupted, 338         links, 335–336
   migrating to switched fabrics, 79–80        portErrShow command, 295–297
Loop Initialization Primitive (LIP), 77,      masking, LUN, 122, 205, 226
        337–338, 339                           HBA-based, 99
loop ports. See FL_Ports                       hot-plug systems and, 105
loop zoning, 77                                storage partitioning by, 210–211
loopPortTest command, 290                     media, cabling, 61, 65–68, 164
LUN (Logical Unit Number)                     Media Interface Adapters (MIAs), 75
   access permissions, 100                    mesh topologies, 238
   high-availability (HA) clusters and, 200    compared to other topologies, 247
   LUN-level zoning, 373, 374                  full-mesh topology, 239–240
   Microsoft Cluster Server configurations,    partial-mesh topology, 240–242
        200–202                                resiliency of, 256, 257, 258
   selective presentation of, 108             metadata servers, 212
LUN Manager, Hewlett-Packard’s, 212           Metropolitan Area Networking (MAN),
LUN mapping (persistent binding), 99–100            15–16, 217, 219–221, 238
LUN masking, 122, 205, 226                    MIAs. See Media Interface Adapters (MIAs)
442     Index


Micromuse’s Netcool, 350, 368                  accelerating cycling of, 14
Microsoft Cluster Server (MSCS), 11,           collecting information on in interviews,
      200–202                                      167
Microsoft Exchange databases, 11               LAN-free configurations for, 212–213
Microsoft SQL Server, 226                      reducing network congestion from, 13
Microsoft Windows Hardware Quality Lab         remote, 218–219
      (WHQL), 202                              SAN-based server-free, 213–216
migration process, 154                       Network Data Management Protocol
missing devices, troubleshooting, 279–283,         (NDMP), 214
      327–335                                network heartbeats, 198, 202
 locating on Name Server with nsShow,        network protocols, 31
      332–333                                NetWorker, Legato’s, 13
 port configuration conflicts, 329–332       node count, 230
 switchShow command and, 329–332             node WWNs, 306–307
 zoning mismatches and, 333–335              nodes, 230
MQ errors, 318, 327                            adding to fabrics, 401–402
mqShow command, 286                            collecting information on in interviews,
MSCS. See Microsoft Cluster Server                 160–164
      (MSCS)                                   missing from Name Server, 334–335
MT-RJ connectors, 72–73                        over-subscription of, 232
multi-LUN devices, 208                         timeout of at bring up, 321–322
multicast groups, 52                           WWNs, 307
multimode optical cables, 36, 61, 66–67      nonresilient dual-fabric SANs, 257
multipathing software, 199                   nsAllShow command, 286, 320, 329, 395
                                             nsShow command, 286
N                                              fabric validation using, 382
Name Server, 38, 41, 50–51, 85, 133–134        troubleshooting fabrics with, 320–321
  hard zoning and, 376                         troubleshooting missing devices with,
  missing devices and, 280–283, 303–307,           281–282, 332–335
      332–335                                  troubleshooting SANs with, 303–307
  nsAllShow command, 286, 320, 329, 382,
      395
  nsShow command, 281–282, 303–307,
                                             O
      320–321                                Open Fibre