ITIL3 Service Operation by shekhar.salon

VIEWS: 11 PAGES: 276

									Service Operation




                    London: TSO
Published by TSO (The Stationery Office) and available from:

Online
www.tsoshop.co.uk

Mail,Telephone, Fax & E-mail
TSO
PO Box 29, Norwich, NR3 1GN
Telephone orders/General enquiries: 0870 600 5522
Fax orders: 0870 600 5533
E-mail: customer.services@tso.co.uk
Textphone 0870 240 3701

TSO Shops
123 Kingsway, London,WC2B 6PQ
020 7242 6393 Fax 020 7242 6394
16 Arthur Street, Belfast BT1 4GD
028 9023 8451 Fax 028 9023 5401
71 Lothian Road, Edinburgh EH3 9AZ
0870 606 5566 Fax 0870 606 5588

TSO@Blackwell and other Accredited Agents


Published for the Office of Government Commerce under licence from the Controller of Her Majesty’s
Stationery Office.
© Crown Copyright 2007
This is a Crown copyright value added product, reuse of which requires a Click-Use Licence for value
added material issued by OPSI.
Applications to reuse, reproduce or republish material in this publication should be sent to OPSI,
Information Policy Team, St Clements House, 2-16 Colegate, Norwich, NR3 1BQ,
Tel No (01603) 621000 Fax No (01603) 723000, E-mail: hmsolicensing@cabinet-office.x.gsi.gov.uk, or
complete the application form on the OPSI website http://www.opsi.gov.uk/click-use/value-added-licence-
information/index.htm
OPSI, in consultation with Office of Government Commerce (OGC), may then prepare a Value Added
Licence based on standard terms tailored to your particular requirements including payment terms
The OGC logo ® is a Registered Trade Mark of the Office of Government Commerce
ITIL ® is a Registered Trade Mark, and a Registered Community Trade Mark of the Office of Government
Commerce, and is Registered in the U.S. Patent and Trademark Office
The Swirl logo ™ is a Trade Mark of the Office of Government Commerce
First published 2007
ISBN 978 0 11 331046 3
Printed in the United Kingdom for The Stationery Office
                                                                                                            |         iii


Contents
List of figures                                      v        4.4   Problem Management                           58
                                                              4.5   Access Management                            68
List of tables                                      vi
                                                              4.6   Operational activities of processes
OGC’s foreword                                     vii              covered in other lifecycle phases            72

Chief Architect’s foreword                         viii   5   Common Service Operation activities               79
                                                              5.1   Monitoring and control                       82
Preface                                             ix
                                                              5.2   IT Operations                                92
Acknowledgements                                     x        5.3   Mainframe Management                         95

1   Introduction                                     1        5.4   Server Management and Support                95

    1.1 Overview                                     3        5.5   Network Management                           96

    1.2   Context                                    3        5.6   Storage and Archive                          97

    1.3   Purpose                                    7        5.7   Database Administration                      97

    1.4   Usage                                      7        5.8   Directory Services Management                98

    1.5   Chapter overview                           7        5.9   Desktop Support                              98
                                                              5.10 Middleware Management                         99
2   Service Management as a practice                 9
                                                              5.11 Internet/Web Management                       99
    2.1   What is Service Management?               11
                                                              5.12 Facilities and Data Centre Management        100
    2.2   What are services?                        11
                                                              5.13 Information Security Management and
    2.3   Functions and processes across the                       Service Operation                            101
          lifecycle                                 12
                                                              5.14 Improvement of operational activities        102
    2.4   Service Operation fundamentals            13
                                                          6   Organizing for Service Operation              105
3   Service Operation principles                   17
                                                              6.1   Functions                                   107
    3.1   Functions, groups, teams, departments
                                                              6.2   Service Desk                                109
          and divisions                             19
                                                              6.3   Technical Management                        121
    3.2   Achieving balance in Service Operation    19
                                                              6.4   IT Operations Management                    125
    3.3   Providing service                         28
                                                              6.5   Application Management                      128
    3.4   Operation staff involvement in Service
          Design and Service Transition             28        6.6   Service Operation roles and
                                                                    responsibilities                            140
    3.5   Operational Health                        28
                                                              6.7   Service Operation Organization Structures 146
    3.6   Communication                             29
    3.7   Documentation                             31    7   Technology considerations                     155
                                                              7.1   Generic requirements                        157
4   Service Operation processes                    33
                                                              7.2   Event Management                            158
    4.1   Event Management                          35
                                                              7.3   Incident Management                         159
    4.2   Incident Management                       46
                                                              7.4   Request fulfilment                          159
    4.3   Request Fulfilment                        55
                                                              7.5   Problem Management                          159
iv       |



             7.6   Access Management                         160   Appendix C: Kepner and Tregoe                   199
             7.7   Service Desk                              160      C1      Defining the problem                 201
                                                                      C2      Describing the problem               201
     8       Implementing Service Operation                  163
                                                                      C3      Establishing possible causes         201
             8.1   Managing change in Service Operation      165
                                                                      C4      Testing the most probable cause      201
             8.2   Service Operation and Project
                   Management                                165      C5      Verifying the true cause             201
             8.3   Assessing and managing risk in Service
                   Operation                                 166
                                                                   Appendix D: Ishikawa Diagrams                   203
             8.4   Operational staff in Service Design and         Appendix E: Detailed description of
                   Transition                                166   Facilities Management                           207
             8.5   Planning and Implementing Service                  E1      Building Management                  209
                   Management technologies                   166
                                                                      E2      Equipment Hosting                    209
     9   Challenges, Critical Success Factors                         E3      Power Management                     210
         and risks                                           169
                                                                      E4      Environmental Conditioning and
             9.1   Challenges                                171              Alert Systems                        210
             9.2   Critical Success Factors                  173      E5      Safety                               211
             9.3   Risks                                     175      E6      Physical Access Control              211
                                                                      E7      Shipping and Receiving               212
     Afterword                                               177
                                                                      E8      Involvement in Contract Management   212
     Appendix A: Complementary industry                               E9      Maintenance                          212
     guidance                                                181
             A1    COBIT                                     183   Appendix F: Physical Access Control             213
             A2    ISO/IEC 20000                             183
                                                                   Glossary                                        219
             A3    CMMI                                      184
                                                                      Acronyms list                                221
             A4    Balanced Scorecard                        184
                                                                      Definitions list                             223
             A5    Quality Management                        184
                                                                   Index                                           251
             A6    ITIL and the OSI Framework                184

     Appendix B: Communication in Service
     Operation                                               185
             B1    Routine operational communication         187
             B2    Communication between shifts              188
             B3    Performance Reporting                     189
             B4    Communication in projects                 192
             B5    Communication related to changes          194
             B6    Communication related to exceptions       195
             B7    Communication related to emergencies      196
             B8    Communication with users and customers 197
                                                                                                                     |   v


List of figures
All diagrams in this publication are intended to provide an   Figure 6.5   Application Management Lifecycle
illustration of ITIL Service Management Practice concepts     Figure 6.6   Role of teams in the Application Management
and guidance. They have been artistically rendered to                      Lifecycle
visually reinforce key concepts and are not intended to
meet a formal method or standard of technical drawing.        Figure 6.7   IT Operations organized according to
The ITIL Service Management Practices Integrated Service                   technical specialization (sample)
Model conforms to technical drawing standards and             Figure 6.8   A department based on executing a set of
should be referred to for complete details. Please see                     activities
www.best-management-practice.com/itil for details.
                                                              Figure 6.9   IT Operations organized according to
Figure 1.1   Source of Service Management Practice                         geography
Figure 1.2   ITIL Core                                        Figure 6.10 Centralized IT Operations, Technical and
Figure 2.1   A conversation about the definition and                      Application Management structure
             meaning of services                              Figure D.1 Sample of starting an Ishikawa Diagram
Figure 2.2   A basic process                                  Figure D.2 Sample of a completed Ishikawa Diagram
Figure 3.1   Achieving a balance between external and
             internal focus
Figure 3.2   Achieving a balance between focus on
             stability and responsiveness
Figure 3.3   Balancing service quality and cost
Figure 3.4   Achieving a balance between focus on cost
             and quality
Figure 3.5   Achieving a balance between being too
             reactive or too proactive
Figure 4.1   The Event Management process
Figure 4.2   Incident Management process flow
Figure 4.3   Multi-level incident categorization
Figure 4.4   Problem Management process flow
Figure 4.5   Important versus trivial causes
Figure 4.6   Service Knowledge Management System
Figure 5.1   Achieving maturity in Technology
             Management
Figure 5.2   The Monitor Control Loop
Figure 5.3   Complex Monitor Control Loop
Figure 5.4   ITSM Monitor Control Loop
Figure 6.1   Service Operation functions
Figure 6.2   Local Service Desk
Figure 6.3   Centralized Service Desk
Figure 6.4   Virtual Service Desk
vi       |




     List of tables
     Table 3.1   Examples of extreme internal and external
                 focus
     Table 3.2   Examples of extreme focus on stability and
                 responsiveness
     Table 3.3   Examples of extreme focus on quality and
                 cost
     Table 3.4   Examples of extremely reactive and proactive
                 behaviour
     Table 4.1   Simple priority coding system
     Table 4.2   Pareto cause ranking chart
     Table 5.1   Active and Passive Reactive and Proactive
                 Monitoring
     Table 6.1   Survey techniques and tools
     Table 6.2   Organizational roles
     Table B.1   Communication requirements in IT services
     Table B.2   Communication requirements between shifts
     Table B.3   Performance Reporting requirements: IT
                 service
     Table B.4   Performance Reporting requirements: Service
                 Operation team or department
     Table B.5   Performance Reporting requirements:
                 infrastructure or process
     Table B.6   Communication within projects
     Table B.7   Communication on handover of projects
     Table B.8   Communication about changes
     Table B.9   Communication during exceptions
     Table B.10 Communication during emergencies
     Table B.11 Communication with users and customers
     Table F.1   Access control devices
                                                               |   vii


OGC’s foreword
Since its creation, ITIL has grown to become the most
widely accepted approach to IT service management in
the world. However, along with this success comes the
responsibility to ensure that the guidance keeps pace with
a changing global business environment. Service
management requirements are inevitably shaped by the
development of technology, revised business models and
increasing customer expectations. Our latest version of ITIL
has been created in response to these developments.
This is one of the five core publications describing the IT
service management practices that make up ITIL. They are
the result of a two-year project to review and update the
guidance. The number of service management
professionals around the world who have helped to
develop the content of these publications is impressive.
Their experience and knowledge have contributed to the
content to bring you a consistent set of high-quality
guidance. This is supported by the ongoing development
of a comprehensive qualifications scheme, along with
accredited training and consultancy.
Whether you are part of a global company, a government
department or a small business, ITIL gives you access to
world-class service management expertise. Essentially, it
puts IT services where they belong – at the heart of
successful business operations.




Peter Fanning
Acting Chief Executive
Office of Government Commerce
viii   |




  Chief Architect’s foreword
   ITIL Service Management Practice guidance is structured
   around the Service Lifecycle. Common across the lifecycle
   is the overall practice itself, which relies on processes,
   functions, activities, organizational models and
   measurement, which together allow IT Service
   Management (ITSM) to integrate with the business
   processes, provide measurable value and evolve the ITSM
   industry forward in our pursuit of service excellence.
   Nowhere else in the ITIL Service Lifecycle does the effect
   of how we perform as service providers touch the
   customers as intimately as Service Operations. This is
   where the strategy, design, transition and improvements
   are delivered and supported on a day-to-day basis.
   The Service Operation publication brings Service
   Management to life for the business, and the
   accountability for the performance of the services, the
   people who create them and the technology that enables
   them are monitored, controlled and delivered in this stage
   of the Service Lifecycle.
   This publication will help guide us all to achieve service
   excellence and to see the value of ITSM in a broad,
   business-focused view of it. Whether you are new to the
   practice of ITIL or a seasoned practitioner, the guidance in
   this publication will expand your vision and knowledge of
   how to be the best-of-breed service provider through
   implementation of Service Operation.
   There is a saying that hindsight is 20/20. The guidance in
   Service Operation is distilled from over 20 years of
   experience in ITSM by world experts, business people and
   ITSM practitioners and the lessons learned by them about
   what service excellence really is and how to achieve it.
   Anyone involved in operating services will benefit from
   the guidance in the following pages of this publication.
   Service Operation offers the best advice and guidance
   from around the world and a path to what is possible in
   your future.




   Sharon Taylor
   Chief Architect, ITIL Service Management Practices
                                                              |   ix


Preface
This publication encompasses and supersedes the
operational aspects of the ITIL Service Support and Service
Delivery publications and also covers most of the scope of
ICT infrastructure Management. It also incorporates
operational aspects from the Planning to Implement,
Application Management, Software Asset Management
and Security Management publications.
The basic principles of best practice IT service
management encompassed within earlier versions of
ITIL remain unchanged. Common sense remains
common sense!
However, the technologies, tools and relationships
have changed significantly, even in the relatively short
time since the latest version of ITIL was completed. Whilst
this publication re-uses and updates relevant material
from the earlier versions where appropriate, it also
includes many new concepts and industry practices to
give complete coverage of best-practice guidance for
today’s Service Operation in a single volume, for today’s
business and technological environment.


Contact information
Full details of the range of material published under the
ITIL banner can be found at
www.best-management-practice.com/itil
For further information on qualifications and training
accreditation, please visit www.itil-officialsite.com.
Alternatively, please contact:
APMG Service Desk
Sword House
Totteridge Road
High Wycombe
Buckinghamshire
HP13 6DG
Tel: +44 (0) 1494 452450
E-mail: servicedesk@apmg.co.uk
x      |




    Acknowledgements
    Chief Architect and authors                                     Algorri, Mary Fischer, Bill Thayer and Diana Osberg of The
    Sharon Taylor                                 Chief Architect   Walt Disney Company’s Enterprise IT, Dennis Deane and
    (Aspect Group Inc)                                              John Sowerby of DHL, Richard Fahey and Chris Hughes of
                                                                    HP Global Delivery Application Services, Cindi Locker and
    David Cannon (HP)                                     Author    Dhiraj Gupta of Progressive Casualty Insurance Company,
    David Wheeldon (HP)                                   Author    Peter Doherty and Robert Stroud from Computer
                                                                    Associates and Paul Tillston from Hewlett-Packard, Brian
                                                                    Jakubec, Vernon Blakes, Angela Chin, Colin Lovell, Ken
    ITIL authoring team                                             Hamilton, Rose Lariviere, Jenny McPhee, Tom Nielsen, Roc
    The ITIL authoring team contributed to this guide through       Paez, Lloyd Robinson, Paul Wilmot, Jeanette Smith and
    commenting on content and alignment across the set. So          Ken Wendle of Hewlett-Packard.
    thanks are also due to the other ITIL authors, specifically
    Jeroen Bronkhorst (HP), Gary Case (Pink Elephant), Ashley       In order to develop ITIL Service Management Practices to
    Hannah (HP), Majid Iqbal (Carnegie Mellon University),          reflect current best practice and produce publications of
    Shirley Lacy (ConnectSphere), Vernon Lloyd (Fox IT), Ivor       lasting value, OGC consulted widely with different
    Macfarlane (Guillemot Rock), Michael Nieves (Accenture),        stakeholders throughout the world at every stage in the
    Stuart Rance (HP), Colin Rudd (ITEMS) and George                process. OGC would also like to thank the following
    Spalding (Pink Elephant).                                       individuals and their organisations for their contributions
                                                                    to refreshing the ITIL guidance:
    Mentors
                                                                    The ITIL Advisory Group
    Christian Nissen and Paul Wilkinson.
                                                                    Pippa Bass, OGC; Tony Betts, Independent; Signe-Marie
    Further contributions                                           Hernes Bjerke, Det Norske Veritas; Alison Cartlidge, Xansa;
                                                                    Diane Colbeck, DIYmonde Solutions Inc; Ivor Evans,
    A number of people generously contributed their time
                                                                    DIYmonde Solutions Inc; Karen Ferris, ProActive; Malcolm
    and expertise to this Service Operation publication. Jim
                                                                    Fry, FRY-Consultants; John Gibert, Independent; Colin
    Clinch, as OGC Project Manager, is grateful for the support
                                                                    Hamilton, RENARD Consulting Ltd; Lex Hendriks, EXIN;
    provided by HP to the authoring team on the
                                                                    Carol Hulm, British Computer Society-ISEB; Tony Jenkins,
    development of this publication and particularly the
                                                                    DOMAINetc; Phil Montanaro, EDS; Alan Nance, ITPreneurs;
    contribution of Peter Doherty and Robert Stroud, and for
                                                                    Christian Nissen, Itilligence; Don Page, Marval Group; Bill
    the support of Jenny Dugmore, Convenor of Working
                                                                    Powell, IBM; Sergio Rubinato Filho, CA; James Siminoski,
    Group ISO/IEC 20000, Janine Eves, Carol Hulm, Aidan
                                                                    SOScorp; Robert E. Stroud, CA; Jan van Bon, Inform-IT; Ken
    Lawes and Michiel van der Voort.
                                                                    Wendle, HP; Paul Wilkinson, Getronics PinkRoccade;
    The authors would also like to thank Stuart Rance and           Takashi Yagi, Hitachi.
    Ashley Hanna of Hewlett-Packard, Christian F Nissen
    (ITILLIGENCE), Maria Vase (Itilligence), Eu Jin Ho (UBS), Jan   Reviewers
    Bjerregaard, (Sun Microsystems), Jan Øberg (ØBERG               Jorge Acevedo, Computec S.A; Valerie Arraj, InteQ; Colin
    Partners), Lars Zobbe Mortensen (Zobbe Consult &                Ashcroft, City of London; Martijn Bakker, Getronics
    Zoftware), Mette Nielsen (Carlsberg IT), Michael Imhoff         PinkRoccade; Jeff Bartrop, BT & Customer Service Direct;
    (IBM), Niels Berner (Novo Nordisk), Nina Schertiger (HP),       John Bennett, Centram Ltd; Niels Berner, Novo Nordisk; Ian
    Signe-Marie Hernes Bjerke (DNV), Steen Sverker Nilsson          Bevan, Fox IT; Signe-Marie Hernes Bjerke, DNV; Jan
    (Westergaard CSM), Ulf Myrberg (BiTa), Russell Jukes,           Bjerregaard, Sun Microsystems; Enrico Boverino, CA;
    Debbi Jancaitis, Sheldon Parmer, Ramon Alanis, Tim              Stephen Bull, Sierra Systems; Bradley Busch, InTotality;
    Benson and Nenen Ong of Hewlett-Packard IT, Jaye                Howard Carpenter, IBM; Diane Colbeck, DIYmonde
    Thompson, Dee Seymour, Andranik Ziyalyan, Young                 Solutions Inc; Nicole Conboy, Nicole Conboy & Associates;
    Chang, Lauren Abernethy, April McCowan, Becky                   Sharon Dale, aQuip International; Sandra Daly, Dawling
    Wershbale, Rob Garman, Scott McPherson, Sandra                  Consultancy; Michael Donahue, IBM; Paul Donald, Lucid IT;
    Breading, Rick Streeter, Leon Gantt, Charlotte Devine, Greg     Juan Antonio Fernandez, Quint Wellington Redrood; Juan
                                                              |   xi

Jose Figueiras, Globant; Rae Garrett, Pink Elephant; Klaus
Goedel, HP; Detlef Gross, Automation Consulting Group
GmbH; Matthias Hall, University of Dundee; Lex Hendriks,
EXIN; Jabe Hickey, IBM; Kevin Hite, Microsoft; Eu Jin Ho,
UBS; Michael Imhoff, IBM; Scott Jaegar, Plexant; Tony
Jenkins, DOMAINetc; Tony Kelman-Smith, HP; Peter Koepp,
Independent; Joanne Kopcho, Capgemini America; Debbie
Langenfield, IBM; Sarah Lascelles, Interserve Project
Services Ltd; Peter Loos, Accenture Services GmbH;
Emmanuel Marchand, Advens; Jesus Martin, Ibermatica SA;
Phil Montanaro, EDS; Luis Moran, Independent; Lars Zobbe
Mortensen, Zobbe Consult & Zoftware; Ron Morton, HP;
Darren Murtagh, Retravision; Ulf Myrberg, BiTa; Mette
Nielsen, Carlsberg IT; Steen Sverker Nilsson, Westergaard
CSM; Jan Øberg, ØBERG Partners; Eddy Peters, CTG; Poul
Mols Poulsen, Coop Norden IT; Bill D Powell, IBM; Roger
Purdie, The Art of Service; Padmini Ramamurthy, Satyam
Computer Services Ltd; Frances Scarff, OGC; Nina
Schertiger, HP; Markus Schiemer, Unisys; Barbara Schiesser,
Swiss ICT; Klaus Seidel, Microsoft; Gilbert Silva, Techbiz
Informatica Ltd; Joseph Stephen, Department of
Transportation, US Government; Michala Sterling, Mid
Sussex District Council; Rohan Thuraisingham, Friends
Provident Management Services Ltd; Matthew Tolman,
Sandvik; Jan van Bon, Inform-IT; Maria Vase, ITILLIGENCE;
Christoph Wettstein, CLAVIS klw AG; Andi Wijaya, IBM;
Aaron Wolfe, Pink Elephant; Takashi Yagi, Hitachi;
YoungHoon Youn, IBM.
Introduction   1
                                                                                                                          |       3


1 Introduction
This publication provides best-practice advice and                separate components, such as hardware, software
guidance on all aspects of managing the day-to-day                applications and networks, that make up the end-to-end
operation of an organization’s information technology (IT)        service from a business perspective) and to detect any
services. It covers issues relating to the people, processes,     threats or failures to service quality.
infrastructure technology and relationships necessary to
                                                                  As services may be provided, in whole or in part, by one
ensure the high-quality, cost-effective provision of IT
                                                                  or more partner/supplier organizations, the Service
service necessary to meet business needs.
                                                                  Operation view of end-to-end service must be extended to
The advent of new technology and the now blurred lines            encompass external aspects of service provision – and
between the traditional technology silos of hardware,             where necessary shared or interfacing processes and tools
networks, telephony and software applications                     are needed to manage cross-organizational workflows.
management mean that an updated approach to
                                                                  Service Operation is neither an organizational unit nor a
managing service operations is needed. Organizations are
                                                                  single process – but it does include several functions and
increasingly likely to consider different ways of providing
                                                                  many processes and activities, which are described in
their IT at optimum cost and flexibility, with the
                                                                  Chapters 4, 5 and 6.
introduction of utility IT, pay-per-use IT Services, virtual IT
provision, dynamic capacity and Adaptive Enterprise
computing, as well as task-sourcing and outsourcing               1.2 CONTEXT
options.
                                                                  1.2.1 Service Management
These alternatives have led to a myriad of IT business
relationships, both internally and externally, that have          IT is a commonly used term that changes meaning with
increased in complexity as much as the technologies               context. From the first perspective, IT systems, applications
being managed have. Business dependency on these                  and infrastructure are components or sub-assemblies of a
complex relationships is increasingly critical to survival        larger product. They enable or are embedded in processes
and prosperity.                                                   and services. From the second perspective, IT is an
                                                                  organization with its own set of capabilities and resources.
                                                                  IT organizations can be of various types such as business
1.1 OVERVIEW                                                      functions, shared services units and enterprise-level core
Service Operation is the phase in the ITSM Lifecycle that is      units.
responsible for ‘business-as-usual’ activities.                   From the third perspective, IT is a category of services
Service Operation can be viewed as the ‘factory’ of IT.           utilized by business. They are typically IT applications and
This implies a closer focus on the day-to-day activities          infrastructure that are packaged and offered as services by
and infrastructure that are used to deliver services.             internal IT organizations or external service providers. IT
However, this publication is based on the understanding           costs are treated as business expenses. From the fourth
that the overriding purpose of Service Operation is to            perspective, IT is a category of business assets that provide
deliver and support services. Management of the                   a stream of benefits for their owners, including, but not
infrastructure and the operational activities must                limited to, revenue, income and profit. IT costs are treated
always support this purpose.                                      as investments.

Well planned and implemented processes will be to no
                                                                  1.2.2 Good practice in the public domain
avail if the day-to-day operation of those processes is not
properly conducted, controlled and managed. Nor will              Organizations operate in dynamic environments with the
service improvements be possible if day-to-day activities         need to learn and adapt. There is a need to improve
to monitor performance, assess metrics and gather data            performance while managing trade-offs. Under similar
are not systematically conducted during Service Operation.        pressure, customers seek advantage from service
                                                                  providers. They pursue sourcing strategies that best serve
Service Operation staff should have in place processes and        their own business interest. In many countries,
support tools to allow them to have an overall view of            government agencies and non-profit-making enterprises
Service Operation and delivery (rather than just the              have a similar propensity to outsource for the sake of
4       | Introduction




                                     Standards                                              Employees


                              Industry practices                                            Customers

        Sources                                                                                                  Enablers
                             Academic research                                              Suppliers
      (Generate)                                                                                                 (Aggregate)

                         Training and education                                             Advisors


                            Internal experience                                             Technologies




                                    Substitutes                                             Competition


           Drivers                  Regulators                                              Compliance           Scenarios
            (Filter)                                                                                             (Filter)

                                    Customers                                               Commitments




                                                        Knowledge fit for business
                                                      objectives, context and purpose

    Figure 1.1 Source of Service Management Practice

    operational effectiveness. This puts additional pressure on       knowledge have matching circumstances, the
    service providers to maintain a competitive advantage             knowledge may not be as effective in use.
    with regard to the alternatives that customers may have.        ■ Owners of proprietary knowledge expect to be
    The increase in outsourcing has particularly exposed              rewarded for their long-term investments.
    internal service providers to unusual competition.                They may make such knowledge available only
    To cope with the pressure, organizations benchmark                under commercial terms, through purchases and
    themselves against peers and seek to close gaps in                licensing agreements.
    capabilities. One way to close such gaps is the adoption of     ■ Publicly available frameworks and standards such as
    good practices across the industry. There are several             ITIL, Control Objectives for IT (COBIT), CMMI, eSCM-SP,
    sources for good practices, including public frameworks,          PRINCE2, ISO 9000, ISO 20000 and ISO 27001 are
    standards and the proprietary knowledge of organizations          validated across a diverse set of environments and
    and individuals (see Figure 1.1).                                 situations rather than the limited experience of a
                                                                      single organization. They are subject to broad
    Public frameworks and standards are attractive when               review across multiple organizations and disciplines.
    compared with proprietary knowledge:                              They are vetted by diverse sets of partners, suppliers
    ■ Proprietary knowledge is deeply embedded in                     and competitors.
      organizations and therefore difficult to adopt,               ■ The knowledge of public frameworks is more likely to
      replicate or transfer, even with the cooperation of             be widely distributed among a large community of
      the owners. Such knowledge is often in the form                 professionals through publicly available training and
      of tacit knowledge which is inextricable and                    certification. It is easier for organizations to acquire
      poorly documented.                                              such knowledge through the labour market.
    ■ Proprietary knowledge is customized for the local
                                                                    Ignoring public frameworks and standards can needlessly
      context and specific business needs, to the point of          place an organization at a disadvantage. Organizations
      being idiosyncratic. Unless the recipients of such            should cultivate their own proprietary knowledge on top
                                                                                                                  Introduction |        5




                Continual
                 Service
              Improvement
                                     Service
                                    Transition




                                    Service
                                    Strategy


                          Service
                          Design                  Service
                                                 Operation




                                                                      en ce
     Co Impro




                                                                   vem ervi
       nti ve




                                                                         t
          nu m




                                                                pro l S
                                                             Im tinua
            al S ent
                erv




                                                                n
                                                             Co
                    ice




                                                                               Figure 1.2 ITIL Core


of a body of knowledge based on public frameworks and                  ■ Service Strategy
standards. Collaboration and coordination across                       ■ Service Design
organizations are easier on the basis of shared practices              ■ Service Transition
and standards.                                                         ■ Service Operation
                                                                       ■ Continual Service Improvement.
1.2.3 ITIL and good practice in Service
Management                                                             Each publication addresses capabilities having direct
                                                                       impact on a service provider’s performance. The structure
The context of this publication is the ITIL Framework as a
                                                                       of the core is in the form of a lifecycle. It is iterative and
source of good practice in Service Management. ITIL is
                                                                       multidimensional. It ensures that organizations are set up
used by organizations worldwide to establish and improve
                                                                       to leverage capabilities in one area for learning and
capabilities in Service Management. ISO/IEC 20000
                                                                       improvements in others. The Core is expected to provide
provides a formal and universal standard for organizations
                                                                       structure, stability and strength to Service Management
seeking to have their Service Management capabilities
                                                                       capabilities, with durable principles, methods and tools.
audited and certified. While ISO/IEC 20000 is a standard to
                                                                       This serves to protect investments and provide the
be achieved and maintained, ITIL offers a body of
                                                                       necessary basis for measurement, learning and
knowledge useful for achieving the standard.
                                                                       improvement.
The ITIL Library has the following components:
                                                                       The guidance in ITIL can be adapted for changes of use in
■ ITIL Core: best-practice guidance applicable to all                  various business environments and organizational
  types of organizations that provide services to a                    strategies. The Complementary Guidance provides
  business                                                             flexibility to implement the Core in a diverse range of
■ ITIL Complementary Guidance: a complementary set                     environments. Practitioners can select Complementary
  of publications with guidance specific to industry                   Guidance as needed to provide traction for the Core in a
  sectors, organization types, operating models and                    given business context, much as tyres are selected based
  technology architectures.                                            on the type of automobile, purpose and road conditions.
                                                                       This is to increase the durability and portability of
The ITIL Core consists of five publications (see Figure 1.2).
                                                                       knowledge assets and to protect investments in Service
Each provides the guidance necessary for an integrated
                                                                       Management capabilities.
approach as required by the ISO/IEC 20000 standard
specification:
6       | Introduction



    1.2.3.1 Service Strategy                                        1.2.3.3 Service Transition
    The Service Strategy volume provides guidance on how to         The Service Transition volume provides guidance for the
    design, develop and implement Service Management, not           development and improvement of capabilities for
    only as an organizational capability but also as a strategic    transitioning new and changed services into operations.
    asset. Guidance is provided on the principles underpinning      This publication provides guidance on how the
    the practice of Service Management which are useful for         requirements of Service Strategy encoded in Service
    developing Service Management policies, guidelines and          Design are effectively realized in Service Operations while
    processes across the ITIL Service Lifecycle. Service Strategy   controlling the risks of failure and disruption. The
    guidance is useful in the context of Service Design, Service    publication combines practices in Release Management,
    Transition, Service Operation and Continual Service             Programme Management and Risk Management and
    Improvement. Topics covered in Service Strategy include         places them in the practical context of Service
    the development of markets, internal and external, service      Management. It provides guidance on managing the
    assets, service catalogue and implementation of strategy        complexity related to changes to services and Service
    through the Service Lifecycle. Financial Management,            Management processes, preventing undesired
    Service Portfolio Management, Organizational                    consequences while allowing for innovation. Guidance is
    Development and Strategic Risks are among other                 provided on transferring the control of services between
    major topics.                                                   customers and service providers.
    Organizations use the guidance to set objectives and
    expectations of performance towards serving customers
                                                                    1.2.3.4 Service Operation
    and market spaces and to identify, select and prioritize        This volume embodies practices in the management of
    opportunities. Service Strategy is about ensuring that          Service Operations. It includes guidance on achieving
    organizations are in a position to handle the costs and         effectiveness and efficiency in the delivery and support of
    risks associated with their service portfolios and are set up   services so as to ensure value for the customer and the
    not just for operational effectiveness but for distinctive      service provider. Strategic objectives are ultimately realized
    performance. Decisions made with regard to Service              through Service Operations, therefore making it a critical
    Strategy have far-reaching consequences, including those        capability. Guidance is provided on how to maintain
    with delayed effect.                                            stability in Service Operations, allowing for changes in
                                                                    design, scale, scope and service levels. Organizations are
    Organizations already practising ITIL use this volume to        provided with detailed process guidelines, methods and
    guide a strategic review of their ITIL-based Service            tools for use in two major control perspectives: reactive
    Management capabilities and to improve the alignment            and proactive. Managers and practitioners are provided
    between those capabilities and their business strategies.       with knowledge allowing them to make better decisions in
    This volume of ITIL encourages readers to stop and think        areas such as managing the availability of services,
    about why something is to be done before thinking of            controlling demand, optimizing capacity utilization,
    how. Answers to the first type of questions are closer to       scheduling of operations and fixing problems. Guidance is
    the customer’s business. Service Strategy expands the           provided on supporting operations through new models
    scope of the ITIL Framework beyond the traditional              and architectures such as shared services, utility
    audience of ITSM professionals.                                 computing, web services and mobile commerce.

    1.2.3.2 Service Design                                          1.2.3.5 Continual Service Improvement
    The Service Design volume provides guidance for the             This volume provides instrumental guidance in creating
    design and development of services and service                  and maintaining value for customers through better
    management processes. It covers design principles and           design, introduction and operation of services. It combines
    methods for converting strategic objectives into portfolios     principles, practices and methods from Quality
    of services and service assets. The scope of Service Design     Management, Change Management and Capability
    is not limited to new services. It includes the changes and     Improvement. Organizations learn to realize incremental
    improvements necessary to increase or maintain value to         and large-scale improvements in service quality,
    customers over the lifecycle of services, the continuity of     operational efficiency and business continuity. Guidance is
    services, achievement of service levels and conformance to      provided for linking improvement efforts and outcomes
    standards and regulations. It guides organizations on how       with Service Strategy, Service Design and Service
    to develop design capabilities for Service Management.          Transition. A closed-loop feedback system, based on the
                                                                                                          Introduction |         7

Plan, Do, Check, Act (PDCA) model specified in ISO/IEC         and adopt’ the guidance for its own specific needs,
20000, is established and capable of receiving inputs for      environment and culture. This will involve taking into
change from any planning perspective.                          account the organization’s size, skills/resources, culture,
                                                               funding, priorities and existing ITSM maturity and
The day-to-day operational management of IT Services is
                                                               modifying the guidance as appropriate to suit the
significantly influenced by how well an organization’s
                                                               organization’s needs.
overall IT service strategy has been defined and how well
the ITSM processes have been planned and implemented.          For organizations finding ITIL for the first time, some form
This is the fourth publication in the ITIL Service             of initial assessment to compare the organization’s current
Management Practices series and the other publications         processes and practices with those recommended by ITIL
on Service Strategy, Service Design and Service Transition     would be a very valuable starting point. These assessments
should be consulted for best practice guidance on these        are described in more detail in the ITIL Continual Service
important stages prior to Service Operation.                   Improvement publication.
Service Operation is extremely important, as it is on a day-   Where significant gaps exist, it may be necessary to
to-day operational basis that events occur which can           address them in stages over a period of time to meet the
adversely impact service quality. The way in which an          organization’s business priorities and keep pace with what
organization’s IT infrastructure and its supporting ITSM       the organization is able to absorb and afford.
processes are operated will have the most direct and
immediate short-term bearing upon service quality.
                                                               1.5 CHAPTER OVERVIEW
                                                               Chapter 2 introduces the concept of Service Management
1.3 PURPOSE                                                    as a practice. Here, Service Management is positioned as a
Service Operation is a critical phase of the ITSM lifecycle.   strategic and professional component of any organization.
Well-planned and well-implemented processes will be to         This chapter also provides an overview of Service
no avail if the day-to-day operation of those processes is     Operation as a critical component of the Service
not properly conducted, controlled and managed. Nor will       Management Practice.
service improvements be possible if day-to-day activities
                                                               The key principles of Service Operation are covered in
to monitor performance, assess metrics and gather data
                                                               Chapter 3 of this publication. These principles outline
are not systematically conducted during Service Operation.
                                                               some of the basic concepts and principles on which the
Service Operation staff should have in place processes and     rest of the publication is based.
support tools to allow them to have an overall view of
                                                               Chapter 4 covers the processes performed within Service
Service Operation and delivery (rather than just the
                                                               Operation – most of the Service Operation processes are
separate components, such as hardware, software
                                                               reactive because of the nature of the work being
applications and networks, that make up the end-to-end
                                                               performed to maintain IT services in a robust, stable
service from a business perspective) and to detect any
                                                               condition. This chapter also covers proactive processes to
threats or failures to service quality.
                                                               emphasize that the aim of Service Operation is stability –
As services may be provided, in whole or in part, by one       but not stagnation. Service Operation should be constantly
or more partner/supplier organizations, the Service            looking at ways of doing things better and more cost-
Operation view of end-to-end service must be extended to       effectively, and the proactive processes have an important
encompass external aspects of service provision – and          role to play here.
where necessary shared or interfacing processes and tools
                                                               Chapter 5 covers a number of Common Service Operation
are needed to manage cross-organizational workflows.
                                                               activities, which are groups of activities and procedures
                                                               performed by Service Operation Functions. These
1.4 USAGE                                                      specialized, and often technical, activities are not
                                                               processes in the true sense of the word, but they are all
This publication should be used in conjunction with the
                                                               vital for the ability to deliver quality IT services at optimal
other four publications that make up the ITIL Service
                                                               cost.
Lifecycle.
                                                               Chapter 6 covers the organizational aspects of Service
Readers should be aware that the best-practice guidelines
                                                               Operation – the individuals or groups who carry out
in this and other volumes are not intended to be
                                                               Service Operation processes or activities – and includes
prescriptive. Each organization is unique and must ‘adapt
8       | Introduction



    some guidance on Service Operation organization
    structures.
    Chapter 7 describes the tools and technology that are
    used during Service Operation.
    Chapter 8 covers some aspects of implementation that will
    need to be considered before the operational phase of the
    lifecycle becomes active.
    Chapter 9 highlights the challenges, Critical Success
    Factors and risks faced during Service Operation, while the
    Afterword summarizes and concludes the publication.
    ITIL does not stand alone in providing guidance to IT
    managers and the appendices outline some of the key
    supplementary frameworks, methodologies and
    approaches that are commonly used in conjunction with
    ITIL during Service Operation.
Service Management
        as a practice   2
                                                                                                                          |       11


2 Service Management as a practice
2.1 WHAT IS SERVICE MANAGEMENT?                                  ■ The perishable nature of service output and service
                                                                    capacity: There is value for the customer from
Service Management is a set of specialized organizational
                                                                    assurance on the continued supply of consistent
capabilities for providing value to customers in the form of
                                                                    quality. Providers need to secure a steady supply
services. The capabilities take the form of functions and
                                                                    of demand from customers.
processes for managing services over a lifecycle, with
specializations in strategy, design, transition, operation and   However, Service Management is more than just a set of
continual improvement. The capabilities represent a              capabilities. It is also a professional practice supported by
service organization’s capacity, competency and                  an extensive body of knowledge, experience and skills. A
confidence for action. The act of transforming resources         global community of individuals and organizations in the
into valuable services is at the core of Service                 public and private sectors fosters its growth and maturity.
Management. Without these capabilities, a service                Formal schemes exist for the education, training and
organization is merely a bundle of resources that by itself      certification of practising organizations and individuals
has relatively low intrinsic value for customers.                influence its quality. Industry best practices, academic
                                                                 research and formal standards contribute to its intellectual
  Definition of Service Management                               capital and draw from it.
  Service Management is a set of specialized                     The origins of Service Management are in traditional
  organizational capabilities for providing value to             service businesses such as airlines, banks, hotels and
  customers in the form of services.                             phone companies. Its practice has grown with the
                                                                 adoption by IT organizations of a service-oriented
Organizational capabilities are shaped by the challenges         approach to managing IT applications, infrastructure and
they are expected to overcome. An example of this is how         processes. Solutions to business problems and support for
in the 1950s Toyota developed unique capabilities to             business models, strategies and operations are increasingly
overcome the challenge of smaller scale and financial            in the form of services. The popularity of shared services
capital compared to its American rivals. Toyota developed        and outsourcing has contributed to the increase in the
new capabilities in production engineering, operations           number of organizations that are service providers,
management and managing suppliers to compensate for              including internal organizational units. This in turn has
its inability to afford large inventories, make components,      strengthened the practice of Service Management and at
produce raw materials or own the companies that                  the same time imposed greater challenges upon it.
produced them. [Source: Magretta, Joan 2002. What
Management Is: How it works and why it’s everyone’s
business. The Free Press.] Service Management capabilities
                                                                 2.2 WHAT ARE SERVICES?
are similarly influenced by the following challenges that
distinguish services from other systems of value-creation,
                                                                 2.2.1 The value proposition
such as manufacturing, mining and agriculture:                     Definition of service
■ Intangible nature of the output and intermediate                 A service is a means of delivering value to customers
  products of service processes: Difficult to measure,             by facilitating outcomes customers want to achieve,
  control and validate (or prove).                                 without the ownership of specific costs and risks.
■ Demand is tightly coupled with the customer’s assets:
  Users and other customer assets such as processes,             Services are a means of delivering value to customers by
  applications, documents and transactions arrive with           facilitating outcomes customers want to achieve, without
  demand and stimulate service production.                       the ownership of specific costs and risks. Services facilitate
■ High level of contact for producers and consumers of           outcomes by enhancing the performance of associated
  services: Little or no buffer between the customer, the        tasks and reducing the effect of constraints. The result is
  front-office and the back-office.                              an increase in the probability of desired outcomes.
12     | Service Management as a practice



                    I must ask, do you                                          I believe services are a means of delivering value by
                    have a definition                                           facilitating outcomes customers want to achieve
                    for services?                                               without the ownership of specific costs and risks.

                 What would that mean
                 in operational terms?                                           Well, services facilitate outcomes by
                 Give me a few handles.                                          having a positive effect on activities,
                                                                                 objects and tasks, to create conditions for
                                                                                 better performance. As a result, the
     But without the ownership of                                                probability of desired outcomes is higher.
     costs and risks? Customers
     cannot wish them away.
                                                                                 No, they cannot but what they can do is
                                                  Manager       Manager          let the provider take ownership. That’s
     Aha! Because the provider is               (Operations)   (Strategy)        really why it is a service. If customers
     specialized with capabilities for                                           manage it all by themselves, they
     dealing with those costs and risks.                                         wouldn’t need a service would they?


                                                 (A casual conversation
                                                                                Yes, and also because the customer
                                                  at the water-cooler)          would rather specialize in those outcomes.

     And also because the provider can
                                                                                 Let’s write a book on
     potentially spread those costs and risks
                                                                                 service management!
     across more than one customer.

 Figure 2.1 A conversation about the definition and meaning of services

 2.3 FUNCTIONS AND PROCESSES ACROSS                                         2.3.2 Processes
 THE LIFECYCLE                                                              Processes are examples of closed-loop systems because
                                                                            they provide change and transformation towards a goal
 2.3.1 Functions                                                            and utilize feedback for self-reinforcing and self-corrective
 Functions are units of organizations specialized to perform                action (see Figure 2.2). It is important to consider the
 certain types of work and responsible for specific                         entire process or how one process fits into another.
 outcomes. They are self-contained, with capabilities and                   Process definitions describe actions, dependencies and
 resources necessary for their performance and outcomes.                    sequence. Processes have the following characteristics:
 Capabilities include work methods internal to the
 functions. Functions have their own body of knowledge,                     ■ Measurable: We are able to measure the process in a
 which accumulates from experience. They provide                              relevant manner. It is performance driven. Managers
 structure and stability to organizations.                                    want to measure cost, quality and other variables,
                                                                              while practitioners are concerned with duration and
 Functions are a means of structuring organizations so as                     productivity.
 to implement the specialization principle. Functions                       ■ Specific results: The reason a process exists is to
 typically define roles and the associated authority and                      deliver a specific result. This result must be individually
 responsibility for a specific performance and outcomes.                      identifiable and countable. While we can count
 Coordination between functions through shared processes                      changes, it is impossible to count how many Service
 is a common pattern in organization design. Functions                        Desks were completed.
 tend to optimize their work methods locally, to focus on
                                                                            ■ Customers: Every process delivers its primary results
 assigned outcomes. Poor coordination between functions,
                                                                              to a customer or stakeholder. They may be internal or
 combined with an inward focus, leads to functional silos
                                                                              external to the organization but the process must
 that hinder alignment and feedback critical to the success
                                                                              meet their expectations.
 of the organization as a whole. Process models help avoid
                                                                            ■ Responds to a specific event: While a process may
 this problem with functional hierarchies by improving
                                                                              be ongoing or iterative, it should be traceable to a
 cross-functional coordination and control. Well-defined
                                                                              specific trigger.
 processes can improve productivity within and across
 functions.
                                                                                           Service Management as a practice |         13


                     Data,        Process
                information and
                   knowledge
   Suppliers
                                                                                                   Desired
                                                                                                  Outcome
                                       Activity 1          Activity 2              Activity 3                      Customer




                             Service control and quality




                        Trigger

Figure 2.2 A basic process

Functions are often mistaken for processes. For example,                systems thinking. Each control perspective can reveal
there are misconceptions about Capacity Management                      patterns that may not be apparent from the other.
being a Service Management process. First, Capacity
Management is an organizational capability with
                                                                        2.4 SERVICE OPERATION FUNDAMENTALS
specialized processes and work methods. Whether it is a
function or a process depends entirely on organization
                                                                        2.4.1 Purpose/goal/objective
design. It is a mistake to assume that Capacity
Management can only be a process. It is possible to                     The purpose of Service Operation is to coordinate and
measure and control capacity and to determine whether it                carry out the activities and processes required to deliver
is adequate for a given purpose. Assuming that it is always             and manage services at agreed levels to business users
a process, with discrete countable outcomes, can be an                  and customers. Service Operation is also responsible for
error.                                                                  the ongoing management of the technology that is used
                                                                        to deliver and support services.
2.3.3 Specialization and coordination across                            Well-designed and well-implemented processes will be of
the lifecycle                                                           little value if the day-to-day operation of those processes
Specialization and coordination are necessary in the                    is not properly conducted, controlled and managed. Nor
lifecycle approach. Feedback and control between the                    will service improvements be possible if day-to-day
functions and processes within and across the elements of               activities to monitor performance, assess metrics and
the lifecycle make this possible. The dominant pattern in               gather data are not systematically conducted during
the lifecycle is the sequential progress starting from SS               Service Operation.
through SD-ST-SO and back to SS through CSI. However,
that is not the only pattern of action. Every element of the            2.4.2 Scope
lifecycle provides points for feedback and control.                     Service Operation includes the execution of all ongoing
                                                                        activities required to deliver and support services. The
The combination of multiple perspectives allows greater
                                                                        scope of Service Operation includes:
flexibility and control across environments and situations.
The lifecycle approach mimics the reality of most                       ■ The services themselves. Any activity that forms part
organizations where effective management requires the                     of a service is included in Service Operation, whether
use of multiple control perspectives. Those responsible for               it is performed by the Service Provider, an external
the design, development and improvement of processes                      supplier or the user or customer of that service
for Service Management can adopt a process-based                        ■ Service Management processes. The ongoing
control perspective. Those responsible for managing                       management and execution of many Service
agreements, contracts and services may be better served                   Management processes are performed in Service
by a lifecycle-based control perspective with distinct                    Operation, even though a number of ITIL processes
phases. Both these control perspectives benefit from
14   | Service Management as a practice



   (such as Change and Capacity Management) originate           ■ It is difficult to obtain funding during the operational
   at the Service Design or Service Transition stage              phase, to fix design flaws or unforeseen requirements
   of the Service Lifecycle, they are in use continually          – since this was not part of the original value
   in Service Operation. Some processes are not                   proposition. In many cases it is only after some time in
   included specifically in Service Operation, such as            operation that these problems surface. Most
   Strategy Definition, the actual design process itself.         organizations do not have a formal mechanism to
   These processes focus more on longer-term planning             review operational services for design and value. This
   and improvement activities, which are outside the              is left to Incident and Problem Management to resolve
   direct scope of Service Operation; however,                    – as if it is purely an operational issue.
   Service Operation provides input and influences              ■ It is difficult to obtain additional funding for tools or
   these regularly as part of the lifecycle of                    actions (including training) aimed at improving the
   Service Management.                                            efficiency of Service Operation. This is partly because
 ■ Technology. All services require some form of                  they are not directly linked to the functionality of a
   technology to deliver them. Managing this technology           specific service and partly because there is an
   is not a separate issue, but an integral part of the           expectation from the customer that these costs should
   management of the services themselves. Therefore a             have been built into the cost of the service from the
   large part of this publication is concerned with the           beginning. Unfortunately, the rate of technology
   management of the infrastructure used to deliver               change is very high. Shortly after a solution has been
   services.                                                      deployed that will efficiently manage a set of services,
 ■ People. Regardless of what services, processes and             new technology becomes available that can do it
   technology are managed, they are all about people. It          faster, cheaper and more effectively.
   is people who drive the demand for the organization’s        ■ Once a service has been operational for some time, it
   services and products and it is people who decide              becomes part of the baseline of what the business
   how this will be done. Ultimately, it is people who            expects from the IT services. Attempts to optimize the
   manage the technology, processes and services.                 service or to use new tools to manage it more
   Failure to recognize this will result (and has resulted)       effectively are seen as successful only if the service has
   in the failure of Service Management projects                  been very problematic in the past. In other words,
                                                                  some services are taken for granted and any action to
 2.4.3 Value to business                                          optimize them is perceived as ‘fixing services that are
 Each stage in the ITIL Service Lifecycle provides value to       not broken’.
 business. For example, service value is modelled in Service    This publication suggests a number of processes, functions
 Strategy; the cost of the service is designed, predicted and   and measures which are aimed at addressing these areas.
 validated in Service Design and Service Transition; and
 measures for optimization are identified in Continual          2.4.4 Optimizing Service Operation
 Service Improvement. The operation of service is where         performance
 these plans, designs and optimizations are executed and
                                                                Service Operation is optimized in two ways:
 measured. From a customer viewpoint, Service Operation
 is where actual value is seen.                                 ■ Long-term incremental improvement. This is based
                                                                   on evaluating the performance and output of all
 There is a down side to this, though:
                                                                   Service Operation processes, functions and outputs
 ■ Once a service has been designed and tested, it is              over time. The reports are analysed and a decision
     expected to run within the budgetary and Return on            made about whether improvement is needed and, if
     Investment targets established earlier in the lifecycle.      so, how best to implement it through Service Design
     In reality, however, very few organizations plan              and Transition. Examples include the deployment of a
     effectively for the costs of ongoing management of            new set of tools, changes to process designs,
     services. It is very easy to quantify the costs of a          reconfiguration of the infrastructure, etc. This type of
     project, but very difficult to quantify what the service      improvement is covered in detail in the Continual
     will cost after three years of operation.                     Service Improvement publication.
                                                                                  Service Management as a practice |            15

■ Short-term ongoing improvement of working                     In order to resolve one or more incidents, problems or
   practices within the Service Operation processes,            Known Errors, some form of change may be necessary.
   functions and technology itself. These are generally         Smaller, often standard, changes can be handled through
   smaller improvements that are implemented without            a Request Fulfilment process, but larger, higher-risk or
   any change to the fundamental nature of a process or         infrequent changes must go through a formal Change
   technology. Examples include tuning, workload                Management process.
   balancing, personnel redeployment and training, etc.
Although both of these are discussed in some detail within
                                                                2.4.5.4 Access Management
the scope of Service Operation, the Continual Service           Access Management is the process of granting authorized
Improvement publication will provide a framework and            users the right to use a service, while restricting access to
alternatives within which improvement may be driven as          non-authorized users. It is based on being able accurately
part of the overall support of business objectives.             to identify authorized users and then manage their ability
                                                                to access services as required during different stages of
2.4.5 Processes within Service Operation                        their Human Resources (HR) or contractual lifecycle. Access
                                                                Management has also been called Identity or Rights
There are a number of key Service Operation processes
                                                                Management in some organizations.
that must link together to provide an effective overall IT
support structure. The overall structure is briefly described
here and then each of the processes is described in more
                                                                2.4.6 Functions within Service Operation
detail in Chapter 4.                                            Processes alone will not result in effective Service
                                                                Operation. A stable infrastructure and appropriately skilled
2.4.5.1 Event Management                                        people are needed as well. To achieve this, Service
                                                                Operation relies on several groups of skilled people, all
Event Management monitors all events that occur
                                                                focused on using processes to match the capability of the
throughout the IT infrastructure, to monitor normal
                                                                infrastructure to the needs of the business.
operation and to detect and escalate exception conditions.
                                                                These groups fall into four main functions, listed here and
2.4.5.2 Incident and Problem Management                         discussed in detail in Chapter 6.
Incident Management concentrates on restoring
unexpectedly degraded or disrupted services to users as         2.4.6.1 Service Desk
quickly as possible, in order to minimize business impact.      The Service Desk is the primary point of contact for users
                                                                when there is a service disruption, for Service Requests, or
Problem Management involves: root-cause analysis to
                                                                even for some categories of Request for Change. The
determine and resolve the cause of incidents, proactive
                                                                Service Desk provides a point of communication to the
activities to detect and prevent future problems/incidents
                                                                users and a point of coordination for several IT groups
and a Known Error sub-process to allow quicker diagnosis
                                                                and processes
and resolution if further incidents do occur.

                                                                2.4.6.2 Technical Management
2.4.5.3 Request Fulfilment
                                                                Technical Management provides detailed technical skills
Request Fulfilment is the process for dealing with Service
                                                                and resources needed to support the ongoing operation
Requests – many of them actually smaller, lower-risk,
                                                                of the IT Infrastructure. Technical Management also plays
changes – initially via the Service Desk, but using a
                                                                an important role in the design, testing, release and
separate process similar to that of Incident Management
                                                                improvement of IT services. In small organizations, it is
but with separate Request Fulfilment records/tables –
                                                                possible to manage this expertise in a single department,
where necessary linked to the Incident or Problem
                                                                but larger organizations are typically split into a number
Record(s) that initiated the need for the request. To be a
                                                                of technically specialized departments.
Service Request, it is normal for some prerequisites to be
defined and met (e.g. needs to be proven, repeatable, pre-
approved, proceduralized).
16   | Service Management as a practice



 2.4.6.3 IT Operations Management                             ■ Financial Management, which is covered in the Service
 IT Operations Management executes the daily operational        Strategy publication.
 activities needed to manage the IT Infrastructure. This is   ■ Knowledge Management, which is covered in the
 done according to the Performance Standards defined            Service Transition publication.
 during Service Design. In some organizations this is a       ■ IT Service Continuity, which is covered in the Service
 single, centralized department, while in others some           Design publication.
 activities and staff are centralized and some are provided   ■ Service Reporting and Measurement, which are
 by distributed or specialized departments. IT Operations       covered in the Continual Service Improvement
 Management has two functions that are unique and are           publication.
 generally formal organizational structures. These are:
 ■ IT Operations Control, which is generally staffed by
   shifts of operators and which ensures that routine
   operational tasks are carried out. IT Operations Control
   will also provide centralized monitoring and control
   activities, usually using an Operations Bridge or
   Network Operations Centre.
 ■ Facilities Management refers to the management of
   the physical IT environment, usually data centres or
   computer rooms. In many organizations Technical and
   Application Management are co-located with IT
   Operations in large data centres.

 2.4.6.4 Application Management
 Application Management is responsible for managing
 Applications throughout their lifecycle. The Application
 Management function supports and maintains operational
 applications and also plays an important role in the
 design, testing and improvement of applications that form
 part of IT services. Application Management is usually
 divided into departments based on the application
 portfolio of the organization, thus allowing easier
 specialization and more focused support.

 2.4.6.5 Interfaces to other Service Management
 Lifecycle stages
 There are several other processes that will be executed or
 supported during Service Operation, but which are driven
 during other phases of the Service Management Lifecycle.
 These will be discussed in the final part of Chapter 4
 and include:
 ■ Change Management, which is a major process that
   should be closely linked to Configuration Management
   and Release Management. These topics are primarily
   covered in the Service Transition publication.
 ■ Capacity and Availability Management, which are
   covered in the Service Design publication.
Service Operation
        principles   3
                                                                                                                         |     19


3 Service Operation principles
When considering Service Operation it is tempting to                processes across the organization – e.g. ensuring that
focus only on managing day-to-day activities and                    all people who resolve incidents complete the Incident
technology as ends in themselves. However, Service                  Record in the same way. In this publication the term
Operation exists within a far greater context. As part of the       ‘group’ does not refer to a group of companies that
Service Management Lifecycle, it is responsible for                 are owned by the same entity.
executing and performing processes that optimize the cost       ■   Team: A team is a more formal type of group. These
and quality of services. As part of the organization, it is         are people who work together to achieve a common
responsible for enabling the business to meet its                   objective, but not necessarily in the same organization
objectives. As part of the world of technology, it is               structure. Team members can be co-located, or work
responsible for the effective functioning of components             in multiple different locations and operate virtually.
that support services. The principles in this chapter are           Teams are useful for collaboration, or for dealing with
aimed at helping Service Operation practitioners to                 a situation of a temporary or transitional nature.
achieve a balance between all of these roles and to focus           Examples of teams include project teams, application
on effectively managing the day-to-day aspects while                development teams (often consisting of people from
maintaining a perspective of the greater context.                   several different business units) and incident or
                                                                    problem resolution teams.
3.1 FUNCTIONS, GROUPS, TEAMS,                                   ■   Department: Departments are formal organization
                                                                    structures which exist to perform a specific set of
DEPARTMENTS AND DIVISIONS
                                                                    defined activities on an ongoing basis. Departments
The Service Operation publication uses several terms to             have a hierarchical reporting structure with managers
refer to the way in which people are organized to execute           who are usually responsible for the execution of the
processes or activities. There are several published                activities and also for day-to-day management of the
definitions for each term and it is not the purpose of this         staff in the department.
publication to enter the debate about which definition is       ■   Division: A division refers to a number of departments
best. Please note that the following definitions are generic        that have been grouped together, often by geography
and not prescriptive. They are provided simply to define            or product line. A division is normally self-contained
assumptions and to facilitate understanding of the                  and is able to plan and execute all activities in a
material. The reader should adapt these principles to the           supply chain.
organizational practices used in their own organization.        ■   Role: A role refers to a set of connected behaviours or
■ Function: A function is a logical concept that refers to          actions that are performed by a person, team or group
  the people and automated measures that execute a                  in a specific context. For example, a Technical
  defined process, an activity or a combination of                  Management department can perform the role of
  processes or activities. In larger organizations, a               Problem Management when diagnosing the root
  function may be broken out and performed by several               cause of incidents. This same department could also
  departments, teams and groups, or it may be                       be expected to play several other roles at different
  embodied within a single organizational unit (e.g.                times, e.g. it may assess the impact of changes
  Service Desk). In smaller organizations, one person or            (Change Management role), manage the performance
  group can perform multiple functions – e.g. a                     of devices under their control (Capacity Management
  Technical Management department could also                        role), etc. The scope of their role and what triggers
  incorporate the Service Desk function.                            them to play that role are defined by the relevant
■ Group: A group is a number of people who are similar              process and agreed by their line manager.
  in some way. In this publication, groups refer to
  people who perform similar activities – even though           3.2 ACHIEVING BALANCE IN SERVICE
  they may work on different technology or report into
                                                                OPERATION
  different organizational structures or even in different
  companies. Groups are usually not formal organization         Service Operation is more than just the repetitive
  structures, but are very useful in defining common            execution of a standard set of procedures or activities. All
20   | Service Operation principles



 functions, processes and activities are designed to deliver   Both views are necessary when delivering services. The
 a specified and agreed level of services, but they have to    organization that focuses only on business requirements
 be delivered in an ever-changing environment.                 without thinking about how they are going to deliver will
                                                               end up making promises that cannot be kept. The
 This forms a conflict between maintaining the status quo
                                                               organization that focuses only on internal systems without
 and adapting to changes in the business and
                                                               thinking about what services they support will end up
 technological environments. One of Service Operation’s
                                                               with expensive services that deliver little value.
 key roles is therefore to deal with this conflict and to
 achieve a balance between conflicting sets of priorities.     The potential for role conflict between the external and
                                                               internal views is the result of many variables, including
 This section of the publication highlights some of the key
                                                               the maturity of the organization, its management culture,
 tensions and conflicts and identifies how IT organizations
                                                               its history, etc. This makes a balance difficult to achieve,
 can recognize that they are suffering from an imbalance
                                                               and most organizations tend more towards one role
 by tending more towards one extreme or the other. It also
                                                               than the other. Of course, no organization will be
 provides some high-level guidelines on how to resolve the
                                                               totally internally or externally focused, but will find itself in
 conflict and thus move towards a best-practice approach.
                                                               a position along a spectrum between the two. This is
 Every conflict therefore represents an opportunity for
                                                               illustrated in Figure 3.1:
 growth and improvement.
                                                                          An organization here    An organization here is
 3.2.1 Internal IT view versus external                                     is out of balance        quite balanced,
                                                                           and is in danger of         but tends to
 business view                                                            not meeting business       under-deliver on
 The most fundamental conflict in all phases of the ITSM                      requirements       promises to the business

 Lifecycle is between the view of IT as a set of IT services
 (the external business view) and the view of IT as a set of   Extreme Focus                                       Extreme Focus
                                                               on Internal                                            on External
 technology components (internal IT view).
 ■ The external view of IT is the way in which services
   are experienced by its users and customers. They do
   not always understand, nor do they care about, the
   details of what technology is used to manage those
   services. All they are concerned about is that the          Figure 3.1 Achieving a balance between external and
   services are delivered as required and agreed.              internal focus
 ■ The internal view of IT is the way in which IT
   components and systems are managed to deliver the           Table 3.1 outlines some examples of the characteristics of
   services. Since IT systems are complex and diverse, this    positions at the extreme ends of the spectrum. The
   often means that the technology is managed by               purpose of this table is to assist organizations in
   several different teams or departments – each of            identifying to which extreme they are closer, not to
   which is focused on achieving good performance and          identify real-life positions to which organizations should
   availability of ‘its’ systems.                              aspire.
                                                                                              Service Operation principles |       21

Table 3.1 Examples of extreme internal and external focus
                Extreme internal focus                                  Extreme external focus
Primary focus Performance and management of IT Infrastructure           Achieving high levels of IT service performance with
              devices, systems and staff, with little regard to the     little regard to how it is achieved
              end result on the IT service

Metrics         ■   Focus on technical performance without              ■   Focus on External Metrics without showing internal
                    showing what this means for services                    staff how these are derived or how they can be
                ■   Internal metrics (e.g. network uptime) reported         improved
                    to the business instead of service performance      ■   Internal staff are expected to devise their own
                    metrics.                                                metrics to measure internal performance.

Customer/user   ■   High consistency of delivery, but only delivers a   ■   Poor consistency of delivery
experience          percentage of what the business needs.              ■   ‘IT consists of good people with good intentions,
                ■   Uses a ‘push’ approach to delivery, i.e. prefers        but cannot always execute’
                    to have a standard set of services for all          ■   Reactive mode of operation.
                    business units.                                     ■   Uses a ‘pull’ approach to delivery, i.e. prefers to
                                                                            deliver customized services upon request

Operations      ■   Standard operations across the board                ■   Multiple delivery teams and multiple technologies
strategy        ■   All new services need to fit into the current       ■   New technologies require new operations
                    architecture and procedures.                            approaches and often new IT Operations teams.

Procedures      Focus purely on how to manage the technology,           Focuses primarily on what needs to be done and when
and manual      not on how its performance relates to IT services       and less on how this should be achieved

Cost strategy   ■   Cost reduction achieved purely through              ■   Budget allocated on the basis of which business unit
                    technology consolidation                                is perceived to have the most need
                ■   Optimization of operational procedures and          ■   Less articulate or vocal business units often have
                    resources                                               inferior services as there is not enough funding
                ■   Business impact of cost cutting often only              allocated to their services.
                    understood later
                ■   Return on Investment calculations are focused
                    purely on cost savings or ‘payback periods’.

Training        Training is conducted as an apprenticeship, where       ■   Training is conducted on a project-by-project basis
                new Operations staff have to learn the way things       ■   There are no standard training courses since
                have to be done, not why                                    operational procedures and technology are
                                                                            constantly changing.

Operations      ■   Specialized staff, organized according to           ■   Generalist staff, organized partly according to
staff               technical specialty                                     technical capability and partly according to their
                ■   Staff work on the false assumption that good            relationship with a business unit
                    technical achievement is the same as good           ■   Reliance on ‘heroics’, where staff go out of their
                    customer service.                                       way to resolve problems that could have been
                                                                            prevented by better internal processes.
22   | Service Operation principles



 This does not mean that the external focus is unimportant.      ■ Input from and feedback to Continual Service
 The whole point of Service Management is to provide               Improvement to identify areas where there is an
 services that meet the objectives of the organization as a        imbalance and the means to identify and enforce
 whole. It is critical to structure services around customers.     improvement.
 At the same time, it is possible to compromise the              ■ A clear communication and training plan for business.
 quality of services by not thinking about how they                While many organizations are good at developing
 will be delivered.                                                Communication Plans for projects, this often does not
 Building Service Operation with a balance between                 extend into their operational phase.
 internal and external focus requires a long-term, dedicated
 approach reflected in all phases of the ITSM Service            3.2.2 Stability versus responsiveness
 Lifecycle. This will require the following:                     No matter how good the functionality is of an IT service
                                                                 and no matter how well it has been designed, it will be
 ■ An understanding of what services are used by the
                                                                 worth far less if the service components are not available
     business and why.
                                                                 or if they perform inconsistently.
 ■ An understanding of the relative importance and
     impact of those services on the business.                   This means that Service Operation needs to ensure that
 ■   An understanding of how technology is used to               the IT Infrastructure is stable and available as designed. At
     provide IT services.                                        the same time, Service Operation needs to recognize that
                                                                 business and IT requirements change.
 ■   Involvement of Service Operation in Continual Service
     Improvement projects that aim to identify ways of           Some of these changes are evolutionary. For example, the
     delivering more, increase service quality and lower         functionality, performance and architecture of a platform
     cost.                                                       may change over a number of years. Each change brings
 ■   Procedures and manuals that outline the role of IT          with it an opportunity to provide better levels of service to
     Operations in both the management of technology             the business. In evolutionary changes, it is possible to plan
     and the delivery of IT services.                            how to respond to the change and thus maintain stability
 ■   A clearly differentiated set of metrics to report to the    while responding to the changes.
     business on the achievement of service objectives; and      Many changes, though, happen very quickly and
     to report to IT managers on the efficiency and              sometimes under extreme pressure. For example, a
     effectiveness of Service Operation.                         Business Unit unexpectedly wins a contract that requires
 ■   All IT Operations staff understand exactly how the          additional IT services, more capacity and faster response
     performance of the technology affects the delivery of       times. The ability to respond to this type of change
     IT services and in turn how these affect the business       without impacting other services is a significant challenge.
     and the business goals.
                                                                 Many IT organizations are unable to achieve this balance
 ■   A set of standard services delivered consistently to all
                                                                 and tend to focus on either the stability of the IT
     Business Units and a set of non-standard (sometimes
                                                                 Infrastructure or the ability to respond to changes quickly.
     customized) services delivered to specific Business
                                                                          An organization here is     An organization here
     Units – together with a set of Standard Operating                    out of balance and is in   is quite balanced, but
     Procedures (SOPs) that can meet both sets of                           danger of ignoring            may tend to
     requirements.                                                          changing business        overspend on change
                                                                               requirements
 ■   A cost strategy aimed at balancing the requirements
     of different business units with the cost savings           Extreme Focus                                    Extreme Focus on
     available through optimization of existing technology       on Stability                                       Responsiveness
     or investment in new technology – and an
     understanding of the cost strategy by all involved
     IT resources.
 ■   A value-based, rather than cost-based, Return on
     Investment strategy.
                                                                 Figure 3.2 Achieving a balance between focus on
 ■   Involvement of IT Operations staff in the Service
                                                                 stability and responsiveness
     Design and Service Transition phases of the
     ITSM Lifecycle.
                                                                                             Service Operation principles |         23

Table 3.2     Examples of extreme focus on stability and responsiveness
                Extreme focus on stability                             Extreme focus on responsiveness
Primary focus   ■   Technology                                          ■   Output to the business
                ■   Developing and refining standard IT management      ■   Agrees to required changes before determining what
                    techniques and processes.                               it will take to deliver them.

Typical         IT can demonstrate that it is complying with SOPs       IT staff are not available to define or execute routine
problems        and Operational Level Agreements (OLAs), even when      tasks because they are busy on projects for new
experienced     there is clear misalignment to business requirements    services

Technology      ■   Growth strategy based on analysing existing         ■   Technology purchased for each new business
growth              demand on existing systems                              requirement
strategy        ■   New services are resisted and Business Units        ■   Using multiple technologies and solutions for similar
                    sometimes take ownership of ‘ their own’                solutions, to meet slightly different business needs.
                    systems to get access to new services.

Technology      Existing or standard technology to be used; services    Over-provisioning. No attempt is made to model the
used to         must be adjusted to work within existing parameters     new service on the existing infrastructure. New,
deliver IT                                                              dedicated technology is purchased for each new project
services

Capacity        ■   Forecasts based on projections of current           ■   Forecasts based on future business activity for each
Management          workloads                                               service individually and do not take into account IT
                ■   System performance is maintained at consistent          activity or other IT services
                    levels through tuning and demand management,        ■   Existing workloads not relevant.
                    not by workload forecasting and management.



Table 3.2 outlines some examples of the characteristics of         ■ Initiate changes at the earliest appropriate stage in the
positions at extreme ends of the spectrum. The purpose of            ITSM Lifecycle. This will ensure that both functional
this table is to assist organizations in identifying to which        (business) and manageability (IT operational)
extreme they are closer, not to identify real-life positions         requirements can be assessed and built or changed
to which organizations should aspire.                                together.
                                                                   ■ Ensure IT involvement in business changes as early as
Building an IT organization that achieves a balance
between stability and responsiveness in Service Operation            possible in the change process to ensure scalability,
will require the following actions:                                  consistency and achievability of IT services sustaining
                                                                     business changes.
■ Ensure investment in technologies and processes that             ■ Service Operation teams should provide input into the
  are adaptive rather than rigid, e.g. virtual server and            ongoing design and refinement of the architectures
  application technology and the use of Change Models                and IT services (see Service Design and Service
  (see Service Transition publication).                              Strategy publications).
■ Build a strong Service Level Management (SLM)                    ■ Implement and use SLM to avoid situations where
  process which is active from the Service Design phase              business and IT managers and staff negotiate informal
  to the Continual Service Improvement phase of the                  agreements.
  ITSM Lifecycle.
■ Foster integration between SLM and the other Service             3.2.3 Quality of service versus cost of
  Design processes to ensure proper mapping of
                                                                   service
  business requirements to IT operational activities and
  components of the IT Infrastructure. This makes it               Service Operation is required consistently to deliver the
  easier to model the effect of changes and                        agreed level of IT service to its customers and users, while
  improvements.                                                    at the same time keeping costs and resource utilization at
                                                                   an optimal level.
24   | Service Operation principles




                                                                                                        Service
                Cost of Service




                                                                          Range of optimal
                                                                          balance between
                                                                          Cost and Quality



                                                        Quality of Service
                                               (Performance, Availability, Recovery)

 Figure 3.3 Balancing service quality and cost

 Figure 3.3 represents the investment made to deliver a           initiated within Service Operation and carried forward by
 service at increasing levels of quality.                         Continual Service Improvement. Some costs can be
                                                                  reduced incrementally over time, but most cost savings
 In Figure 3.3, an increase in the level of quality usually
                                                                  can be made only once. For example, once a duplicate
 results in an increase in the cost of that service, and vice
                                                                  software tool has been eliminated, it cannot be eliminated
 versa. However, the relationship is not always directly
                                                                  again for further cost savings.
 proportional:
                                                                  Achieving an optimal balance between cost and quality
 ■ Early in the service’s lifecycle it is possible to achieve
                                                                  (shown between the dotted lines in Figure 3.3) is a key
   significant increases in service quality with a relatively
                                                                  role of Service Management. There is no industry standard
   small amount of money. For example, improving
                                                                  for what this range should be, since each service will have
   service availability from 55% to 75% is fairly
                                                                  a different range of optimization, depending on the nature
   straightforward and may not require a huge
                                                                  of the service and the type of business objective being
   investment.
                                                                  met. For example, the business may be prepared to spend
 ■ Later in the service’s lifecycle, even small
                                                                  more to achieve high availability on a mission-critical
   improvements in quality are very expensive. For
                                                                  service, while it is prepared to live with the lower quality
   example, improving the same service’s availability from
                                                                  of an administrative tool.
   96% to 99.9% may require large investments in high-
   availability technology and support staff and tools.           Determining the appropriate balance of cost and quality
                                                                  should be done during the Service Strategy and Service
 While this may seem straightforward, many organizations
                                                                  Design Lifecycle phases, although in many organizations it
 are under severe pressure to increase the quality of service
                                                                  is left to the Service Operation teams – many of whom do
 while reducing their costs. In Figure 3.3, the relationship
                                                                  not generally have all the facts or authority to be able to
 between cost and quality is sometimes inverse. It is
                                                                  make this type of decision.
 possible (usually inside the range of optimization) to
 increase quality while reducing costs. This is normally
                                                                                                   Service Operation principles |            25

Unfortunately, it is also common to find organizations that           available, or ‘under sizing’ because the business does not
are spending vast quantities of money without achieving               understand the manageability requirements of the
any clear improvements in quality. Again, Continual                   solution. Either result will cause customer dissatisfaction
Service Improvement will be able to identify the cause of             and even more expense when the solution is re-
the inefficiency, evaluate the optimal balance for that               engineered or retro-fitted to the requirements that should
service and formulate a corrective plan.                              have been specified during Service Design.
Achieving the correct balance is important. Too much
focus on quality will result in IT services that deliver more
                                                                                  An organization here is   An organization here is
than necessary, at a higher cost, and could lead to a                            out of balance and is in   quite balanced, but may
discussion on reducing the price of services. Too much                           danger of losing service     tend to overspend to
                                                                                 quality because of heavy    deliver higher levels of
focus on cost will result in IT delivering on or under                                  cost cutting         service than are strictly
budget, but putting the business at risk through sub-                                                               necessary
standard IT services.                                                 Extreme Focus                                             Extreme Focus
                                                                      on Cost                                                       on Quality
  Special note: just how far is too much?
  Over the past several years, IT organizations have
  been under pressure to cut costs. In many cases this
  resulted in optimized costs and quality. But, in other
  cases, costs were cut to the point where quality
  started to suffer. At first, the signs were subtle – small          Figure 3.4 Achieving a balance between focus on cost
  increases in incident resolution times and a slight                 and quality
  increase in the number of incidents. Over time,
  though, the situation became more serious as staff                  Table 3.3 outlines some examples of the characteristics of
  worked long hours to handle multiple workloads and                  positions at extreme ends of the cost/quality spectrum.
  services ran on ageing or outdated infrastructure.                  The purpose of this table is to assist organizations in
  There is no simple calculation to determine when                    identifying to which extreme they are closer, not to
  costs have been cut too far, but good SLM is crucial                identify real-life positions to which organizations should
  to making customers aware of the impact of cutting                  aspire.
  too far, so recognizing these warning signs and
  symptoms can greatly enhance an organization’s                      Achieving a balance will ensure delivery of the level of
  ability to correct this situation.                                  service necessary to meet business requirements at an
                                                                      optimal (as opposed to lowest possible) cost. This will
                                                                      require the following:
Service Level Requirements – together with a clear
understanding of the business purpose of the service and              ■ A Financial Management process and tools that can
the potential risks – will help to ensure that the service is              account for the cost of providing IT services; and
delivered at the appropriate cost. They will also help to                  which model alternative methods of delivering services
avoid ‘over sizing’ of the service just because budget is                  at differing levels of cost. For example, comparing the

Table 3.3     Examples of extreme focus on quality and cost
                Extreme focus on quality                                     Extreme focus on cost
Primary focus Delivering the level of quality demanded by the                Meeting budget and reducing costs
                business regardless of what it takes

Typical         ■   Escalating budgets                                       ■    IT limits the quality of service based on their
problems        ■   IT services generally deliver more than is necessary          budget availability
experienced         for business success                                     ■    Escalations from the business to get more service
                ■   Escalating demands for higher-quality services.               from IT.

Financial  IT usually does not have a method of communicating                Financial reporting is done purely on budgeted
Management the cost of IT services. Accounting methods are based             amounts. There is no way of linking activities in IT to
                on an aggregated method (e.g. cost of IT per user).          the delivery of IT services.
26   | Service Operation principles



   cost of delivering a service at 98% availability or at         ■ The role that IT plays in the business and the mandate
   99.9% availability; or the cost of providing a service           that IT has to influence the strategy and tactics of the
   with or without additional functionality.                        business. For example, a company where the CIO is a
 ■ Ensuring that decisions around cost versus quality are           board member is likely to have an IT organization that
   made by the appropriate managers during Service                  is far more proactive and responsive than a company
   Strategy and Service Design. IT operational managers             where IT is seen as an administrative overhead.
   are generally not equipped to evaluate business                ■ The level of integration of management processes and
   opportunities and should only be asked to make                   tools. Higher levels of integration will facilitate better
   financial decisions that are related to achieving                knowledge of opportunities.
   operational efficiencies.                                      ■ The maturity and scope of Knowledge Management in
                                                                    the organization; this is especially seen in
 3.2.4 Reactive versus proactive                                    organizations which have been able to store and
 A reactive organization is one which does not act unless it        organize historical data effectively – especially
 is prompted to do so by an external driver, e.g. a new             Availability and Problem Management data.
 business requirement, an application that has been               From a maturity perspective, it is clear that newer
 developed or escalation in complaints made by users and          organizations will have different priorities and experiences
 customers. An unfortunate reality in many organizations is       from a more established organization – what is best
 the focus on reactive management mistakenly as the sole          practice for a mature organization may not suit a younger
 means to ensure services that are highly consistent and          organization. Therefore an imbalance could result from an
 stable, actively discouraging proactive behaviour from           organization being either less or more mature. Consider
 operational staff. The unfortunate irony of this approach is     the following:
 that discouraging effort investment in proactive Service
 Management can ultimately increase the effort and cost of        ■ Less mature organizations (or organizations with
 reactive activities and further risk stability and consistency       newer IT services or technology) will generally be
 in services.                                                         more reactive, simply because they do not know all
                                                                      the variables involved in running their business and
 A proactive organization is always looking for ways to               providing IT services.
 improve the current situation. It will continually scan the
                                                                  ■   IT staff in newer organizations tend to be generalists
 internal and external environments, looking for signs of
                                                                      because it is unclear exactly what is required to deliver
 potentially impacting changes. Proactive behaviour is
                                                                      stable IT services to the business.
 usually seen as positive, especially since it enables the
                                                                  ■   Incidents and problems in newer organizations are
 organization to maintain competitive advantage in a
                                                                      fairly unpredictable because the technology is
 changing environment. However, being too proactive can
                                                                      relatively new and changes quickly.
 be expensive and can result in staff being distracted. The
                                                                  ■   More mature organizations tend to be more proactive,
 need for proper balance in reactive and proactive
 behaviour often achieves the optimal result.                         simply because they have more data and reporting
                                                                      available and know the typical patterns of incidents
 Generally, it is better to manage IT services proactively, but       and workflows. Thus, they forecast exceptions far
 achieving this is not easily planned or achieved. This is            more easily.
 because building a proactive IT organization is dependent        ■   Staff working in mature organizations also generally
 on many variables, including:                                        tend to have more established relationships between
 ■ The maturity of the organization. The longer the                   IT staff and the business and so can be more proactive
   organization has been delivering a consistent set of IT            about meeting changing business requirements – this
   services, the more likely it is to understand the                  is especially true when IT is seen as a strategic
   relationship between IT and the business and the IT                component of the business.
   Infrastructure and IT services.
 ■ The culture of the organization. Some organizations
   have a culture that is focused on innovation and are
   more likely to be proactive. Others are more likely to
   focus on the status quo and as such are likely to resist
   change and have more reactive focus.
                                                                                                   Service Operation principles |        27

          An organization here is    An organization here is              While proactive behaviour in Service Operation is generally
         out of balance and is not quite balanced, but tends
             able to effectively   to fix services that are not
                                                                          good, there are also times where reactive behaviour is
           support the business    broken, resulting in higher            needed. The role of Service Operation is therefore to
                  strategy               levels of change                 achieve a balance between being reactive and proactive.
                                                                          This will require:
Extremely                                                    Extremely
Reactive                                                      Proactive   ■ Formal Problem Management and Incident
                                                                            Management processes, integrated between Service
                                                                            Operation and Continual Service Improvement.
                                                                          ■ The ability to be able to prioritize technical faults as
                                                                            well as business demands. This needs to be done
                                                                            during Service Operation, but the mechanisms need to
Figure 3.5 Achieving a balance between being too                            be put in place during Service Strategy and Design.
reactive or too proactive                                                   These mechanisms could include incident
                                                                            categorization systems, escalation procedures and
Table 3.4 outlines some examples of the characteristics of                  tools to facilitate impact assessment for changes.
positions at extreme ends of the spectrum. The purpose of                 ■ Data from Configuration and Asset Management to
this table is to assist organizations in identifying to which               provide data where required, saving projects time and
extreme they are closer, not to identify real-life positions                making decisions more accurate.
to which organizations should aspire.                                     ■ Ongoing involvement of SLM in Service Operation.

Table 3.4 Examples of extremely reactive and proactive behaviour
                 Extremely reactive                                           Extremely proactive
Primary focus Responds to business needs and incidents only                   Anticipates business requirements before they are
                 after they are reported                                      reported and problems before they occur

                 ■   Preparing to deliver new services takes a long           ■   Money is spent before the requirements are stated.
Typical              time because each project is dealt with as if it             In some cases IT purchases items that will never be
problems             is the first                                                 used because they anticipated the wrong
experienced
                 ■   Similar incidents occur again and again, as there            requirements or because the project is stopped
                     is no way of trending them                               ■   IT staff tend to have been in the organization for a
                 ■   Staff turnover is high and morale is generally               long time and tend to assume that they know the
                     low, as IT staff keep moving from project to                 business requirements better than the business does
                     project without achieving a lasting, stable set
                     of IT services

Capacity         Wait until there are capacity problems and then              Anticipate capacity problems and spend money on
Planning         purchase surplus capacity to last until the next             preventing these – even when the scenario is unlikely to
                 capacity-related incident                                    happen

IT Service       ■   No plans exist until after a major event or              Over-planning (and over-spending) of IT Recovery
Continuity           disaster                                                 options. Usually immediate recovery is provided for
Planning         ■   IT Plans focus on recovering key systems, but            most IT services, regardless of their impact or priority
                     without ensuring that the business can recover
                     its processes

Change           ■   Changes are often not logged, or logged at the           Changes are requested and implemented even when
Management           last minute as Emergency Changes                         there is no real need, i.e. a significant amount of work
                 ■   Not enough time for proper impact and cost               done to fix items that are not broken
                     assessments
                 ■   Changes are poorly tested and controlled,
                     resulting in a high number of incidents
28   | Service Operation principles



 3.3 PROVIDING SERVICE                                          This should not only be encouraged, but Service
                                                                Operation staff should be measured on their involvement
 All Service Operation staff must be fully aware that they
                                                                in Service Design activities – and such activities should be
 are there to ‘provide service’ to the business. They must
                                                                included in job descriptions and roles, etc. This will help to
 provide a timely (rapid response and speedy delivery
                                                                ensure continuity between business requirements and
 of requirements), professional and courteous service to
                                                                technology design and operation and it will also help to
 allow the business to conduct its own activities – so that
                                                                ensure that what is designed can also be operated. IT
 the commercial customer’s needs are met and the
                                                                Operations Management staff should also be involved
 business thrives.
                                                                during Service Transition to ensure consistency and to
 It is important that staff are trained not only in how to      ensure that both stated business and manageability
 deliver and support IT services, but also in the manner in     requirements are met.
 which that service should be provided. For example, staff
                                                                Resources must be made available for these activities and
 that are capable and deliver service effectively may still
                                                                the time required should be taken into account, as
 cause significant customer dissatisfaction if they are
                                                                appropriate.
 insensitive or dismissive. Conversely, no amount of being
 nice to a customer will help if the service is not being
 delivered.                                                     3.5 OPERATIONAL HEALTH
 A critical element of being a proficient service provider is   Many organizations find it helpful to compare the
 placing as much emphasis on recruiting and training staff      monitoring and control of Service Operation to health
 to develop competency in dealing with and managing             monitoring and control.
 customer relationships and interactions as they do on          In this sense, the IT Infrastructure is like an organism that
 technical competencies for managing the IT environment.        has vital life signs that can be monitored to check whether
                                                                it is functioning normally. This means that it is not
 3.4 OPERATION STAFF INVOLVEMENT IN                             necessary to monitor continuously every component of
 SERVICE DESIGN AND SERVICE TRANSITION                          every IT system to ensure that it is functioning.

 It is extremely important that Service Operation staff are     Operational Health can be determined by isolating a few
 involved in Service Design and Service Transition and          important ‘vital signs’ on devices or services that are
 potentially also in Service Strategy where appropriate.        defined as critical for the successful execution of a Vital
                                                                Business Function. This could be the bandwidth utilization
 One key to achieving balance in Service Operation is an        on a network segment, or memory utilization on a major
 effective set of Service Design processes. These will          server. If these signs are within normal ranges, the system
 provide IT Operations Management with:                         is healthy and does not require additional attention.
 ■ Clear definition of IT service objectives and                This reduction in the need for extensive monitoring will
     performance criteria                                       result in cost reduction and operational teams and
 ■   Linkage of IT service specifications to the performance    departments that are focused on the appropriate areas
     of the IT Infrastructure                                   for service success.
 ■   Definition of operational performance requirements         However, as with organisms, it is important to check
 ■   A mapping of services and technology                       systems more thoroughly from time to time, to check for
 ■   The ability to model the effect of changes in              problems that do not immediately affect vital signs. For
     technology and changes to business requirements            example a disk may be functioning perfectly, but it could
 ■   Appropriate cost models (e.g. customer or service          be nearing its Mean Time Between Failures (MTBF)
     based) to evaluate Return on Investment and cost-          threshold. In this case the system should be taken out of
     reduction strategies.                                      service and given a thorough examination or ‘health
                                                                check’. At the same time, it should be stressed that the
 The nature of IT Operations Management involvement             end result should be the healthy functioning of the service
 should be carefully positioned. Service Design is a phase in   as a whole. This means that health checks on components
 the Service Management Lifecycle using a set of processes,     should be balanced against checks of the ‘end-to-end’
 not a function independent of Service Operation. As such,      service. The definition of what needs to be monitored and
 many of the people who are involved in Service Design          what is healthy versus unhealthy is defined during Service
 will come from IT Operations Management.                       Design, especially Availability Management and SLM.
                                                                                          Service Operation principles |         29

Operational Health is dependent on the ability to prevent           common workarounds. These are used as soon as an
incidents and problems by investing in reliable and                 error is detected, to determine the appropriate
maintainable infrastructure. This is achieved through good          response.
availability design and proactive Problem Management. At          ■ The ability to generate a call for human intervention
the same time, Operational Health is also dependent on              by raising an alert or generating an incident.
the ability to identify faults and localize them effectively so
                                                                  While the concept of Operational Health is not a core
that they have minimal impact on the service. This
                                                                  concept of Service Operation, it is often a helpful
requires strong (preferably automated) Incident and
                                                                  metaphor to assist in determining what needs to be
Problem Management.
                                                                  monitored and how frequently to perform preventive
The idea of Operational Health has also led to a                  maintenance.
specialized area called ‘Self Healing Systems’. This is an
                                                                  What and when to monitor for operational health should
application of Availability, Capacity, Knowledge, Incident
                                                                  be determined in Service Design, tested and refined
and Problem Management and refers to a system that has
                                                                  during Service Transition and optimized in Continual
been designed to withstand the most severe operating
                                                                  Service Improvement, as necessary.
conditions and to detect, diagnose and recover from most
incidents and Known Errors. Self Healing Systems are
known by different names, for example Autonomic                   3.6 COMMUNICATION
Systems, Adaptive Systems and Dynamic Systems.
                                                                  Good communication is needed with other IT teams and
Characteristics of Self Healing Systems include:
                                                                  departments, with users and internal customers, and
■ Resilience is designed and built into the system, for           between the Service Operation teams and departments
    example multiple redundant disks or multiple                  themselves. Issues can often be prevented or mitigated
    processors. This protects the system against hardware         with appropriate communication.
    failure since it is able to continue operating using the
                                                                  This section is aimed at summarizing the communication
    duplicated hardware component.
                                                                  that should take place in Service Operation. This is not a
■   Software, data and operating system resilience is also        separate process, but a checklist of the type of
    designed into the system, for example mirrored                communication that is required for effective Service
    databases (where a database is duplicated on a                Operation.
    backup device) and disk-striping technology (where
    individual bits of data are distributed across a disk         An important principle is that all communication must
    array – so that a disk failure results in the loss of only    have an intended purpose or a resultant action.
    a part of data, which can be easily recovered using           Information should not be communicated unless there is a
    algorithms).                                                  clear audience. In addition, that audience should have
■   The ability to shift processing from one physical             been actively involved in determining the need for that
    device to another without any disruption to the               communication and what they will do with the
    service. This could be a response to a failure or             information.
    because the device is reaching high utilization levels        A detailed description of the types of communication
    (some systems are designed to distribute processing           typical in Service Operation is contained in Appendix B of
    workloads continuously, to make optimum use of                this publication, together with a description of the typical
    available capacity, which is also known as                    audience and the actions that are intended to be taken as
    virtualization).                                              a result of each communication. These include:
■   Built-in monitoring utilities which enable the system to
                                                                  ■ Routine operational communication
    detect events and to determine whether these
                                                                  ■ Communication between shifts
    represent normal operations or not.
                                                                  ■ Performance reporting
■   A correlation engine (see paragraph 4.1.5.6 on Event
                                                                  ■ Communication in projects
    Management). This will enable the system to
    determine the significance of each event and also to          ■ Communication related to changes
    determine whether there is any predefined response            ■ Communication related to exceptions
    to that event.                                                ■ Communication related to emergencies
■   A set of diagnostic tools, such as diagnostic scripts,        ■ Training on new or customized processes and
    fault trees and a database of Known Errors and                   service designs
30   | Service Operation principles



 ■ Communication of strategy and design to Service             that have more mature Service Management processes
     Operation teams.                                          and tools will tend to rely on the tools and processes for
                                                               communication (e.g. using an Incident Management tool
 Please note that there is no definitive medium for
                                                               to escalate and track incidents, instead of requesting e-
 communication, nor is there a fixed location or frequency.
                                                               mail or telephone calls for updates).
 In some organizations communication has to take place in
 meetings. Other organizations prefer to use e-mail or the     Other organizations prefer to communicate using
 communication inherent in their Service Management            meetings. However, it is important not to get into the
 tools.                                                        mode whereby the only time work is done, or
                                                               management is involved, is during a meeting. Also, face-
 There should therefore be a policy around communication
                                                               to-face meetings tend to increase costs (e.g. travel, time
 within each team or department and for each process.
                                                               spent in informal discussions, refreshments, etc.), so
 Although this should be formal, the policy should not be
                                                               meeting organizers should balance the value of the
 cumbersome or complex. For example, a manager might
                                                               meeting with the number and identity of the attendees
 require that all communications regarding changes must
                                                               and the time they will spend in, and getting to, the
 be sent by e-mail. As long as this is specified in the
                                                               meeting.
 department’s SOPs (in whatever form they exist), there is
 no need to create a separate policy for it.                   The purpose of meetings is to communicate effectively to
                                                               a group of people about a common set of objectives or
 Although the typical content of communication is fairly
                                                               activities. Meetings should be well controlled and brief,
 consistent once processes have been defined, the means
                                                               and the focus should be on facilitating action. A good rule
 of communication are changing with every new
                                                               is not to hold a meeting if the information can be
 introduction of technology. The list of alternatives is
                                                               communicated effectively by automated means.
 growing and, today, includes:
                                                               A number of factors are essential for successful meetings.
 ■ E-mail, to traditional clients or mobile devices
                                                               Although these may seem to be common sense, they are
 ■ SMS messages
                                                               sometimes neglected:
 ■ Pagers
 ■ Instant messaging and web-based ‘chats’                     ■ Establish and communicate a clear agenda to ensure
                                                                   that the meeting achieves its objective and to help the
 ■ Voice over Internet Protocol (VoIP) utilities that can
                                                                   facilitator prevent attendees from ‘hijacking’ the
   turn any connected device to an inexpensive
                                                                   meeting.
   communication medium
                                                               ■   Ensure that the rules for participating are understood.
 ■ Teleconference and virtual meeting utilities, which
                                                                   Organizations tend to have a formal set of meeting
   have revolutionized meetings, which are now held
                                                                   rules, ranging from relatively informal to very formal
   across long distances
                                                                   (e.g. Roberts Rules of Order).
 ■ Document-sharing utilities.
                                                               ■   Make use of ‘parking lots’ or notes that record issues
 The means of communication itself is outside the scope of         that are not directly relevant to the purpose of the
 this publication. However, the following points should be         meeting, but which can be called on if the need for
 noted:                                                            discussion arises.
 ■ Communication is primary and the means of                   ■   Minutes of the meeting: rules should be set about
   communication must ensure that they serve this goal.            when minutes are taken. Minutes are used to remind
   For example, the need for secure communication may              people who are assigned actions and to track the
   eliminate the possibility of some of the above means.           progress of delegated actions. They are also useful in
   The need for quality may eliminate some VoIP options.           ensuring that cross-functional decisions and actions
 ■ It is possible to use any means of communication as             are tracked and followed through.
   long as all stakeholders understand how and when            ■   Use techniques to encourage the appropriate level of
   the communication will take place.                              participation. One technique when discussing
                                                                   improvements, for example, is the ‘keep, stop, start’
 3.6.1 Meetings                                                    technique. Participants are encouraged to list items
                                                                   that they would like to keep, things that need to be
 Different organizations communicate in different ways.
                                                                   stopped and initiatives or actions that they would like
 Where organizations are distributed, they will tend to rely
                                                                   to see started.
 on e-mail and teleconferencing facilities. Organizations
                                                                                        Service Operation principles |     31

Examples of typical meetings are given below:                     ● Request for additional resources, if required
                                                                  ● Discussion of potential problems or concerns
3.6.1.1 The Operations meeting                                 ■ Confirmation of staff availability for roster duties
Operations meetings are normally held between the              ■ Confirmation of vacation schedules.
managers of the IT operational departments, teams or
groups, at the beginning of each business day or week.         3.6.1.3 Customer meetings
The purpose of this type of meeting is to make staff aware     From time to time it will be necessary to hold meetings
of any issue relevant to Operations (such as change            with customers, apart from the regular Service Level
schedules, business events, maintenance schedules, etc.)       Review meetings. Examples include:
and to provide an opportunity for staff to raise any issues
of which they are aware. This is an opportunity to ensure      ■ Follow-up after serious incidents. The purpose of these
that all departments in a data centre are synchronized.          meetings is to repair the relationship with the
                                                                 customers, but also to ensure that IT has all the
In geographically dispersed organizations it may not be          information required to prevent recurrence. Customers
possible to have a single daily Operations meeting. In           also have the opportunity to provide information
these cases it is important to coordinate the agenda of the      about unforeseen business impacts. These meetings
meetings and to ensure that each meeting has two                 are helpful in agreeing actions for similar types of
components:                                                      incident that may occur in future.
1 The first part of the meeting will cover aspects that        ■ A customer forum, which can be used for a range of
  apply to the organization as a whole, e.g. new                 purposes, including testing ideas for new services or
  policies, changes that affect all regions and business         solutions, or gathering requirements for new or
  events that span all regions.                                  revised services or procedures. A customer forum is
2 The second part of the meeting will cover aspects that         generally a regular meeting with customers to discuss
  apply only to the local region, e.g. local operations          areas of common concern.
  schedules, changes to local equipment, etc.
The Operations meeting is usually chaired by the IT            3.7 DOCUMENTATION
Operations Manager or a senior Operations Manager and          IT Operations Management and all of the Technical and
attended by all managers and supervisors (except those         Application Management teams and departments are
whose shifts are not on duty). It is also helpful to have at   involved in creating and maintaining a range of
least one representative from the Service Desk at the          documents. These are detailed in Chapters 4, 5 and 6 of
meeting so that they are aware of any situations that          this publication and include the following:
could give rise to incidents.
                                                               ■ Participation in the definition and maintenance of
Opportunities to improve services or processes should be         process manuals for all processes they are involved in.
captured, if raised, and forwarded to the team responsible       These will include processes in other phases of the IT
for Continual Service Improvement.                               Service Management Lifecycle (e.g. Capacity
                                                                 Management, Change Management, Availability
3.6.1.2 Department, group or team meetings                       Management) as well as for all processes included in
These meetings are essentially the same as the Operations        the Service Operation phase.
meeting, but are aimed at a single IT department, group        ■ Establishing their own technical procedures manuals.
or team. Each manager or supervisor relays the                   These must be kept up to date and new material must
information from the Operations meeting that is relevant         be added as it becomes relevant, under Change
to their team.                                                   Control. It should be remembered that their
Additionally, these meetings will also cover the following:      procedures should always be structured to meet the
                                                                 objectives and constraints defined within higher-level
■ A more detailed discussion of incidents, problems and
                                                                 Service Management processes, such as SLM. For
   changes that are still being worked on, with                  example, a technical procedure for managing servers
   information about:                                            should always ensure that it aims at achieving the
   ● Progress to date                                            availability and performance levels agreed to in the
   ● Confirmation of what still needs to be done                 Operational Level Agreements and Service Level
   ● Estimated completion times                                  Agreements (SLAs).
32   | Service Operation principles



 ■ Participation in the creation and maintenance of
   planning documents, e.g. the Capacity and Availability
   Plans and the IT Service Continuity Plans.
 ■ Participation in the creation and maintenance of the
   Service Portfolio. This will include quantifying costs
   and establishing the operational feasibility of each
   proposed service.
 ■ Participation in the definition and maintenance of
   Service Management tool work instructions in order to
   meet reporting requirements.
Service Operation
        processes   4
                                                                                                                        |       35


4 Service Operation processes
The processes listed in paragraph 2.4.5 are discussed in          formal Request Fulfilment process to manage
detail in this chapter. As a reference, the overall structure     customer and user requests for all types of requests
is briefly described here and then each of the processes is       which include facilities, moves and supplies as well as
described in more detail later in the chapter. Please note        those specific to IT services. These requests are not
that the roles for each process and the tools used for each       generally tied to the same SLA measures and
process are described in Chapters 6 and 7 respectively.           separating the records and the process flow is
                                                                  emerging as best practice in many organizations.
■ Event Management is the process that monitors all
                                                                ■ Access Management: this is the process of granting
  events that occur through the IT infrastructure to allow
  for normal operation and also to detect and escalate            authorized users the right to use a service, while
  exception conditions.                                           restricting access to non-authorized users. It is based
                                                                  on being able accurately to identify authorized users
■ Incident Management concentrates on restoring the
                                                                  and then manage their ability to access services as
  service to users as quickly as possible, in order to
                                                                  required during different stages of their human
  minimize business impact.
                                                                  resources (HR) or contractual lifecycle. Access
■ Problem Management involves root-cause analysis to
                                                                  Management has also been called Identity or Rights
  determine and resolve the cause of events and
                                                                  Management in some organizations.
  incidents, proactive activities to detect and prevent
  future problems/incidents and a Known Error sub-              In addition, there are several other processes that will be
  process to allow quicker diagnosis and resolution if          executed or supported during Service Operation, but
  further incidents do occur.                                   which are driven during other phases of the Service
                                                                Management Lifecycle. The operational aspects of these
  NOTE: Without this distinction between incidents and
                                                                processes will be discussed in the final part of this chapter
  problems, and keeping separate Incident and Problem
                                                                and include:
  Records, there is a danger that either:
  ● Incidents will be closed too early in the overall           ■ Change Management, a major process which should
      support cycle and there will be no actions taken to           be closely linked to Configuration Management and
      prevent recurrence – so the same incidents will               Release Management. These topics are primarily
      have to be fixed over and over again, or                      covered in the Service Transition publication.
  ● Incidents will be kept open so that root cause              ■   Capacity and Availability Management, the operational
      analysis can be done and visibility will be lost of           aspects of which are covered in this publication, but
      when the user’s service was actually restored – so            which are covered in more detail in the Service Design
      SLA targets may not be met even though the                    publication.
      service has been restored within users’                   ■   Financial Management, which is covered in the Service
      expectations. This often results in a large number            Strategy publication.
      of open incidents, many of which will never be            ■   Knowledge Management, which is covered in the
      closed unless a periodic ‘purge’ is undertaken. This          Service Transition publication.
      can be very demotivating and can prevent effective        ■   IT Service Continuity, which is covered in the Service
      visibility of current issues.                                 Design publication.
■ Request Fulfilment involves the management of                 ■   Service Reporting and Measurement, which are
  customer or user requests that are not generated as               covered in the Continual Service Improvement
  an incident from an unexpected service delay or                   publication.
  disruption. Some organizations may choose to handle
  such requests as a ‘category’ of incidents and manage
  the information through an Incident Management
                                                                4.1 EVENT MANAGEMENT
  system – but others may choose (because of high               An event can be defined as any detectable or discernible
  volumes or business priority of such requests) to             occurrence that has significance for the management of
  facilitate the provision of Request Fulfilment                the IT Infrastructure or the delivery of IT service and
  capabilities separately via the Request Fulfilment            evaluation of the impact a deviation might cause to the
  process. It has become popular practice to use a
36   | Service Operation processes



 services. Events are typically notifications created by an IT   ■ Configuration Items:
 service, Configuration Item (CI) or monitoring tool.                 ● Some CIs will be included because they need to
 Effective Service Operation is dependent on knowing the                  stay in a constant state (e.g. a switch on a network
 status of the infrastructure and detecting any deviation                 needs to stay on and Event Management tools
 from normal or expected operation. This is provided by                   confirm this by monitoring responses to ‘pings’).
 good monitoring and control systems, which are based on              ● Some CIs will be included because their status
 two types of tools:                                                      needs to change frequently and Event
                                                                          Management can be used to automate this and
 ■ active monitoring tools that poll key CIs to determine
                                                                          update the CMS (e.g. the updating of a file server).
   their status and availability. Any exceptions will
                                                                 ■    Environmental conditions (e.g. fire and smoke
   generate an alert that needs to be communicated to
                                                                      detection)
   the appropriate tool or team for action
                                                                 ■    Software licence monitoring for usage to ensure
 ■ passive monitoring tools that detect and correlate
                                                                      optimum/legal licence utilization and allocation
   operational alerts or communications generated by
                                                                 ■    Security (e.g. intrusion detection)
   CIs.
                                                                 ■    Normal activity (e.g. tracking the use of an application
 4.1.1 Purpose/goal/objective                                         or the performance of a server).

 The ability to detect events, make sense of them and
                                                                     The difference between monitoring and Event
 determine the appropriate control action is provided by             Management
 Event Management. Event Management is therefore the
 basis for Operational Monitoring and Control (see                   These two areas are very closely related, but slightly
 Appendix B).                                                        different in nature. Event Management is focused on
                                                                     generating and detecting meaningful notifications
 In addition, if these events are programmed to                      about the status of the IT Infrastructure and services.
 communicate operational information as well as warnings
                                                                     While it is true that monitoring is required to detect
 and exceptions, they can be used as a basis for                     and track these notifications, monitoring is broader
 automating many routine Operations Management                       than Event Management. For example, monitoring
 activities, for example executing scripts on remote devices,        tools will check the status of a device to ensure that
 or submitting jobs for processing, or even dynamically              it is operating within acceptable limits, even if that
 balancing the demand for a service across multiple devices          device is not generating events.
 to enhance performance.
                                                                     Put more simply, Event Management works with
 Event Management therefore provides the entry point for             occurrences that are specifically generated to be
 the execution of many Service Operation processes and               monitored. Monitoring tracks these occurrences, but
 activities. In addition, it provides a way of comparing             it will also actively seek out conditions that do not
 actual performance and behaviour against design                     generate events.
 standards and SLAs. As such, Event Management also
 provides a basis for Service Assurance and Reporting; and
                                                                 4.1.3 Value to business
 Service Improvement. This is covered in detail in the
 Continual Service Improvement publication.                      Event Management’s value to the business is generally
                                                                 indirect; however, it is possible to determine the basis for
 4.1.2 Scope                                                     its value as follows:

 Event Management can be applied to any aspect of                ■ Event Management provides mechanisms for early
 Service Management that needs to be controlled and                   detection of incidents. In many cases it is possible for
 which can be automated. These include:                               the incident to be detected and assigned to the
                                                                                     Service Operation processes |         37

  appropriate group for action before any actual service            alert indicates that a payment authorization site is
  outage occurs.                                                    unavailable – impacting financial approval of
■ Event Management makes it possible for some types                 business transactions)
  of automated activity to be monitored by exception –          ● a device’s CPU is above the acceptable utilization
  thus removing the need for expensive and resource                 rate
  intensive real-time monitoring, while reducing                ● a PC scan reveals the installation of unauthorized
  downtime.                                                         software.
■ When integrated into other Service Management               ■ Events that signify unusual, but not exceptional,
  processes (such as, for example, Availability or Capacity     operation. These are an indication that the situation
  Management), Event Management can signal status               may require closer monitoring. In some cases the
  changes or exceptions that allow the appropriate              condition will resolve itself, for example in the case of
  person or team to perform early response, thus                an unusual combination of workloads – as they are
  improving the performance of the process. This, in            completed, normal operation is restored. In other
  turn, will allow the business to benefit from more            cases, operator intervention may be required if the
  effective and more efficient Service Management               situation is repeated or if it continues for too long.
  overall.                                                      These rules or policies are defined in the Monitoring
■ Event Management provides a basis for automated               and Control Objectives for that device or service.
  operations, thus increasing efficiencies and allowing         Examples of this type of event are:
  expensive human resources to be used for more                 ● A server’s memory utilization reaches within 5% of
  innovative work, such as designing new or improved                its highest acceptable performance level
  functionality or defining new ways in which the               ● The completion time of a transaction is 10% longer
  business can exploit technology for increased                     than normal.
  competitive advantage.
                                                              Two things are significant about the above examples:
4.1.4 Policies/principles/basic concepts                      ■ Exactly what constitutes normal versus unusual
There are many different types of events, for example:          operation, versus an exception? There is no definitive
                                                                rule about this. For example, a manufacturer may
■ Events that signify regular operation:
                                                                provide that a benchmark of 75% memory utilization
   ● notification that a scheduled workload has                 is optimal for application X. However, it is discovered
     completed                                                  that, under the specific conditions of our organization,
  ● a user has logged in to use an application                  response times begin to degrade above 70%
  ● an e-mail has reached its intended recipient.               utilization. The next section will explore how these
■ Events that signify an exception                              figures are determined.
  ● a user attempts to log on to an application with          ■ Each relies on the sending and receipt of a message
     the incorrect password                                     of some type. These are generally referred to as Event
  ● an unusual situation has occurred in a business             notifications and they don’t just happen. The next
     process that may indicate an exception requiring           paragraphs will explore exactly how events are
     further business investigation (e.g. a web page            defined, generated and captured.
38   | Service Operation processes




                                                         Event




                                                 Event Notification
                                                    Generated




                                                  Event Detected




                                                   Event Filtered




                           Informational           Significance?                 Exception


                                                     Warning



                                                 Event Correlation




                                                         Trigger




                                                                                             Incident/
           Event Logged       Auto Response      Alert                                       Problem/
                                                                              Incident       Change?      Change

                                                                                         Problem

                                                 Human                     Incident           Problem          Change
                                              Intervention               Management          Management      Management




                                                  Review Actions




                                                                         No
                                                     Effective?

                                                                   Yes




                                                    Close Event




                                                          End                                       Figure 4.1 The Event
                                                                                                    Management process
                                                                                        Service Operation processes |           39


4.1.5 Process activities, methods and                          In many organizations, however, defining which events to
techniques                                                     generate is done by trial and error. System managers use
                                                               the standard set of events as a starting point and then
Figure 4.1 is a high-level and generic representation of
                                                               tune the CI over time, to include or exclude events as
Event Management. It should be used as a reference and
                                                               required. The problem with this approach is that it only
definition point, rather than an actual Event Management
                                                               takes into account the immediate needs of the staff
flowchart. Each activity in this process is described below.
                                                               managing the device and does not facilitate good
                                                               planning or improvement. In addition, it makes it very
4.1.5.1 Event occurs
                                                               difficult to monitor and manage the service over all
Events occur continuously, but not all of them are             devices and staff. One approach to combating this
detected or registered. It is therefore important that         problem is to review the set of events as part of continual
everybody involved in designing, developing, managing          improvement activities.
and supporting IT services and the IT Infrastructure
that they run on understands what types of event need          A general principle of Event notification is that the more
to be detected.                                                meaningful the data it contains and the more targeted the
                                                               audience, the easier it is to make decisions about the
This is discussed in paragraph 4.1.10.1, titled                event. Operators are often confronted by coded error
‘Instrumentation’.                                             messages and have no idea how to respond to them or
                                                               what to do with them. Meaningful notification data and
4.1.5.2 Event notification                                     clearly defined roles and responsibilities need to be
Most CIs are designed to communicate certain information       articulated and documented during Service Design and
about themselves in one of two ways:                           Service Transition (see also paragraph 4.1.10.1 on
                                                               ‘Instrumentation’). If roles and responsibilities are not
■ A device is interrogated by a management tool, which
                                                               clearly defined, in a wide alert, no one knows who is
  collects certain targeted data. This is often referred to
                                                               doing what and this can lead to things being missed or
  as polling.
                                                               duplicated efforts.
■ The CI generates a notification when certain
  conditions are met. The ability to produce these
                                                               4.1.5.3 Event detection
  notifications has to be designed and built into
  the CI, for example a programming hook inserted              Once an Event notification has been generated, it will be
  into an application.                                         detected by an agent running on the same system, or
                                                               transmitted directly to a management tool specifically
Event notifications can be proprietary, in which case only     designed to read and interpret the meaning of the event.
the manufacturer’s management tools can be used to
detect events. Most CIs, however, generate Event               4.1.5.4 Event filtering
notifications using an open standard such as SNMP
                                                               The purpose of filtering is to decide whether to
(Simple Network Management Protocol).
                                                               communicate the event to a management tool or to
Many CIs are configured to generate a standard set of          ignore it. If ignored, the event will usually be recorded in a
events, based on the designer’s experience of what is          log file on the device, but no further action will be taken.
required to operate the CI, with the ability to generate
                                                               The reason for filtering is that it is not always possible to
additional types of event by ‘turning on’ the relevant
                                                               turn Event notification off, even though a decision has
event generation mechanism. For other CI types, some
                                                               been made that it is not necessary to generate that type
form of ‘agent’ software will have to be installed in order
                                                               of event. It may also be decided that only the first in a
to initiate the monitoring. Often this monitoring feature
                                                               series of repeated Event notifications will be transmitted.
is free, but sometimes there is a cost to the licensing
of the tool.                                                   During the filtering step, the first level of correlation is
                                                               performed, i.e. the determination of whether the event is
In an ideal world, the Service Design process should define
                                                               informational, a warning, or an exception (see next step).
which events need to be generated and then specify how
                                                               This correlation is usually done by an agent that resides on
this can be done for each type of CI. During Service
                                                               the CI or on a server to which the CI is connected.
Transition, the event generation options would be set
and tested.                                                    The filtering step is not always necessary. For some CIs,
                                                               every event is significant and moves directly into a
                                                               management tool’s correlation engine, even if it is
40   | Service Operation processes



 duplicated. Also, it may have been possible to turn off all       an exception could be generated when an
 unwanted Event notifications.                                     unauthorized device is discovered on the network.
                                                                   This can be managed by using either an Incident
 4.1.5.5 Significance of events                                    Record or a Request for Change (or even both),
 Every organization will have its own categorization of the        depending on the organization’s Incident and Change
 significance of an event, but it is suggested that at least       Management policies. Examples of exceptions include:
 these three broad categories be represented:                      ● A server is down
                                                                   ● Response time of a standard transaction across the
 ■ Informational: This refers to an event that does not
                                                                       network has slowed to more than 15 seconds
   require any action and does not represent an
                                                                   ● More than 150 users have logged on to the
   exception. They are typically stored in the system or
   service log files and kept for a predetermined period.              General Ledger application concurrently
   Informational events are typically used to check on the         ● A segment of the network is not responding to
   status of a device or service, or to confirm the                    routine requests.
   successful completion of an activity. Informational
   events can also be used to generate statistics (such as     4.1.5.6 Event correlation
   the number of users logged on to an application             If an event is significant, a decision has to be made about
   during a certain period) and as input into                  exactly what the significance is and what actions need to
   investigations (such as which jobs completed                be taken to deal with it. It is here that the meaning of the
   successfully before the transaction processing queue        event is determined.
   hung). Examples of informational events include:
                                                               Correlation is normally done by a ‘Correlation Engine’,
   ● A user logs onto an application
                                                               usually part of a management tool that compares the
   ● A job in the batch queue completes successfully           event with a set of criteria and rules in a prescribed order.
   ● A device has come online                                  These criteria are often called Business Rules, although
   ● A transaction is completed successfully.                  they are generally fairly technical. The idea is that the
 ■ Warning: A warning is an event that is generated            event may represent some impact on the business and the
   when a service or device is approaching a threshold.        rules can be used to determine the level and type of
   Warnings are intended to notify the appropriate             business impact.
   person, process or tool so that the situation can be        A Correlation Engine is programmed according to the
   checked and the appropriate action taken to prevent         performance standards created during Service Design and
   an exception. Warnings are not typically raised for a       any additional guidance specific to the operating
   device failure. Although there is some debate about         environment.
   whether the failure of a redundant device is a warning
   or an exception (since the service is still available). A   Examples of what Correlation Engines will take into
   good rule is that every failure should be treated as an     account include:
   exception, since the risk of an incident impacting the      ■ Number of similar events (e.g. this is the third time
   business is much greater. Examples of warnings are:             that the same user has logged in with the incorrect
   ● Memory utilization on a server is currently at 65%            password, a business application reports that there has
       and increasing. If it reaches 75%, response times           been an unusual pattern of usage of a mobile
       will be unacceptably long and the OLA for that              telephone that could indicate that the device has
       department will be breached.                                been lost or stolen)
   ● The collision rate on a network has increased by          ■   Number of CIs generating similar events
       15% over the past hour.                                 ■   Whether a specific action is associated with the code
 ■ Exception: An exception means that a service or                 or data in the event
   device is currently operating abnormally (however that      ■   Whether the event represents an exception
   has been defined). Typically, this means that an OLA        ■   A comparison of utilization information in the event
   and SLA have been breached and the business is                  with a maximum or minimum standard (e.g. has the
   being impacted. Exceptions could represent a total              device exceeded a threshold?)
   failure, impaired functionality or degraded
                                                               ■   Whether additional data is required to investigate the
   performance. Please note, though, that an exception
                                                                   event further, and possibly even a collection of that
   does not always represent an incident. For example,
                                                                   data by polling another system or database
                                                                                       Service Operation processes |          41

■ Categorization of the event                                    standing order for the appropriate Operations
■ Assigning a priority level to the event.                       Management staff to check the logs on a regular basis
                                                                 and clear instructions about how to use each log. It
4.1.5.7 Trigger                                                  should also be remembered that the event
If the correlation activity recognizes an event, a response      information in the logs may not be meaningful until
will be required. The mechanism used to initiate that            an incident occurs; and where the Technical
response is called a trigger.                                    Management staff use the logs to investigate where
                                                                 the incident originated. This means that the Event
There are many different types of triggers, each designed        Management procedures for each system or team
specifically for the task it has to initiate. Some examples      need to define standards about how long events are
include:                                                         kept in the logs before being archived and deleted.
■ Incident Triggers that generate a record in the Incident     ■ Auto response: Some events are understood well
    Management system, thus initiating the Incident              enough that the appropriate response has already
    Management process                                           been defined and automated. This is normally as a
■   Change Triggers that generate a Request for Change           result of good design or of previous experience
    (RFC), thus initiating the Change Management process         (usually Problem Management). The trigger will initiate
■   A trigger resulting from a approved RFC that has been        the action and then evaluate whether it was
    implemented but caused the event, or from an                 completed successfully. If not, an Incident or
    unauthorised change that has been detected – in              Problem Record will be created. Examples of auto
    either case this will be referred to Change                  responses include:
    Management for investigation                                 ● Rebooting a device
■   Scripts that execute specific actions, such as               ● Restarting a service
    submitting batch jobs or rebooting a device                  ● Submitting a job into batch
■   Paging systems that will notify a person or team of          ● Changing a parameter on a device
    the event by mobile phone                                    ● Locking a device or application to protect it
■   Database triggers that restrict access of a user to               against unauthorized access.
    specific records or fields, or that create or delete         Note: locking a device may result in denial of service
    entries in the database.                                     to authorized users, which could be exploited by a
                                                                 deliberate attacker – so great care should be taken
4.1.5.8 Response selection                                       when deciding whether this is an appropriate
At this point in the process, there are a number of              automated action. Where this response is used it may
response options available. It is important to note that the     be prudent to also combine this with a call for human
response options can be chosen in any combination. For           intervention, so that the automated action can be
example, it may be necessary to preserve the log entry for       swiftly checked and approved.
future reference, but at the same time escalate the event      ■ Alert and human intervention: If the event requires
to an Operations Management staff member for action.             human intervention, it will need to be escalated. The
                                                                 purpose of the alert is to ensure that the person with
The options in the flowchart are examples. Different
                                                                 the skills appropriate to deal with the event is notified.
organizations will have different options, and they are sure
                                                                 The alert will contain all the information necessary for
to be more detailed. For example, there will be a range of
                                                                 that person to determine the appropriate action –
auto responses for each different technology. The process
                                                                 including reference to any documentation required
of determining which one is appropriate and how to
                                                                 (e.g. user manuals). It is important to note that this is
execute it are not represented in this flowchart. Some of
                                                                 not necessarily the same as the functional escalation
the options available are:
                                                                 of an incident, where the emphasis is on restoring
■ Event logged: Regardless of what activity is                   service within an agreed time (which may require a
    performed, it is a good idea to have a record of the         variety of activities). The alert requires a person, or
    event and any subsequent actions. The event can be           team, to perform a specific action, possibly on a
    logged as an Event Record in the Event Management            specific device and possibly at a specific time, e.g.
    tool, or it can simply be left as an entry in the system     changing a toner cartridge in a printer when the level
    log of the device or application that generated the          is low.
    event. If this is the case, though, there needs to be a
42   | Service Operation processes



 ■ Incident, problem or change? Some events will              ■ Open or link to a Problem Record: It is rare for a
   represent a situation where the appropriate response         Problem Record to be opened without related
   will need to be handled through the Incident, Problem        incidents (for example as a result of a Service Failure
   or Change Management process. These are discussed            Analysis (see Service Design publication) or maturity
   below, but it is important to note that a single             assessment, or because of a high number of retry
   incident may initiate any one or a combination of            network errors, even though a failure has not yet
   these three processes – for example, a non-critical          occurred). In most cases this step refers to linking an
   server failure is logged as an incident, but as there is     incident to an existing Problem Record. This will assist
   no workaround, a Problem Record is created to                the Problem Management teams to reassess the
   determine the root cause and resolution and an RFC is        severity and impact of the problem, and may result in
   logged to relocate the workload onto an alternative          a changed priority to an outstanding problem.
   server while the problem is resolved.                        However, it is possible, with some of the more
 ■ Open an RFC: There are two places in the Event               sophisticated tools, to evaluate the impact of the
   Management process where an RFC can be created:              incidents and also to raise a Problem Record
   ● When an exception occurs: For example, a scan              automatically, where this is warranted, to allow root-
       of a network segment reveals that two new                cause analysis to commence immediately.
       devices have been added without the necessary          ■ Special types of incident: In some cases an event
       authorization. A way of dealing with this situation      will indicate an exception that does not directly
       is to open an RFC, which can be used as a vehicle        impact any IT service, for example, a redundant air
       for the Change Management process to deal with           conditioning unit fails, or unauthorized entry to a data
       the exception (as an alternative to the more             centre. Guidelines for these events are as follows:
       conventional approach of opening an incident that        ● An incident should be logged using an Incident
       would be routed via the Service Desk to Change               Model that is appropriate for that type of
       Management). Investigation by Change                         exception, e.g. an Operations Incident or Security
       Management is appropriate here since                         Incident (see paragraph 4.2.4.2 for more details of
       unauthorized changes imply that the Change                   Incident Models).
       Management process was not effective.                    ● The incident should be escalated to the group that
   ● Correlation identifies that a change is needed:                manages that type of incident.
       In this case the event correlation activity              ● As there is no outage, the Incident Model used
       determines that the appropriate response to an               should reflect that this was an operational issue
       event is for something to be changed. For                    rather than a service issue. The statistics would not
       example, a performance threshold has been                    normally be reported to customers or users, unless
       reached and a parameter on a major server needs              they can be used to demonstrate that the money
       to be tuned. How does the correlation activity               invested in redundancy was a good investment.
       determine this? It was programmed to do so either        ● These incidents should not be used to calculate
       in the Service Design process or because this has            downtime, and can in fact be used to demonstrate
       happened before and Problem Management or                    how proactive IT has been in making services
       Operations Management updated the Correlation                available.
       Engine to take this action.
 ■ Open an Incident Record: As with an RFC, an                4.1.5.9 Review actions
   incident can be generated immediately when an
                                                              With thousands of events being generated every day, it is
   exception is detected, or when the Correlation Engine
                                                              not possible formally to review every individual event.
   determines that a specific type or combination of
                                                              However, it is important to check that any significant
   events represents an incident. When an Incident
                                                              events or exceptions have been handled appropriately, or
   Record is opened, as much information as possible
                                                              to track trends or counts of event types, etc. In many cases
   should be included – with links to the events
                                                              this can be done automatically, for example polling a
   concerned and if possible a completed diagnostic
                                                              server that had been rebooted using an automated script
   script.
                                                              to see that it is functioning correctly.
                                                              In the cases where events have initiated an incident,
                                                              problem and/or change, the Action Review should not
                                                              duplicate any reviews that have been done as part of
                                                                                         Service Operation processes |         43

those processes. Rather, the intention is to ensure that the      ■ Access of an application or database by a user or
handover between the Event Management process and                   automated procedure or job
other processes took place as designed and that the               ■ A situation where a device, database or application,
expected action did indeed take place. This will ensure             etc. has reached a predefined threshold of
that incidents, problems or changes originating within              performance.
Operations Management do not get lost between the
                                                                  Event Management can interface to any process that
teams or departments.
                                                                  requires monitoring and control, especially those that do
The Review will also be used as input into continual              not require real-time monitoring, but which do require
improvement and the evaluation and audit of the Event             some form of intervention following an event or group of
Management process.                                               events. Examples of interfaces with other processes
                                                                  include:
4.1.5.10 Close event
                                                                  ■ Interface with business applications and/or business
Some events will remain open until a certain action takes             processes to allow potentially significant business
place, for example an event that is linked to an open                 events to be detected and acted upon (e.g. a business
incident. However, most events are not ‘opened’                       application reports abnormal activity on a customer’s
or ‘closed’.                                                          account that may indicate some sort of fraud or
Informational events are simply logged and then used as               security breach).
input to other processes, such as Backup and Storage              ■   The primary ITSM relationships are with Incident,
Management. Auto-response events will typically be closed             Problem and Change Management. These interfaces
by the generation of a second event. For example, a                   are described in some detail in paragraph 4.1.5.8.
device generates an event and is rebooted through auto            ■   Capacity and Availability Management are critical in
response – as soon as that device is successfully back                defining what events are significant, what appropriate
online, it generates an event that effectively closes the             thresholds should be and how to respond to them. In
loop and clears the first event.                                      return, Event Management will improve the
It is sometimes very difficult to relate the open event and           performance and availability of services by responding
the close notifications as they are in different formats. It is       to events when they occur and by reporting on actual
optimal that devices in the infrastructure produce ‘open’             events and patterns of events to determine (by
and ‘close’ events in the same format and specify the                 comparison with SLA targets and KPIs) if there is some
change of status. This allows the correlation step in the             aspect of the infrastructure design or operation that
process to easily match open and close notifications.                 can be improved.
                                                                  ■   Configuration Management is able to use events to
In the case of events that generated an incident, problem             determine the current status of any CI in the
or change, these should be formally closed with a link to             infrastructure. Comparing events with the authorized
the appropriate record from the other process.                        baselines in the Configuration Management System
                                                                      (CMS) will help to determine whether there is
4.1.6 Triggers, input and output/inter-                               unauthorized Change activity taking place in the
process interfaces                                                    organization (see Service Transition publication).
Event Management can be initiated by any type of                  ■   Asset Management (covered in more detail in the
occurrence. The key is to define which of these                       Service Design and Transition publications) can use
occurrences is significant and which need to be acted                 Event Management to determine the lifecycle status of
upon. Triggers include:                                               assets. For example, an event could be generated to
                                                                      signal that a new asset has been successfully
■ Exceptions to any level of CI performance defined in
                                                                      configured and is now operational.
    the design specifications, OLAs or SOPs
                                                                  ■   Events can be a rich source of information that can be
■   Exceptions to an automated procedure or process, e.g.
                                                                      processed for inclusion in Knowledge Management
    a routine change that has been assigned to a build
                                                                      systems. For example, patterns of performance can be
    team has not been completed in time
                                                                      correlated with business activity and used as input
■   An exception within a business process that is being
                                                                      into future design and strategy decisions.
    monitored by Event Management
■   The completion of an automated task or job
■   A status change in a device or database record
44   | Service Operation processes



 ■ Event Management can play an important role in               ■ Number and percentage of events caused by existing
     ensuring that potential impact on SLAs is detected             problems or Known Errors. This may result in a change
     early and any failures are rectified as soon as possible       to the priority of work on that problem or Known
     so that impact on service targets is minimized.                Error
                                                                ■   Number and percentage of repeated or duplicated
 4.1.7 Information Management                                       events. This will help in the tuning of the Correlation
 Key information involved in Event Management includes              Engine to eliminate unnecessary event generation and
 the following:                                                     can also be used to assist in the design of better
                                                                    event generation functionality in new services
 ■ SNMP messages, which are a standard way of
                                                                ■   Number and percentage of events indicating
     communicating technical information about the status
                                                                    performance issues (for example, growth in the
     of components of an IT Infrastructure.
                                                                    number of times an application exceeded its
 ■   Management Information Bases (MIBs) of IT devices.
                                                                    transaction thresholds over the past six months)
     An MIB is the database on each device that contains
                                                                ■   Number and percentage of events indicating potential
     information about that device, including its operating
                                                                    availability issues (e.g. failovers to alternative devices,
     system, BIOS version, configuration of system
                                                                    or excessive workload swapping)
     parameters, etc. The ability to interrogate MIBs and
                                                                ■   Number and percentage of each type of event per
     compare them to a norm is critical to being able to
     generate events.                                               platform or application
                                                                ■   Number and ratio of events compared with the
 ■   Vendor’s monitoring tools agent software.
                                                                    number of incidents.
 ■   Correlation Engines contain detailed rules to
     determine the significance and appropriate response
     to events. Details on this are provided in paragraph
                                                                4.1.9 Challenges, Critical Success Factors
     4.1.5.6.                                                   and risks
 ■   There is no standard Event Record for all types of
                                                                4.1.9.1 Challenges
     event. The exact contents and format of the record
     depend on the tools being used, what is being              There are a number of challenges that might be
     monitored (e.g. a server and the Change Management         encountered:
     tools will have very different data and probably use a     ■ An initial challenge may be to obtain funding for the
     different format). However, there is some key data that      necessary tools and effort needed to install and exploit
     is usually required from each event to be useful in          the benefits of the tools.
     analysis. It should typically include the:                 ■ One of the greatest challenges is setting the correct
     ● Device                                                     level of filtering. Setting the level of filtering
     ● Component                                                  incorrectly can result in either being flooded with
     ● Type of failure                                            relatively insignificant events, or not being able to
     ● Date/time                                                  detect relatively important events until it is too late.
     ● Parameters in exception                                  ■ Rolling out of the necessary monitoring agents across
     ● Value.                                                     the entire IT infrastructure may be a difficult and time-
                                                                  consuming activity requiring an ongoing commitment
                                                                  over quite a long period of time – there is a danger
 4.1.8 Metrics
                                                                  that other activities may arise that could divert
 For each measurement period in question, the metrics to
                                                                  resources and delay the rollout.
 check on the effectiveness and efficiency of the Event
                                                                ■ Acquiring the necessary skills can be time consuming
 Management process should include the following:
                                                                  and costly.
 ■ Number of events by category
 ■ Number of events by significance                             4.1.9.2 Critical Success Factors
 ■ Number and percentage of events that required                In order to obtain the necessary funding a compelling
   human intervention and whether this was performed            Business Case should be prepared showing how the
 ■ Number and percentage of events that resulted in             benefits of effective Event Management can far outweigh
   incidents or changes                                         the costs – giving a positive return on investment.
                                                                                          Service Operation processes |         45

One of the most important CSFs is achieving the correct           that will feed through the Continual Improvement process
level of filtering. This is complicated by the fact that the      back into Service Strategy, Service Design etc.
significance of events changes. For example, a user
                                                                  Service Operation functions will be expected to participate
logging into a system today is normal, but if that user
                                                                  in the design of the service and how it is measured (see
leaves the organization and tries to log in it is a security
                                                                  section 3.4).
breach.
                                                                  For Event Management, the specific design areas include
There are three keys to the correct level of filtering,
                                                                  the following.
as follows:
■ Integrate Event Management into all Service                     4.1.10.1 Instrumentation
  Management processes where feasible. This will ensure           Instrumentation is the definition of what can be monitored
  that only the events significant to these processes             about CIs and the way in which their behaviour can be
  are reported.                                                   affected. In other words, instrumentation is about defining
■ Design new services with Event Management in mind               and designing exactly how to monitor and control the IT
  (this is discussed in detail in paragraph 4.1.10).              Infrastructure and IT services.
■ Trial and error. No matter how thoroughly Event
                                                                  Instrumentation is partly about a set of decisions that
  Management is prepared, there will be classes of
                                                                  need to be made and partly about designing mechanisms
  events that are not properly filtered. Event
                                                                  to execute these decisions.
  Management must therefore include a formal process
  to evaluate the effectiveness of filtering.                     Decisions that need to be made include:
Proper planning is needed for the rollout of the                  ■ What needs to be monitored?
monitoring agent software across the entire IT                    ■ What type of monitoring is required (e.g. active or
Infrastructure. This should be regarded as a project with           passive; performance or output)?
realistic timescales and adequate resources being allocated       ■ When do we need to generate an event?
and protected throughout the duration of the project.             ■ What type of information needs to be communicated
                                                                    in the event?
4.1.9.3 Risks                                                     ■ Who are the messages intended for?
The key risks are really those already mentioned above:
                                                                  Mechanisms that need to be designed include:
failure to obtain adequate funding; ensuring the correct
level of filtering; and failure to maintain momentum in           ■ How will events be generated?
rolling out the necessary monitoring agents across the IT         ■ Does the CI already have event generation
Infrastructure. If any of these risks is not addressed it could       mechanisms as a standard feature and, if so, which of
adversely impact on the success of Event Management.                  these will be used? Are they sufficient or does the CI
                                                                      need to be customized to include additional
4.1.10       Designing for Event Management                           mechanisms or information?
Effective Event Management is not designed once a                 ■   What data will be used to populate the Event Record?
service has been deployed into Operations. Since Event            ■   Are events generated automatically or does the CI
Management is the basis for monitoring the performance                have to be polled?
and availability of a service, the exact targets and              ■   Where will events be logged and stored?
mechanisms for monitoring should be specified and                 ■   How will supplementary data be gathered?
agreed during the Availability and Capacity Management
processes (see Service Design publication).                       Note: A strong interface exists here with the application’s
                                                                  design. All applications should be coded in such a way
However, this does not mean that Event Management is              that meaningful and detailed error messages/codes are
designed by a group of remote system developers and               generated at the exact point of failure – so that these can
then released to Operations Management together with              be included in the event and allow swift diagnosis and
the system that has to be managed. Nor does it mean               resolution of the underlying cause. The need for the
that, once designed and agreed, Event Management                  inclusion and testing of such error messaging is covered in
becomes static – day-to-day operations will define                more detail in the Service Transition publication.
additional events, priorities, alerts and other improvements
46   | Service Operation processes



 4.1.10.2 Error messaging                                        4.1.10.4 Identification of thresholds
 Error messaging is important for all components                 Thresholds themselves are not set and managed through
 (hardware, software, networks, etc.). It is particularly        Event Management. However, unless these are properly
 important that all software applications are designed to        designed and communicated during the instrumentation
 support Event Management. This might include the                process, it will be difficult to determine which level of
 provision of meaningful error messages and/or codes that        performance is appropriate for each CI.
 clearly indicate the specific point of failure and the most
                                                                 Also, most thresholds are not constant. They typically
 likely cause. In such cases the testing of new applications
                                                                 consist of a number of related variables. For example, the
 should include testing of accurate event generation.
                                                                 maximum number of concurrent users before response
 Newer technologies such as Java Management Extensions           time slows will vary depending on what other jobs are
 (JMX) or HawkNL™ provide the tools for building                 active on the server. This knowledge is often only gained
 distributed, web-based, modular and dynamic solutions for       by experience, which means that Correlation Engines have
 managing and monitoring devices, applications and               to be continually tuned and updated through the process
 service-driven networks. These can be used to reduce or         of Continual Service Improvement.
 eliminate the need for programmers to include error
 messaging within the code – allowing a valuable level of
                                                                 4.2 INCIDENT MANAGEMENT
 normalization and code-independence.
                                                                   In ITIL terminology, an ‘incident’ is defined as:
 4.1.10.3 Event Detection and Alert Mechanisms
                                                                   An unplanned interruption to an IT service or
 Good Event Management design will also include the
                                                                   reduction in the quality of an IT service. Failure of a
 design and population of the tools used to filter, correlate
                                                                   configuration item that has not yet impacted service
 and escalate Events.                                              is also an incident, for example failure of one disk
 The Correlation Engine specifically will need to be               from a mirror set.
 populated with the rules and criteria that will determine         Incident Management is the process for dealing with
 the significance and subsequent action for each type              all incidents; this can include failures, questions or
 of event.                                                         queries reported by the users (usually via a telephone
                                                                   call to the Service Desk), by technical staff, or
 Thorough design of the event detection and alert
                                                                   automatically detected and reported by event
 mechanisms requires the following:
                                                                   monitoring tools.
 ■ Business knowledge in relationship to any business
     processes being managed via Event Management
 ■   Detailed knowledge of the Service Level Requirements        4.2.1 Purpose/goal/objective
     of the service being supported by each CI                   The primary goal of the Incident Management process is
 ■   Knowledge of who is going to be supporting the CI           to restore normal service operation as quickly as possible
 ■   Knowledge of what constitutes normal and abnormal           and minimize the adverse impact on business operations,
     operation of the CI                                         thus ensuring that the best possible levels of service
                                                                 quality and availability are maintained. ‘Normal service
 ■   Knowledge of the significance of multiple similar
                                                                 operation’ is defined here as service operation within
     events (on the same CI or various similar CIs
                                                                 SLA limits.
 ■   An understanding of what they need to know to
     support the CI effectively
                                                                 4.2.2 Scope
 ■   Information that can help in the diagnosis of problems
                                                                 Incident Management includes any event which disrupts,
     with the CI
                                                                 or which could disrupt, a service. This includes events
 ■   Familiarity with incident prioritization and
                                                                 which are communicated directly by users, either through
     categorization codes so that if it is necessary to create
                                                                 the Service Desk or through an interface from Event
     an Incident Record, these codes can be provided
                                                                 Management to Incident Management tools.
 ■   Knowledge of other CIs that may be dependent on
     the affected CI, or those CIs on which it depends           Incidents can also be reported and/or logged by technical
 ■   Availability of Known Error information from vendors        staff (if, for example, they notice something untoward with
     or from previous experience.                                a hardware or network component they may report or log
                                                                 an incident and refer it to the Service Desk). This does not
                                                                                       Service Operation processes |          47

mean, however, that all events are incidents. Many classes     resolution targets within SLAs – and captured as targets
of events are not related to disruptions at all, but are       within OLAs and Underpinning Contracts (UCs). All support
indicators of normal operation or are simply informational     groups should be made fully aware of these timescales.
(see section 4.1).                                             Service Management tools should be used to automate
                                                               timescales and escalate the incident as required based on
Although both incidents and service requests are reported
                                                               pre-defined rules.
to the Service Desk, this does not mean that they are the
same. Service requests do not represent a disruption to
agreed service, but are a way of meeting the customer’s
                                                               4.2.4.2 Incident Models
needs and may be addressing an agreed target in an SLA.        Many incidents are not new – they involve dealing with
Service requests are dealt with by the Request Fulfilment      something that has happened before and may well
process (see section 4.3).                                     happen again. For this reason, many organizations will
                                                               find it helpful to pre-define ‘standard’ Incident Models –
4.2.3 Value to business                                        and apply them to appropriate incidents when they occur.
The value of Incident Management includes:                     An Incident Model is a way of pre-defining the steps that
                                                               should be taken to handle a process (in this case a process
■ The ability to detect and resolve incidents, which
                                                               for dealing with a particular type of incident) in an agreed
  results in lower downtime to the business, which in
                                                               way. Support tools can then be used to manage the
  turn means higher availability of the service. This
                                                               required process. This will ensure that ‘standard’ incidents
  means that the business is able to exploit the
                                                               are handled in a pre-defined path and within pre-defined
  functionality of the service as designed.
                                                               timescales.
■ The ability to align IT activity to real-time business
  priorities. This is because Incident Management              Incidents which would require specialized handling can be
  includes the capability to identify business priorities      treated in this way (for example, security-related incidents
  and dynamically allocate resources as necessary.             can be routed to Information Security Management and
■ The ability to identify potential improvements to            capacity- or performance-related incidents that would be
  services. This happens as a result of understanding          routed to Capacity Management.
  what constitutes an incident and also from being in          The Incident Model should include:
  contact with the activities of business operational staff.
                                                               ■ The steps that should be taken to handle the incident
■ The Service Desk can, during its handling of incidents,
                                                               ■ The chronological order these steps should be taken
  identify additional service or training requirements
  found in IT or the business.                                    in, with any dependences or co-processing defined
                                                               ■ Responsibilities; who should do what
Incident Management is highly visible to the business, and
                                                               ■ Timescales and thresholds for completion of the
it is therefore easier to demonstrate its value than most
                                                                 actions
areas in Service Operation. For this reason, Incident
                                                               ■ Escalation procedures; who should be contacted and
Management is often one of the first processes to be
                                                                 when
implemented in Service Management projects. The added
                                                               ■ Any necessary evidence-preservation activities
benefit of doing this is that Incident Management can be
used to highlight other areas that need attention –              (particularly relevant for security- and capacity-related
thereby providing a justification for expenditure on             incidents).
implementing other processes.                                  The models should be input to the incident-handling
                                                               support tools in use and the tools should then automate
4.2.4 Policies/principles/basic concepts                       the handling, management and escalation of the process.
There are some basic things that need to be taken into
account and decided when considering Incident                  4.2.4.3 Major incidents
Management. These are covered in this section.                 A separate procedure, with shorter timescales and greater
                                                               urgency, must be used for ‘major’ incidents. A definition of
4.2.4.1 Timescales                                             what constitutes a major incident must be agreed and
Timescales must be agreed for all incident-handling stages     ideally mapped on to the overall incident prioritization
(these will differ depending upon the priority level of the    system – such that they will be dealt with through the
incident) – based upon the overall incident response and       major incident process.
48   | Service Operation processes



                             From                From                          User              Email
                             Event               Web                          Phone            Technical
                             Mgmt              Interface                       Call              Staff




                                                               Incident
                                                            Identification




                                                               Incident
                                                               Logging




                                                               Incident
                                                            Categorization




                                                                                      Yes
                                                           Service Request?                 To Request
                                                                                            Fulfilment

                                                                    No


                                                              Incident
                                                            Prioritization




                            Major Incident   Yes
                             Procedure                     Major Incident?


                                                                      No

                                                                Initial
                                                              Diagnosis




                                                              Functional                    Functional
                                             Yes                                      Yes   Escalation
                                                              Escalation
                                                               Needed?                       2/3 Level


       Management    Yes        Hierarchic                            No
        Escalation              Escalation
                                Needed?

                                      No                    Investigation
                                                             & Diagnosis




                                                             Resolution
                                                            and Recovery




                                                           Incident Closure




                                                                End                                        Figure 4.2 Incident
                                                                                                           Management process flow
                                                                                         Service Operation processes |        49

Note: People sometimes use loose terminology and/or             Please see section 4.1 for further details.
confuse a major incident with a problem. In reality, an
incident remains an incident forever – it may grow in           4.2.5.2 Incident logging
impact or priority to become a major incident, but an           All incidents must be fully logged and date/time stamped,
incident never ‘becomes’ a problem. A problem is the            regardless of whether they are raised through a Service
underlying cause of one or more incidents and remains a         Desk telephone call or whether automatically detected via
separate entity always!                                         an event alert.
Some lower-priority incidents may also have to be               Note: If Service Desk and/or support staff visit the
handled through this procedure – due to potential               customers to deal with one incident, they may be asked to
business impact – and some major incidents may not              deal with further incidents ‘while they are there’. It is
need to be handled in this way if the cause and                 important that if this is done, a separate Incident Record is
resolutions are obvious and the normal incident process         logged for each additional incident handled – to ensure
can easily cope within agreed target resolution times –         that a historical record is kept and credit is given for the
provided the impact remains low!                                work undertaken.
Where necessary, the major incident procedure should            All relevant information relating to the nature of the
include the dynamic establishment of a separate major           incident must be logged so that a full historical record is
incident team under the direct leadership of the Incident       maintained – and so that if the incident has to be referred
Manager, formulated to concentrate on this incident alone       to other support group(s), they will have all relevant
to ensure that adequate resources and focus are provided        information to hand to assist them.
to finding a swift resolution. If the Service Desk Manager is
also fulfilling the role of Incident Manager (say in a small    The information needed for each incident is likely to
organization), then a separate person may need to be            include:
designated to lead the major incident investigation team –      ■ Unique reference number
so as to avoid conflict of time or priorities – but should      ■ Incident categorization (often broken down into
ultimately report back to the Incident Manager.                     between two and four levels of sub-categories)
If the cause of the incident needs to be investigated at the    ■   Incident urgency
same time, then the Problem Manager would be involved           ■   Incident impact
as well but the Incident Manager must ensure that service       ■   Incident prioritization
restoration and underlying cause are kept separate.             ■   Date/time recorded
Throughout, the Service Desk would ensure that all              ■   Name/ID of the person and/or group recording the
activities are recorded and users are kept fully informed of        incident
progress.
                                                                ■   Method of notification (telephone, automatic, e-mail,
                                                                    in person, etc.)
4.2.5 Process activities, methods and
                                                                ■   Name/department/phone/location of user
techniques
                                                                ■   Call-back method (telephone, mail, etc.)
The process to be followed during the management of an          ■   Description of symptoms
incident is shown in Figure 4.2. The process includes the
                                                                ■   Incident status (active, waiting, closed, etc.)
following steps.
                                                                ■   Related CI
4.2.5.1 Incident identification                                 ■   Support group/person to which the incident is
                                                                    allocated
Work cannot begin on dealing with an incident until it is
                                                                ■   Related problem/Known Error
known that an incident has occurred. It is usually
unacceptable, from a business perspective, to wait until a      ■   Activities undertaken to resolve the incident
user is impacted and contacts the Service Desk. As far as       ■   Resolution date and time
possible, all key components should be monitored so that        ■   Closure category
failures or potential failures are detected early so that the   ■   Closure date and time.
incident management process can be started quickly.
                                                                Note: If the Service Desk does not work 24/7 and
Ideally, incidents should be resolved before they have an
                                                                responsibility for first-line incident logging and handling
impact on users!
                                                                passes to another group, such as IT Operations or Network
50    | Service Operation processes



 Support, out of Service Desk hours, then these staff need                 to achieve a correct and complete set of categories – if
 to be equally rigorous about logging of incident details.                 they are starting from scratch! The steps involve:
 Full training and awareness needs to be provided to such
                                                                           1 Hold a brainstorming session among the relevant
 staff on this issue.
                                                                             support groups, involving the SD Supervisor and
                                                                             Incident and Problem Managers.
 4.2.5.3 Incident categorization
                                                                           2 Use this session to decide the ‘best guess’ top-level
 Part of the initial logging must be to allocate suitable                    categories – and include an ‘other’ category. Set up
 incident categorization coding so that the exact type of                    the relevant logging tools to use these categories for a
 the call is recorded. This will be important later when                     trial period.
 looking at incident types/frequencies to establish trends
                                                                           3 Use the categories for a short trial period (long
 for use in Problem Management, Supplier Management
                                                                             enough for several hundred incidents to fall into each
 and other ITSM activities.
                                                                             category, but not too long that an analysis will take
 Please note that the check for Service Requests in this                     too long to perform).
 process does not imply that Service Requests are incidents.               4 Perform an analysis of the incidents logged during the
 This is simply recognition of the fact that Service Requests                trial period. The number of incidents logged in each
 are sometimes incorrectly logged as incidents (e.g. a user                  higher-level category will confirm whether the
 incorrectly enters the request as an incident from the web                  categories are worth having – and a more detailed
 interface). This check will detect any such requests and                    analysis of the ‘other’ category should allow
 ensure that they are passed to the Request Fulfilment                       identification of any additional higher-level categories
 process.                                                                    that will be needed.
 Multi-level categorization is available in most tools –                   5 A breakdown analysis of the incidents within each
 usually to three or four levels of granularity. For example,                higher-level category should be used to decide the
 an incident may be categorized as shown in Figure 4.3.                      lower-level categories that will be required.
                                                                           6 Review and repeat these activities after a further
                                                                             period – of, say, one to three months – and again
          Hardware
                                                                             regularly to ensure that they remain relevant. Be aware
                                                                             that any significant changes to categorization may
                       Server                                                cause some difficulties for incident trending or
                                                                             management reporting – so they should be stabilized
                                                                             unless changes are genuinely required.
                                   Memory Board
                                                                           If an existing categorization scheme is in use, but it is not
                                                                           thought to be working satisfactorily, the basic idea of the
                                                       Card failure
     Or                                                                    technique suggested above can be used to review and
                                                                           amend the existing scheme.
          Software
                                                                           NOTE: Sometimes the details available at the time an
                                                                           incident is logged may be incomplete, misleading or
                     Application                                           incorrect. It is therefore important that the categorization
                                                                           of the incident is checked, and updated if necessary, at
                                   Finance suite                           call closure time (in a separate closure categorization field,
                                                                           so as not to corrupt the original categorization) – please
                                                                           see paragraph 4.2.5.9.
                                                   Purchase order system

 Figure 4.3 Multi-level incident categorization                            4.2.5.4 Incident prioritization
                                                                           Another important aspect of logging every incident is to
                                                                           agree and allocate an appropriate prioritization code – as
 All organizations are unique and it is therefore difficult to
                                                                           this will determine how the incident is handled both by
 give generic guidance on the categories an organization
                                                                           support tools and support staff.
 should use, particularly at the lower levels. However, there
 is a technique that can be used to assist an organization                 Prioritization can normally be determined by taking into
                                                                           account both the urgency of the incident (how quickly the
                                                                                           Service Operation processes |         51

business needs a resolution) and the level of impact it is         Some organizations may also recognize VIPs (high-ranking
causing. An indication of impact is often (but not always)         executives, officers, diplomats, politicians, etc.) whose
the number of users being affected. In some cases, and             incidents would be handled on a higher priority than
very importantly, the loss of service to a single user can         normal – but in such cases this is best catered for and
have a major business impact – it all depends upon who is          documented within the guidance provided to the Service
trying to do what – so numbers alone is not enough to              Desk staff on how to apply the priority levels, so they are
evaluate overall priority! Other factors that can also             all aware of the agreed rules for VIPs, and who falls into
contribute to impact levels are:                                   this category.
■ Risk to life or limb                                             It should be noted that an incident’s priority may be
■ The number of services affected – may be multiple                dynamic – if circumstances change, or if an incident is not
  services                                                         resolved within SLA target times, then the priority must be
■ The level of financial losses                                    altered to reflect the new situation.
■ Effect on business reputation                                    Note: some tools may have constraints that make it
■ Regulatory or legislative breaches.                              difficult automatically to calculate performance against SLA
                                                                   targets if a priority is changed during the lifetime of an
An effective way of calculating these elements and
                                                                   incident. However, if circumstances do change, the change
deriving an overall priority level for each incident is given
                                                                   in priority should be made – and if necessary manual
in Table 4.1:
                                                                   adjustments made to reporting tools. Ideally, tools with
Table 4.1 Simple priority coding system                            such constraints should not be selected.
                             Impact
                                                                   4.2.5.5 Initial diagnosis
                  High       Medium Low
                                                                   If the incident has been routed via the Service Desk, the
         High     1          2        3                            Service Desk Analyst must carry out initial diagnosis,
Urgency Medium 2             3        4                            typically while the user is still on the telephone – if the
                                                                   call is raised in this way – to try to discover the full
         Low      3          4        5
                                                                   symptoms of the incident and to determine exactly what
                                                                   has gone wrong and how to correct it. It is at this stage
Priority code         Description         Target resolution time   that diagnostic scripts and known error information can be
                                                                   most valuable in allowing earlier and accurate diagnosis.
1                     Critical            1 hour
                                                                   If possible, the Service Desk Analyst will resolve the
2                     High                8 hours
                                                                   incident while the user is still on the telephone – and
3                     Medium              24 hours                 close the incident if the resolution is successful.
4                     Low                 48 hours                 If the Service Desk Analyst cannot resolve the incident
5                     Planning            Planned                  while the user is still on the telephone, but there is a
                                                                   prospect that the Service Desk may be able to do so
                                                                   within the agreed time limit without assistance from other
In all cases, clear guidance – with practical examples –           support groups, the Analyst should inform the user of their
should be provided for all support staff to enable them to         intentions, give the user the incident reference number
determine the correct urgency and impact levels, so the            and attempt to find a resolution.
correct priority is allocated. Such guidance should be
produced during service level negotiations.                        4.2.5.6 Incident escalation
However, it must be noted that there will be occasions             ■ Functional escalation. As soon as it becomes clear
when, because of particular business expediency or                    that the Service Desk is unable to resolve the incident
whatever, normal priority levels have to be overridden.               itself (or when target times for first-point resolution
When a user is adamant that an incident’s priority level              have been exceeded – whichever comes first!) the
should exceed normal guidelines, the Service Desk should              incident must be immediately escalated for further
comply with such a request – and if it subsequently turns             support.
out to be incorrect this can be resolved as an off-line               If the organization has a second-level support group
management level issue, rather than a dispute occurring               and the Service Desk believes that the incident can be
when the user is on the telephone.                                    resolved by that group, it should refer the incident to
52   | Service Operation processes



   them. If it is obvious that the incident will need          and/or Incident Management staff initially, in conjunction
   deeper technical knowledge, or when the second-level        with managers of the various support groups to which
   group has not been able to resolve the incident within      incidents are escalated, to decide the order in which
   agreed target times (whichever comes first), the            incidents should be picked up and actively worked on.
   incident must be immediately escalated to the               These managers must ensure that incidents are dealt with
   appropriate third-level support group. Note that third-     in true business priority order and that staff are not
   level support groups may be internal – but they may         allowed to ‘cherry-pick’ the incidents they choose!
   also be third parties such as software suppliers or
   hardware manufacturers or maintainers. The rules for        4.2.5.7 Investigation and Diagnosis
   escalation and handling of incidents must be agreed         In the case of incidents where the user is just seeking
   in OLAs and UCs with internal and external support          information, the Service Desk should be able to provide
   groups respectively.                                        this fairly quickly and resolve the service request – but if a
   Note: Incident Ownership remains with the Service           fault is being reported, this is an incident and likely to
   Desk! Regardless of where an incident is referred to        require some degree of investigation and diagnosis.
   during its life, ownership of the incident remains with
                                                               Each of the support groups involved with the incident
   the Service Desk at all times. The Service Desk remains
                                                               handling will investigate and diagnose what has gone
   responsible for tracking progress, keeping users
                                                               wrong – and all such activities (including details of any
   informed and ultimately for Incident Closure.
                                                               actions taken to try to resolve or re-create the incident)
 ■ Hierarchic escalation. If incidents are of a serious
                                                               should be fully documented in the incident record so that
   nature (for example Priority 1 incidents) the
                                                               a complete historical record of all activities is maintained
   appropriate IT managers must be notified, for
                                                               at all times.
   informational purposes at least. Hierarchic escalation is
   also used if the ‘Investigation and Diagnosis’ and          Note: Valuable time can often be lost if investigation and
   ‘Resolution and Recovery’ steps are taking too long or      diagnostic action (or indeed resolution or recovery actions)
   proving too difficult. Hierarchic escalation should         are performed serially. Where possible, such activities
   continue up the management chain so that senior             should be performed in parallel to reduce overall
   managers are aware and can be prepared and take             timescales – and support tools should be designed and/or
   any necessary action, such as allocating additional         selected to allow this. However, care should be taken to
   resources or involving suppliers/maintainers. Hierarchic    coordinate activities, particularly resolution or recovery
   escalation is also used when there is contention about      activities, otherwise the actions of different groups may
   to whom the incident is allocated.                          conflict or further complicate a resolution!
   Hierarchic escalation can, of course, be initiated by the   This investigation is likely to include such actions as:
   affected users or customer management, as they see
                                                               ■ Establishing exactly what has gone wrong or being
   fit – that is why it is important that IT managers are
   made aware so that they can anticipate and prepare              sought by the user
   for any such escalation.                                    ■   Understanding the chronological order of events
                                                               ■   Confirming the full impact of the incident, including
 The exact levels and timescales for both functional and
                                                                   the number and range of users affected
 hierarchic escalation need to be agreed, taking into
                                                               ■   Identifying any events that could have triggered the
 account SLA targets, and embedded within support tools
                                                                   incident (e.g. a recent change, some user action?)
 which can then be used to police and control the process
                                                               ■   Knowledge searches looking for previous occurrences
 flow within agreed timescales.
                                                                   by searching previous Incident/Problem Records
 The Service Desk should keep the user informed of any             and/or Known Error Databases or
 relevant escalation that takes place and ensure the               manufacturers’/suppliers’ Error Logs or Knowledge
 Incident Record is updated accordingly to keep a full             Databases.
 history of actions.

 Note regarding Incident allocation                            4.2.5.8 Resolution and Recovery
 There may be many incidents in a queue with the same          When a potential resolution has been identified, this
 priority level – so it will be the job of the Service Desk    should be applied and tested. The specific actions to be
                                                               undertaken and the people who will be involved in taking
                                                                                           Service Operation processes |            53

the recovery actions may vary, depending upon the nature          ■ Ongoing or recurring problem? Determine (in
of the fault – but could involve:                                   conjunction with resolver groups) whether it is likely
                                                                    that the incident could recur and decide whether any
■ Asking the user to undertake directed activities on
                                                                    preventive action is necessary to avoid this. In
  their own desk top or remote equipment
                                                                    conjunction with Problem Management, raise a
■ The Service Desk implementing the resolution either
                                                                    Problem Record in all such cases so that preventive
  centrally (say, rebooting a server) or remotely using
                                                                    action is initiated.
  software to take control of the user’s desktop to
                                                                  ■ Formal closure. Formally close the Incident Record.
  diagnose and implement a resolution
■ Specialist support groups being asked to implement              Note: Some organizations may chose to utilize an
  specific recovery actions (e.g. Network Support                 automatic closure period on specific, or even all, incidents
  reconfiguring a router)                                         (e.g. incident will be automatically closed after two
■ A third-party supplier or maintainer being asked to             working days if no further contact is made by the user).
  resolve the fault.                                              Where this approach is to be considered, it must first be
                                                                  fully discussed and agreed with the users – and widely
Even when a resolution has been found, sufficient testing         publicized so that all users and IT staff are aware of this. It
must be performed to ensure that recovery action is               may be inappropriate to use this method for certain types
complete and that the service has been fully restored to          of incidents – such as major incidents or those involving
the user(s).                                                      VIPs, etc.
NOTE: in some cases it may be necessary for two or more
                                                                  Rules for re-opening incidents
groups to take separate, though perhaps coordinated,
recovery actions for an overall resolution to be                  Despite all adequate care, there will be occasions when
implemented. In such cases Incident Management must               incidents recur even though they have been formally
coordinate the activities and liaise with all parties involved.   closed. Because of such cases, it is wise to have pre-
                                                                  defined rules about if and when an incident can be re-
Regardless of the actions taken, or who does them, the            opened. It might make sense, for example, to agree that if
Incident Record must be updated accordingly with all              the incident recurs within one working day then it can be
relevant information and details so that a full history           re-opened – but that beyond this point a new incident
is maintained.                                                    must be raised, but linked to the previous incident(s).
The resolving group should pass the incident back to the          The exact time threshold/rules may vary between
Service Desk for closure action.                                  individual organizations – but clear rules should be agreed
                                                                  and documented and guidance given to all Service Desk
4.2.5.9 Incident Closure                                          staff so that uniformity is applied.
The Service Desk should check that the incident is fully
resolved and that the users are satisfied and willing to          4.2.6 Triggers, input and output/inter-
agree the incident can be closed. The Service Desk should         process interfaces
also check the following:
                                                                  Incidents can be triggered in many ways. The most
■ Closure categorization. Check and confirm that the              common route is when a user rings the Service Desk or
  initial incident categorization was correct or, where           completes a web-based incident-logging screen, but
  the categorization subsequently turned out to be                increasingly incidents are raised automatically via Event
  incorrect, update the record so that a correct closure          Management tools. Technical staff may notice potential
  categorization is recorded for the incident – seeking           failures and raise an incident, or ask the Service Desk to do
  advise or guidance from the resolving group(s) as               so, so that the fault can be addressed. Some incidents may
  necessary.                                                      also arise at the initiation of suppliers – who may send
■ User satisfaction survey. Carry out a user satisfaction         some form of notification of a potential or actual difficulty
  call-back or e-mail survey for the agreed percentage of         that needs attention.
  incidents.                                                      The interfaces with Incident Management include:
■ Incident documentation. Chase any outstanding
                                                                  ■ Problem Management: Incident Management forms
  details and ensure that the Incident Record is fully
  documented so that a full historic record at a                     part of the overall process of dealing with problems in
  sufficient level of detail is complete.                            the organization. Incidents are often caused by
                                                                     underlying problems, which must be solved to prevent
54   | Service Operation processes



     the incident from recurring. Incident Management          ■ The Incident Management tools, which contain
     provides a point where these are reported.                  information about:
 ■   Configuration Management provides the data used             ● Incident and problem history
     to identify and progress incidents. One of the uses of      ● Incident categories
     the CMS is to identify faulty equipment and to assess       ● Action taken to resolve incidents
     the impact of an incident. It is also used to identify      ● Diagnostic scripts which can help first-line analysts
     the users affected by potential problems. The CMS               to resolve the incident, or at least gather
     also contains information about which categories of             information that will help second- or third-line
     incident should be assigned to which support group.             analysts resolve it faster.
     In turn, Incident Management can maintain the status
                                                               ■ Incident Records, which include the following data:
     of faulty CIs. It can also assist Configuration
                                                                 ● Unique reference number
     Management to audit the infrastructure when working
                                                                 ● Incident classification
     to resolve an incident.
 ■                                                               ● Date and time of recording and any subsequent
     Change Management: Where a change is required to
     implement a workaround or resolution, this will need            activities
     to be logged as an RFC and progressed through               ● Name and identity of the person recording and
     Change Management. In turn, Incident Management is              updating the Incident Record
     able to detect and resolve incidents that arise from        ● Name/organization/contact details of affected
     failed changes.                                                 user(s)
 ■   Capacity Management: Incident Management                    ● Description of the incident symptoms
     provides a trigger for performance monitoring where         ● Details of any actions taken to try to diagnose,
     there appears to be a performance problem. Capacity             resolve or re-create the incident
     Management may develop workarounds for incidents.           ● Incident category, impact, urgency and priority
 ■   Availability Management; will use Incident                  ● Relationship with other incidents, problems,
     Management data to determine the availability of IT             changes or Known Errors
     services and look at where the incident lifecycle can       ● Closure details, including time, category, action
     be improved.                                                    taken and identity of person closing the record.
 ■   SLM: The ability to resolve incidents in a specified
                                                               Incident Management also requires access to the CMS.
     time is a key part of delivering an agreed level of
                                                               This will help it to identify the CIs affected by the incident
     service. Incident Management enables SLM to define
                                                               and also to estimate the impact of the incident.
     measurable responses to service disruptions. It also
     provides reports that enable SLM to review SLAs           The Known Error Database provides valuable information
     objectively and regularly. In particular, Incident        about possible resolutions and workarounds. This is
     Management is able to assist in defining where            discussed in detail in paragraph 4.4.7.2.
     services are at their weakest, so that SLM can define
     actions as part of the Service Improvement Plan (SIP) –   4.2.8 Metrics
     please see the Continual Service Improvement              The metrics that should be monitored and reported upon
     publication for more details. SLM defines the             to judge the efficiency and effectiveness of the Incident
     acceptable levels of service within which Incident        Management process, and its operation, will include:
     Management works, including:
                                                               ■ Total numbers of Incidents (as a control measure)
     ● Incident response times
                                                               ■ Breakdown of incidents at each stage (e.g. logged,
     ● Impact definitions
                                                                   work in progress, closed etc)
     ● Target fix times
                                                               ■   Size of current incident backlog
     ● Service definitions, which are mapped to users
                                                               ■   Number and percentage of major incidents
     ● Rules for requesting services
                                                               ■   Mean elapsed time to achieve incident resolution or
     ● Expectations for providing feedback to users.
                                                                   circumvention, broken down by impact code
                                                               ■   Percentage of incidents handled within agreed
 4.2.7 Information Management
                                                                   response time (incident response-time targets may be
 Most information used in Incident Management comes                specified in SLAs, for example, by impact and urgency
 from the following sources:                                       codes)
                                                                                         Service Operation processes |           55

■ Average cost per incident                                     ■ Integration into the SLM process. This will assist
■ Number of incidents reopened and as a percentage of               Incident Management correctly to assess the impact
    the total                                                       and priority of incidents and assists in defining and
■ Number and percentage of incidents incorrectly                    executing escalation procedures. SLM will also benefit
    assigned                                                        from the information learned during Incident
■ Number and percentage of incidents incorrectly                    Management, for example in determining whether
    categorized                                                     service level performance targets are realistic and
                                                                    achievable.
■   Percentage of Incidents closed by the Service Desk
    without reference to other levels of support (often         4.2.9.2 Critical Success Factors
    referred to as ‘first point of contact’)
                                                                The following factors will be critical for successful Incident
■   Number and percentage the of incidents processed            Management:
    per Service Desk agent
■   Number and percentage of incidents resolved                 ■ A good Service Desk is key to successful Incident
    remotely, without the need for a visit                          Management
■   Number of incidents handled by each Incident Model          ■   Clearly defined targets to work to – as defined in SLAs
■   Breakdown of incidents by time of day, to help              ■   Adequate customer-oriented and technically training
    pinpoint peaks and ensure matching of resources.                support staff with the correct skill levels, at all stages
                                                                    of the process
Reports should be produced under the authority of the           ■   Integrated support tools to drive and control the
Incident Manager, who should draw up a schedule and                 process
distribution list, in collaboration with the Service Desk and
                                                                ■   OLAs and UCs that are capable of influencing and
support groups handling incidents. Distribution lists
                                                                    shaping the correct behaviour of all support staff.
should at least include IT Services Management and
specialist support groups. Consider also making the data
                                                                4.2.9.3 Risks
available to users and customers, for example via SLA
reports.                                                        The risks to successful Incident Management are actually
                                                                similar to some of the challenges and the reverse of some
4.2.9 Challenges, Critical Success Factors                      of the Critical Success Factors mentioned above. They
                                                                include:
and risks
                                                                ■ Being inundated with incidents that cannot be
4.2.9.1 Challenges                                                handled within acceptable timescales due to a lack of
The following challenges will exist for successful Incident       available or properly trained resources
Management:                                                     ■ Incidents being bogged down and not progressed as
■ The ability to detect incidents as early as possible. This
                                                                  intended because of inadequate support tools to raise
  will require education of the users reporting incidents,        alerts and prompt progress
  the use of Super Users (see paragraph 6.2.4.5) and the        ■ Lack of adequate and/or timely information sources
  configuration of Event Management tools.                        because of inadequate tools or lack of integration
■ Convincing all staff (technical teams as well as users)       ■ Mismatches in objectives or actions because of poorly
  that all incidents must be logged, and encouraging              aligned or non-existent OLAs and/or UCs.
  the use of self-help web-based capabilities (which can
  speed up assistance and reduce resource                       4.3 REQUEST FULFILMENT
  requirements).
                                                                The term ‘Service Request’ is used as a generic description
■ Availability of information about problems and Known
                                                                for many varying types of demands that are placed upon
  Errors. This will enable Incident Management staff to
                                                                the IT Department by the users. Many of these are actually
  learn from previous incidents and also to track the
                                                                small changes – low risk, frequently occurring, low cost,
  status of resolutions.
                                                                etc. (e.g. a request to change a password, a request to
■ Integration into the CMS to determine relationships
                                                                install an additional software application onto a particular
  between CIs and to refer to the history of CIs when
                                                                workstation, a request to relocate some items of desktop
  performing first-line support.
                                                                equipment) or maybe just a question requesting
                                                                information – but their scale and frequent, low-risk nature
56   | Service Operation processes



 means that they are better handled by a separate process,       through the Request Fulfilment process and which others
 rather than being allowed to congest and obstruct the           will have to go through more formal Change
 normal Incident and Change Management processes.                Management. There will always be grey areas which
                                                                 prevent generic guidance from being usefully prescribed.
 4.3.1 Purpose/goal/objective
 Request Fulfilment is the processes of dealing with Service     4.3.3 Value to business
 Requests from the users. The objectives of the Request          The value of Request Fulfilment is to provide quick and
 Fulfilment process include:                                     effective access to standard services which business staff
                                                                 can use to improve their productivity or the quality of
 ■ To provide a channel for users to request and receive
                                                                 business services and products.
   standard services for which a pre-defined approval and
   qualification process exists                                  Request Fulfilment effectively reduces the bureaucracy
 ■ To provide information to users and customers about           involved in requesting and receiving access to existing or
   the availability of services and the procedure for            new services, thus also reducing the cost of providing
   obtaining them                                                these services. Centralizing fulfilment also increases the
 ■ To source and deliver the components of requested             level of control over these services. This in turn can help
   standard services (e.g. licences and software media)          reduce costs through centralized negotiation with
 ■ To assist with general information, complaints or             suppliers, and can also help to reduce the cost of support.
   comments.
                                                                 4.3.4 Policies/principles/basic concepts
 4.3.2 Scope                                                     Many Service Requests will be frequently recurring, so a
 The process needed to fulfil a request will vary depending      predefined process flow (a model) can be devised to
 upon exactly what is being requested – but can usually be       include the stages needed to fulfil the request, the
 broken down into a set of activities that have to be            individuals or support groups involved, target timescales
 performed. Some organizations will be comfortable to let        and escalation paths. Service Requests will usually be
 the Service Requests be handled through their Incident          satisfied by implementing a Standard Change (see the
 Management processes (and tools) – with Service Requests        Service Transition publication for further details on
 being handled as a particular type of ‘incident’ (using a       Standard Changes). The ownership of Service Requests
 high-level categorization system to identify those              resides with the Service Desk, which monitors, escalates,
 ‘incidents’ that are in fact Service Requests).                 dispatches and often fulfils the user request.

 Note, however, that there is a significant difference here –    4.3.4.1 Request Models
 an incident is usually an unplanned event whereas a
                                                                 Some Service Requests will occur frequently and will
 Service Request is usually something that can and should
                                                                 require handling in a consistent manner in order to meet
 be planned!
                                                                 agreed service levels. To assist this, many organizations will
 Therefore, in an organization where large numbers of            wish to create pre-defined Request Models (which typically
 Service Requests have to be handled, and where the              include some form of pre-approval by Change
 actions to be taken to fulfil those requests are very varied    Management). This is similar in concept to the idea of
 or specialized, it may be appropriate to handle Service         Incident Models already described in paragraph 4.2.4.2,
 Requests as a completely separate work stream – and to          but applied to Service Requests.
 record and manage them as a separate record type.
 This may be particularly appropriate if the organization        4.3.5 Process activities, methods and
 has chosen to widen the scope of the Service Desk to            techniques
 expand upon just IT-related issues and use the desk as a
 focal point for other types or request for service – for
                                                                 4.3.5.1 Menu selection
 example, a request to service a photocopier or even going       Request Fulfilment offers great opportunities for self-help
 so far as to include, for example, building management          practices where users can generate a Service Request
 issues, such as a need to replace a light fitment or repair a   using technology that links into Service Management
 leak in the plumbing.                                           tools. Ideally, users should be offered a ‘menu’-type
                                                                 selection via a web interface, so that they can select and
 Note: It will ultimately be up to each organization to          input details of Service Requests from a pre-defined list –
 decide and document which request it will handle
                                                                                        Service Operation processes |           57

appropriate expectations can be set by giving target            4.3.5.5 Closure
delivery and/or implementation targets/dates (in line with      When the Service Request has been fulfilled it must be
SLA targets). Where organizations are offering a self-help IT   referred back to the Service Desk for closure. The Service
support capability to the users, it would make sense to         Desk should go through the same closure process as
combine this with a Request Fulfilment system as                described earlier in paragraph 4.2.5.9 – checking that the
described.                                                      user is satisfied with the outcome.
Specialist web tools to offer this type of ‘shopping basket’
experience can be used together with interfaces directly to     4.3.6 Triggers, input and output/inter-
the back-end integrated ITSM tools, or other more general       process interfaces
business process automation or Enterprise Resource              Most requests will be triggered through either a user
Planning (ERP) tools that may be used for management of         calling the Service Desk or a user completing some form
the Request Fulfilment activities.                              of self-help web-based input screen to make their request.
                                                                The latter will often involve a selection from a portfolio of
4.3.5.2 Financial approval                                      available request types.
One important extra step that is likely to be needed when
                                                                The primary interfaces with Request Fulfilment include:
dealing with a service request is that of financial approval.
                                                                ■ Service Desk/Incident Management: Many Service
Most requests will have some form of financial
                                                                  Requests may come in via the Service Desk and may
implications, regardless of the type of commercial
                                                                  be initially handled through the Incident Management
arrangements in place. The cost of fulfilling the request
                                                                  process. Some organizations may choose that all
must first be established. It may be possible to agree fixed
                                                                  requests are handled via this route – but others may
prices for ‘standard’ requests – and prior approval for such
                                                                  choose to have a separate process, for reasons already
requests may be given as part of the organization’s overall
                                                                  discussed earlier in this chapter.
annual financial management. In all other cases, an
                                                                ■ A strong link is also needed between Request
estimate of the cost must be produced and submitted to
                                                                  Fulfilment, Release, Asset and Configuration
the user for financial approval (the user may need to seek
                                                                  Management – as some requests will be for the
approval up their management/financial chain). If approval
                                                                  deployment of new or upgraded components that can
is given, in addition to fulfilling the request, the process
                                                                  be automatically deployed. In such cases the ‘release’
must also include charging (billing or cross-charging) for
                                                                  can be pre-defined, built and tested but only
the work done – if charging is in place.
                                                                  deployed upon request by those who want the
                                                                  ‘release’. Upon deployment, the CMS will have to be
4.3.5.3 Other approval
                                                                  updated to reflect the change. Where appropriate,
In some cases further approval may be needed – such as            software licence checks/updates will also be necessary.
compliance-related or wider business approval. Request
Fulfilment must have the ability to define and check such       Where appropriate, it will be necessary to relate IT-related
approvals where needed.                                         Service Requests to any incidents or problems that have
                                                                initiated the need for the request (as would be the case
4.3.5.4 Fulfilment                                              for any other type of change).
The actual fulfilment activity will depend upon the nature
                                                                4.3.7 Information Management
of the Service Request. Some simpler requests may be
completed by the Service Desk, acting as first-line support,    Request Fulfilment is dependent on information from the
while others will have to be forwarded to specialist groups     following sources:
and/or suppliers for fulfilment.                                ■ The Service Requests will contain information about:
Some organizations may have specialist fulfilment groups           ● What service is being requested
(to ‘pick, pack and dispatch’) – or may have outsourced            ● Who requested and authorized the service
some fulfilment activities to a third-party supplier(s). The       ● Which process will be used to fulfil the request
Service Desk should monitor and chase progress and keep            ● To whom it was assigned to and what action
users informed throughout, regardless of the actual                    was taken
fulfilment source.
58   | Service Operation processes



     ● The date and time when the request was logged             4.3.9.2 Critical Success Factors
       as well as the date and time of all actions taken         Request Fulfilment depends on the following Critical
   ● Closure details.                                            Success Factors:
 ■ Requests for Change: In some cases the Request
                                                                 ■ Agreement of what services will be standardized and
   Fulfilment process will be initiated by an RFC. This is
                                                                     who is authorized to request them. The cost of these
   typical where the Service Request relates to a CI
                                                                     services must also be agreed. This may be done as
 ■ The Service Portfolio, to enable the scope of agreed
                                                                     part of the SLM process. Any variances of the services
   Service Request to be identified
                                                                     must also be defined.
 ■ Security Policies will prescribe any controls to be
                                                                 ■   Publication of the services to users as part of the
   executed or adhered to when providing the service,
                                                                     Service Catalogue. It is important that this part of the
   e.g. ensuring that the requester is authorized to access
                                                                     Service Catalogue must be easily accessed, perhaps on
   the service, or that the software is licensed.
                                                                     the Intranet, and should be recognized as the first
                                                                     source of information for users seeking access to a
 4.3.8 Metrics                                                       service.
 The metrics needed to judge the effectiveness and               ■   Definition of a standard fulfilment procedure for each
 efficiency of Request Fulfilment will include the following         of the services being requested. This includes all
 (each metric will need to be broken down by request                 procurement policies and the ability to generate
 type, within the period):                                           purchase orders and work orders
 ■ The total number of Service Requests (as a control            ■   A single point of contact which can be used to
     measure)                                                        request the service. This is often provided by the
 ■   Breakdown of service requests at each stage (e.g.               Service Desk or through an Intranet request, but could
     logged, WIP, closed, etc.)                                      be through an automated request directly into the
 ■   The size of current backlog of outstanding Service              Request Fulfilment or procurement system.
     Requests                                                    ■   Self-service tools needed to provide a front-end
 ■   The mean elapsed time for handling each type of                 interface to the users. It is essential that these
     Service Request                                                 integrate with the back-end fulfilment tools, often
                                                                     managed through Incident or Change Management.
 ■   The number and percentage of Service Requests
     completed within agreed target times
                                                                 4.3.9.3 Risks
 ■   The average cost per type of Service Request
 ■   Level of client satisfaction with the handling of Service   Risks that may be encountered with Request Fulfilment
     Requests (as measured in some form of satisfaction          include:
     survey).                                                    ■ Poorly defined scope, where people are unclear about
                                                                     exactly what the process is expected to handle
 4.3.9 Challenges, Critical Success Factors                      ■ Poorly designed or implemented user interfaces so
 and risks                                                         that users have difficulty raising the requests that
                                                                   they need
 4.3.9.1 Challenges                                              ■ Badly designed or operated back-end fulfilment
 The following challenges will be faced when introducing           processes that are incapable of dealing with the
 Request Fulfilment:                                               volume or nature of the requests being made
 ■ Clearly defining and documenting the type of requests         ■ Inadequate monitoring capabilities so that accurate
   that will be handled within the Request Fulfilment              metrics cannot be gathered.
   process (and those that will either go through the
   Service Desk and be handled as incidents or those             4.4 PROBLEM MANAGEMENT
   that will need to go through formal Change
   Management) – so that all parties are absolutely clear        ITIL defines a ‘problem’ as the unknown cause of one or
   on the scope.                                                 more incidents.
 ■ Establishing self-help front-end capabilities that allow
   the users to interface successfully with the Request
                                                                 4.4.1 Purpose/goal/objective
   Fulfilment process.                                           Problem Management is the process responsible for
                                                                 managing the lifecycle of all problems. The primary
                                                                                      Service Operation processes |         59

objectives of Problem Management are to prevent               4.4.4 Policies/principles/basic concepts
problems and resulting incidents from happening, to           There are some important concepts of Problem
eliminate recurring incidents and to minimize the impact      Management that must be taken into account from the
of incidents that cannot be prevented.                        outset. These include:

4.4.2 Scope                                                   4.4.4.1 Problem Models
Problem Management includes the activities required to        Many problems will be unique and will require handling in
diagnose the root cause of incidents and to determine the     an individual way – but it is conceivable that some
resolution to those problems. It is also responsible for      incidents may recur because of dormant or underlying
ensuring that the resolution is implemented through the       problems (for example, where the cost of a permanent
appropriate control procedures, especially Change             resolution will be high and a decision has been taken not
Management and Release Management.                            to go ahead with an expensive solution – but to ‘live with’
Problem Management will also maintain information             the problem).
about problems and the appropriate workarounds and            As well as creating a Known Error Record in the Known
resolutions, so that the organization is able to reduce the   Error Database (see paragraph 4.4.5.7) to ensure quicker
number and impact of incidents over time. In this respect,    diagnosis, the creation of a Problem Model for handling
Problem Management has a strong interface with                such problems in the future may be helpful. This is very
Knowledge Management, and tools such as the Known             similar in concept to the idea of Incident Models already
Error Database will be used for both.                         described in paragraph 4.2.4.2, but applied to problems as
Although Incident and Problem Management are separate         well as incidents.
processes, they are closely related and will typically use
the same tools, and may use similar categorization, impact    4.4.5 Process activities, methods and
and priority coding systems. This will ensure effective       techniques
communication when dealing with related incidents and         Problem Management consists of two major processes:
problems.
                                                              ■ Reactive Problem Management, which is generally
4.4.3 Value to business                                         executed as part of Service Operation – and is
                                                                therefore covered in this publication
Problem Management works together with Incident
                                                              ■ Proactive Problem Management which is initiated in
Management and Change Management to ensure that IT
                                                                Service Operation, but generally driven as part of
service availability and quality are increased. When
                                                                Continual Service Improvement (see this publication
incidents are resolved, information about the resolution is
                                                                for fuller details).
recorded. Over time, this information is used to speed up
the resolution time and identify permanent solutions,         The reactive Problem Management process is shown in
reducing the number and resolution time of incidents. This    Figure 4.4. This is a simplified chart to show the normal
results in less downtime and less disruption to business      process flow, but in reality some of the states may be
critical systems.                                             iterative or variations may have to be made in order to
                                                              handle particular situations.
Additional value is derived from the following:
■ Higher availability of IT services
■ Higher productivity of business and IT staff
■ Reduced expenditure on workarounds or fixes that do
   not work
■ Reduction in cost of effort in fire-fighting or resolving
   repeat incidents.
60   | Service Operation processes




                                                                          Proactive
                                       Event              Incident                         Supplier or
            Service Desk                                                  Problem
                                    Management          Management                         Contractor
                                                                         Management




                                                          Problem
                                                          Detection




                                                          Problem
                                                          Logging




                                                        Categorization




                                                        Prioritization




                                                                                CMS
                                                        Investigation
                                                         & Diagnosis




                                                        Workaround?




                                                        Create Known            Known
                                                         Error Record            Error
                                                                               Database



                             Change              Yes
                           Management                  Change Needed?


                                                                No


                                                         Resolution




                                                           Closure




                                                            Major         Major Problem
                                                          Problem?           Review




                                                                                          Figure 4.4 Problem
                                                             End
                                                                                          Management process
                                                                                          flow
                                                                                         Service Operation processes |         61

4.4.5.1 Problem detection                                       ■ Equipment details
It is likely that multiple ways of detecting problems will      ■ Date/time initially logged
exist in all organizations. These will include:                 ■ Priority and categorization details
                                                                ■ Incident description
■ Suspicion or detection of an unknown cause of one or
    more incidents by the Service Desk, resulting in a          ■ Details of all diagnostic or attempted recovery
    Problem Record being raised – the desk may have                 actions taken.
    resolved the incident but has not determined a
    definitive cause and suspects that it is likely to recur,   4.4.5.3 Problem Categorization
    so will raise a Problem Record to allow the underlying      Problems must be categorized in the same way as
    cause to be resolved. Alternatively, it may be              incidents (and it is advisable to use the same coding
    immediately obvious from the outset that an incident,       system) so that the true nature of the problem can be
    or incidents, has been caused by a major problem, so        easily traced in the future and meaningful management
    a Problem Record will be raised without delay.              information can be obtained.
■   Analysis of an incident by a technical support group
    which reveals that an underlying problem exists, or is      4.4.5.4 Problem Prioritization
    likely to exist.                                            Problems must be prioritized in the same way and for the
■   Automated detection of an infrastructure or                 same reasons as incidents – but the frequency and impact
    application fault, using event/alert tools automatically    of related incidents must also be taken into account. The
    to raise an incident which may reveal the need for a        coding system described earlier in Table 4.1 (which
    Problem Record.                                             combines impact with urgency to give an overall priority
■   A notification from a supplier or contractor that a         level) can be used to prioritize problems in the same way
    problem exists that has to be resolved.                     that it might be used for incidents, though the definitions
■   Analysis of incidents as part of proactive Problem          and guidance to support staff on what constitutes a
    Management – resulting in the need to raise a               problem, and the related service targets at each level,
    Problem Record so that the underlying fault can be          must obviously be devised separately.
    investigated further.                                       Problem prioritization should also take into account the
Frequent and regular analysis of incident and problem           severity of the problems. Severity in this context refers to
data must be performed to identify any trends as they           how serious the problem is from an infrastructure
become discernible. This will require meaningful and            perspective, for example:
detailed categorization of incidents/problems and regular       ■ Can the system be recovered, or does it need to be
reporting of patterns and areas of high occurrence. ‘Top            replaced?
ten’ reporting, with drill-down capabilities to lower levels,   ■   How much will it cost?
is useful in identifying trends.
                                                                ■   How many people, with what skills, will be needed to
Further details of how detected trends should be handled            fix the problem?
are included in the Continual Service Improvement               ■   How long will it take to fix the problem?
publication.                                                    ■   How extensive is the problem (e.g. how many CIs are
                                                                    affected)?
4.4.5.2 Problem logging
Regardless of the detection method, all the relevant details    4.4.5.5 Problem Investigation and Diagnosis
of the problem must be recorded so that a full historic         An investigation should be conducted to try to diagnose
record exists. This must be date and time stamped to            the root cause of the problem – the speed and nature of
allow suitable control and escalation.                          this investigation will vary depending upon the impact,
A cross-reference must be made to the incident(s) which         severity and urgency of the problem – but the appropriate
initiated the Problem Record – and all relevant details         level of resources and expertise should be applied to
must be copied from the Incident Record(s) to the               finding a resolution commensurate with the priority
Problem Record. It is difficult to be exact, as cases may       code allocated and the service target in place for that
vary, but typically this will include details such as:          priority level.

■ User details                                                  There are a number of useful problem solving techniques
■ Service details                                               that can be used to help diagnose and resolve problems –
62   | Service Operation processes



 and these should be used as appropriate. Such techniques        ■ Kepner and Tregoe: Charles Kepner and Benjamin
 are described in more detail later in this section.               Tregoe developed a useful way of problem analysis
                                                                   which can be used formally to investigate deeper-
 The CMS must be used to help determine the level of
                                                                   rooted problems. They defined the following stages:
 impact and to assist in pinpointing and diagnosing the
                                                                   ● defining the problem
 exact point of failure. The Know Error Database (KEDB)
 should also be accessed and problem-matching                      ● describing the problem in terms of identity,
 techniques (such as key word searches) should be used to              location, time and size
 see if the problem has occurred before and, if so, to find        ● establishing possible causes
 the resolution.                                                   ● testing the most probable cause
                                                                   ● verifying the true cause.
 It is often valuable to try to recreate the failure, so as to
 understand what has gone wrong, and then to try various           The method is described in fuller detail in Appendix C.
 ways of finding the most appropriate and cost-effective         ■ Brainstorming: It can often be valuable to gather
 resolution to the problem. To do this effectively without         together the relevant people, either physically or by
 causing further disruption to the users, a test system will       electronic means, and to ‘brainstorm’ the problem –
 be necessary that mirrors the production environment.             with people throwing in ideas on what the potential
                                                                   cause may be and potential actions to resolve the
 There are many problem analysis, diagnosis and solving
                                                                   problem. Brainstorming sessions can be very
 techniques available and much research has been done in
                                                                   constructive and innovative but it is equally important
 this area. Some of the most useful and frequently used
                                                                   that someone, perhaps the Problem Manager,
 techniques include:
                                                                   documents the outcome and any agreed actions and
 ■ Chronological Analysis: When dealing with a difficult           keeps a degree of control in the session(s).
   problem, there are often conflicting reports about            ■ Ishikawa Diagrams: Kaoru Ishikawa (1915–89), a
   exactly what has happened and when. It is therefore             leader in Japanese quality control, developed a
   very helpful briefly to document all events in                  method of documenting causes and effects which can
   chronological order – to provide a timeline of events.          be useful in helping identify where something may be
   This often makes it possible to see which events may            going wrong, or be improved. Such a diagram is
   have been triggered by others – or to discount any              typically the outcome of a brainstorming session
   claims that are not supported by the sequence of                where problem solvers can offer suggestions. The
   events.                                                         main goal is represented by the trunk of the diagram,
 ■ Pain Value Analysis: This is where a broader view is            and primary factors are represented as branches.
   taken of the impact of an incident or problem, or               Secondary factors are then added as stems, and so on.
   incident/problem type. Instead of just analysing the            Creating the diagram stimulates discussion and often
   number of incidents/problems of a particular type in a          leads to increased understanding of a complex
   particular period, a more in-depth analysis is done to          problem. An example diagram is given in Appendix D.
   determine exactly what level of pain has been caused          ■ Pareto Analysis: This is a technique for separating
   to the organization/business by these                           important potential causes from more trivial issues.
   incidents/problems. A formula can be devised to                 The following steps should be taken:
   calculate this pain level. Typically this might include         1 Form a table listing the causes and their
   taking into account:                                                  frequency as a percentage.
   ● The number of people affected                                 2 Arrange the rows in the decreasing order of
   ● The duration of the downtime caused                                 importance of the causes, i.e. the most important
   ● The cost to the business (if this can be readily                    cause first.
       calculated or estimated).                                   3 Add a cumulative percentage column to the
   By taking all of these factors into account, a much                   table. By this step, the chart should look
   more detailed picture of those incidents/problems or                  something like Table 4.2, which illustrates 10
   incident/problem types that are causing most pain can                 causes of network failure in an organization.
   be determined – to allow a better focus on those                4 Create a bar chart with the causes, in order of
   things that really matter and deserve highest priority                their percentage of total.
   in resolving.
                                                                                                                                                Service Operation processes |                                                                                     63

Table 4.2      Pareto cause ranking chart
                                                                Network failures
Causes                               Percentage of total        Computation                                                                                    Cumulative %
Network Controller                   35                         0+35%                                                                                          35
File corruption                      26                         35%+26%                                                                                        61
Addressing conflicts                 19                         61%+19%                                                                                        80
Server OS                            6                          80%+6%                                                                                         86
Scripting error                      5                          86%+5%                                                                                         91
Untested change                      3                          91%+3%                                                                                         94
Operator error                       2                          94%+2%                                                                                         96
Backup failure                       2                          96%+2%                                                                                         98
Intrusion attempts                   1                          98%+1%                                                                                         99
Disk failure                         1                          99%+1%                                                                                         100




    5     Superimpose a line chart of the cumulative
          percentages. The completed graph is illustrated
          in Figure 4.5.
                                                                                                                                   Network Failures
    6     Draw line at 80% on the y-axis parallel to            40                                                                                                                                                                                          120
          the x-axis. Then drop the line at the point
          of intersection with the curve on the x-axis.
                                                                35
          This point on the x-axis separates the important                                                                                                                                                                                                  100
          causes and trivial causes. This line is represented
          as a dotted line in Figure 4.5.                       30

From this chart it is clear to see that there are three                                                                                                                                                                                                     80
primary causes for network failure in the organization.         25
These should therefore be targeted first.
                                                                20                                                                                                                                                                                          60


                                                                15
                                                                                                                                                                                                                                                            40

                                                                10

                                                                                                                                                                                                                                                            20
                                                                 5


                                                                 0                                                                                                                                                                                          0
                                                                     Network controller

                                                                                          File corruption

                                                                                                            Addressing conflicts

                                                                                                                                    Server OS

                                                                                                                                                 Scripting error

                                                                                                                                                                    Untested change

                                                                                                                                                                                      Operator error

                                                                                                                                                                                                       Backup failure

                                                                                                                                                                                                                        Intrusion attempts

                                                                                                                                                                                                                                             Disk failure




                                                                Figure 4.5 Important versus trivial causes
64   | Service Operation processes



 4.4.5.6 Workarounds                                             Note: There may be some problems for which a Business
 In some cases it may be possible to find a workaround to        Case for resolution cannot be justified (e.g. where the
 the incidents caused by the problem – a temporary way of        impact is limited but the cost of resolution would be
 overcoming the difficulties. For example, a manual              extremely high). In such cases a decision may be taken to
 amendment may be made to an input file to allow a               leave the Problem Record open but to use a workaround
 program to complete its run successfully and allow a            description in the Known Error Record to detect and
 billing process to complete satisfactorily, but it is           resolve any recurrences quickly. Care should be taken to
 important that work on a permanent resolution continues         use the appropriate code to flag the open Problem Record
 where this is justified – in this example the reason for the    so that it does not count against the performance of the
 file becoming corrupted in the first place must be found        team performing the process and so that unauthorized
 and corrected to prevent this happening again.                  rework does not take place.

 In cases where a workaround is found, it is therefore           4.4.5.9 Problem Closure
 important that the problem record remains open, and
                                                                 When any change has been completed (and successfully
 details of the workaround are always documented within
                                                                 reviewed), and the resolution has been applied, the
 the Problem Record.
                                                                 Problem Record should be formally closed – as should any
                                                                 related Incident Records that are still open. A check should
 4.4.5.7 Raising a Known Error Record
                                                                 be performed at this time to ensure that the record
 As soon as the diagnosis is complete, and particularly          contains a full historical description of all events – and if
 where a workaround has been found (even though it may           not, the record should be updated.
 not yet be a permanent resolution), a Known Error Record
 must be raised and placed in the Known Error Database –         The status of any related Known Error Record should be
 so that if further incidents or problems arise, they can be     updated to shown that the resolution has been applied.
 identified and the service restored more quickly.
                                                                 4.4.5.10 Major Problem Review
 However, in some cases it may be advantageous to raise a
                                                                 After every major problem (as determined by the
 Known Error Record even earlier in the overall process –
                                                                 organization’s priority system), while memories are still
 just for information purposes, for example – even though
                                                                 fresh a review should be conducted to learn any lessons
 the diagnosis may not be complete or a workaround
                                                                 for the future. Specifically, the review should examine:
 found, so it is inadvisable to set a concrete procedural
 point exactly when a Known Error Record must be raised.         ■ Those things that were done correctly
 It should be done as soon as it becomes useful to do so!        ■ Those things that were done wrong
 The Known Error Database and the way it should be used          ■ What could be done better in the future
 are described in more detail in paragraph 4.4.7.2.              ■ How to prevent recurrence
                                                                 ■ Whether there has been any third-party responsibility
 4.4.5.8 Problem resolution                                         and whether follow-up actions are needed.
 Ideally, as soon as a solution has been found, it should be     Such reviews can be used as part of training and
 applied to resolve the problem – but in reality safeguards      awareness activities for support staff – and any lessons
 may be needed to ensure that this does not cause further        learned should be documented in appropriate procedures,
 difficulties. If any change in functionality is required this   work instructions, diagnostic scripts or Known Error
 will require an RFC to be raised and approved before the        Records. The Problem Manager facilitates the session and
 resolution can be applied. If the problem is very serious       documents any agreed actions.
 and an urgent fix is needed for business reasons, then an
                                                                 The knowledge learned from the review should be
 Emergency RFC should be handled by the Change
                                                                 incorporated into a service review meeting with the
 Advisory Board Emergency Committee (CAB/EC) to
                                                                 business customer to ensure the customer is aware of the
 facilitate this urgent action. Otherwise, the RFC should
                                                                 actions taken and the plans to prevent future major
 follow the established Change Management process for
                                                                 incidents from occurring. This helps to improve customer
 that type of change – and the resolution should be
                                                                 satisfaction and assure the business that Service
 applied only when the change has been approved and
                                                                 Operations is handling major incidents responsibly and
 scheduled for release. In the meantime, the KEDB should
                                                                 actively working to prevent their future recurrence.
 be used to help resolve quickly any further occurrences of
 the incidents/problems that occur.
                                                                                       Service Operation processes |      65

4.4.5.11 Errors detected in the development                          changes and keep Problem Management advised.
environment                                                          Problem Management is also involved in rectifying
                                                                     the situation caused by failed changes.
It is rare for any new applications, systems or software
releases to be completely error-free. It is more likely that      ● Configuration Management: Problem
during testing of such new applications, systems or                  Management uses the CMS to identify faulty CIs
releases a prioritization system will be used to eradicate           and also to determine the impact of problems and
the more serious faults, but it is possible that minor faults        resolutions. The CMS can also be used to form the
are not rectified – often because of the balance that has to         basis for the KEDB and hold or integrate with the
be made between delivering new functionality to the                  Problem Records.
business as quickly as possible and ensuring totally fault-       ● Release and Deployment Management: Is
free code or components.                                             responsible for rolling problem fixes out into the
                                                                     live environment. It also assists in ensuring that the
Where a decision is made to release something into the               associated known errors are transferred from the
production environment that includes known deficiencies,             development Known Error Database into the live
these should be logged as Known Errors in the KEDB,                  Known Error Database. Problem Management will
together with details of workarounds or resolution                   assist in resolving problems caused by faults during
activities. There should be a formal step in the testing             the release process.
sign-off that ensures that this handover always takes place
                                                                ■ Service Design
(see Service Transition publication).
                                                                  ● Availability Management: Is involved with
Experience has shown if this does not happen, it will lead           determining how to reduce downtime and increase
to far higher support costs when the users start to                  uptime. As such, it has a close relationship with
experience the faults and raise incidents that have to be            Problem Management, especially the proactive
re-diagnosed and resolved all over again!                            areas. Much of the management information
                                                                     available in Problem Management will be
4.4.6 Triggers, input and output/inter-                              communicated to Availability Management.
process interfaces                                                ● Capacity Management: Some problems will
The vast majority of Problem Records will be triggered in            require investigation by Capacity Management
reaction to one or more incidents, and many will be raised           teams and techniques, e.g. performance issues.
or initiated via Service Desk staff. Other Problem Records,          Capacity Management will also assist in assessing
and corresponding Known Error Records, may be triggered              proactive measures. Problem Management provides
in testing, particularly the latter stages of testing such as        management information relative to the quality of
User Acceptance Testing/Trials (UAT), if a decision is made          decisions made during the Capacity Planning
to go ahead with a release even though some faults are               process.
known. Suppliers may trigger the need for some Problem            ● IT Service Continuity: Problem Management acts
Records through the notification of potential faults or              as an entry point into IT Service Continuity
known deficiencies in their products or services (e.g. a             Management where a significant problem is not
warning may be given regarding the use of a particular CI            resolved before it starts to have a major impact on
and a Problem Record may be raised to facilitate the                 the business.
investigation by technical staff of the condition of such CIs   ■ Continual Service Improvement
within the organization’s IT Infrastructure).                     ● Service Level Management: The occurrence of
The primary relationship between Incident and Problem                incidents and problems affects the level of service
Management has been discussed in detail in paragraphs                delivery measured by SLM. Problem Management
4.2.6 and 4.4.5.1. Other key interfaces include the                  contributes to improvements in service levels, and
following:                                                           its management information is used as the basis of
                                                                     some of the SLA review components. SLM also
■ Service Transition                                                 provides parameters within which Problem
   ● Change Management: Problem Management                           Management works, such as impact information
       ensures that all resolutions or workarounds that              and the effect on services of proposed resolutions
       require a change to a CI are submitted through                and proactive measures.
       Change Management through an RFC. Change
       Management will monitor the progress of these
66   | Service Operation processes



 ■ Service Strategy                                                 to diagnose and implement a workaround as quickly as
     ● Financial Management: Assists in assessing the               possible, which is where the KEDB can be of assistance.
        impact of proposed resolutions or workarounds, as           It is essential that any data put into the database can be
        well as Pain Value Analysis. Problem Management             quickly and accurately retrieved. The Problem Manager
        provides management information about the cost              should be fully trained and familiar with the search
        of resolving and preventing problems, which is              methods/algorithms used by the selected database and
        used as input into the budgeting and accounting             should carefully ensure that when new records are added,
        systems and Total Cost of Ownership calculations.           the relevant search key criteria are correctly included.

 4.4.7 Information Management                                       Care should be taken to avoid duplication of records (i.e.
                                                                    the same problem described in two or more ways as
 4.4.7.1 CMS                                                        separate records). To avoid this, the Problem Manager
 The CMS will hold details of all of the components of the          should be the only person able to enter a new record.
 IT Infrastructure as well as the relationships between these       Other support groups should be allowed, indeed
 components. It will act as a valuable source for problem           encouraged, to propose new records, but these should be
 diagnosis and for evaluating the impact of problems (e.g.          vetted by the Problem Manager before entry to the KEDB.
 if this disk is down, what data is on that disk; which             In large organizations where Problem Management staff
 services use that data; which users use those services?).          exist in multiple locations but a single KEDB is used
 As it will also hold details of previous activities, it can also   (recommended!), a procedure must be agreed between all
 be used as a valuable source of historical data to help            Problem Management staff to ensure that such duplication
 identify trends or potential weaknesses – a key part of            cannot occur. This may involve designating just one staff
 proactive Problem Management (see Continual Service                member as the central KEDB Manager.
 Improvement publication).                                          The KEDB should be used during the Incident and
                                                                    Problem Diagnosis phases to try to speed up the
 4.4.7.2 Known Error Database                                       resolution process – and new records should be added as
 The purpose of a Known Error Database is to allow storage          quickly as possible when a new problem has been
 of previous knowledge of incidents and problems – and              identified and diagnosed.
 how they were overcome – to allow quicker diagnosis and            All support staff should be fully trained and conversant
 resolution if they recur.                                          with the value that the KEDB can offer and the way it
 The Known Error Record should hold exact details of the            should be used. They should be able readily to retrieve
 fault and the symptoms that occurred, together with                and use data.
 precise details of any workaround or resolution action that        Note: Some tools/implementations may choose to
 can be taken to restore the service and/or resolve the             delineate Known Errors simply by changing a field in the
 problem. An incident count will also be useful to                  original Problem Record. This is acceptable provided the
 determine the frequency with which incidents are likely to         same level of functionality is available.
 recur and influence priorities, etc.
                                                                    The KEDB, like the CMS, forms part of a larger Service
 It should be noted that a Business Case for a permanent            Knowledge Management System (SKMS) illustrated in
 resolution for some problems may not exist. For example,           Figure 4.6. More information on the SKMS can be found in
 if a problem does not cause serious disruption and a               the Service Transition publication.
 workaround exists and/or the cost of resolving the
 problem far outweighs the benefits of a permanent
 resolution – then a decision may be taken to tolerate the
 existence of the problem. However, it will still be desirable
                                                                                                                                              Service Operation processes |                                  67


                            Change and Release          Asset Management              Configuration Life                   Technical                      Quality                    Service Desk View
Presentation
                                   View                          View                     Cycle View                  Configuration View            Management View                       User assets
   Layer
                              Schedules/plans           Financial Asset Asset        Project configurations           Service Applications               Asset and                    User configuration,
                           Change Request Status         Status Reports Asset           Service Strategy,                 Application                  Configuration                  Changes, Releases,
               Portal
                           Change Advisory Board         Statements and Bills          Design, Transition,               Environment               Management Policies,            Asset and Configuration
                                agenda and              Licence Management                 Operations                  Test Environment            Processes, Procedures,              item and related
                                  minutes                 Asset performance              configuration                   Infrastructure              forms, templates,               incidents, problems,
                                                                                         baselines and                                                   checklists                 workarounds, changes
                                                                                            changes

                                                               Search, Browse, Store, Retrieve, Update, Publish, Subscribe, Collaborate


 Knowledge                                                                                    Performance Management                                                                    Monitoring
 Processing              Query and Analysis               Reporting                                                                                     Modelling                  Scorecards, Dashboards
                                                                                            Forecasting, Planning, Budgeting
   Layer                                                                                                                                                                                  Alerting



                                                                  Business/Customer/Supplier/User – Service – Application – Infrastructure mapping
Information
 Integration                    Service Portfolio                                                                                                                                Service Change
    Layer                      Service Catalogue                    Service
                                                                                                  Integrated CMDB                            Service Release
                                                                    Model




                         Common Process,                                                                 Data                        Data
                                                     Schema                    Meta Data                                                                   Extract, Transform,
                             Data and                                                                reconciliation             synchronization                                              Mining
                                                     Mapping                  Management                                                                          Load
                        Information Model

                                                                                                  Data Integration

                                                      Definitive Media           Physical CMDBs                      Platform              Software             Discovery,
                          Project Document                                                                                                                                              Enterprise
                                                           Library                                             Configuration Tools       Configuration            Asset
                               Filestore                                                                                                                                               Applications
                                                        Definitive                                            E.g. Storage Database      Management            Management           Access Management
  Data and                                                                        CMDB1                       Middleware Network                                and audit
                                                     Document Library                                                                                                                Human Resources
Information                                                                                                         Mainframe                                      tools
  Sources                   Structured                                                                                                                                                 Supply Chain
                                                         Definitive                                            Distributed Desktop                                                     Management
 and Tools                                                                              CMDB2                         Mobile
                                                    Multimedia Library 1                                                                                                           Customer Relationship
                                                                                                                                                                                       Management
                               Project
                                                         Definitive
                              Software                                            CMDB3
                                                    Multimedia Library 2



Figure 4.6 Service Knowledge Management System

4.4.8 Metrics                                                                                     ■ The percentage of Major Problem Reviews completed
The following metrics should be used to judge the                                                       successfully and on time.
effectiveness and efficiency of the Problem Management                                            All metrics should be broken down by category, impact,
process, or its operation:                                                                        severity, urgency and priority level and compared with
■ The total number of problems recorded in the period                                             previous periods.
    (as a control measure)
■   The percentage of problems resolved within SLA
                                                                                                  4.4.9 Challenges, Critical Success Factors
    targets (and the percentage that are not!)                                                    and risks
■   The number and percentage of problems that                                                    A major dependency for Problem Management is the
    exceeded their target resolution times                                                        establishment of an effective Incident Management
■   The backlog of outstanding problems and the trend                                             process and tools. This will ensure that problems are
    (static, reducing or increasing?)                                                             identified as soon as possible and that as much work is
■   The average cost of handling a problem                                                        done on pre-qualification as possible. However, it is also
                                                                                                  critical that the two processes have formal interfaces and
■   The number of major problems (opened and closed
                                                                                                  common working practices. This implies the following:
    and backlog)
■   The percentage of Major Problem Reviews successfully                                          ■ Linking Incident and Problem Management tools
    performed                                                                                     ■ The ability to relate Incident and Problem Records
■   The number of Known Errors added to the KEDB                                                  ■ The second- and third-line staff should have a good
■   The percentage accuracy of the KEDB (from audits of                                                 working relationship with staff on the first line
    the database)                                                                                 ■ Making sure that business impact is well understood
                                                                                                        by all staff working on problem resolution.
68   | Service Operation processes



 In addition it is important that Problem Management is         ■ There is less likelihood of errors being made in data
 able to use all Knowledge and Configuration Management           entry or in the use of a critical service by an unskilled
 resources available.                                             user (e.g. production control systems)
                                                                ■ The ability to audit use of services and to trace the
 Another CSF is the ongoing training of technical staff in
 both technical aspects of their job as well as the business      abuse of services
 implications of the services they support and the              ■ The ability more easily to revoke access rights when
 processes they use.                                              needed – an important security consideration
                                                                ■ May be needed for regulatory compliance (e.g. SOX,
                                                                  HIPAA, COBIT).
 4.5 ACCESS MANAGEMENT
 Access Management is the process of granting authorized        4.5.4 Policies/principles/basic concepts
 users the right to use a service, while preventing access to   Access Management is the process that enables users to
 non-authorized users. It has also been referred to as Rights   use the services that are documented in the Service
 Management or Identity Management in different                 Catalogue. It comprises the following basic concepts:
 organizations.
                                                                ■ Access refers to the level and extent of a service’s
 4.5.1 Purpose/goal/objective                                      functionality or data that a user is entitled to use.
                                                                ■ Identity refers to the information about them that
 Access Management provides the right for users to be able
                                                                  distinguishes them as an individual and which verifies
 to use a service or group of services. It is therefore the
                                                                  their status within the organization. By definition, the
 execution of policies and actions defined in Security and
                                                                  Identity of a user is unique to that user. (This is
 Availability Management.
                                                                  covered in more detail in paragraph 4.5.7.1.)
                                                                ■ Rights (also called privileges) refer to the actual
 4.5.2 Scope
                                                                  settings whereby a user is provided access to a service
 Access Management is effectively the execution of both
                                                                  or group of services. Typical rights, or levels of access,
 Availability and Information Security Management, in that
                                                                  include read, write, execute, change, delete.
 it enables the organization to manage the confidentiality,
                                                                ■ Services or service groups. Most users do not use
 availability and integrity of the organization’s data and
                                                                  only one service, and users performing a similar set of
 intellectual property.
                                                                  activities will use a similar set of services. Instead of
 Access Management ensures that users are given the right         providing access to each service for each user
 to use a service, but it does not ensure that this access is     separately, it is more efficient to be able to grant each
 available at all agreed times – this is provided by              user – or group of users – access to the whole set of
 Availability Management.                                         services that they are entitled to use at the same time.
 Access Management is a process that is executed by all           (This is discussed in more detail in paragraph 4.5.7.2.)
 Technical and Application Management functions and is          ■ Directory Services refers to a specific type of tool
 usually not a separate function. However, there is likely to     that is used to manage access and rights. These are
 be a single control point of coordination, usually in IT         discussed in section 5.8.
 Operations Management or on the Service Desk.
                                                                4.5.5 Process activities, methods and
 Access Management can be initiated by a Service Request
                                                                techniques
 through the Service Desk.
                                                                4.5.5.1 Requesting access
 4.5.3 Value to business
                                                                Access (or restriction) can be requested using one of any
 Access Management provides the following value:                number of mechanisms, including:
 ■ Controlled access to services ensures that the               ■ A standard request generated by the Human Resource
   organization is able to maintain more effectively the          system. This is generally done whenever a person is
   confidentiality of its information                             hired, promoted, transferred or when they leave the
 ■ Employees have the right level of access to execute            company
   their jobs effectively                                       ■ A Request for Change
                                                                ■ A Service Request submitted via the Request
                                                                  Fulfilment system
                                                                                         Service Operation processes |          69

■ By executing a pre-authorized script or option (e.g.         decisions to restrict or provide access, rather than making
    downloading an application from a staging server as        the decision.
    and when it is needed).
                                                               As soon as a user has been verified, Access Management
Rules for requesting access are normally documented as         will provide that user with rights to use the requested
part of the Service Catalogue.                                 service. In most cases this will result in a request to every
                                                               team or department involved in supporting that service to
4.5.5.2 Verification                                           take the necessary action. If possible, these tasks should
Access Management needs to verify every request for            be automated.
access to an IT service from two perspectives:                 The more roles and groups that exist, the more likely that
■ That the user requesting access is who they say              Role Conflict will arise. Role Conflict in this context refers
    they are                                                   to a situation where two specific roles or groups, if
■ That they have a legitimate requirement for                  assigned to a single user, will create issues with separation
    that service.                                              of duties or conflict of interest. Examples of this include:
                                                               ■ One role requires detailed access, while another role
The first category is usually achieved by the user providing
their username and password. Depending on the                     prevents that access
organization’s security policies, the use of the username      ■ Two roles allow a user to perform two tasks that
and password are usually accepted as proof that the               should not be combined (e.g. a contractor can log
person is a legitimate user. However, for more sensitive          their time sheet for a project and then approve all
services further identification may be required (biometric,       payment on work for the same project).
use of an electronic access key or encryption device, etc.).   Role Conflict can be avoided by careful creation of roles
The second category will require some independent              and groups, but more often they are caused by policies
verification, other than the user’s request. For example:      and decisions made outside of Service Operation – either
                                                               by the business or by different project teams working
■ Notification from Human Resources that the person is         during Service Design. In each case the conflict must be
    a new employee and requires both a username and            documented and escalated to the stakeholders to resolve.
    access to a standard set of services
■   Notification from Human Resources that the user has        Whenever roles and groups are defined, it is possible that
    been promoted and requires access to additional            they could be defined too broadly or too narrowly. There
    resources                                                  will always be users who need something slightly different
                                                               from the pre-defined roles. In these cases, it is possible to
■   Authorization from an appropriate (defined in the
                                                               use standard roles and then add or subtract specific rights
    process) manager
                                                               as required – similar to the concept of Baselines and
■   Submission of a Service Request (with supporting
                                                               Variants in Configuration Management (see Service
    evidence) through the Service Desk
                                                               Transition publication). However, the decision to do this is
■   Submission of an RFC (with supporting evidence)            not in the hands of individual operational staff members.
    through Change Management, or execution of a               Each exception should be coordinated by Access
    pre-defined Standard Change                                Management and approved through the originating
■   A policy stating that the user may have access to an       process.
    optional service if they need it.
                                                               Access Management should perform a regular review of
For new services the Change Record should specify which        the roles and groups that it has created and manage to
users or groups of users will have access to the Service.      ensure that they are appropriate for the services that IT
Access Management will then check to see that all the          delivers and supports – and obsolete or unwanted
users are still valid and automatically provide access as      roles/groups should be removed.
specified in the RFC.
                                                               4.5.5.4 Monitoring identity status
4.5.5.3 Providing rights
                                                               As users work in the organization, their roles change and
Access Management does not decide who has access to            so also do their needs to access services. Examples of
which IT services. Rather, Access Management executes          changes include:
the policies and regulations defined during Service
Strategy and Service Design. Access Management enforces
70   | Service Operation processes



 ■ Job changes. In this case the user will possibly need           this information available to all who have access to the
     access to different or additional services.                   Incident Management system will expose vulnerabilities.
 ■   Promotions or demotions. The user will probably use           Information Security Management plays a vital role in
     the same set of services, but will need access to             detecting unauthorized access and comparing it with the
     different levels of functionality or data.                    rights that were provided by Access Management. This will
 ■   Transfers. In this situation, the user may need access        require Access Management involvement in defining the
     to exactly the same set of services, but in a different       parameters for use in Intrusion Detection tools.
     region with different working practices and different
     sets of data.                                                 Access Management may also be required to provide a
                                                                   record of access for specific Services during forensic
 ■   Resignation or death. Access needs to be completely
                                                                   investigations. If a user is suspected of breaches of policy,
     removed to prevent the username being used as a
                                                                   inappropriate use of resources, or fraudulent use of data,
     security loophole.
                                                                   Access Management may be required to provide evidence
 ■   Retirement. In many organizations, an employee who
                                                                   of dates, times and even content of that user’s access to
     retires may still have access to a limited set of services,
                                                                   specific Services. This is normally provided by the
     including benefits systems or systems that allow them
                                                                   Operational staff of that service, but working as part of the
     to purchase company products at a reduced rate.
                                                                   Access Management process.
 ■   Disciplinary action. In some cases the organization
     will require a temporary restriction to prevent the user      4.5.5.6 Removing or restricting rights
     from accessing some or all of the services that they
                                                                   Just as Access Management provides rights to use a
     would normally have access to. There should be a
                                                                   Service, it is also responsible for revoking those rights.
     feature in the process and tools to do this, rather than
                                                                   Again, this is not a decision that it makes on its own.
     having to delete and reinstate the user’s access rights.
                                                                   Rather, it will execute the decisions and policies made
 ■   Dismissals. Where an employee or contractor is
                                                                   during Service Strategy and Design and also decisions
     dismissed, or where legal action is taken against a
                                                                   made by managers in the organization.
     customer (for example for defaulting on payment for
     products purchased on the Internet), access should be         Removing access is usually done in the following
     revoked immediately. In addition, Access Management,          circumstances:
     working together with Information Security                    ■ Death
     Management, should take active measures to prevent
                                                                   ■ Resignation
     and detect malicious action against the organization
                                                                   ■ Dismissal
     from that user.
                                                                   ■ When the user has changed roles and no longer
 Access Management should understand and document                     requires access to the service
 the typical User Lifecycle for each type of user and use it       ■ Transfer or travel to an area where different regional
 to automate the process. Access Management tools should              access applies.
 provide features that enable a user to be moved from one
 state to another, or from one group to another, easily and        In other cases it is not necessary to remove access, but
 with an audit trail.                                              just to provide tighter restrictions. These could include
                                                                   reducing the level, time or duration of access. Situations
 4.5.5.5 Logging and tracking access                               in which access should be restricted include:
 Access Management should not only respond to requests.            ■ When the user has changed roles or been demoted
 It is also responsible for ensuring that the rights that they       and no longer requires the same level of access
 have provided are being properly used.                            ■ When the user is under investigation, but still requires
 In this respect, Access Monitoring and Control must be              access to basic services, such as e-mail. In this case
 included in the monitoring activities of all Technical and          their e-mail may be subject to additional scanning
 Application Management functions and all Service                    (but this would need to be handled very carefully
 Operation processes.                                                and in full accordance with the organization’s
                                                                     security policy)
 Exceptions should be handled by Incident Management,              ■ When a user is away from the organization on
 possibly using Incident Models specifically designed to             temporary assignment and will not require access to
 deal with abuse of access rights. It should be noted that           that service for some time.
 the visibility of such actions should be restricted. Making
                                                                                        Service Operation processes |        71


4.5.6 Triggers, input and output/inter-                        4.5.7 Information Management
process interfaces
                                                               4.5.7.1 Identity
Access Management is triggered by a request for a user or
users to access a service or group of services. This could     The identity of a user is the information about them that
originate from any of the following:                           distinguishes them as an individual and which verifies
                                                               their status within the organization. By definition, the
■ An RFC. This is most frequently used for large-scale         identity of a user is unique to that user. Since there are
  service introductions or upgrades where the rights of a      cases where two users share a common piece of
  significant number of users need to be updated as            information (e.g. they have the same name), identity is
  part of the project.                                         usually established using more than one piece of
■ A Service Request. This is usually initiated through         information, for example:
  the Service Desk, or directly into the Request
                                                               ■ Name
  Fulfilment system, and executed by the relevant
                                                               ■ Address
  Technical or Application Management teams.
■ A request from the appropriate Human Resources               ■ Contact details, e.g. telephone, e-mail address, etc.
  Management personnel (which should be channelled             ■ Physical documentation, e.g. driver’s licence, passport,
  via the Service Desk). This is usually generated as part        marriage certificate, etc.
  of the process for hiring, promoting, relocating and         ■ Numbers that refer to a document or an entry in a
  termination or retirement.                                     database, e.g. employee number, tax number,
■ A request from the manager of a department, who                government identity number, driver’s licence number,
  could be performing an HR role, or who could have              etc.
  made a decision to start using a service for the first       ■ Biometric information, e.g. fingerprints, retinal images,
  time.                                                          voice recognition patterns, DNA, etc.
                                                               ■ Expiration date (if relevant).
Access Management should be linked to the Human
Resource processes to verify the user’s identify as well as    A user identity is provided to anyone with a legitimate
to ensure that they are entitled to the services being         requirement to access IT services or organizational
requested.                                                     information. These could include:
Information Security Management is a key driver for Access     ■ Employees
Management as it will provide the security and data            ■ Contractors
protection policies and tools needed to execute Access         ■ Vendor staff (e.g. account managers, support
Management.                                                       personnel, etc.)
Change Management plays an important role as the               ■ Customers (especially when purchasing products or
means to control the actual requests for access. This is          services over the Internet).
because any request for access to a service is a change,       Most organizations will verify a user’s identity before they
although it is usually processed as a Standard Change or       join the organization by requesting a subset of the above
Service Request (possibly using a model) once the criteria     information. The more secure the organization, the more
for access have been agreed through SLM.                       types of information are required and the more thoroughly
SLM maintains the agreements for access to each service.       they are checked.
This will include the criteria for who is entitled to access   Many organizations will be faced with the need to provide
each service, what the cost of that access will be, if         access rights to temporary or occasional staff or
appropriate and what level of access will be granted to        contractors/suppliers. The management of access to such
different types of user (e.g. managers or staff).              personnel often proves problematic – closing access after
There is also a strong relationship between Access             use is often as difficult to manage, or more so, than
Management and Configuration Management. The CMS               providing access initially. Well-defined procedures
can be used for data storage and interrogated to               between IT and HR should be established that include fail-
determine current access details.                              safe checks that ensure access rights are removed
                                                               immediately they are no longer justified or required.
                                                               When a user is granted access to an application, it should
                                                               already have been established by the organization (usually
72   | Service Operation processes



 the Human Resources or Security Department) that the             and protected as part of the organization’s security
 user is who they say they are.                                   procedures.
 At this point, all that information is filed and the file is
 associated with a corporate identity, usually an employee
                                                                  4.5.8 Metrics
 or contractor number and an identity that can be used to         Metrics that can be used to measure the efficiency and
 access corporate resources and information, usually a user       effectiveness of Access Management include:
 identity or ‘username’ and an associated password.               ■ Number of requests for access (Service Request, RFC,
                                                                      etc.)
 4.5.7.2 Users, groups, roles and service groups                  ■ Instances of access granted, by service, user,
 While each user has an individual identity, and each IT              department, etc.
 service can be seen as an entity in its own right, it is often   ■ Instances of access granted by department or
 helpful to group them together so that they can be                 individual granting rights
 managed more easily. Sometimes the terms ‘user profile’          ■ Number of incidents requiring a reset of access rights
 or ‘user template’ or ‘user role’ are used to describe this
                                                                  ■ Number of incidents caused by incorrect access
 type of grouping.
                                                                    settings.
 Most organizations have a standard set of services for all
 individual users, regardless of their position or job            4.5.9 Challenges, Critical Success Factors
 (excluding customers – who do not have any visibility to         and risks
 internal services and processes). These will include services
                                                                  Conditions for successful Access Management include:
 such as messaging, office automation, Desktop Support,
 telephony, etc. New users are automatically provided with        ■ The ability to verify the identity of a user (that the
 rights to use these services.                                        person is who they say they are)
                                                                  ■   The ability to verify the identity of the approving
 However, most users also have some specialized role that
                                                                      person or body
 they perform. For example, in addition to the standard
                                                                  ■   The ability to verify that a user qualifies for access to a
 services, the user also performs a Marketing Management
 role, which requires that they have access to some                   specific service
 specialized marketing and financial modelling tools              ■   The ability to link multiple access rights to an
 and data.                                                            individual user
                                                                  ■   The ability to determine the status of the user at any
 Some groups may have unique requirements – such as
                                                                      time (e.g. to determine whether they are still
 field or home workers who may have to dial in or use
                                                                      employees of the organization when they log on to a
 Virtual Private Network (VPN) connections, with security
                                                                      system)
 implications that may have to be more tightly managed.
                                                                  ■   The ability to manage changes to a user’s access
 To make it easier for Access Management to provide the               requirements
 appropriate rights, it uses a catalogue of all the roles in      ■   The ability to restrict access rights to unauthorized
 the organization and which services support each role.               users
 This catalogue of roles should be compiled and                   ■   A database of all users and the rights that they have
 maintained by Access Management in conjunction with                  been granted.
 HR and will often be automated in the Directory Services
 tools (see section 5.8).
                                                                  4.6 OPERATIONAL ACTIVITIES OF
 In addition to playing different roles, users may also
                                                                  PROCESSES COVERED IN OTHER LIFECYCLE
 belong to different groups. For example, all contractors are
 required to log their timesheets in a dedicated Time Card        PHASES
 System, which is not used by employees. Access
 Management will assess all the roles that a user plays as        4.6.1 Change Management
 well as the groups that they belong to and ensure that           Change Management is primarily covered in the Service
 they provide rights to use all associated services.              Transition publication, but there are some aspects of
                                                                  Change Management which Service Operation staff will be
 Note: All data held on users will be subject to data
                                                                  involved with on a day-to-day basis. These include:
 protection legislation (this exists in most geographic
 locations in some form or other) so should be handled
                                                                                       Service Operation processes |          73

■ Raising and submitting RFCs as needed to address             ■ Participation in the planning stages of major new
    Service Operation issues                                     releases to advise on Service Operation issues
■   Participating in CAB or CAB/EC meetings to ensure          ■ The physical handling of CIs from/to the DML as
    that Service Operation risks, issues and views are taken     required to fulfil their operational roles – while
    into account                                                 adhering to relevant Release and Deployment
■   Implementing changes as directed by Change                   Management procedures, such as ensure that all items
    Management where they involve Service Operation              are properly booked out and back in.
    component or services
■   Backing out changes as directed by Change                  4.6.4 Capacity Management
    Management where they involve Service Operation            Capacity Management should operate at three levels:
    component or services                                      Business Capacity Management, Service Capacity
■   Helping define and maintain change models relating         Management and Component Capacity Management.
    to Service Operation components or services                ■ Business Capacity Management involves working
■   Receiving change schedules and ensuring that all             with the business to plan and anticipate both longer-
    Service Operation staff are made aware of and                term strategic issues and shorter-term tactical
    prepared for all relevant changes                            initiatives that are likely to have an impact on IT
■   Using the Change Management process for standard,            capacity.
    operational-type changes.                                  ■ Service Capacity Management is about
                                                                 understanding the characteristics of each of the IT
4.6.2 Configuration Management                                   services, and then the demands that different types
Configuration Management is primarily covered in the             of users or transactions have on the underlying
Service Transition publication, but there are some aspects       infrastructure – and how these vary over time and
of Configuration Management which Service Operation              might be impacted by business change.
staff will be involved with on a day-to-day basis. These       ■ Component Capacity Management involves
include:                                                         understanding the performance characteristics and
■ Informing Configuration Management of any                      capabilities and current utilization levels of all the
    discrepancies found between any CIs and the CMS              technical components (CIs) that make up the IT
                                                                 Infrastructure, and predicting the impact of any
■ Making any amendments necessary to correct any
                                                                 changes or trends.
    discrepancies, under the authority of Configuration
    Management, where they involve any Service                 Many of these activities are of a strategic or longer-term
    Operation components or services.                          planning nature and are covered in the Service Strategy,
                                                               Service Design and Service Transition publications.
Responsibility for updating the CMS remains with
                                                               However, there are a number of operational Capacity
Configuration Management, but in some cases Operations
                                                               Management activities that must be performed on a
staff might be asked, under the direction of Configuration
                                                               regular ongoing basis as part of Service Operation. These
Management, to update relationships, or even to add new
                                                               include the following.
CIs or mark CIs as ‘disposed’ in the CMS, if these updates
are related to operational activities actually performed by
Operations staff.
                                                               4.6.4.1 Capacity and Performance Monitoring
                                                               All components of the IT Infrastructure should be
4.6.3 Release and Deployment Management                        continually monitored (in conjunction with Event
                                                               Management) so that any potential problems or trends
Release and Deployment Management is primarily covered
                                                               can be identified before failures or performance
in the Service Transition publication, but there are some
                                                               degradation occurs. Ideally, such monitoring should be
aspects of this process which Service Operation staff will
                                                               automated and thresholds should be set so that exception
be involved with on a day-to-day basis. These may
                                                               alerts are raised in good time to allow appropriate
include:
                                                               avoiding or recovery action to be taken before adverse
■ Actual implementation actions regarding the                  impact occurs.
    deployment of new releases, under the direction of
                                                               The components and elements to be monitored will vary
    Release and Deployment Management, where they
                                                               depending upon the infrastructure in use, but will typically
    relate to Service Operation components or services
                                                               include:
74   | Service Operation processes



 ■ CPU utilization (overall and broken down by                   support group(s) are dealing with the fault and can
     system/service usage)                                       intervene if necessary.
 ■   Memory utilization                                          Manufacturers’ claimed performance capabilities and
 ■   IO rates (physical and buffer) and device utilization       agreed service level targets, together with actual historical
 ■   Queue length (maximum and average)                          monitored performance and capacity data, should be used
 ■   File store utilization (disks, partitions, segments)        to set alert levels. This may need to be an iterative process
 ■   Applications (throughput rates, failure rates)              initially, performing some trial-and-error adjustments until
 ■   Databases (utilization, record locks, indexing,             the correct levels are achieved.
     contention)                                                 Note: Capacity Management may have to become
 ■   Network transaction rates, error and retry rates            involved in the capacity requirements and capabilities of IT
 ■   Transaction response time                                   Service Management. Whether the organization has
 ■   Batch duration profiles                                     enough Service Desk staff to handle the rate of incidents;
 ■   Internet/intranet site/page hit rates                       whether the CAB structure can handle the number of
 ■   Internet response times (external and internal to           changes it is being asked to review and approve; whether
     firewalls)                                                  support tools can handle the volume of data being
                                                                 gathered are Capacity Management issues, which the
 ■   Number of system/application log-ons and concurrent
                                                                 Capacity Management team may be asked to help
     users
                                                                 investigate and answer.
 ■   Number of network nodes in use, and utilization
     levels.
                                                                 4.6.4.2 Handling capacity- or performance-
 There are different kinds of monitoring tools needed to         related incidents
 collect and interpret data at each level. For example, some
                                                                 If an alert is triggered, or an incident is raised at the
 tools will allow performance of business transactions to be
                                                                 Service Desk, caused by a current or ongoing Capacity or
 monitored, while others will monitor CI behaviour.
                                                                 Performance Management problem, Capacity Management
 Capacity Management must set up and calibrate alarm             must become involved to identify the cause and find a
 thresholds (where necessary in conjunction with Event           resolution. Working together with appropriate technical
 Management, as it is often Event Monitoring tools that          support groups, and alongside Problem Management, all
 may be used) so that the correct alert levels are set and       necessary investigations must be performed to detect
 that any filtering is established as necessary so that only     exactly what has gone wrong and what is needed to
 meaningful events are raised. Without such filtering it is      correct the situation.
 possible that ‘information only’ alerts can obscure more
                                                                 It may be necessary to switch to more detailed monitoring
 significant alerts that require immediate attention. In
                                                                 during the investigation phase to determine the exact
 addition, it is possible for serious failures to cause ‘alert
                                                                 cause. Monitoring is often set at a ‘background’ level
 storms’ due to very high volumes of repeat alerts, which
                                                                 during normal circumstances due to the large amount of
 again must be filtered so that the most meaningful
                                                                 data that can be generated and to avoid placing too high
 messages are not obscured.
                                                                 a burden on the IT Infrastructure – but when specific
 It may be appropriate to use external, third-party,             difficulties are being investigated more detailed
 monitoring capabilities for some CIs or components of           monitoring may be needed to pinpoint the exact cause.
 the IT Infrastructure (e.g. key internet sites/pages).
                                                                 When a solution, or potential solution, has been found,
 Capacity Management should be involved in helping
                                                                 any changes necessary to resolve the problem must be
 specify and select any such monitoring capabilities and
                                                                 approved via formal Change Management prior to
 in integrating the results or any alerts with other
                                                                 implementation. If the fault is causing serious disruption
 monitoring and handling systems.
                                                                 and an urgent resolution is needed, the urgent change
 Capacity Management must work with all appropriate              process should be used. It is very important that no
 support groups to make decisions on where alarms are            ‘tuning’ takes place without submission through Change
 routed and on escalation paths and timescales. Alerts           Management, as even apparently small adjustments can
 should be logged to the Service Desk as well as to              often have very large cumulative effects – sometimes
 appropriate support staff, so that appropriate Incident         across the entire IT Infrastructure.
 Records can be raised so a permanent record of the event
 exists – and Service Desk staff have a view of how well the
                                                                                        Service Operation processes |         75

4.6.4.3 Capacity and performance trends                         Operation functions will have to take action to implement
Capacity Management has a role to play in identifying any       such restrictions – usually accompanied by concurrent
capacity or performance trends as they become                   action to implement the logging-out of users who have
discernible. Further details of actions needed to address       been inactive for an agreed period of time to free up
such trends are included in the Continual Service               resources for others.
Improvement publication.
                                                                4.6.4.6 Workload Management
4.6.4.4 Storage of Capacity Management data                     There may be occasions when optimization of
Large amounts of data are usually generated through             infrastructure resources is needed to maintain or improve
capacity and performance monitoring. Monitoring of              performance or throughput. This can often be done
meters and tables of just a few Kbytes each can quickly         through Workload Management, which is a generic term
grown into huge files if many components are being              to cover such actions as:
monitored at relatively short intervals. Another problem        ■ Rescheduling a particular service or workload to run at
with very short-term monitoring is that it is not possible to     a different time of day, or day of the week etc.
gather meaningful information without looking over a              (usually away from peak-times to off-peak windows) –
longer period. For example, a single snapshot of a CPU            which will often mean having to make adjustments to
will show the device to be either ‘busy’ or ‘idle’ – but a        job-scheduling software.
summary over, say, a 5-minute period will show the              ■ Moving a service or workload from one location or set
average utilization level over that period, which is a much       of CIs to another – often to balance utilization or
more meaningful measure of whether the device is able to          traffic.
work comfortably, or whether potential performance              ■ Technical Virtualization: setting up and using
problems are likely to occur.                                     virtualization systems to allow movement of
In any organization it is likely that the monitoring tools        processing around the infrastructure to give better
used will vary greatly – with a combination of system-            performance/resilience in a dynamic fashion.
specific tools, many of them part of the basic operating        ■ Limiting or moving demand for resources through
system, and specialist monitoring tools being used. In            Demand Management techniques (see above and also
order to coordinate the data being generated and allow            the Service Design publication).
the retention of meaningful data for analysis and trending
                                                                It will only be possible to manage workloads effectively if
purposes, some form of central repository for holding
                                                                a good understanding exists of which workloads will run
this summary data is needed: a Capacity Management
                                                                at what time and how much resource utilization each
Information System (CMIS).
                                                                workload places upon the IT Infrastructure. Diligent
The format, location and design of such a database should       monitoring and analysis of workloads is therefore needed
be planned and implemented in advance – see the Service         on an ongoing operational basis.
Design publication for further details – but there will be
some operational aspects to handle, such as database            4.6.4.7 Modelling and applications sizing
housekeeping and backups.                                       Modelling and/or sizing of new services and/or
                                                                applications must, where appropriate, be done during the
4.6.4.5 Demand Management                                       planning and transition phases – see the Service Design
Demand Management is the name given to a number of              and Service Transition publications. However, the Service
techniques that can be used to modify demand for a              Operation functions have a role to play in evaluating the
particular resource or service. Some techniques for             accuracy of the predictions and feeding back any issues or
Demand Management can be planned in advance – and               discrepancies.
these are covered in more detail in the Service Design
publication. However, there are other aspects of Demand         4.6.4.8 Capacity Planning
Management that are of a more operational nature,               During Service Design and Service Transition, the capacity
requiring shorter-term action.                                  requirements of IT services are calculated. A forward-
If, for example, the performance of a particular service is     looking capacity plan should be maintained and regularly
causing concern, and short-term restrictions on                 updated and Service Operation will have a role to play in
concurrency of users are needed to allow performance            this. Such a plan should look forward up to two years or
improvements for a smaller restricted group, then Service
76   | Service Operation processes



 more, but should be reviewed regularly every three to 12      available to the specified users at the required time and at
 months, depending upon volatility and resources available.    the agreed levels.
 The plan should be linked to the organization’s financial     During Service Operation the IT teams and users are in the
 planning cycle, so that any required expenditure for          best position to detect whether services actually meet the
 infrastructure upgrades, enhancements or additions can be     agreed requirements and whether the design of these
 included in budget estimates and approved in advance.         services is effective.
 The plan should predict the future but must also examine      What seems like a good idea during the Design phase
 and report upon previous predictions, particularly to give    may not actually be practical or optimal. The experience of
 some confidence in further predictions. Where any             the users and operational functions makes them a primary
 discrepancies have been encountered, these should be          input into the ongoing improvement of existing services
 explained and future remedial action described.               and the design.
 The Capacity Plan might typically cover:                      However, there are a number of challenges with gaining
                                                               access to this knowledge:
 ■ Current performance and utilization details, with
     recent trends for all key CIs, including                  ■ Most of the experiences of the operational teams and
     ● Backbone networks                                         users are either informal, or spread across multiple
     ● LANs                                                      sources.
     ● Mainframes (if still used)                              ■ The process for collecting and collating this data
     ● Key servers                                               needs to be formalized.
     ● Main data storage devices                               ■ Users and operational staff are usually fully occupied
                                                                 with their regular activities and tasks and it is very
     ● Selected (representative) desktop and laptop
                                                                 difficult for them to be involved in regular planning
         equipment
                                                                 and design activities. One argument often made here
     ● Key websites
                                                                 is that if design is improved, the operational teams
     ● Key databases
                                                                 will be less busy resolving problems and will therefore
     ● Key applications                                          have more time to be involved in design activities.
     ● Operational capacity – electricity, floor space,          However, practice shows that as soon as staff are freed
         environmental capacity (air condition), floor           up, they often become the target of workforce
         weighting, heat generation and output, electrical       reduction exercises.
         and water demand and supply etc.
                                                               Having said this, there are three key opportunities for
     ● Magnetic media.
                                                               operational staff to be involved in Availability
 ■   Estimated performance and utilization for all such CIs    Improvement, since these are generally viewed as part of
     during the planning period (e.g. the next three           their ongoing responsibility:
     months)
 ■                                                             ■ Review of maintenance activities. Service Design
     Comparative data with previous estimates – to allow
     confidence in future estimates to be judged                 will define detailed maintenance schedules and
                                                                 activities, which are required to keep IT services
 ■   Reports on any specific capacity difficulties
                                                                 functioning at the required level of performance and
     encountered in the past period, with details of
                                                                 availability. Regular comparison of actual maintenance
     recovery and preventive actions taken for the future
                                                                 activities and times with the plans will highlight
 ■   Details of any required upgrades or procurements
                                                                 potential areas for improvement. One of the sources of
     needed and planned for the future, with indicative
                                                                 this information is a review of whether Service
     costs and timescales.
                                                                 Maintenance Objectives were met and, if not, why not.
 ■   Any potential capacity risks that are likely – with
                                                               ■ Major problem reviews. Problems could be the result
     suggested countermeasures should they arise.
                                                                 of any number of factors, one of which is poor design.
                                                                 Problem reviews therefore may include opportunities
 4.6.5 Availability Management                                   to identify improvements to the design of IT services,
 During Service Design and Service Transition, IT services       which will include availability and capacity
 are designed for availability and recovery. Service             improvement.
 Operation is responsible for actually making the IT service
                                                                                         Service Operation processes |           77

■ Involvement in specific initiatives using techniques           The Service Operation Manager must also be involved in
   such as Service Failure Analysis (SFA), Component             regular, at least monthly, reviews of expenditure against
   Failure Impact Analysis (CFIA), or Fault Tree Analysis        budgets – as part of the ongoing IT budgeting and
   (FTA) or as members of Technical Observation (TO)             accounting process. Any discrepancies must be identified
   activities – either as part of the follow-up to major         and necessary adjustments made. All committed
   problems or as part of an ongoing Service                     expenditure must go through the organization’s purchase
   Improvement Plan, in collaboration with dedicated             order system so that commitments can be accrued and
   Availability Management staff. These Availability             proper checks must be made on all goods received so that
   Management techniques are explained in more detail            invoices and payments can be correctly authorized – or
   in the Service Design publication.                            discrepancies investigated and rectified.
There may be occasions when Operational Staff                    It should be noted that some proposed cost reductions by
themselves need downtime of one or more services to              the business may actually increase IT costs, or at least unit
enable them to conduct their operational or maintenance          costs. Care should therefore be taken to ensure that IT is
activities – which may impact on availability if not             involved in discussing all cost-saving measures and
properly scheduled and managed. In such cases they must          contribute to overall decisions. Financial Management is
liaise with SLM and Availability Management staff – who          covered in detail in the Service Strategy publication.
will negotiate with the business/users, often using the
Service Desk to perform this role, to agree and schedule         4.6.8 IT Service Continuity Management
such activities.                                                 Service Operation functions are responsible for the testing
                                                                 and execution of system and service recovery plans as
4.6.6 Knowledge Management                                       determined in the IT Service Continuity plans for the
It is vitally important that all data and information that can   organization. In addition, managers of all Service
be useful for future Service Operation activities are            Operation functions must be on the Business Continuity
properly gathered, stored and assessed. Relevant data,           Central Coordination team.
metrics and information should be passed up on the
                                                                 This is discussed in detail in Service Strategy and Service
management chain and to other Service Lifecycle phases
                                                                 Design and will not be repeated here, except to indicate
so that it can feed into the knowledge and wisdom layers
                                                                 that it is important that Service Operation functions must
of the organization’s Service Knowledge Management
                                                                 be involved in the following areas:
System, the structures of which have to be defined in
Service Strategy and Service Design and refined in               ■ Risk assessment, using its knowledge of the
Continual Service Improvement (see other ITIL publications           infrastructure and techniques such as CFIA and access
in this series).                                                     to information in the CMS to identify single points of
                                                                     failure or other high-risk situations
Key repositories of Service Operation, which have been
                                                                 ■   Execution of any Risk Management measures that are
frequently mentioned elsewhere, are the CMS and the
                                                                     agreed, e.g. implementation of countermeasures, or
KEDB, but this must be widened out to include all of the
                                                                     increased resilience to components of the
Service Operation teams’ and departments’
                                                                     infrastructures, etc.
documentation, such as operations manuals, procedures
                                                                 ■   Assistance in writing the actual recovery plans for
manuals, work instructions, etc.
                                                                     systems and services under its control
4.6.7 Financial Management for IT services                       ■   Participation in testing of the plans (such as
                                                                     involvement in off-site testing, simulations etc) on an
Service Operation staff must participate in and support the
                                                                     ongoing basis under the direction of the IT Service
overall IT budgeting and accounting system – and may
                                                                     Continuity Manager (ITSCM)
be actively involved in any charging system that may be
                                                                 ■   Ongoing maintenance of the plans under the control
in place.
                                                                     of ITSCM and Change Management
Proper planning is necessary so that capital expenditure         ■   Participation in training and awareness campaigns to
(Capex) and operational expenditure (Opex) budget                    ensure that they are able to execute the plans and
estimates can be prepared and agreed in good time to                 understand their roles in a disaster
meet the budgetary cycles.
                                                                 ■   The Service Desk will play a key role in
                                                                     communicating with staff, customers and users during
                                                                     an actual disaster.
 Common Service
Operation activities   5
                                                                                                                                                                               |   81


5 Common Service Operation activities
Chapter 4 dealt with the processes required for effective
                                                                                                 In reality, it is impossible to achieve quality services
Service Operation and Chapter 6 will deal with the
                                                                                                 without aligning and ‘gearing’ every level of
organizational aspects. This chapter focuses on a number                                         technology (and the people who manage it) to the
of operational activities that ensure that technology is                                         services being provided. Service Management involves
aligned with the overall Service and Process objectives.                                         people, process and technology.
These activities are sometimes described as processes, but
                                                                                                 In other words, the common Service Operation
in reality they are sets of specialized technical activities all
                                                                                                 activities are not about managing the technology for
aimed at ensuring that the technology required to deliver
                                                                                                 the sake of having good technology performance. They
and support services is operating effectively and efficiently.                                   are about achieving performance that will integrate the
These activities will usually be technical in nature –                                           technology component with the people and process
although the exact technology will vary depending on the                                         components to achieve service and business objectives.
type of services being delivered. This publication will focus                                    See Figure 5.1 for examples of how technology is
                                                                                                 managed in maturing organizations.
on the activities required to manage IT.

  Important note on managing technology                                                      Figure 5.1 illustrates the steps involved in maturing from a
  It is tempting to divorce the concept of Service                                           technology-centric organization to an organization that
  Management from the management of the                                                      harnesses technology as part of its business strategy.
  infrastructure that is used to deliver those services.                                     Figure 5.1 further outlines the role of Technology
                                                                                             Managers in organizations of differing maturity. The
                                                                                             diagram is not comprehensive, but it does provide
                                                                                             examples of the way in which technology is managed


                                                                                                           •   IT is measured in terms of its contribution to the business
                                                                                          Level 5          •   All services are measured by their ability to add value
                                                                                                           •   Technology is subordinate to the business function it enables
                                                                                      Strategic            •   Service Portfolio drives investment and performance targets
                                                                                                           •   Technology expertise is so entrenched in everyday operations
                                                                                     Contribution              it is viewed as a utility by the business

                                                                                      •   Services are quantified and initiatives aimed at delivering appropriate levels
                                    Business                        Level 4           •   Service requirements and technology constraints drive procurement
                                                                                      •   Service Design specifies performance requirements and operational norms
                                     Centric                         Service          •   Consolidated systems support multiple services
                                                                                      •   All technology is mapped to services and is managed to service requirements
                                                                    Provision         •   Change Management covers both development and operations

                                                                •   Critical services have been identified together with their technological dependencies
                                              Level 3           •   Systems are integrated to provide required performance, availability and recovery for those services
                                                                •   More focus on measuring performance across multiple devices and even platforms
         Technology                        Technology           •
                                                                •
                                                                    Virtual mapping of Configuration and Asset data with single Change Management for operations
                                                                    Consolidated Availability and Capacity Planning on some services
                                           Integration
           Centric                                              •
                                                                •
                                                                    Integrated Disaster Recovery Planning
                                                                    Systems are consolidated to save cost

                                          •   Initiatives are aimed at achieving control and increasing the stability of the infrastructure
                         Level 2          •   IT has identified most technology components and understands what each is used for
                                          •   Technical management focuses on achieving high performance of each component regardless of its function
                     Technology           •   Availability of components is measured and reported
                                          •   Reactive Problem Management and inventory control are performed
                       Control            •   Change control is performed on ‘mission critical’ components
                                          •   Point solutions are used to automate those processes that are in place, usually on a platform-by-platform basis

                     •   IT is driven by technology and most initiatives are aimed at trying to understand the infrastructure and deal with exceptions
       Level 1       •   Technology management is performed by technical experts, and only they understand how to manage each device or platform
                     •   Most teams are driven by incidents, and most improvements are aimed at making management easier – not to improve services
      Technology     •   Organizations entrench technology specializations and do not encourage interaction with other groups
                     •   Management tools are aimed at managing single technologies, resulting in duplication
        Driven       •   Incident Management processes start being created



Figure 5.1 Achieving maturity in Technology Management
82   | Common Service Operation activities



 in each type of organization. The bold headings indicate         5.1 MONITORING AND CONTROL
 the major role played by IT in managing technology. The
                                                                  The measurement and control of services is based on a
 text in the rows describes the characteristics of an IT
                                                                  continual cycle of monitoring, reporting and subsequent
 department at each level.
                                                                  action. This cycle is discussed in detail in this section
 The purpose of this diagram in this chapter is as follows:       because it is fundamental to the delivery, support and
 ■ This chapter focuses on Technical Management                   improvement of services.
   activities, but there is no single way of representing         It is also important to note that, although this cycle takes
   them. A less mature organization will tend to see              place during Service Operation, it provides a basis for
   these activities as ends in themselves, not a means to         setting strategy, designing and testing services and
   an end. A more mature organization will tend to                achieving meaningful improvement. It is also the basis for
   subordinate these activities to higher-level Service           SLM measurement. Therefore, although monitoring is
   Management objectives. For example, the Server                 performed by Service Operation functions, it should not be
   Management team will move from an insulated                    seen as a purely operational matter. All phases of the
   department, focused purely on managing servers,                Service Lifecycle should ensure that measures and controls
   to a team that works closely with other Technology             are clearly defined, executed and acted upon.
   Managers to find ways of increasing their value
   to the business.                                               5.1.1 Definitions
 ■ To make and reinforce the point that there is no ‘right’
   way of grouping and organizing the departments that                Monitoring refers to the activity of observing a
   perform these services. Some readers might interpret               situation to detect changes that happen over time.
   the headings in this chapter as the names of
   departments, but this is not the case. The aim of this         In the context of Service Operation, this implies the
   chapter is to identify the typical technical activities        following:
   involved in Service Operation. Organizational aspects          ■ Using tools to monitor the status of key CIs and key
   are discussed in Chapter 6.                                         operational activities
 ■ The Service Operation activities described in the rest         ■    Ensuring that specified conditions are met (or not
   of this chapter are not typical of any one of the levels            met) and, if not, to raise an alert to the appropriate
   of maturity. Rather, the activities are usually all present         group (e.g. the availability of key network devices)
   in some form at all levels. They are just organized and        ■    Ensuring that the performance or utilization of a
   managed differently at each level.                                  component or system is within a specified range (e.g.
 In some cases a dedicated group may handle all of a                   disk space or memory utilization)
 process or activity while in other cases processes or            ■    To detect abnormal types or levels of activity in the
 activities may be shared or split between groups.                     infrastructure (e.g. potential security threats)
 However, by way of broad guidance, the following                 ■    To detect unauthorized changes (e.g. introduction of
 sections list the required activities under the functional            software)
 groups most likely to be involved in their operation. This       ■    To ensure compliance with the organization’s policies
 does not mean that all organizations have to use these                (e.g. inappropriate use of e-mail)
 divisions. Smaller organizations will tend to assign groups      ■    To track outputs to the business and ensure that they
 of these activities (if they are needed at all) to single             meet quality and performance requirements
 departments, or even individuals.                                ■    To track any information that is used to measure Key
 Finally, the purpose of this chapter is not to provide a              Performance Indicators (KPIs).
 detailed analysis of all the activities. They are specialized,
 and detailed guidance is available from the platform                 Reporting refers to the analysis, production and
 vendors and other, more technical, frameworks; new                   distribution of the output of the monitoring activity.
 categories will be added continually as technology
 evolves. This chapter simply aims to highlight the               In the context of Service Operation, this implies the
 importance and nature of technology management for               following:
 Service Management in the IT context.                            ■ Using tools to collate the output of monitoring
                                                                       information that can be disseminated to various
                                                                       groups, functions or processes
                                                                              Common Service Operation activities |            83

■ Interpreting the meaning of that information
■ Determining where that information would best be                                                Norm
   used
■ Ensuring that decision makers have access to the
   information that will enable them to make decisions
■ Routing the reported information to the appropriate
                                                                           Control               Compare
   person, group or tool.

  Control refers to the process of managing the
  utilization or behaviour of a device, system or service.
  It is important to note, though, that simply                                                   Monitor
  manipulating a device is not the same as controlling
  it. Control requires three conditions:
  ■ The action must ensure that behaviour conforms
      to a defined standard or norm
  ■ The conditions prompting the action must be
      defined, understood and confirmed                            Input              Activity             Output

  ■ The action must be defined, approved and
      appropriate for these conditions.
                                                                Figure 5.2 The Monitor Control Loop
In the context of Service Operation, control implies the
                                                                  and frequency – and will run regardless of other
following:
                                                                  conditions.
■ Using tools to define what conditions represent               ■ Closed Loop Systems monitor an environment and
  normal operations or abnormal operations                        respond to changes in that environment. For example,
■ Regulate performance of devices, systems or services            in network load balancing a monitor will evaluate the
■ Measure availability                                            traffic on a circuit. If network traffic exceeds a certain
■ Initiate corrective action, which could be automated            range, the control system will begin to route traffic
  (e.g. reboot a device remotely or run a script), or             across a backup circuit. The monitor will continue to
  manual (e.g. notify operations staff of the status).            provide feedback to the control system, which will
                                                                  continue to regulate the flow of network traffic
5.1.2 Monitor Control Loops                                       between the two circuits.
The most common model for defining control is the               To help clarify the difference, solving Capacity
Monitor Control Loop. Although it is a simple model, it has     Management through over-provisioning is open loop; a
many complex applications within IT Service Management.         load-balancer that detects congestion/failure and redirects
This section will define the basic concepts of the Monitor      capacity is closed loop.
Control Loop Model and subsequent sections will show
how important these concepts are for the Service                5.1.2.1 Complex Monitor Control Loop
Management Lifecycle.                                           The Monitor Control Loop in Figure 5.2 is a good basis for
Figure 5.2 outlines the basic principles of control. A single   defining how Operations Management works, but within
activity and its output are measured using a predefined         the context of ITSM the situation is far more complex.
norm, or standard, to determine whether it is within an         Figure 5.3 illustrates a process consisting of three major
acceptable range of performance or quality. If not, action      activities. Each one has an input and an output, and the
is taken to rectify the situation or to restore normal          output becomes an input for the next activity.
performance.                                                    In this diagram, each activity is controlled by its own
Typically there are two types of Monitor Control Loops:         Monitor Control Loop, using a set of norms for that
                                                                specific activity. The process as a whole also has its
■ Open Loop Systems are designed to perform a
                                                                own Monitor Control Loop, which spans all the activities
   specific activity regardless of environmental conditions.    and ensures that all norms are appropriate and are
   For example, a backup can be initiated at a given time       being followed.
84    | Common Service Operation activities




                                                                                                                                                    Norm




                                                                              Control                                                             Compare




                                                                                                                                                    Monitor




                                   Norm                                            Norm                                                 Norm




             Control              Compare                    Control              Compare                         Control              Compare




                                  Monitor                                         Monitor                                              Monitor




                       Activity                                        Activity                                             Activity
     Input                                  Output   Input                                   Output       Input                                  Output       Input


 Figure 5.3 Complex Monitor Control Loop

 In Figure 5.3 there is a double feedback loop. One loop                                    on what has been described so far, Monitor Control Loops
 focuses purely on executing a defined standard, and the                                    can be used to manage:
 second evaluates the performance of the process and also
                                                                                            ■ The performance of activities in a process or
 the standards whereby the process is executed. An
                                                                                              procedure. Each activity and its related output can
 example of this would be if the first set of feedback loops
                                                                                              potentially be measured to ensure that problems with
 at the bottom of the diagram represented individual
                                                                                              the process are identified before the process as a
 stations on an assembly line and the higher-level loop
                                                                                              whole is completed. For example, in Incident
 represented Quality Assurance.
                                                                                              Management, the Service Desk monitors whether a
 The Complex Monitor Control Loop is a good                                                   technical team has accepted an incident in a specified
 organizational learning tool (as defined by Chris Argyris                                    time. If not, the incident is escalated. This is done well
 (1976, Increasing Leadership Effectiveness. New York: Wiley).                                before the target resolution time for that incident
 The first level of feedback at individual activity level is                                  because the aim of escalating that one activity is to
 concerned with monitoring and responding to data (single                                     ensure that the process as whole is completed in time.
 facts, codes or pieces of information). The second level is                                ■ The effectiveness of a process or procedure as a
 concerned with monitoring and responding to information                                      whole. In this case the ‘activity’ box represents the
 (a collection of a number of facts about which a                                             entire process as a single entity. For example, Change
 conclusion may be drawn). Refer to the Service Transition                                    Management will measure the success of the process
 publication for a full discussion on Data, Information,                                      by checking whether a change was implemented on
 Knowledge and Wisdom.                                                                        time, to specification and within budget.
 All of this is interesting theory, but does not explain how                                ■ The performance of a device. For example, the
 the Monitor Control Loop concept can be used to operate                                      ‘activity’ box could represent the response time of a
 IT services. And especially – who defines the norm? Based                                    server under a given workload.
                                                                                                                                                               Common Service Operation activities |                          85

■ The performance of a series of devices. For                                                                                            ■ If not, how are the other instances of monitoring
                                              example, the end user response time of an application                                        related to Operations Management?
                                              across the network.                                                                        ■ If there are multiple loops, which processes are
To define how to use the concept of Monitor Control                                                                                        responsible for each loop?
Loops in Service Management, the following questions                                                                                     The following sections will expand on the concept of
need to be answered:                                                                                                                     Monitor Control Loops and demonstrate how these
■ How do we define what needs to be monitored?                                                                                           questions are answered.
■ What are the appropriate thresholds for each of these?
                                                                                                                                         5.1.2.2 The ITSM Monitor Control Loop
■ How will monitoring be performed (manual or
                                              automated)?                                                                                In ITSM, the complex Monitor Control Loop can be
                                                                                                                                         represented as shown in Figure 5.4.
■                                             What represents normal operation?
■                                             What are the dependencies for normal operation?                                            Figure 5.4 can be used to illustrate the control of a
■                                             What happens before we get the input?                                                      process or of the components used to deliver a service.
■                                             How frequently should the measurement take place?                                          In this diagram the word ‘activity’ implies that it refers
                                                                                                                                         to a process. To apply it to a service, an ‘activity’ could
■                                             Do we need to perform active measurement to check
                                                                                                                                         also be a ‘CI’. There are a number of significant features
                                              whether the item is within the norm or do we wait
                                                                                                                                         in Figure 5.4 as given overleaf.
                                              until an exception is reported (passive measurement)?
■                                             Is Operations Management the only function that
                                              performs monitoring?
                                                                                                    Business Executives and Business Unit Managers




                                                  Service Strategy                                                                                1
                                                                                                                                                       2                                          Continual Service
                                                                                                                                                           3                                       Improvement



                                                                                                  Service Design
IT Management and Vendor Account Management




                                                      Portfolios,
                                                Standards and Policies

                                                                                                                                                                Service Transition
                                                                                             Technical Architectures
                                                                                                and Performance
                                                                                                   Standards

                                                                                                                                                                                                                      Users




                                                                                Norm                                                     Norm                                                    Norm




                                                          Control              Compare                             Control          Compare                                Control              Compare




                                                                               Monitor                                                  Monitor                                                 Monitor




                                                  Input             Activity             Output           Input              Activity                 Output       Input             Activity             Output



                                                                                                     Internal and External Technical Staff and Experts

Figure 5.4 ITSM Monitor Control Loop
86   | Common Service Operation activities



 ■ Each activity in a Service Management process (or               ● Arrow 3. In this case the norms specified in
   each component used to provide a service) is                       Service Design are not being adhered to. This
   monitored as part of the Service Operation processes.              could be because they are not appropriate or
   The operational team or department responsible for                 executable, or because of a lack of education or a
   each activity or component will apply the Monitor                  lack of communication. The norms and the lack of
   Control Loop as defined in the process, and using the              compliance need to be investigated and action
   norms that were defined during the Service Design                  taken to rectify the situation.
   processes. The role of Operational Monitoring and
                                                                Service Transition provides a major set of checks and
   Control is to ensure that the process or service
                                                                balances in these processes. It does so as follows:
   functions exactly as specified, which is why they are
   primarily concerned with maintaining the status quo.         ■ For new services, Service Transition will ensure that
 ■ The norms and Monitoring and Control mechanisms                the technical architectures are appropriate; and that
   are defined in Service Design, but they are based on           the Operational Performance Standards can be
   the standards and architectures defined during Service         executed. This in turn will ensure that the Service
   Strategy. Any changes to the organization’s Service            Operation teams or departments are able to meet the
   Strategy, architecture, service portfolios or Service          Service Level Requirements.
   Level Requirements will precipitate changes to what is       ■ For existing services, Change Management will
   monitored and how it is controlled.                            manage any of the changes that are required as part
 ■ The Monitor Control Loops are placed within the                of a control (e.g. tuning) as well as any changes
   context of the organization. This implies that Service         represented by the arrows labelled 1, 2 and 3.
   Strategy will primarily be executed by Business and IT         Although Service Transition does not define strategy
   Executives with support from vendor account                    and design services per se, it provides coordination
   managers. Service Design acts as the bridge between            and assurance that the services are working, and will
   Service Strategy and Service Operation and will                continue to work, as planned.
   typically involve representatives from all groups. The
   activities and controls will generally be executed by IT       Why is this loop covered under Service
   staff (sometimes involving users) and supported by IT          Operation?
   Managers and the vendors. Service Improvement                  Figure 5.4 represents Monitoring and Control for the
   spans all areas, but primarily represents the interests of     whole of IT Service Management. Some readers of the
   the business and its users.                                    Service Operation publication may feel that it should
 ■ Notice that the second level of monitoring in this             be more suitably covered in the Service Strategy
   complex Monitor Control Loop is performed by the               publication.
   CSI processes through Service Strategy and Service             However, Monitoring and Control can only effectively
   Design. These relationships are represented by the             be deployed when the service is operational. This
   numbered arrows in Figure 5.4 as follows:                      means that the quality of the entire set of IT Service
   ● Arrow 1. In this case CSI has recognized that the            Management processes depends on how they are
                                                                  monitored and controlled in Service Operation.
       service will be improved by making a change to
       the Service Strategy. This could be the result of the      The implications of this are as follows:
       business needing a change to the Service Portfolio,        ■ Service Operation staff are not the only people
       or that the architecture does not deliver what was            with an interest in what is monitored and how
       expected.                                                     they are controlled.
   ● Arrow 2. In this case the Service Level
                                                                  ■ While Service Operation is responsible for
       Requirements need to be adjusted. It could be that            monitoring and control of services and
       the service is too expensive; or that the                     components, they are acting as stewards of a very
       configuration of the infrastructure needs to be               important part of the set of ITSM Monitoring and
       changed to enhance performance; or because                    Control loops.
       Operations Management is unable to maintain
                                                                  ■ If Service Operation staff define and execute
       service quality in the current architecture.
                                                                     Monitoring and Control procedures in isolation,
                                                                     none of the Service Management processes or
                                                                              Common Service Operation activities |           87

                                                                Monitoring, it will understand how poor the service quality
      functions will be fully effective. This is because the
                                                                is, but will have no idea what is causing it or how to
      Service Operation functions will not support the
                                                                change it.
      priorities and information requirements of the
      other processes, e.g. attempting to negotiate an          In reality, most organizations have a combination of
      SLA when the only data available is page-swap             Internal and External Monitoring, but in many cases these
      rates on a server and detailed bandwidth                  are not linked. For example, the Server Management team
      utilization of a network.                                 knows exactly how well the servers are performing and
                                                                the Service Level Manager knows exactly how the users
                                                                perceive the quality of service provided by the servers.
5.1.2.3 Defining what needs to be monitored
                                                                However, neither of them knows how to link these metrics
The definition of what needs to be monitored is based on        to define what level of server performance represents
understanding the desired outcome of a process, device or       good quality service. This becomes even more confusing
system. IT should focus on the service and its impact on        when server performance that is acceptable in the middle
the business, rather than just the individual components of     of the month, is not acceptable at month-end.
technology. The first question that needs to be asked is
‘What are we trying to achieve?’.                               5.1.2.5 Defining objectives for Monitoring and
                                                                Control
5.1.2.4 Internal and External Monitoring and
                                                                Many organizations start by asking the question ‘What are
Control
                                                                we managing?’. This will invariably lead to a strong
At the outset, it will become clear that there are two levels   Internal Monitoring System, with very little linkage to the
of monitoring:                                                  real outcome or service that is required by the business.
■ Internal Monitoring and Control: Most teams or                The more appropriate question is ‘What is the end result
  departments are concerned about being able to                 of the activities and equipment that my team manages?’.
  execute effectively and efficiently the tasks that have       Therefore the best place to start, when defining what to
  been assigned to them. Therefore, they will monitor           monitor, is to determine the required outcome.
  the items and activities that are directly under their
  control. This type of monitoring and control focuses          The definition of Monitoring and Control objectives should
  on activities that are self-contained within that             ideally start with the definition of the Service Level
  team or department. For example, the Service Desk             Requirements documents (see Service Design publication).
  Manager will monitor the volume of calls to determine         These will specify how the customers and users will
  how many staff need to be available to answer                 measure the performance of the service, and are used as
  the telephone.                                                input into the Service Design processes. During Service
                                                                Design, various processes will determine how the service
■ External Monitoring and Control: Although each
                                                                will be delivered and managed. For example, Capacity
  team or department is responsible for managing its
                                                                Management will determine the most appropriate and
  own area, they do not act independently. Every task
                                                                cost-effective way to deliver the levels of performance
  that they perform, or device that they manage, has an
                                                                required. Availability Management will determine how the
  impact on the success of the organization as a whole.
                                                                infrastructure can be configured to provide the fewest
  Each team or department will also be controlling
                                                                points of failure.
  items and activities on behalf of other groups,
  processes or functions. For example, the Server               If there is any doubt about the validity or completeness
  Management team will monitor the CPU performance              of objectives, the COBIT framework provides a
  on key servers and perform workload balancing so              comprehensive, high-level set of objectives as a checklist.
  that a critical application is able to stay within            More information on COBIT is provided in Appendix A of
  performance thresholds set by Application                     this publication.
  Management.
                                                                The Service Design Process will help to identify the
The distinction between Internal and External Monitoring        following sets of inputs for defining Operational
is an important one. If Service Operation focuses only on       Monitoring and Control norms and mechanisms:
Internal Monitoring, it will have very well-managed
                                                                ■ They will work with customers and users to determine
infrastructure, but no way of understanding or influencing
                                                                   how the output of the service will be measured. This
the quality of services. If it focuses only on External
                                                                   will include measurement mechanisms, frequency and
88    | Common Service Operation activities



   sampling. This part of Service Design will focus              Active versus Passive Monitoring
   specifically on the Functional Requirements.                  ■ Active Monitoring refers to the ongoing
 ■ They will identify key CIs, how they should be                  ‘interrogation’ of a device or system to determine its
   configured and what level of performance and                    status. This type of monitoring can be resource
   availability is required in order to meet the agreed            intensive and is usually reserved to proactively monitor
   Service Levels.                                                 the availability of critical devices or systems; or as a
 ■ They will work with the developers and vendors of               diagnostic step when attempting to resolve an
   the CIs that make up each service to identify any               Incident or diagnose a problem.
   constraints or limitations in those components.               ■ Passive Monitoring is more common and refers to
 ■ All support and delivery teams and departments will             generating and transmitting events to a ‘listening
   need to identify what information will help them to             device’ or monitoring agent. Passive Monitoring
   execute their role effectively. Part of the Service             depends on successful definition of events and
   Design and development will be to instrument each               instrumentation of the system being monitored (see
   service so that it can be monitored to provide this             section 4.1).
   information, or so that it can generate meaningful
   events.                                                       Reactive versus Proactive
                                                                 ■ Reactive Monitoring is designed to request or trigger
 All of this means that a very important part of defining
                                                                   action following a certain type of event or failure. For
 what Service Operation monitors and how it exercises
                                                                   example, server performance degradation may trigger
 control is to identify the stakeholders of each service.
                                                                   a reboot, or a system failure will generate an incident.
 Stakeholders can be defined as anyone with an interest in         Reactive monitoring is not only used for exceptions. It
 the successful delivery and receipt of IT services. Each          can also be used as part of normal operations
 stakeholder will have a different perspective of what it will     procedures, for example a batch job completes
 take to deliver or receive an IT service. Service Operation       successfully, which prompts the scheduling system to
 will need to understand each of these perspectives in             submit the next batch job.
 order to determine exactly what needs to be monitored           ■ Proactive Monitoring is used to detect patterns of
 and what to do with the output.                                   events which indicate that a system or service may be
 Service Operation will therefore rely on SLM to define            about to fail. Proactive monitoring is generally used in
 exactly who these stakeholders are and how they                   more mature environments where these patterns have
 contribute to or use the service. This is discussed more          been detected previously, often several times.
 fully in the Service Design and Continual Service                 Proactive Monitoring tools are therefore a means of
 Improvement publications.                                         automating the experience of seasoned IT staff and
                                                                   are often created through the Proactive Problem
     Note on Internal and External Monitoring                      Management process (see Continual Service
     Objectives                                                    Improvement publication).

     The required outcome could be internal or external to       Please note that Reactive and Proactive Monitoring could
     the Service Operation functions, although it should         be active or passive, as per Table 5.1 overleaf.
     always be remembered that an internal action will
     often have an external result. For example,
     consolidating servers to make them easier to manage
     may result in a cost saving, which will affect the SLM
     negotiation and review cycle as well as the Financial
     Management processes.


 5.1.2.6 Types of monitoring
 There are many different types of monitoring tool and
 different situations in which each will be used. This section
 focuses on some of the different types of monitoring that
 can be performed and when they would be appropriate.
                                                                                   Common Service Operation activities |            89

Table 5.1   Active and Passive Reactive and Proactive Monitoring
               Active                                                    Passive
Reactive       Used to diagnose which device is causing the failure      Detects and correlates event records to determine the
               and under what conditions (e.g. ‘ping’ a device, or       meaning of the events and the appropriate action (e.g.
               run and track a sample transaction through a series       a user logs in three times with the incorrect password,
               of devices)                                               which generates represents a security exception and is
                                                                         escalated through Information Security Management
               Requires knowledge of the infrastructure topography
                                                                         procedures)
               and the mapping of services to CIs
                                                                         Requires detailed knowledge of the normal operation
                                                                         of the infrastructure and services

Proactive      Used to determine the real-time status of a device,       Event records are correlated over time to build trends
               system or service – usually for critical components       for Proactive Problem Management.
               or following the recovery of a failed device to ensure
                                                                         Patterns of events are defined and programmed into
               that it is fully recovered (i.e. is not going to cause
                                                                         correlation tools for future recognition
               further incidents)


Continuous Measurement versus Exception-Based                           is physical inspection – often performed by the user
Measurement                                                             rather than IT staff). Where Exception-Based
                                                                        Measurement is used, it is important that both the
■ Continuous Measurement is focused on monitoring
                                                                        OLA and the SLA for that service reflect this, as service
  a system in real time to ensure that it complies with a
                                                                        outages are more likely to occur, and users are often
  performance norm (for example, an application server
                                                                        required to report the exception.
  is available for 99.9% of the agreed service hours). The
  difference between Continuous Measurement and                    Performance versus output
  Active Monitoring is that Active Monitoring does not             There is an important distinction between the reporting
  have to be continuous. However, as with Active                   used to track the performance of components or teams or
  Monitoring, this is resource intensive and is usually            department used to deliver a service and the reporting
  reserved for critical components or services. In most            used to demonstrate the achievement of service quality
  cases the cost of the additional bandwidth and                   objectives.
  processor power outweighs the benefit of continuous
  measurement. In these cases monitoring will usually              IT managers often confuse these by reporting to the
  be based on sampling and statistical analysis (e.g. the          business on the performance of their teams or
  system performance is reported every 30 seconds and              departments (e.g. number of calls taken per Service Desk
  extrapolated to represent overall performance). In               Analyst), as if that were the same thing as quality of
  these cases, the method of measurement will have to              service (e.g. incidents solved within the agreed time).
  be documented and agreed in the OLAs to ensure                   Performance Monitoring and metrics should be used
  that it is adequate to support the Service Reporting             internally by the Service Management to determine
  Requirements (see Continual Service Improvement                  whether people, process and technology are functioning
  publication).                                                    correctly and to standard.
■ Exception-Based Measurement does not measure
                                                                   Users and customers would rather see reporting related to
  the real-time performance of a service or system, but
                                                                   the quality and performance of the service.
  detects and reports against exceptions. For example,
  an event is generated if a transaction does not                  Although Service Operation is concerned with both types
  complete, or if a performance threshold is reached.              of reporting, the primary concern of this publication is
  This is more cost-effective and easier to measure, but           Performance Monitoring, whereas monitoring of Service
  could result in longer service outages. Exception-Based          Quality (or Output-Based Monitoring) will be discussed in
  Measurement is used for less critical systems or on              detail in the Continual Service Improvement publication.
  systems where cost is a major issue. It is also used
  where IT tools are not able to determine the status or
  quality of a service (e.g. if printing quality is part of
  the service specification, the only way to measure this
90    | Common Service Operation activities



 5.1.2.7 Monitoring in Test Environments                         The relevant Application Management team should also
 As with any IT Infrastructure, a Test Environment will          have defined the exact steps that it will take when the
 need to define how it will use monitoring and control.          application fails.
 These controls are more fully discussed in the Service          In addition, it should also be recognized that action may
 Transition publication.                                         need to be taken by different people, for example a single
 ■ Monitoring the Test Environment itself: A Test                event (such as an application failure) may trigger action by
   Environment consists of infrastructure, applications and      the Application Management team (to restore service), the
   processes that have to be managed and controlled              users (to initiate manual processing) and management (to
   just as any other environment. It is tempting to think        determine how this event can be prevented in future).
   that the Test Environment does not need rigorous              The implications of this principle are outlined in more
   monitoring and control because it is not a live               detail in relation to Event Management (see section 4.1).
   environment. However, this argument is not valid. If a
   Test Environment is not properly monitored and                5.1.2.9 Service Operation audits
   controlled, there is a danger of running the tests on
                                                                 Regular audits must be performed on the Service
   equipment that deviates from the standards defined in
                                                                 Operation processes and activities to ensure:
   Service Design.
 ■ Monitoring items being tested: The results of testing         ■ They are being performed as intended
   have to be accurately tracked and checked. Also it is         ■ There is no circumvention
   important that any monitoring tools that have been            ■ They are still fit for purpose, or to identify any
   built into new or changed services have to be tested             required changes or improvements.
   as well.
                                                                 Service Operation Managers may choose to perform such
                                                                 audits themselves, but ideally some form of independent
 5.1.2.8 Reporting and action                                    element to the audits is preferable.
           ‘A report alone creates awareness; a report with an
                                                                 The organization’s internal IT audit team or department
           action plan achieves results.’
                                                                 may be asked to be involved or some organizations may
     Reporting and dysfunction                                   choose to engage third-party consultancy/audit/
                                                                 assessment companies so that an entirely independent
     Practical experience has shown that there is more
                                                                 expert view is obtained.
     reporting in dysfunctional organizations than in
     effective organizations. This is because reports are not    Service Operation audits are part of the ongoing
     being used to initiate pre-defined action plans, but        measurement that takes place as part of Continual Service
     rather:                                                     Improvement and are discussed in more detail in that
     ■ to shift the blame for an incident                        publication.

     ■ to try to find out who is responsible for making a
                                                                 5.1.2.10 Measurement, metrics and KPIs
        decision
                                                                 This section has focused primarily on the monitoring and
     ■ as input to creating action plans for future
                                                                 control as a basis for Service Operation. Other sections of
        occurrences.
                                                                 the publication have covered some basic metrics that
     In dysfunctional organizations a lot of reports are         could be used to measure the effectiveness and efficiency
     produced which no one has the time to look at or            of a process.
     query.
                                                                 Although this publication is not primarily about
                                                                 measurement and metrics, it is important that
 Monitoring without control is irrelevant and ineffective.
                                                                 organizations using these guidelines have robust
 Monitoring should always be aimed at ensuring that
                                                                 measurement techniques and metrics that support the
 service and operational objectives are being met. This
                                                                 objectives of their organization. This section is a summary
 means that unless there is a clear purpose for monitoring
                                                                 of these concepts.
 a system or service, it should not be monitored.
 This also means that when monitoring is defined, so too
 should any required actions. For example, being able to
 detect that a major application has failed is not sufficient.
                                                                               Common Service Operation activities |            91

Measurement                                                     A further reason for not including them is the fact that
                                                                similar metrics can be used to achieve very different KPIs.
  Measurement refers to any technique that is used to           For example, one organization used the metric
  evaluate the extent, dimension or capacity of an item         ‘Percentage of Incidents resolved by the Service Desk’
  in relation to a standard or unit.                            to evaluate the performance of the Service Desk.
  ■ Extent refers to the degree of compliance or                This worked effectively for about two years, after which
      completion (e.g. are all changes formally                 the IT manager began to realize that this KPI was being
      authorized by the appropriate authority)                  used to prevent effective Problem Management, i.e. if,
  ■ Dimension refers to the size of an item, e.g. the           after two years, 80% of all incidents are easy enough to be
      number of incidents resolved by the Service Desk          resolved in 10 minutes on the first call, why have we not
                                                                come up with a solution for them? In effect, the KPI now
  ■ Capacity refers to the total capability of an item,
                                                                became a measure for how ineffective the Problem
      for example maximum number of standard
                                                                Management teams were.
      transactions that can be processed by a server per
      minute.
                                                                5.1.2.11 Interfaces to other Service Lifecycle
                                                                practices
Measurement only becomes meaningful when it is
possible to measure the actual output or dimensions of a        Operational Monitoring and Continual Service
system, function or process against a standard or desired       Improvement
level, e.g. the server must be capable of processing a          This section has focused on Operational Monitoring and
minimum of 100 standard transactions per minute. This           Reporting, but monitoring also forms the starting point for
needs to be defined in Service Design, and refined over         Continual Service Improvement. This is covered in the
time through Continual Service Improvement, but the             Continual Service Improvement publication, but key
measurement itself takes place during Service Operation.        differences are outlined here.
Metrics                                                         Quality is the key objective of monitoring for Continual
                                                                Service Improvement (CSI). Monitoring will therefore focus
  Metrics refer to the quantitative, periodic assessment        on the effectiveness of a service, process, tool,
  of a process, system or function, together with the           organization or CI. The emphasis is not on assuring real-
  procedures and tools that will be used to make these
                                                                time service performance; rather it is on identifying where
  assessments and the procedures for interpreting them.
                                                                improvements can be made to the existing level of service,
                                                                or IT performance.
This definition is important because it not only specifies
what needs to be measured, but also how to measure it,          Monitoring for CSI will therefore tend to focus on
what the acceptable range of performance will be and            detecting exceptions and resolutions. For example, CSI is
what action will need to be taken as a result of normal         not as interested in whether an incident was resolved, but
performance or an exception. From this, it is clear that any    whether it was resolved within the agreed time and
metric given in the previous section of this publication is a   whether future incidents can be prevented.
very basic one and will need to be applied and expanded         CSI is not only interested in exceptions, though. If an SLA
within the context of each organization before it can be        is consistently met over time, CSI will also be interested in
effective.                                                      determining whether that level of performance can be
                                                                sustained at a lower cost or whether it needs to be
Key Performance Indicators
                                                                upgraded to an even better level of performance. CSI may
  A KPI refers to a specific, agreed level of performance       therefore also need access to regular performance reports.
  that will be used to measure the effectiveness of an          However, since CSI is unlikely to need, or be able to cope
  organization or process.                                      with, the vast quantities of data that are produced by all
                                                                monitoring activity, they will most likely focus on a
KPIs are unique to each organization and have to be             specific subset of monitoring at any given time. This could
related to specific inputs, outputs and activities. They are    be determined by input from the business or
not generic or universal and thus have not been included        improvements to technology.
in this publication.
92   | Common Service Operation activities



 This has two main implications:                               second-line support groups if they do not work 24/7).
                                                               In some organizations, the Service Desk is part of the
 ■ Monitoring for CSI will change over time. They may be
                                                               Operations Bridge.
   interested in monitoring the e-mail service one quarter
   and then move on to look at HR systems in the next          The physical location and layout of the Operation’s Bridge
   quarter.                                                    needs to be carefully designed to give the correct
 ■ This means that Service Operation and CSI need to           accessibility and visibility of all relevant screens and
   build a process which will help them to agree on            devices to authorised personnel. However, this will
   what areas need to be monitored and for what                become a very sensitive area where controlled access and
   purpose.                                                    tight security will be essential.
                                                               Smaller organizations may not have a physical Operations
 5.2 IT OPERATIONS                                             Bridge, but there will certainly still be the need for Console
                                                               Management, usually combined with other technical roles.
 5.2.1 Console Management/Operations                           For example, a single team of technical staff will manage
 Bridge                                                        the network, servers and applications. Part of their role will
                                                               be to monitor the consoles for those systems – often
 These provide a central coordination point for managing
                                                               using virtual consoles so that they can perform the activity
 various classes of events, detecting incidents, managing
                                                               from any location. However, it should be noted that these
 routine operational activities and reporting on the status
                                                               virtual consoles are powerful tools and, if used in insecure
 or performance of technology components.
                                                               locations or over unsecured connections, could represent a
 Observation and monitoring of the IT Infrastructure can       significant security threat.
 occur from a centralized console – to which all system
 events are routed. Historically, this involved the            5.2.2 Job Scheduling
 monitoring of the master operations console of one or         IT Operations will perform standard routines, queries or
 more mainframes – but these days is more likely to            reports delegated to it as part of delivering services; or as
 involve monitoring of a server farm(s), storage devices,      part of routine housekeeping delegated by Technical and
 network components, applications, databases, or any other     Application Management teams.
 CIs, including any remaining mainframe(s), from a single
 location, known as the Operations Bridge.                     Job Scheduling involves defining and initiating job-
                                                               scheduling software packages to run batch and real-time
 There are two theories about how the Operations Bridge        work. This will normally involve daily, weekly, monthly,
 was so named. One is that it resembles the bridge of a        annual and ad hoc schedules to meet business needs.
 large, automated ship (such as spaceships commonly seen
 in science fiction movies). The other theory is that the      In addition to the initial design, or periodic redesign, of
 Operations Bridge represents a link between the IT            the schedules, there are likely to be frequent amendments
 Operations teams and the traditional Help Desk. In some       or adjustments to make during which job dependencies
 organizations this means that the functions of Operational    have to be identified and accommodated. There will also
 Control and the Help Desk were merged into the Service        be a role to play in defining alerts and Exception Reports
 Desk, which performed both sets of duties in a single         to be used for monitoring/checking job schedules. Change
 physical location.                                            Management plays an important role in assessing and
                                                               validating major changes to schedules, as well as creating
 Regardless of how it was named, an Operations Bridge will     Standard Change procedures for more routine changes.
 pull together all of the critical observation points within
 the IT Infrastructure so that they can be monitored and       Run-time parameters and/or files have to be received (or
 managed from a centralised location with minimal effort.      expedited if delayed) and input – and all run-time logs
 The devices being monitored are likely to be physically       have to be checked and any failures identified.
 dispersed and may be located in centralized computer          If failures do occur, then re-runs will have to be initiated,
 installations or dispersed within the user community,         under the guidance of the appropriate business units,
 or both.                                                      often with different parameters or amended data/file
 The Operations Bridge will combine many activities, which     versions. This will require careful communications to
 might include Console Management, event handling, first-      ensure correct parameters and files are used.
 line network management, Job Scheduling and out-of-           Many organizations are faced with increasing overnight
 hours support (covering for the Service Desk and/or           batch schedules which can, if they overrun the overnight
                                                                               Common Service Operation activities |          93

batch slot, adversely impact upon the online day services       each service and Service Transition should ensure that
– so are seeking ways of utilizing maximum overnight            these are properly tested.
capacity and performance, in conjunction with Capacity
                                                                In addition, regulatory requirements specify that certain
Management. This is where Workload Management
                                                                types of organization (such as Financial Services or listed
techniques can be useful, such as:
                                                                companies) must have a formal Backup and Restore
■ Re-scheduling of work to avoid contention on specific         strategy in place and that this strategy is executed and
  devices or at specific times and improve overall              audited. The exact requirements will vary from country to
  throughput                                                    country and by industry sector. This should be determined
■ Migration of workloads to alternative                         during Service Design and built into the service
  platforms/environments to gain improved performance           functionality and documentation.
  and/or throughput (virtualization capabilities make this      The only point of taking backups is that they may need to
  far more achievable by allowing dynamic, automated            be restored at some point. For this reason it is not as
  migration)                                                    important to define how to back a system up as it is to
■ Careful timing and ‘interleaving’ of jobs to gain             define what components are at risk and how to effectively
  maximum utilization of available resources.                   mitigate that risk.

  Anecdote                                                      There are any number of tools available for Backup and
                                                                Restore, but it is worth noting that features of storage
  One large organization, which was faced with batch            technologies used for business data are being used for
  overrun/utilization problems, identified that, due to         backup/restore (e.g. snapshots). There is therefore an
  human nature where people were seeking to be
                                                                increasing degree of integration between Backup and
  ‘tidy’, all jobs were being started on the hour or at
                                                                Restore activities and those of Storage and Archiving (see
  15-minute intervals during the hour (i.e. n o’clock, 15
  minutes past, half past, 15 minutes to, etc.).                section 5.6).

  By re-scheduling of work so that it started as soon as        5.2.3.1 Backup
  other work finished, and staggering the start times of
  other work, it was able to gain significant reductions        The organization’s data has to be protected and this will
  in contention and achieve much quicker overall                include backup (copying) and storage of data in remote
  processing, which resolved its problems without a             locations where it can be protected – and used should it
  need for upgrades.                                            need to be restored due to loss, corruption or
                                                                implementation of IT Service Continuity Plans.
Job Scheduling has become a highly sophisticated activity,      An overall backup strategy must be agreed with the
including any number of variables – such as time-               business, covering:
sensitivity, critical and non-critical dependencies, workload   ■ What data has to be backed up and the frequency
balancing, failure and resubmission, etc. As a result, most         and intervals to be used.
operations rely on Job Scheduling tools that allow IT
                                                                ■ How many generations of data have to be retained –
Operations to schedule jobs for the optimal use of
                                                                    this may vary by the type of data being backed up, or
technology to achieve Service Level Objectives.
                                                                    what type of file (e.g. data file or application
The latest generation of scheduling tools allows for a              executable).
single toolset to schedule and automate technical               ■   The type of backup (full, partial, incremental) and
activities and Service Management process activities (such          checkpoints to be used.
as Change Scheduling). While this is a good opportunity         ■   The locations to be used for storage (likely to include
for improving efficiency, it also represents a greater single       disaster recovery sites) and rotation schedules.
point of failure. Organizations using this type of tool         ■   Transportation methods (e.g. file transfer via the
therefore still use point solutions as agents and also as a         network, physical transportation on magnetic media).
backup in case the main toolset fails.
                                                                ■   Testing/checks to be performed, such as test-reads,
                                                                    test restores, check-sums etc.
5.2.3 Backup and Restore
                                                                ■   Recovery Point Objective. This describes the point to
Backup and Restore is essentially a component of good IT            which data will be restored after recovery of an IT
Service Continuity Planning. As such, Service Design                Service. This may involve loss of data. For example, a
should ensure that there are solid backup strategies for            Recovery Point Objective of one day may be
94   | Common Service Operation activities



   supported by daily backups, and up to 24 hours of            while any user or customer requirements or activity should
   data may be lost. Recovery Point Objectives for each IT      be specified in the appropriate SLA.
   service should be negotiated, agreed and documented
   in OLAs, SLAs and UCs.                                       5.2.3.2 Restore
 ■ Recovery Time Objective. This describes the                  A restore can be initiated from a number of sources,
   maximum time allowed for recovery of an IT service           ranging from an event that indicates data corruption,
   following an interruption. The Service Level to be           through to a Service Request from a user or customer
   provided may be less than normal Service Level               logged at the Service Desk. A restore may be needed in
   Targets. Recovery Time Objectives for each IT service        the case of:
   should be negotiated, agreed and documented in
                                                                ■ Corrupt data
   OLAs, SLAs and UCs.
                                                                ■ Lost data
 ■ How to verify that the backups will work if they need
   to be restored. Even if there are no error codes             ■ Disaster recovery/IT Service Continuity situation
   generated, there may be several reasons why the              ■ Historical data required for forensic investigation.
   backup cannot be restored. A good backup strategy            The steps to be taken will include:
   and operations procedures will minimize the risk of
                                                                ■ Location of the appropriate data/media
   this happening. Backup procedures should include a
   verification step to ensure that the backups are             ■ Transportation or transfer back to the physical recovery
   complete and that they will work if a restore is                 location
   needed. Where any backup failures are detected,              ■   Agreement on the checkpoint recovery point and the
   recovery actions must be initiated.                              specific location for the recovered data (disk, directory,
                                                                    folder etc)
 There is also a need to procure and manage the necessary
                                                                ■   Actual restoration of the file/data (copy-back and any
 media (disks, tapes, CDs, etc.) to be used for backups, so
                                                                    roll-back/roll-forward needed to arrive at the agreed
 that there is no shortage of supply.
                                                                    checkpoint
 Where automated devices are being used, pre-loading of         ■   Checking to ensure successful completion of the
 the required media will be needed in advance. When                 restore – with further recovery action if needed until
 loading and clearing media returned from off-site storage          success has been achieved.
 it is important that there is a procedure for verifying that   ■   User/customer sign-off.
 these are the right ones. This will prevent the most recent
 backup being overwritten with faulty data, and then            5.2.4 Print and Output
 having no valid data to restore. After successful backups
                                                                Many services consist of generating and delivering
 have been taken, the media must be removed for storage.
                                                                information in printed or electronic form. Ensuring the
 The actual initiation of the backups might be automated,       right information gets to the right people, with full
 or carried out from the Operations Bridge.                     integrity, requires formal control and management.
 Some organizations may utilize Operations staff to perform     Print (physical) and Output (electronic) facilities and
 the physical transportation and racking of backup copies       services need to be formally managed because:
 to/from remote locations, where in other cases this may
                                                                ■ They often represent the tangible output of a service.
 be handed over to other groups such as internal security
                                                                  The ability to measure that this output has reached
 staff or external contractors.
                                                                  the appropriate destination is therefore very important
 If backups are being automated or performed remotely,            (e.g. checking whether files with financial transaction
 then Event Monitoring capabilities should be considered          data have actually reached a bank through an FTP
 so that any failures can be detected early and rectified         service)
 before they cause problems. In such cases IT Operations        ■ Physical and electronic output often contains sensitive
 has a role to play in defining alerts and escalation paths.      or confidential information. It is vital that the
 In all cases, IT Operations staff must be trained in backup      appropriate levels of security are applied to both the
 (and restore) procedures – which must be well                    generation and the delivery of this output.
 documented in the organization’s IT Operations                 Many organizations will have centralised bulk printing
 Procedures Manual. Any specific requirements or targets        requirements which IT Operations must handle.
 should be referenced in OLAs or UCs where appropriate,
                                                                               Common Service Operation activities |        95

In addition to the physical loading and re-loading of paper     ■ Interfacing to hardware (H/W) support; arranging
and the operation and care of the printers, other activities      maintenance, agreeing slots, identifying H/W failure,
may be needed, such as:                                           liaison with H/W engineering.
■ Agreement and setting of pre-notification of large            ■ Provision of information and assistance to Capacity
  print runs and alerts to prevent excessive printing by          Management to help achieve optimum throughput,
  rogue print jobs                                                utilization and performance from the mainframe.
■ Physical control of high-value stationery such as
  company cheques or certificates, etc.                         5.4 SERVER MANAGEMENT AND SUPPORT
■ Management of the physical and electronic storage
                                                                Servers are used in most organizations to provide flexible
  required to generate the output. In many cases IT will        and accessible services from hosting applications or
  be expected to provide archives for the printed and           databases, running client/server services, Storage, Print and
  electronic materials                                          File Management. Successful management of servers is
■ Control of all printed material so as to adhere to data       therefore essential for successful Service Operation.
  protection legislation and regulation e.g. HIPAA (Health
  Insurance Portability and Accountability Act) in the          The procedures and activities which must be undertaken
  USA, or FSA (Financial Services Authority) in the UK.         by the Server Team(s) or department(s) – separate teams
                                                                may be needed where different server-types are used
Where print and output services are delivered directly to       (UNIX, Wintel etc) – include:
the users, it is important that the responsibility for
maintaining the printers or storage devices is clearly          ■ Operating system support: Support and
defined. For example, most users assume that cleaning               maintenance of the appropriate operating system(s)
and maintenance of printers must be performed by IT. If             and related utility software (e.g. failover software)
this is not the case, this must be clearly stated in the SLA.       including patch management and involvement in
                                                                    defining backup and restore policies.
                                                                ■   Licence management for all server CIs, especially
5.3 MAINFRAME MANAGEMENT                                            operating systems, utilities and any application
Mainframes are still widely in use and have well                    software not managed by the Application
established and mature practices. Mainframes form the               Management teams.
central component of many services and its performance          ■   Third-level support: Third-level support for all server
will therefore set a baseline for service performance and           and/or server operating system-related incidents,
user or customer expectations, although they may never              including diagnosis and restoration activities. This will
know that they are using the mainframe.                             also include liaison with third-party hardware support
                                                                    contractors and/or manufacturers as needed to
The ways in which mainframe management teams are
                                                                    escalate hardware-related incidents.
organized are quite diverse. In some organizations
                                                                ■   Procurement advice: Advice and guidance to the
Mainframe Management is a single, highly specialized
team that manages all aspects from daily operations                 business on the selection, sizing, procurement and
through to system engineering. In other organizations, the          usage of servers and related utility software to meet
activities are performed by several teams or departments,           business needs.
with engineering and third-level support being provided         ■   System security: Control and maintenance of the
by one team and daily operations being combined with                access controls and permissions within the relevant
the rest of IT Operations (and very probably managed                server environment(s) as well as appropriate system
through the Operations Bridge).                                     and physical security measures. These include
                                                                    identification and application of security patches,
Typically, the following activities are likely to be                Access Management (see section 4.5) and intrusion
undertaken:                                                         detection.
■ Mainframe operating system maintenance and support            ■   Definition and management of virtual servers. This
■ Third-level support for any mainframe-related                     implies that any server that has been designed and
  incidents/problems                                                built around a common standard can be used to
■ Writing job scripts                                               process workloads from a range of applications or
■ System programming                                                users. Server Management will be required to set these
                                                                    standards and then ensure that workloads are
96   | Common Service Operation activities



     appropriately balanced and distributed. They are also          upgrades to the physical network infrastructure. This is
     responsible for being able to track which workload is          done through Service Design and Service Transition.
     being processed by which server so that they are able      ■   Third-level support for all network related activities,
     to deal with incidents effectively.                            including investigation of network issues (e.g. pinging
 ■   Capacity and Performance: Provide information and              or trace route and/or use of network management
     assistance to Capacity Management to help achieve              software tools – although it should be noted that
     optimum throughput, utilization and performance                pinging a server does not necessarily mean that the
     from the available servers. This is discussed in more          service is available!) and liaison with third-parties as
     detail in Service Design, but includes providing               necessary. This also includes the installation and use of
     guidance on, and installation and operation of,                ‘sniffer’ tools, which analyse network traffic, to assist in
     virtualization software so as to achieve value for             incident and problem resolution.
     money by obtaining the highest levels of performance       ■   Maintenance and support of network operating system
     and utilization from the minimal number of servers.            and middleware software including patch
 ■   Other routine activities include:                              management, upgrades, etc.
     ● Defining standard builds for servers as part of the      ■   Monitoring of network traffic to identify failures or to
         provisioning process. This is covered in more detail       spot potential performance or bottleneck issues.
         in Service Design and Service Transition               ■   Reconfiguring or rerouting of traffic to achieve
     ● Building and installing new servers as part of               improved throughput or batter balance – definition of
         ongoing maintenance or for the provision of                rules for dynamic balancing/routing.
         new services. This is discussed in more detail in      ■   Network security (in liaison with the organization’s
         Service Transition                                         Information Security Management) including firewall
     ● Setting up and managing clusters, which are aimed            management, access rights, password protection etc.
         at building redundancy, improving service              ■   Assigning and managing IP addresses, Domain Name
         performance and making the infrastructure easier           Systems (DNSs – which convert the name of a service
         to manage.                                                 to its associated IP address) and Dynamic Host
 ■   Ongoing maintenance. This typically consists of                Configuration Protocol (DHCP) systems, which enable
     replacing servers or ‘blades’ on a rolling schedule to         access and use of the DNS.
     ensure that equipment is replaced before it fails or       ■   Managing Internet Service Providers (ISPs).
     becomes obsolete. This results in servers that are not     ■   Implementing, monitoring and maintaining Intrusion
     only fully functional, but also capable of supporting          Detection Systems on behalf of Information Security
     evolving services.                                             Management. They will also be responsible for
 ■   Decommissioning and disposal of old server                     ensuring that there is no denial of service to
     equipment. This is often done in conjunction with the          legitimate users of the network.
     organization’s environmental policies for disposal.        ■   Updating Configuration Management as necessary by
                                                                    documenting CIs, status, relationships, etc.
 5.5 NETWORK MANAGEMENT                                         Network Management is also often responsible, often in
 As most IT services are dependent on connectivity,             conjunction with Desktop Support, for remote connectivity
 Network Management will be essential to deliver services       issues such as dial-in, dial-back and VPN facilities provided
 and also to enable Service Operation staff to access and       to home-workers, remote workers or suppliers.
 manage key service components.                                 Some Network Management teams or departments will
 Network Management will have overall responsibility for        also have responsibility for voice/telephony, including the
 all of the organization’s own Local Area Networks (LANs),      provision and support for exchanges, lines, ACD, statistical
 Metropolitan Area Networks (MANs) and Wide Area                software packages etc. and for Voice over Internet Protocol
 Networks (WANs) – and will also be responsible for liaising    (VoIP) and Remote Monitoring (RMon) systems.
 with third-party network suppliers.                            At the same time, many organizations see VoIP and
 Their role will include the following activities:              telephony as specialized areas and have teams dedicated
                                                                to managing this technology. Their activities will be
 ■ Initial planning and installation of new                     similar to those described above.
     networks/network components; maintenance and
                                                                              Common Service Operation activities |          97

                                                               and who may access it. Specific responsibilities will
  Note on managing VoIP as a service
                                                               include:
  Many organizations have experienced performance
                                                               ■ Definition of data storage policies and procedures
  and availability problems with their VoIP solutions, in
  spite of the fact that there seems to be more than           ■ File storage naming conventions, hierarchy and
  adequate bandwidth available. This results in dropped            placement decisions
  calls and poor sound quality. This is usually because        ■ Design, sizing, selection, procurement, configuration
  of variations in bandwidth utilization during the call,          and operation of all data storage infrastructure
  which is often the result of utilization of the network      ■   Maintenance and support for all utility and
  by other users, applications or other web activity. This         middleware data-storage software
  has led to the differentiation between measuring the
                                                               ■   Liaison with Information Lifecycle Management
  bandwidth available to initiate a call (Service Access
  Bandwidth – or SAB) and the amount of bandwidth                  team(s) or Governance teams to ensure compliance
  that must be continuously available during the call              with freedom of information, data protection and IT
  (Service Utilization Bandwidth – or SUB). Care should            governance regulations
  be taken in differentiating between these when               ■   Involvement with definition and agreement of
  designing, managing or measuring VoIP services.                  archiving policy
                                                               ■   Housekeeping of all data storage facilities
                                                               ■   Archiving data according to rules and schedules
5.6 STORAGE AND ARCHIVE                                            defined during Service Design. The Storage teams or
                                                                   departments will also provide input into the definition
Many services require the storage of data for a specific
                                                                   of these rules and will provide reports on their
time and also for that data to be available off-line for a
                                                                   effectiveness as input into future design
certain period after it is no longer used. This is often due
to regulatory or legislative requirements, but also because    ■   Retrieval of archived data as needed (e.g. for audit
history and audit data are invaluable for a variety of             purposes, for forensic evidence, or to meet any other
purposes, including marketing, product development,                business requirements)
forensic investigations, etc.                                  ■   Third-line support for storage- and archive-related
                                                                   incidents.
A separate team or department may be needed to
manage the organization’s data storage technology
such as:                                                       5.7 DATABASE ADMINISTRATION
■ Storage devices, such as disks, controllers, tapes, etc.     Database Administration must work closely with key
■ Network Attached Storage (NAS), which is storage             Application Management teams or departments – and in
   attached to a network and accessible by several clients     some organizations the functions may be combined or
■ Storage Area Networks (SANs) designed to attach
                                                               linked under a single management structure.
  computer storage devices such as disk array controllers      Organizational options include:
  and tape libraries. In addition to storage devices, a        ■ Database administration being performed by each
  SAN will also require the management of several                Application Management team for all the applications
  network components, such as hubs, cables, etc.                 under its control
■ Direct Attached Storage (DAS), which is a storage            ■ A dedicated department, which manages all databases,
  device directly attached to a server                           regardless of type or application
■ Content Addressable Storage (CAS) which is storage           ■ Several departments, each managing one type of
  that is based on retrieving information based on its           database, regardless of what application they are
  content rather than location. The focus in this type of        part of.
  system is on understanding the nature of the data and
                                                               Database Administration works to ensure the optimal
  information stored, rather than on providing specific
                                                               performance, security and functionality of databases that
  storage locations.
                                                               they manage. Database Administrators typically have the
Regardless of what type of storage systems are being           following responsibilities:
used, Storage and Archiving will require the management
                                                               ■ Creation and maintenance of database standards
of the infrastructure components as well as the policies
                                                                   and policies
related to where data is stored, for how long, in what form
                                                               ■ Initial database design, creation, testing
98   | Common Service Operation activities



 ■ Management of the database availability and                   generally kept up to date, it is also a good source of data
     performance; resilience, sizing, capacity                   and verification for the CMS.
     volumetrics etc.
                                                                 Directory Services Management refers to the process that
 ■   Resilience may require database replication, which          is used to manage Directory Services. Its activities include:
     would be the responsibility of Database Administration
 ■                                                               ■ Working as part of Service Design and Service
     Ongoing administration of database objects: indexes,
     tables, views, constraints, sequences snapshots and             Transition to ensure that new services are accessible
     stored procedures; page locks – to achieve                      and controlled when they are deployed
     optimum utilization                                         ■   Locating resources on a network (if these have not
 ■   The definition of triggers that will generate events,           already been defined during Service Design)
     which in turn will alert database administrators            ■   Tracking the status of those resources and providing
     of potential performance or integrity issues with               the ability to manage those resources remotely
     the database                                                ■   Managing the rights of specific users or groups of
 ■   Performing database housekeeping – the routine tasks            users to access resources on a network
     that ensure that the databases are functioning              ■   Defining and maintaining naming conventions to be
     optimally and securely, e.g. tuning, indexing, etc.             used for resources on a network
 ■   Monitoring of usage; transaction volumes, response          ■   Ensuring consistency of naming and access control on
     times, concurrency levels, etc.                                 different networks in the organization
 ■   Generating reports. These could be reports based on         ■   Linking different Directory Services throughout the
     the data in the database, or reports related to the             organization to form a distributed Directory Service,
     performance and integrity of the database                       i.e. users will only see one logical set of network
 ■   Identification, reporting and management of database            resources. This is called Distribution of Directory
     security issues; audit trails and forensics                     Services
 ■   Assistance in designing database backup, archiving          ■   Monitoring Events on the Directory Services, such as
     and storage strategy                                            unsuccessful attempts to access a resource, and taking
 ■   Assistance in designing database alerts and event               the appropriate action where required
     management                                                  ■   Maintaining and updating the tools used to manage
 ■   Provision of third-level support for all database-related       Directory Services.
     incidents.
                                                                 5.9 DESKTOP SUPPORT
 5.8 DIRECTORY SERVICES MANAGEMENT                               As most users access IT services using desktop or laptop
 A Directory Service is a specialized software application       computers, it is key that these are supported to ensure the
 that manages information about the resources available          agreed levels of availability and performance of services.
 on a network and which users have access to. It is the          Desktop Support will have overall responsibility for all of
 basis for providing access to those resources and for           the organization’s desktop and laptop computer hardware,
 ensuring that unauthorized access is detected and               software and peripherals. Specific responsibilities will
 prevented (see section 4.5 for detailed information on          include:
 Access Management).
                                                                 ■ Desktop policies and procedures, for example licensing
 Directory Services views each resource as an object of the        policies, use of laptops or desktops for personal
 Directory Server and assigns it a name. Each name is              purposes, USB lockdown, etc.
 linked to the resource’s network address, so that users         ■ Designing and agreeing standard desktop images
 don’t have to memorize confusing and complex addresses.         ■ Desktop service maintenance including deployment of
 Directory Services is based on the OSI’s X.500 standards          releases, upgrades, patches and hot-fixes (in
 and commonly uses protocols such as Directory Access              conjunction with Release Management (see Service
 Protocol (DAP) or Lightweight Directory Access Protocol           Transition publication for further details)
 (LDAP). LDAP is used to support user credentials for            ■ Design and implementation of desktop
 application login and often includes internal and external        archiving/rebuild policy (including policy relating to
 user/customer data which is especially good for extranet          cookies, favourites, templates, personal data, etc.)
 call logging. Since LDAP is a critical operational tool, and
                                                                               Common Service Operation activities |           99

■ Third-level support of desktop-related incidents,             Middleware Management is the set of activities that are
  including desk-side visits where necessary                    used to manage middleware. These include:
■ Support for connectivity issues (in conjunction with          ■ Working as part of Service Design and Transition to
  Network Management) to home-workers, mobile                       ensure that the appropriate middleware solutions are
  staff, etc.                                                       chosen and that they can perform optimally when
■ Configuration control and audit of all desktop                    they are deployed
  equipment (in conjunction with Configuration                  ■   Ensuring the correct operation of middleware through
  Management and IT Audit).                                         monitoring and control
                                                                ■   Detecting and resolving Incidents related to
5.10 MIDDLEWARE MANAGEMENT                                          middleware
                                                                ■   Maintaining and updating middleware, including
Middleware is software that connects or integrates
software components across distributed or disparate                 licensing, and installing new versions
applications and systems. Middleware enables the effective      ■   Defining and maintaining information about how
transfer of data between applications, and is therefore key         applications are linked through Middleware. This
to services that are dependent on multiple applications or          should be part of the CMS (see Service Transition
data sources.                                                       publication).

A variety of technologies are currently used to support
program-to-program communication, such as object                5.11 INTERNET/WEB MANAGEMENT
request brokers, message-oriented middleware, remote            Many organizations conduct much of their business
procedure calls and point-to-point web services. Newer          through the Internet and are therefore heavily dependent
technologies are emerging all the time, for example             upon the availability and performance of their websites. In
Enterprise Service Bus (ESB), which enables programs,           such cases a separate Internet/Web Support team or
systems and services to communicate with each other             department will be desirable and justified.
regardless of the architecture and origin of the
applications. This is especially being used in the context of   The responsibilities of such a team or department
deploying Service Oriented Architectures (SOAs).                incorporate both Intranet and Internet and are likely to
                                                                include:
Middleware Management can be performed as part of an
                                                                ■ Defining architectures for Internet and web services
Application Management function (where it is dedicated
to a specific application) or as part of a Technical            ■ The specification of standards for development and
Management function (where it is viewed as an extension             management of web-based applications, content,
to the Operating System of a specific platform).                    websites and web pages. This will typically be done
                                                                    during Service Design
Functionality provided by middleware includes:
                                                                ■   Design, testing, implementation and maintenance of
■ Providing transfer mechanisms for data from various               websites. This will include the architecture of websites
    applications or data sources                                    and the mapping of content to be made available
■   Sending work to another application or procedure for        ■   In many organizations, web management will include
    processing                                                      the editing of content to be posted onto the web
■   Transmitting data or information to other systems,          ■   Maintenance of all web development and
    such as sourcing data for publication on websites (e.g.         management applications
    publishing Incident status information)                     ■   Liaison and advice to web-content teams within the
■   Releasing updated software modules across distributed           business. Content may reside in applications or
    environments                                                    storage devices, which implies close liaison with
■   Collation and distribution of system messages and               Application Management and other Technical
    instructions, for example Events or operational scripts         Management teams
    that need to be run on remote devices                       ■   Liaison with and supplier management of ISPs, hosts,
■   Multicast setup with networks. Multicast is the delivery        third-party monitoring or virtualization organizations
    of information to a group of destinations                       etc. In many organizations the ISPs are managed as
    simultaneously using the most efficient delivery route          part of Network Management
■   Managing queue sizes.                                       ■   Third-level support for Internet-/web-related incidents
100   | Common Service Operation activities



 ■ Support for interfaces with back-end and legacy                     fire suppression, water, heating and cooling
   systems. This will often mean working with members                  systems, etc.
   of the Application Development and Management                  ■    Safety is concerned with compliance to all legislation,
   teams to ensure secure access and consistency                       standards and policies relative to the safety of
   of functionality                                                    employees
 ■ Monitoring and management of website performance               ■    Physical Access Control refers to ensuring that
   and including: heartbeat testing, user experience                   the facility is only accessed by authorized personnel
   simulation, benchmarking, on-demand load balancing,                 and that any unauthorized access is detected
   virtualization                                                      and managed. This is discussed in more detail in
 ■ Website availability, resilience and security. This will            Appendix F
   form part of the overall Information Security                  ■    Shipping and Receiving refers to the management of
   Management of the organization.                                     all equipment, furniture, mail, etc. that leaves or enters
                                                                       the building. It ensures that only appropriate items are
 5.12 FACILITIES AND DATA CENTRE                                       entering or leaving the building and that they are
                                                                       routed to the correct party
 MANAGEMENT
                                                                  ■    Involvement in Contract Management of the various
 Facilities Management refers to the management of the                 suppliers and service providers involved in the facility
 physical environment of IT Operations, usually located in        ■    Maintenance refers to regular, scheduled upkeep of
 Data Centres or computer rooms. This is a vast and                    the facility, as well as the detection and resolution of
 complex area and this publication will provide an                     problems with the facility.
 overview of its key role and activities. A more detailed
 overview is contained in Appendix E.                                 Important note regarding Data Centres
 In many respects Facilities Management could be viewed               Data Centres are generally specialized facilities and,
 as a function in its own right. However, because this                while they use and benefit from generic Facilities
 publication is focused on where IT Operations are housed,            Management disciplines, they need to adapt these.
 it will cover Facilities Management specifically as it relates       For example layout, heating and conditioning, power
 to the management of Data Centres and as a subset of the             planning and many other aspects are all managed
 IT Operations Management function.                                   uniquely in Data Centres.

 The main components of Facilities Management are                     This means that, although Data Centres may be
 as follows:                                                          facilities owned by an organization, they are better
                                                                      managed under the authority of IT Operations,
 ■ Building Management, which refers to the                           although there may be a functional reporting line
   maintenance and upkeep of the buildings that house                 between IT and the department that manages other
   the IT staff and Data Centre. Typical activities include           facilities for the organization.
   cleaning, waste disposal, parking management and
   access control
 ■ Equipment Hosting, which ensures that all special
                                                                  5.12.1 Data Centre strategies
   requirements are provided for the physical housing of          Managing a Data Centre is far more than hosting an open
   equipment and the teams that support them                      space where technical groups install and manage
 ■ Power Management, which refers to managing the                 equipment, using their own approaches and procedures. It
   sourcing and utilization of power sources that are             requires an integrated set of processes and procedures
   used to keep the facility functional. This definition of       involving all IT groups at every stage of the ITSM Lifecycle.
   Power Management has a number of implications,                 Data Centre operations are governed by strategic and
   which are discussed in Appendix E. Note that                   design decisions for management and control and are
   information about power utilization is important for           executed by operators. This requires a number of key
   planning the capacity of both new services and new             factors to be put in place:
   buildings                                                      ■ Data Centre Automation. Specialized automation
 ■ Environmental Conditioning and Alert Systems,                       systems that reduce the need for manual operators
   which include the specification, maintenance and                    and which monitor and track the status of the facility
   monitoring of systems such as smoke detection and                   and all IT operations at all times
                                                                              Common Service Operation activities |         101

■ Policy-based management, where the rules of                  5.13 INFORMATION SECURITY MANAGEMENT
    automation and resource allocation are managed by          AND SERVICE OPERATION
    policy, rather than having to go through complex
    change procedures every time processing is moved           Information Security Management as a process is covered
    from one resource to another                               in the ITIL Service Design publication. Information Security
■
                                                               Management has overall responsibility for setting policies,
    Real time services 24 hours a day, 7 days a week
                                                               standards and procedures to ensure the protection of the
■   Standardization of equipment. This provides greater
                                                               organization’s assets, data, information and IT services.
    ease of management, more consistent levels of
                                                               Service Operation teams play a role in executing these
    performance and a means of providing multiple
                                                               policies, standards and procedures and will work closely
    services across similar technology. Standardization also
                                                               with the teams or departments responsible for Information
    reduces the variety of technical expertise required to
                                                               Security Management.
    manage equipment in the Data Centre and to provide
    services                                                   Service Operation teams cannot take ownership of
■   SOAs, where service components can be reused,              Information Security Management, as this would represent
    interchanged and replaced very quickly and with no         a conflict. There needs to be segregation of roles between
    impact on the business. This will make it possible for     the groups defining and managing the process and the
    the Data Centre to be highly responsive in meeting         groups executing specific activities as part of ongoing
    changing business demands without having to go             operation. This will help protect against breaches to
    through lengthy and involved re-engineering and re-        security measures, as no single individual should have
    architecting                                               control over two or more phases of a transaction or
■   Virtualization. This means that IT Services are            operation. Information Security Management should assign
    delivered using an ever-changing set of equipment,         responsibilities to ensure a cross-check of duties.
    geared to meet current demand. For example, an             The role of Service Operation teams is outlined next.
    application may run on a dedicated device together
    with its database during high-demand times, but            5.13.1 Policing and reporting
    shifted to a shared device with its database on a
                                                               This will involve Operation staff performing specific
    remote device during non-peak times – all automated
                                                               policing activities such as the checking of system journals,
    and automatic. This will mean even greater savings of
                                                               logs, event/monitoring alerts etc, intrusion detection
    costs as any equipment can be used at any time,
                                                               and/or reporting of actual or potential security breaches.
    without any human intervention, except to perform
                                                               This is done in conjunction with Information Security
    maintenance and replace failed equipment. The IT
                                                               Management to provide a check and balance system
    Infrastructure is more resilient since any component is
                                                               to ensure effective detection and management of
    backed up by any number of similar components, any
                                                               security issues.
    of which could take over a failed component’s
    workload automatically.                                    Service Operation staff are often first to detect security
    Remote monitoring, control and management                  events and are in the best position to be able to shut
    equipment and systems will be essential to manage a        down and/or remove access to compromised systems.
    virtualized environment, as many services will not be      Particular attention will be needed in the case of third-
    linked to any one specific piece of equipment.             party organizations that require physical access into the
■   Unified management systems have become more                organization. Service Operation staff may be required
    important as services run across multiple locations and    to escort visitors into sensitive areas and/or control
    technologies. Today it is important to define what         their access.
    actions need to be taken and what systems will
                                                               They may also have a role to play in controlling network
    perform that action. This means investing in solutions
                                                               access to third parties, such as hardware maintainers
    that will allow Infrastructure managers to simply
                                                               dialling in for diagnostic purposes, etc.
    specify what outcome is required, and allowing the
    management system to calculate the best combination
                                                               5.13.2 Technical assistance
    of tools and actions to achieve the outcome.
                                                               Some technical support may need to be provided to
                                                               IT Security staff to assist in investigating security
                                                               incidents and assist in production of reports or in
102   | Common Service Operation activities



 gathering forensic evidence for use in disciplinary            5.13.6 Documented policies and procedures
 action or criminal prosecutions.                               Service Operation documented procedures must include
 Technical advice and assistance may also be needed             all relevant information relating to security issues –
 regarding potential security improvements (e.g. setting up     extracted from the organization’s overall security policy
 appropriate firewalls or access/password controls).            documents. Consideration should be given to the use of
                                                                handbooks to assist in getting the security messages out
 The use of event, incident, problem and configuration
                                                                to all relevant staff.
 management information can be relied on to provide
 accurate chronologies of security-related investigations.
                                                                5.14 IMPROVEMENT OF OPERATIONAL
 5.13.3 Operational security control                            ACTIVITIES
 For operational reasons, technical staff will often need to    All Service Operation staff should be constantly looking for
 have privileged access to key technical areas (e.g. root       areas in which process improvements can be made to give
 system passwords, physical access to Data Centres or           higher IT service quality and/or performed in a more cost-
 communications rooms etc). It is therefore essential that      effective way. This might include some of the following
 adequate controls and audit trails are kept of all such        activities.
 privileged activities so as to deter and detect any
 security events.                                               5.14.1 Automation of manual tasks
 Physical controls need to be in place for all secure areas     Any tasks which have to be carried out manually,
 with logging in-out of all staff. Where third-party staff or   particularly those that have to be regularly repeated, are
 visitors need access, it may be Service Operation staff that   likely to be more time consuming, costly and error prone
 are responsible for escorting and managing the movement        than those that can be systemised and automated. All
 of such personnel.                                             tasks should be examined for potential automation to
 In the case of privileged systems access, this needs to be     reduce effort and costs and to minimize potential errors.
 restricted to only those people whose need to access the       A judgement must be made on the costs of the
 system has been verified – and withdrawn immediately           automation and the likely benefits that will occur.
 when that need no longer exists. An audit trail must be
 maintained of who has had access and when, and of all          5.14.2 Reviewing makeshift activities or
 activities performed using those access levels.                procedures
                                                                Because of the pragmatic nature of Service Operation, it
 5.13.4 Screening and vetting                                   may sometimes arise that makeshift activities or processes
 All Service Operation staff should be screened and             are introduced to address short-term operational
 vetted to a security level appropriate to the organization     expediencies. There is a danger that such practices can be
 in question.                                                   continued and become the ‘norm’ – leading to ongoing
 Suppliers and third-party contractors should also be           inefficiencies. Where any makeshift activities or procedures
 screened and vetted – both the organizations and the           do have to be introduced it is important that these are
 specific personnel involved. Many organizations have           reviewed as soon as the immediate expediency is
 started using police or government agency background           overcome – and either dispensed with or replaced with
 checks, especially where contractors will be working with      efficient agreed processes for the longer term.
 classified systems. Where necessary, appropriate non-
 disclosure and confidentiality agreements must be agreed.      5.14.3 Operational Audits
                                                                Regular audits should be conducted of all Service
 5.13.5 Training and awareness                                  Operation processes to ensure that they are working
 All Service Operation staff should be given regular and        satisfactorily.
 ongoing training and awareness of the organization’s
 security policy and procedures. This should include details    5.14.4 Using Incident and Problem
 of disciplinary measures in place. In addition, any security   Management
 requirements should be specified in the employee’s             Problem and Incident Management provide a rich source
 contract of employment.                                        of operational improvement opportunities. These
                                                                Common Service Operation activities |   103

processes are discussed in detail in Chapter 4 of this
publication.

5.14.5 Communication
It should go without saying that good communication
about changing requirements, technology and processes
will result in improvement in Service Operation. However,
communication is often neglected. Service Operation
improvement is dependent on formal and regular
communication between teams responsible for design,
support and operation of services.

5.14.6 Education and training
Service Operation teams should understand the
importance of what they do on a daily basis. Education is
required to ensure that staff understand what business
functions or services are supported by their activities. This
will encourage greater care and attention to detail and will
also help Service Operation teams to better identify
business priorities.
Training programmes should ensure that all staff have the
appropriate skills for the technology or applications that
they are managing. Training should always be provided
when new technology is introduced, or when existing
technology is changed.
   Organizing for
Service Operation   6
                                                                                                                             |   107


6 Organizing for Service Operation

6.1 FUNCTIONS                                                      environment. These are logical functions and do not
                                                                   necessarily have to be performed by an equivalent
A function is a logical concept that refers to the people
                                                                   organizational structure. This means that Technical and
and automated measures that execute a defined process,
                                                                   Application Management can be organized in any
an activity or a combination of processes or activities. In
                                                                   combination and into any number of departments. The
larger organizations a function may be broken up and
                                                                   second-level groupings in Figure 6.1 are examples of
performed by several departments, teams and groups, or it
                                                                   typical groups of activities performed by Technical
may be embodied within a single organizational unit.
                                                                   Management (see Chapter 5) and are not a suggested
The Service Operation functions given in Figure 6.1 are            organization structure.
needed to manage the ‘steady state’ operational IT

                                                     IT Operations Management

                                                            IT Operations
                Service Desk         Technical                 Control                Application
                                    Management                                       Management
                                                          Console Management
                                                          Job Scheduling
                                                          Backup and Restore
                                                          Print and Output            Financial
                                     Mainframe                                          Apps



                                                          Facilities Management         HR
                                       Server
                                                                                       Apps
                                                            Data Centres
                                                            Recovery Sites
                                                            Consolidation
                                                            Contracts
                                                                                      Business
                                      Network
                                                                                       Apps




                                      Storage




                                      Database




                                      Directory
                                      Services




                                      Desktop




                                     Middleware




                                    Internet/Web
                                                                                              Figure 6.1 Service Operation
                                                                                              functions
108   | Organizing for Service Operation



 The following is an overview of the Service Operation               routine operational tasks are carried out. IT
 functions in Figure 6.1:                                            Operations Control will also provide centralized
                                                                     monitoring and control activities, usually using an
 ■ The Service Desk is the primary point of contact for
                                                                     Operations Bridge or Network Operations Centre.
   users when there is a service disruption, for service
                                                                 ● Facilities Management refers to the management
   requests or even for some categories of Request for
   Change. The Service Desk provides a point of                      of the physical IT environment, usually Data
   communication to the users and a point of                         Centres or computer rooms. In many organizations
   coordination for several IT groups and processes. To              Technical and Application Management are co-
   enable them to perform these actions effectively the              located with IT Operations in large Data Centres. In
   Service Desk is usually separate from the other Service           some organizations many physical components of
   Operation functions. In some cases, e.g. where                    the IT Infrastructure have been outsourced and
   detailed technical support is offered to users on the             Facilities Management may include the
   first call, it may be necessary for Technical or                  management of the outsourcing contracts.
   Application Management staff to be on the Service           ■ Application Management is responsible for
   Desk. This does not mean that the Service Desk                managing applications throughout their lifecycle. The
   becomes part of the Technical Management function.            Application Management function supports and
   In fact, while they are on the Service Desk, they cease       maintains operational applications and also plays an
   to be a part of the Technical Management or                   important role in the design, testing and improvement
   Application Management functions and become part              of applications that form part of IT services.
   of the Service Desk, even if only temporarily.                Application Management is usually divided into
 ■ Technical Management provides detailed technical              departments based on the application portfolio of the
   skills and resources needed to support the ongoing            organization (see the examples in Figure 6.1), thus
   operation of the IT Infrastructure. Technical                 allowing easier specialization and more focused
   Management also plays an important role in the                support. In many organizations Application
   design, testing, release and improvement of IT                Management departments have staff who perform
   services. In small organizations, it is possible to           daily operations for those applications. As with
   manage this expertise in a single department, but             Technical Management, these staff logically form part
   larger organizations are typically split into a number of     of the IT Operations Management function.
   technically specialized departments (see later in this
   chapter). In many organizations, the Technical                Special note on Information Security
   Management departments are also responsible for the           Management
   daily operation of a subset of the IT Infrastructure.         Although most would agree that Information Security
   Figure 6.1 shows that, although they are part of a            Management is a function, it is highly specialized and
   Technical Management department, staff who perform            spans several phases of the lifecycle. It is also
   these activities are logically part of the IT Operations      responsible for the oversight of many activities within
   Management function.                                          all Service Operation functions. For a more in-depth
                                                                 description of Information Security Management,
 ■ IT Operations Management is the function
                                                                 please refer to the Service Design publication and to
   responsible for the daily operational activities needed       section 5.13 of this publication.
   to manage the IT Infrastructure. This is done according
   to the Performance Standards defined during Service
   Design. In some organizations this is a single,             6.1.1 Functions and activities
   centralized department, while in others some activities     Chapter 5 of this publication introduced a number of
   and staff are centralized and some are provided by          common Service Operation activities. Due to the technical
   distributed or specialized departments. This is             nature and specialization of these activities, the teams,
   illustrated in Figure 6.1 by the overlapping from the       groups or departments that perform them are often given
   Technical and Application Management functions. IT          names that correspond to the particular activities. For
   Operations Management has two functions that are            example, Network Management could be performed by a
   unique and which are generally formal organizational        ‘Network Management Department’. This, however, is by
   structures. These are:                                      no means a rule. There are a number of options available
   ● IT Operations Control, which is generally staffed         in mapping activities to a team or department, for
        by shifts of operators and which ensures that          example:
                                                                                Organizing for Service Operation |        109

■ One activity could be performed by several teams or        organizations will tend to combine these activities into
  departments, e.g. if an organization has five major        single departments, or even individuals – if they are even
  Application Support departments, each supporting           needed at all.
  a different set of applications, each of these
  departments could perform Database Administration            Special note on outsourcing
  for ‘its’ applications
                                                               These organizational considerations are likely to be
■ One department could perform several activities, e.g.        most relevant to internal IT organizations. The
  the Network Management Department could be                   situation becomes even more complex when some or
  responsible for managing the network, Directory              all of a particular activity or function are outsourced.
  Services Management and Server Management                    Prime opportunities for outsourcing have been the
■ An activity could be performed by groups, e.g.               Service Desk and Network Operations. This will be
  Security Administration can be performed by any              covered in more detail in ITIL Complementary
  person with responsibility for managing an application,      Guidance, but some of the key points to remember
                                                               are:
  server, middleware or desktop.
                                                               ■ Regardless of who is performing the activity, the
These organizational decisions are influenced by a number
                                                                  company contracting the outsourcer is still
of factors, such as:                                              responsible for ensuring that it is performed to a
■ The size and location of the organization. Smaller, less        standard that will support the delivery of services
    distributed organizations will tend to combine these          to their customers and users.
    functions, whereas large, decentralized organizations      ■ Outsourcing to solve an organization’s problems
    may have several teams or departments performing              or as an alternative to good Service Management
    the same activity (e.g. per region).                          processes rarely works. The best results are
■   The complexity of technology used in the                      obtained if these are in place before outsourcing.
    organization. The higher the number of different           ■ Outsourcing works best when there is active
    technologies used, the more likely there are to be            involvement by both organizations. If the staff and
    several different teams, each doing something similar,        managers of the customer organization
    but in a different context (e.g. UNIX Server                  disengage, the outsourcer is unlikely to be
    Management and Windows Server Management).                    successful, simply because nobody understands
■   The availability of skills. Where technical skills are        the organization better than the people who work
                                                                  there.
    scarce, it is common for organizations to use
    generalists to perform multiple groups of activities –     ■ The outsourcer should not determine their
    although, in some cases, security considerations make         outputs or how they are measured. These are
    this very difficult. For example, an organization             determined by understanding the business
    working on classified or secret projects may have to          requirements of users and customers and ensuring
                                                                  that they can be met by the outsourcer’s
    hire expensive, specialized resources even when that
                                                                  capabilities.
    means relocating them or contracting through
    security-cleared vendors.                                  ■ Although the outsourcer’s services become an
■   The culture of the organization. Some organizations           integral part of the organization, they are still a
                                                                  third-party organization, with a different set of
    prefer to work in highly specialized environments,
                                                                  business objectives, policies and practices. Security
    while others tend to prefer the flexibility of
                                                                  standards must be upheld and both parties must
    generalist staff.                                             clearly understand their respective roles and
■   The financial situation of the organization will              contributions.
    determine how many people, with what type of skill,
    can be employed and how they will be organized.
As a result of these factors, it is impossible for this      6.2 SERVICE DESK
publication to prescribe an appropriate organizational
structure that will fit every situation, however, the        A Service Desk is a functional unit made up of a dedicated
following sections list the required activities under the    number of staff responsible for dealing with a variety of
functional groups most likely to be involved in their        service events, often made via telephone calls, web
operation. Please note that this does not mean that all      interface, or automatically reported infrastructure events.
organizations have to use these divisions. Smaller
110   | Organizing for Service Operation



 The Service Desk is a vitally important part of an              ■ A reduced negative business impact
 organization’s IT Department and should be the single           ■ Better-managed infrastructure and control
 point of contact for IT users on a day-by-day basis – and       ■ Improved usage of IT Support resources and increased
 will handle all incidents and service requests, usually           productivity of business personnel
 using specialist software tools to log and manage all           ■ More meaningful management information for
 such events.                                                      decision support
 The value of an effective Service Desk should not be            ■ It is common practice that the Service Desk provides
 underrated – a good Service Desk can often compensate             ‘entry-level’ positions for ITSM staff. Working on the
 for deficiencies elsewhere in the IT organization, but a          Service Desk is an excellent ‘grounding’ for anyone
 poor Service Desk (or the lack of a Service Desk) can give        who wishes to pursue a career in Service
 a poor impression of an otherwise very effective IT               Management. However, this could also present
 organization!                                                     challenges with people who do not understand the
                                                                   business or technology. Users calling the Service Desk
 It is therefore very important that the correct calibre of
                                                                   should be able to speak to someone who is able to
 staff is used on the Service Desk and that IT Managers do
                                                                   address their needs, and Service Desk Analysts should
 their best to make the desk an attractive place to work to
                                                                   not be burned out in less than a year because of
 improve staff retention.
                                                                   undue stress. Care should be taken to select
 The exact nature, type, size and location of a Service Desk       appropriately skilled individuals with a good
 will vary, depending upon the type of business, number of         understanding of the business and to provide
 users, geography, complexity of calls, scope of services          adequate training – thus preventing reduction in levels
 and many other factors.                                           of support due to a lack of knowledge at the first line.
 In alignment to customer and business requirements, the
 IT organization’s senior managers should decide the exact       6.2.2 Service Desk objectives
 nature of its required Service Desk (and whether it should      The primary aim of the Service Desk is to restore the
 be internal or outsourced to a third party) as part of its      ‘normal service’ to the users as quickly as possible. In this
 overall ITSM strategy (see Service Strategy publication) –      context ‘restoration of service’ is meant in the widest
 and then subsequent planning must be done to prepare            possible sense. While this could involve fixing a technical
 for and then implement the appropriate Service Desk             fault, it could equally involve fulfilling a service request or
 function (either when implementing a new function, or           answering a query – anything that is needed to allow the
 more likely these days when making necessary                    users to return to working satisfactorily.
 amendments to an existing function – see Service Design         Specific responsibilities will include:
 and Service Transition publications).
                                                                 ■ Logging all relevant incident/service request details,
 6.2.1 Justification and role of the Service                         allocating categorization and prioritization codes
 Desk                                                            ■ Providing first-line investigation and diagnosis
                                                                 ■ Resolving those incidents/service requests they
 Very little justification is needed today for a Service Desk,
 as many organizations have become convinced that this is            are able
 by far the best approach for dealing with first-line IT         ■   Escalating incidents/service requests that they cannot
 support issues. One only needs ask the question ‘What is            resolve within agreed timescales
 the alternative?’ to make a compelling case for the Service     ■   Keeping users informed of progress
 Desk concept. Where further justification is needed, the        ■   Closing all resolved incidents, requests and other calls
 following benefits should be considered:                        ■   Conducting customer/user satisfaction call-
 ■ Improved customer service, perception and satisfaction
                                                                     backs/surveys as agreed
                                                                 ■   Communication with users – keeping them informed
 ■ Increased accessibility through a single point of
   contact, communication and information                            of incident progress, notifying them of impending
                                                                     changes or agreed outages, etc.
 ■ Better-quality and faster turnaround of customer or
                                                                 ■   Updating the CMS under the direction and approval of
   user requests
                                                                     Configuration Management if so agreed.
 ■ Improved teamwork and communication
 ■ Enhanced focus and a proactive approach to service
   provision
                                                                                          Organizing for Service Operation |   111

Note: these activities are explained and set in context with    ■ Specialized groups of users
the fuller Incident Management and Request Fulfilment           ■ The existence of customized or specialized services
process in sections 4.2 and 4.3 respectively.                          that require specialist knowledge
                                                                ■ VIP/criticality status of users.
6.2.3 Service Desk organizational structure
There are many ways of structuring Service Desks and            6.2.3.2 Centralized Service Desk
locating them – and the correct solution will vary for          It is possible to reduce the number of Service Desks by
different organizations. The primary options are detailed       merging them into a single location (or into a smaller
below, but in reality an organization may need to               number of locations) by drawing the staff into one or
implement a structure that combines a number of these           more centralized Service Desk structures. This can be more
options in order to fully meet the business needs:              efficient and cost-effective, allowing fewer overall staff to
                                                                deal with a higher volume of calls, and can also lead to
6.2.3.1 Local Service Desk                                      higher skill levels through great familiarization through
This is where a desk is co-located within or physically         more frequent occurrence of events. It might still be
close to the user community it serves. This often aids          necessary to maintain some form of ‘local presence’ to
communication and gives a clearly visible presence, which       handle physical support requirements, but such staff can
some users like, but can often be inefficient and expensive     be controlled and deployed from the central desk.
to resource as staff are tied up waiting to deal with
incidents when the volume and arrival rate of calls may         6.2.3.3 Virtual Service Desk
not justify this.                                               Through the use of technology, particularly the Internet,
There may, however, be some valid reasons for                   and the use of corporate support tools, it is possible to
maintaining a local desk, even where call volumes alone         give the impression of a single, centralized Service Desk
do not justify this. Reasons might include:                     when in fact the personnel may be spread or located in
                                                                any number or type of geographical or structural locations.
■ Language and cultural or political differences
                                                                This brings in the option of ‘home working’, secondary
■ Different time zones                                          support group, off-shoring or outsourcing – or any




                                    User            User               User           User




                                                       Service Desk




                   Technical          Application      IT Operations          3rd Party            Request
                  Management         Management        Management             Support             Fulfilment



Figure 6.2 Local Service Desk
112   | Organizing for Service Operation




                           Customer Site 1           Customer Site 2            Customer Site 3




                                                       Service Desk



                                                   Second Line Support




                   Technical        Application        IT Operations         3rd Party             Request
                  Management       Management          Management            Support              Fulfilment


 Figure 6.3 Centralized Service Desk


                                                    Virtual Service Desk




                                                            San Francisco
                                                            Service Desk
                                       Paris                                                Rio de
                                   Service Desk                                             Janeiro
                                                                                         Service Desk


                                                                 Virtual
                                                              Service Desk

                                                                                              Sydney
                                                                                            Service Desk
                       Beijing
                    Service Desk


                                                                                Service
                                                                              Knowledge
                                                                             Management
                                                                                System
                                               London
                                             Service Desk


 Figure 6.4 Virtual Service Desk
                                                                                     Organizing for Service Operation |     113

combination necessary to meet user demand. It is                 ■ A quiet environment with adequate acoustic control
important to note, however, that safeguards are needed in          so that one telephone conversation is not disrupted
all of these circumstances to ensure consistency and               by another
uniformity in service quality and cultural terms.                ■ Pleasant surroundings and comfortable furniture so as
                                                                   to lighten the mood (the Service Desk can be a very
6.2.3.4 Follow the Sun                                             stressful place to work, so every little helps!)
Some global or international organizations may wish to           ■ A separate rest-room and refreshment area nearby so
combine two or more of their geographically dispersed              that staff can take short breaks as appropriate when
Service Desks to provide a 24-hour follow-the-sun service.         necessary without being away for too long.
For example, a Service Desk in Asia-Pacific may handle
calls during its standard office hours and at the end of this        Anecdote
period it may hand over responsibility for any open                  One company found that there was a ‘them and us’
incidents to a European-based desk. That desk will handle            culture existing between the Service Desk and the
these calls alongside its own incidents during its standard          other support teams. The third-line teams often
day and then hand over to a USA-based desk – which                   believed themselves to be better than the Service
finally hands back responsibility to the Asia-Pacific desk to        Desk. Hiding the Service Desk away in an isolated
complete the cycle.                                                  room helped to reinforce this culture. The company
                                                                     found that creating an open-plan office with the
This can give 24-hour coverage at relatively low cost, as
                                                                     Service Desk in the middle encouraged closer
no desk has to work more than a single shift. However,               working and helped to break down these barriers.
the same safeguards of common processes, tools, shared
database of information and culture must be addressed for
this approach to proceed – and well-controlled escalation        6.2.3.7 Building a single point of contact
and handover processes are needed.
                                                                 Regardless of the combination of options chosen to fulfil
                                                                 an organization’s overall Service Desk structure, individual
6.2.3.5 Specialized Service Desk groups
                                                                 users should be in no doubt about who to contact if they
For some organizations it might be beneficial to create          need assistance. A single telephone number (or a single
‘specialist groups’ within the overall Service Desk structure,   number for each group if separate desks are chosen)
so that incidents relating to a particular IT service can be     should be provided and well publicized – as well as a
routed directly (normally via telephony selection or a web-      single e-mail address and a single web Service Desk
based interface) to the specialist group. This can allow         contact page.
faster resolution of these incidents, through greater
familiarity and specialist training.                             Ideas that can be successfully used to help publicize the
                                                                 Service Desk telephone number and e-mail address, and
The selection would be made using a script along the             making it available close to hand when users are likely to
lines of ‘If your call is about the X Service, please press 1    need them, are:
now, otherwise please hold for a Service Desk analyst’.
                                                                 ■ Including the Service Desk telephone number on
Care is needed not to over complicate the selection, so               hardware CI labels, attached to the components the
specialist groups should only be considered for a very                user is likely to be calling about
small number of key services where these exist, and              ■    Printing Service Desk contact details on telephones
where call rates about that service justify a separate
                                                                 ■    For PCs and laptops, using a customized background
specialist group.
                                                                      or desktop with the Service Desk contact details,
                                                                      together with information read from the system that
6.2.3.6 Environment                                                   will be needed when calling (such as IP address,
The environment where the Service Desk is to be located               OS build number, etc.) in one corner
should be carefully chosen. Where possible, the following        ■    Printing the Service Desk number on ‘freebies’ (pens,
facilities should be provided:                                        pencils, mugs, mouse-mats, etc.)
■ A location where the entire function can be positioned         ■    Prominently placing these details on Service Desk
   with sufficient natural light and overall space – to               Internet/intranet sites
   allow adequate desk and storage-space, and room to
   move around if necessary
114   | Organizing for Service Operation



 ■ Including them on any calling cards or satisfaction               ● Number of customers and users speaking a
   survey cards left with users when a desk visit has                    different language
   been necessary                                                    ● Skill level
 ■ Repeating the details on all correspondence sent to           ■   Incident and Service Request types (and types of RFC
   the users (together with call reference numbers)                  if appropriate):
 ■ Placing the details on notice boards or physical                  ● Duration of time required for call types (e.g. simple
   locations that users are likely to regularly visit                    queries, specialist application queries, hardware,
   (entrances, canteens, refreshment areas, etc.).                       etc.)
                                                                     ● Local or external expertise required
 6.2.4 Service Desk staffing                                         ● The volume and types of incidents and Service
 The issues involved in, and criteria for, establishing the              Requests
 appropriate staffing model and levels are discussed in this     ■   The period of support cover required, based on:
 section. Details about typical Service Desk roles and               ● Hours covered
 responsibilities can be found in paragraph 6.6.1 below.
                                                                     ● Out-of-hours support requirements
 They include the Service Desk Manager, Supervisor,
                                                                     ● Time zones to be covered
 Analysts and, in some organizations, these roles are
                                                                     ● Locations to be supported (particularly if Service
 complemented by business users (‘Super Users’) who
 provide first-line support.                                             Desk staff also conduct desk-side support)
                                                                     ● Travel time between locations
 6.2.4.1 Staffing levels                                             ● Workload pattern of requests (e.g. daily, month
                                                                         end, etc.)
 An organization must ensure that the correct number of
 staff are available at any given time to match the demand           ● The service level targets in place (response levels
 being placed upon the desk by the business. Call rates can              etc.)
 be very volatile and often in the same day the arrival rate     ■   The type of response required:
 may go from very high to very low and back again. An                ● Telephone
 organization planning a new desk should attempt to                  ● E-mail/fax/voicemail/video
 predict the call arrival rate and profile – and to staff            ● Physical attendance
 accordingly. Statistical analysis of call arrival rates under       ● Online access/control
 current support arrangements must be undertaken and             ■   The level of training required
 then closely monitored and adjusted as necessary.
                                                                 ■   The support technologies available (e.g. phone
 Many organizations will find that call rates peak during the        systems, remote support tools, etc.)
 start of the office day and then fall off quickly, perhaps      ■   The existing skill levels of staff
 with another burst in the early part of the afternoon – this    ■   The processes and procedures in use.
 obviously varies depending upon the organization’s
 business but is an often occurring pattern for many             All these items should be carefully considered before
 organizations. In such circumstances it may be possible to      making any decision on staffing levels. This should also be
 utilize part-time staff, home-workers, second-line support      reflected in the levels of documentation required.
 staff or third parties to cover the peaks.                      Remember that the better the service, the more the
                                                                 business will use it.
 The following factors should be considered when deciding
 staffing levels:                                                A number of tools are available to help determine the
                                                                 appropriate number of staff for the Service Desk. These
 ■ Customer service expectations                                 workload modelling tools are dependent on detailed ‘local
 ■ Business requirements, such as budget, call response          knowledge’ of the organization such as call volumes and
   times, etc.                                                   patterns, service and user profiles, etc.
 ■ Size, relative age, design and complexity of the IT
   Infrastructure and Service Catalogue – for example, the       6.2.4.2 Skill levels
   number and type of incidents, the extent of                   An organization must decide on the level and range of
   customised versus standard off-the-shelf software             skills it requires of its Service Desk staff – and then ensure
   deployed, etc.                                                that these skills are available at the appropriate times.
 ■ The number of customers and users to support, and
   associated factors such as:
                                                                                     Organizing for Service Operation |       115

A range of skill options are possible, starting from a ‘call-     the service, the more likely specialist knowledge will be
logging’ service only – where staff need only very basic          required on the first call.
technical skills – right through to a ‘technical’ Service Desk
                                                                  Note that first-line resolution rates can be reduced by
where the organization’s most technically skilled staff are
                                                                  effective Problem Management, which will reduce a
used. In the case of the former, there will be a high
                                                                  number of the simpler, repetitive incidents. In such cases,
handling but low resolution rate, while in the latter case
                                                                  although the resolution rates appear to be going down,
this will be reversed.
                                                                  the overall service quality will have improved by the
The decision on the required skills level will often be           complete removal of many incidents. While this is good,
driven by target resolution times (agreed with the business       if Service Desk staff are paid incentives or bonuses for
and captured in service level targets), the complexity of         first-call resolution, it could prove disastrous for morale
the systems supported and ‘what the business is prepared          and process effectiveness unless the bonus threshold
to pay’.                                                          is reviewed.
There is a strong correlation between response and                Improvements in resolution times/rates should not be left
resolution targets and costs – generally speaking, the            to chance, but should instead be part of an ongoing
shorter the target times, the higher the cost because more        Service Improvement Plan (see the Continual Service
resources are required.                                           Improvement publication for fuller details).
While there may be instances when business dependency             Once the required skill levels have been identified, there is
or criticality make a highly technically skilled desk an          an ongoing task to ensure that the Service Desk is
imperative, the optimum and most cost-effective approach          operated in such a way that the necessary staff obtain and
is generally to have a ‘call-logging’ first line of support via   maintain the necessary skills – and that staff with the
the Service Desk, with quick and effective escalations to         correct balance of skills are on duty at appropriate times
more skilled second-line and third-line resolution groups         so that consistency is maintained.
where skilled staff can be concentrated and more
                                                                  This will involve an ongoing training and awareness
effectively utilised (see Incident Management, section 4.2,
                                                                  programme which should cover:
for more details and guidance on end-to-end support
structures). However, this basic starting point can be            ■ Interpersonal skills: such as telephony skills,
improved over time by providing the first-line staff with an          communication skills, active listening and customer-
effective knowledge-base, diagnostic scripts and                      care training.
integrated support tools (including a CMS), as well as            ■   Business awareness: specific knowledge of the
ongoing training and awareness, so that first-line                    organization’s business areas, drivers, structure,
resolution rates can gradually be increased.                          priorities, etc.
                                                                  ■   Service awareness of all the organization’s key IT
This can also be achieved by locating second-level staff on
the Service Desk, effectively creating a two-tier structure.          services for which support is being provided
This has advantages of making second-level staff available        ■   Technical awareness (and deeper technical training to
to help deal with peak call periods and to train more                 the appropriate level, depending upon the resolution
junior personnel, and it will often increase the first-call           rate sought)
resolution rate. However, second-line staff often have            ■   Depending on level of support provided, some
duties outside of the Service Desk – resulting in rosters             diagnosis skills (e.g. Kepner and Tregoe)
having to be managed or second-line staff positions being         ■   Support tools and techniques
duplicated. In addition, having to deal with routine calls        ■   Awareness training and tutorials in new systems and
may be demotivating for more experienced staff. A further             technologies, prior to their introduction
potential drawback is that the Service Desk becomes really        ■   Processes and procedures (most particularly Incident,
good at resolving calls, whereas                                      Change and Configuration Management – but an
second-line staff should be focused on removing the                   overview of all ITSM processes and procedures)
root cause instead.                                               ■   Typing skills to ensure quick and accurate entry of
Another factor to consider when deciding on the skills                incident or Service Request details.
requirements for Service Desk staff is the level of               For such a programme to be effective, skill requirements
customization or specialization of the supported services.        and levels should be evaluated periodically and training
Standardized services require less specific knowledge to          records maintained.
provide quality customer support. The more specialized
116   | Organizing for Service Operation



 Careful formulation of staffing rotations or schedules           staff. This often leads to innovation in Service Desk
 should be maintained so that a consistent balance of staff       operation (such as specialized services) which in turn drive
 experience and appropriate skill levels are present during       operational efficiencies at all tier levels of support. It helps
 all critical operational periods. It is not sufficient to have   to build skills that can be used in their current role as well
 only the right number of staff on duty – the correct blend       as it jump-starts the training for a new role. While it is
 of skills should also be available.                              important to develop their core competencies in their
                                                                  current role, having a clear career path and recognising
 6.2.4.3 Training                                                 future requirement and development needs is also
 It is vital that all Service Desk staff are adequately trained   important.
 before they are called upon to staff the Service Desk. A
 formal induction programme should be undertaken by all           6.2.4.4 Staff retention
 new staff, the exact content of which will vary depending        It is very important that all IT Managers recognize the
 upon the existing skill levels and experience of the new         importance of the Service Desk and the staff who work on
 recruit, but is likely to include many of the required skills    it, and give this special attention. Any significant loss of
 as described above.                                              staff can be disruptive and lead to inconsistency of service
                                                                  – so efforts should be made to make the Service Desk an
 Where possible, a business awareness programme,
                                                                  attractive place to work.
 including short periods of secondment into key business
 areas, should be provided for new staff who do not               Ways in which this can be done include proper
 already have this level of business awareness.                   recognition of the role with reward packages recognizing
                                                                  this, team-building exercises, staff rotation onto other
 When starting on the Service Desk, new staff should
                                                                  activities (projects, second-line support, etc.).
 initially ‘shadow’ experienced staff – sit with them and
 listen in on calls – before starting to take calls themselves    The Service Desk can often be used as a stepping stone
 with a mentor listening in and able to intervene and             into other more technical or supervisory/managerial roles.
 provide support where necessary. The mentor should               If this is done, care is needed to ensure that proper
 initially review each call with the trainee after it concludes   succession planning takes place so that the desk does not
 to learn any lessons. The frequency of such reviews should       lose all of its key expertise in any area at one time. Also,
 be gradually reduced as experience and confidence grows          good documentation and cross-training can mitigate this
 but the mentor should still be available to provide              risk.
 ongoing support even when the trainee has reached the
 stage of going solo.                                             6.2.4.5 Super Users
 Mentors may need to be trained on how to mentor.                 Many organizations find it useful to appoint or designate a
 Service Desk experience and technical skills are not the         number of ‘Super Users’ throughout the user community,
 only requirements for mentoring. Effective knowledge-            to act as liaison points with IT in general and the Service
 transfer skills and the ability to teach without being           Desk in particular.
 condescending or threatening are equally important.              Super Users can be given some additional training and
 A programme will be necessary to keep Service Desk staff’s       awareness and used as a conduit for communications flow
 knowledge up to date – and to make them aware of new             in both directions. They can be asked to filter requests and
 developments, services and technologies. The timing of           issues raised by the user community (in some cases even
 such events is critical so as not to impact upon the normal      going as far as to have incidents or requests raised by the
 duties. Many Service Desks find that it is best to organize      Super User) – this can help prevent ‘incident storms’ when
 short ‘tutorials’ during quiet periods when staff are less       a key service or component fails, affecting many users.
 likely to be needed for call handling.                           They can also be used to cascade information from the
 Note: Investment should also be made in the professional         Service Desk outwards throughout their local user
 development of Service Desk staff. Internal mentoring and        community, which can be very useful in disseminating
 shadowing second- and third-level support staff is a good        service details to all users very quickly.
 start, but best-of-breed Service Desks benefit from a            It is important to note that Super Users should log all calls
 formalized programme of staff development.                       that they deal with, and not just those that they pass on
 Organizational commitment to professional development            to IT. This will mean access to, and training on how to use,
 helps instil a sense of accomplishment and opportunity to        the Incident logging tools. This will help to measure the
                                                                                   Organizing for Service Operation |       117

activity of the Super User and also to ensure that their       An increase in the number of calls to the Service Desk can
position is not abused. In addition, it will ensure that       indicate less reliable services over that period of time –
valuable history regarding incidents and service quality are   but may also indicate increased user confidence in a
not lost.                                                      Service Desk that is maturing, resulting in a higher
                                                               likelihood that users will seek assistance rather than try to
It may also be possible for Super Users to be involved in:
                                                               cope alone. For this type of metric to be reliable for
■ Staff training for users in their area                       reaching either conclusion, further comparison of previous
■ Providing support for minor incidents or simple              periods for any Service Desk improvements implemented
   request fulfilment                                          since the last measurement baseline, or service reliability
■ Involvement with new releases and rollouts.                  changes, problems, etc. to isolate the true cause for the
                                                               increase is needed.
Super Users do not necessarily provide support for the
whole of IT. In many cases a Super User will only provide      Further analysis and more detailed metrics are therefore
support for a specific application, module or business unit    needed and must be examined over a period of time.
area. As a business user the Super User often has in-depth     These will include the call-handling statistics previously
knowledge of how key business processes run and how            mentioned under telephony, and additionally:
services work in practice. This is very useful knowledge to    ■ The first-line resolution rate: the percentage of calls
share with the Service Desk, so that it can provide higher-      resolved at first line, without the need for escalation to
quality services in future.                                      other support groups. This is the figure often quoted
It should be noted that a firm commitment is needed from         by organizations as the primary measure of the Service
potential Super Users, and specifically their management,        Desks performance – and used for comparison
that they will have the time and interest to perform this        purposes with the performance of other desks – but
role before selection and training commences.                    care is needed when making any comparisons. For
                                                                 greater accuracy and more valid comparisons this can
A Super User, while a valuable interface to the business
                                                                 be broken down further as follows:
and the Service Desk, must be given proper training,
                                                                 ● The percentage of calls resolved during the first
accountability and expectation. Super Users can be
                                                                      contact with the Service Desk, i.e. while the user is
vulnerable to misuse if their role, responsibilities and
                                                                      still on the telephone to report the call
the process governing these are not clearly communicated
to the users. It is imperative that a Super User is not seen     ● The percentage of calls resolved by the Service
as a replacement for, or a means to circumvent, the                   Desk staff themselves without having to seek
Service Desk.                                                         deeper support from other groups. Note: some
                                                                      desks will choose to co-locate or embed more
6.2.5 Service Desk metrics                                            technically skilled second-line staff with the Service
                                                                      Desk (see Incident Management for further details).
Metrics should be established so that performance of the
                                                                      In such cases it is important when making
Service Desk can be evaluated at regular intervals. This is
                                                                      comparisons to also separate out (i) the percentage
important to assess the health, maturity, efficiency,
                                                                      resolved by the Service Desk staff alone; and
effectiveness and any opportunities to improve Service
                                                                      (ii) the percentage resolved by the first-line Service
Desk operations.
                                                                      Desk staff and second-line support staff combined.
Metrics for Service Desk performance must be realistic and     ■ Average time to resolve an incident (when resolved at
carefully chosen. It is common to select those metrics that      first line)
are easily available and that may seem to be a possible        ■ Average time to escalate an incident (where first-line
indication of performance; however, this can be                  resolution is not possible)
misleading. For example, the total number of calls             ■ Average Service Desk cost of handling an incident.
received by the Service Desk is not in itself an indication      Two metrics should be considered here:
of either good or bad performance and may in fact be
                                                                 ● Total cost of the Service Desk divided by the
caused by events completely outside the control of the
                                                                      number of calls. This will provide an average figure
Service Desk – for example a particularly busy period for
                                                                      which is useful as an index and for planning
the organization, or the release of a new version of a
                                                                      purposes but does not accurately represent the
major corporate system.
                                                                      relative costs of different types of calls
118   | Organizing for Service Operation



      ● By calculating the percentage of call duration time         courteous and professional, whether they instilled
       on the desk overall and working out a cost per               confidence in the user.
       minute (total costs for the period divided by total
                                                                    This type of measure is best obtained from the users
       call duration minutes’) this can be used to
                                                                    themselves. This can be done as part of a wider
       calculate the cost for individual calls and give a
                                                                    customer/user satisfaction survey covering all of IT or can
       more accurate figure.
                                                                    be specifically targeted at Service Desk issues alone.
   By evaluating the types of incidents with call duration,
   a more refined picture of cost per call by types arises          One effective way of achieving the latter is through a call-
   and gives an indication of which incident types tend             back telephone survey, where an independent Service
   to cost more to resolve and possible targets for                 Desk Operator or Supervisor rings back a small percentage
   improvements.                                                    of users shortly after their incident has been resolved, to
 ■ Percentage of customer or user updates conducted                 ask the specific questions needed.
   within target times, as defined in SLA targets                   Care should be taken to keep the number of questions to
 ■ Average time to review and close a resolved call                 a minimum (five to six at the most) so that the users will
 ■ The number of calls broken down by time of day and               have the time to cooperate. Also survey questions should
   day of week, combined with the average call-time                 be designed so that the user or customer knows what area
   metric, is critical in determining the number of staff           or subject questions are about and which incident or
   required.                                                        service they are referring to. The Service Desk must act on
                                                                    low satisfaction levels and any feedback received.
 Further general details on metrics and how they should be
 used to drive forward service quality is included in the           To allow adequate comparisons, the same percentage of
 Continual Service Improvement publication.                         calls should be selected in each period and they should be
                                                                    rigorously carried out despite any other time pressures.
 6.2.5.1 Customer/user satisfaction surveys                         Surveys are a complex and specialized area, requiring a
 As well as tracking the ‘hard’ measures of the Service             good understanding of statistics and survey techniques.
 Desk’s performance (via the metrics described above), it is        This publication will not attempt to provide an overview
 also important to assess ‘soft’ measures – such as how             of all of these, but a summary of some of the more widely
 well the customers and users feel their calls have been            used techniques and tools is listed in Table 6.1.
 answered, whether they feel the Service Desk operator was


 Table 6.1    Survey techniques and tools
  Technique/Tool                         Advantages                                  Disadvantages
  After-call survey                      ■   High response rate since the caller     ■   People may feel pressured into taking the
  Callers are asked to remain on the         is already on the phone                     survey, resulting in a negative service
  phone after the call and then asked    ■   Caller is surveyed immediately after        experience
  to rate the service they were              the call so their experience is         ■   The surveyor is seen as part of the Service
  provided                                   recent                                      Desk being surveyed, which may discourage
                                                                                         open answers

  Outbound telephone survey              ■   Higher response rate since the caller   ■   This method could be seen as intrusive, if
  Customers and users who have               is interviewed directly                     the call disrupts the user or customer from
  previously used the Service Desk are   ■   Specific categories of user or              their work
  contacted some time after their            customer can be targeted for            ■   The survey is conducted some time after
  experience with the Service Desk           feedback (e.g. people who                   the user or customer used the Service Desk,
                                             requested a specific service, or            so their perception may have changed
                                             people experienced a disruption to
                                             a particular service)

                                                                                                               (continued overleaf)
                                                                                            Organizing for Service Operation |            119

Table 6.1 Survey techniques and tools (continued)
 Technique/Tool                             Advantages                                 Disadvantages
 Personal interviews                        ■   The interviewer is able to observe     ■   Interviews are time-consuming for both the
 Customers and users are interviewed            non-verbal signals as well as              interviewer and the respondent
 personally by the person doing the             listening to what the user or          ■   Users and customers could turn the
 survey. This is especially effective for       customer is saying                         interviews into complaint sessions
 customers or users who use the             ■   Users and customers feel a greater
 Service Desk extensively or who have           degree of personal attention and a
 had a very negative experience                 sense that their answers are being
                                                taken seriously
 Group interviews                           ■   A larger number of users and           ■   People may not express themselves freely in
 Customers and users are interviewed in         customers can be interviewed               front of their peers or managers
 small groups. This is good for         ■ Questions are more generic and               ■   People’s opinions can easily be changed by
 gathering general impressions and for    therefore more consistent between                others in the group during the interview
 determining whether there is a need to   interviews
 change certain aspects of the Service
 Desk, e.g. service hours or location

 Postal/e-mail surveys                      ■   Specific or all customers or users     ■   Postal surveys are labour intensive to
 Survey questionnaires are mailed to a          can be targeted                            process
 target set of customers and users.         ■   Postal surveys can be anonymous,       ■   The percentage of people responding to
 They are asked to return their                 allowing people to express                 postal surveys tends to be small
 responses by e/mail                            themselves more freely                 ■   Misinterpretation of a question could affect
                                            ■   E-mail surveys are not anonymous,          the result
                                                but can be created using
                                                automated forms that make it
                                                convenient and easy for the user to
                                                reply and increase the likelihood it
                                                will be completed
 Online surveys                             ■   The potential audience of these        The percentage of respondents cannot be
                                                surveys is fairly large                predicted
 Questionnaires are posted on a website
 and users and customers encouraged      ■ Respondents can complete the
 via e-mail or links from a popular site   questionnaire in their own time
 to participate in the survey            ■ The links on popular websites are
                                           good reminders without being
                                           intrusive

6.2.6 Outsourcing the Service Desk                                     and must therefore determine what service the outsourcer
The decision to outsource is a strategic issue for senior              provides, not the other way round.
managers – and is addressed in detail in the Service                   If the outsourcing route is chosen, there are some
Strategy and Service Design publications. Many of the                  safeguards that are needed to ensure that the outsourced
guidelines in this section are not unique to the Service               Service Desk works effectively and efficiently with the
Desk and can be applied to any function, support area or               organization’s other IT teams and departments and that
service being outsourced (or out-tasked).                              end-to-end Service Management control is maintained
Regardless of the reasons for, or the extent of, the                   (this is particularly important for organizations seeking
outsourcing contract, it is vital that the organization                ISO/IEC 20000 certification as overall management control
retains responsibility for the activities and services                 has to be demonstrated). Some of these safeguards are set
provided by the Service Desk. The organization is                      out below.
ultimately responsible for the outcomes of the decision
120   | Organizing for Service Operation



 6.2.6.1 Common tools and processes                               statements may indicate that a potential supplier uses the
 The Service Desk does not have responsibility for all the        ITIL Framework in its delivery of services to customers, or
 processes and procedures that it initiates. For example, a       that they have achieved standards certification for their
 Service Request is received by the Service Desk but the          internal practices, but it is equally important to have the
 request is fulfilled by the internal IT Operational team.        enabling technology in place and being used that
                                                                  demonstrates a service provider’s capability to manage
 If the Service Desk is outsourced, care must be taken that       services and interface to internal practices harmoniously.
 the tools are consistent with those still being used in the      There is no standard of compliance that ensures this and
 customer organization. Outsourcing is often seen as an           so procurement efforts should include specific queries to
 opportunity to replace outdated or inadequate tools, only        satisfy this requirement. More information on outsource
 to find that there are severe integration problems between       provider acquisition can be found in the Service Design
 the new tool and the legacy tools and processes.                 publication.
 For this reason it is important to ensure that these issues
 are properly researched and the customer’s requirements          6.2.6.2 SLA targets
 are adequately scoped and specified before the                   The SLA targets for overall incident-handling and
 outsourcing contract. Service Desk tools must not only           resolution times need to be agreed with the customers
 support the outsourced Service Desk, but they must               and between all teams and departments – and OLA/UC
 support the customer organization’s processes and                targets need to be coordinated and agreed with individual
 business requirements as well.                                   support groups so that they underpin and support the
 Ideally the outsourced desk should use the same tools and        SLA targets.
 processes (or, as a minimum, interfacing tools and               Examples of these can be seen in the section on metrics
 processes) to allow smooth process flow between the              above (see section 6.2.5).
 Service Desk and second- and third-line support groups.
 In addition, the outsourced Service Desk should have             6.2.6.3 Good communications
 access to:                                                       The lines of communication between the outsourced
                                                                  Service Desk and the other support groups need to work
 ■ All incident records and information
                                                                  very effectively. This can be assisted by some or all of the
 ■ Problem Records and information
                                                                  following steps:
 ■ Known Error Data
                                                                  ■ Close physical co-location
 ■ Change Schedule
                                                                  ■ Regular liaison/review meetings
 ■ Sources of internal knowledge (especially technical or
   application experts)                                           ■ Cross-training tutorials between the teams and
                                                                     departments
 ■ SKMS
                                                                  ■ ‘Partnership’ arrangements when staff from both
 ■ CMS
                                                                     organizations are used jointly to staff the desk
 ■ Alerts from monitoring tools.
                                                                  ■ Communication Plans and performance targets are
 It is often a challenge integrating processes and tools in a        documented in a consistent manner in OLAs and UCs.
 less mature organization with those in a more mature
                                                                  In cases where the Service Desk is located off-shore, not all
 organization. A common but incorrect assumption is that
                                                                  of these measures will be possible. However, the need for
 the maturity of the one organization will somehow result
                                                                  training and communication of the Service Desk staff is
 in higher maturity in the other. Active involvement to
                                                                  still critical, even more so in cases where there are
 ensure alignment of processes and tools is essential to a
                                                                  language and cultural differences.
 smooth transition and ongoing management of services
 between the internal and external organizations. In fact, if     This will be covered in more detail in ITIL complementary
 this is not directly addressed, it could result in the failure   publications, but, as a rule, outsourcing companies who
 of the contract.                                                 offer off-shore Service Desk solutions should take the
                                                                  following into account:
 It is also often incorrectly assumed that the proof of
 Service Management quality and maturity in an external           ■ Training programmes focused on cultural
 outsource partner can be guaranteed by stating                     understanding of the customer market
 requirements in the procurement process for ‘ITIL                ■ Language skills – especially the understanding of
 conformance’ and / or ‘ISO/IEC 20000 certification’. These         idiomatic use of the language in the customer market.
                                                                                 Organizing for Service Operation |          121

  This is not so that the Service Desk staff sound like      ■ It provides the actual resources to support the ITSM
  natives of the customer’s country (that type of               Lifecycle. In this role Technical Management ensures
  insincerity is very quickly detected by customers), but       that resources are effectively trained and deployed to
  to facilitate better understanding of the customer and        design, build, transition, operate and improve the
  the better to appreciate their priorities                     technology required to deliver and support IT services.
■ Regular visits by representatives of the customer
                                                             By performing these two roles, Technical Management is
  organization to provide training and appropriate           able to ensure that the organization has access to the
  feedback directly to the Service Desk management           right type and level of human resources to manage
  and staff                                                  technology and, thus, to meet business objectives.
■ Training in the use of the customer organizations tools    Defining the requirements for these roles starts in Service
  and methods of work. This is especially effective if       Strategy and is expanded in Service Design, validated in
  similar training materials are presented by the same       Service Transition and refined in Continual Service
  instructors as those used by the customer                  Improvement (see other ITIL publications in this series).
  organization.
                                                             Part of this role is also to ensure a balance between the
6.2.6.4 Ownership of data                                    skill level, utilization and the cost of these resources. For
                                                             example, hiring a top-level resource at the higher end of
Clear ownership of the data collected by the outsourced
                                                             the salary scale and then only using that skill for 10% of
Service Desk must be established. Ownership of all data
                                                             the time is not effective. A better Technical Management
relative to users, customers, affected CIs, services,
                                                             strategy would be to identify the times that the skill is
incidents, Service Requests, changes, etc. must remain
                                                             needed and then hire a contractor for only those tasks.
with the organization that is outsourcing the activity –
but both organizations will require access to it.            Another strategy in larger organizations is to leverage
                                                             specialist staff out of ‘central’ pools so that specialists can
Data that is related specifically to performance of
                                                             be well utilized and provide an economy of scale to the
employees of the outsourcing company will remain the
                                                             organization and minimize the need to hire in contractors.
property of that company, which is often legally prevented
                                                             Specialized skills should be identified among resources in
from sharing the data with the customer organization. This
                                                             the IT organization, then leveraged for specific needs as
may also be true of other data that is used purely for the
                                                             they arise, analogous to a special tactical unit, whose
internal management of the Service Desk, such as head
                                                             members also perform regular duties but who are
count, optimization activities, Service Desk cost
                                                             assigned to tasks needing their specialized skills. This type
information, etc.
                                                             of resource utilization is particularly useful both for project
All reporting requirements and issues around ownership of    teams and problem resolution.
data must be specified in the underpinning contract with
                                                             An additional, but very important role played by Technical
the company providing the outsourcing service.
                                                             Management is to provide guidance to IT Operations
                                                             about how best to carry out the ongoing operational
6.3 TECHNICAL MANAGEMENT                                     management of technology. This role is partly carried out
                                                             during the Service Design process, but it is also a part of
Technical Management refers to the groups, departments
                                                             everyday communication with IT Operations Management
or teams that provide technical expertise and overall
                                                             as they seek to achieve stability and optimum
management of the IT Infrastructure.
                                                             performance.
6.3.1 Technical Management role                              The objectives, activities and structures that enable
Technical Management plays a dual role:                      Technical Management to perform these roles effectively
                                                             are discussed below.
■ It is the custodian of technical knowledge and
   expertise related to managing the IT Infrastructure.      6.3.2 Technical Management objectives
   In this role, Technical Management ensures that the
                                                             The objectives of Technical Management are to help plan,
   knowledge required to design, test, manage and
                                                             implement and maintain a stable technical infrastructure
   improve IT services is identified, developed and
                                                             to support the organization’s business processes through:
   refined.
                                                             ■ Well designed and highly resilient, cost-effective
                                                                technical topology
122   | Organizing for Service Operation



 ■ The use of adequate technical skills to maintain the               technology architectures during the Service Strategy
   technical infrastructure in optimum condition                      and Design phases.
 ■ Swift use of technical skills to speedily diagnose and         ■   Research and development of solutions that can help
   resolve any technical failures that do occur.                      expand the Service Portfolio or which can be used to
                                                                      simplify or automate IT Operations, reduce costs or
 6.3.3 Generic Technical Management                                   increase levels of IT service.
 activities                                                       ■   Involvement in the design and building of new
 Technical Management is involved in two types of activity:           services. Technical Management will contribute to the
                                                                      design of the Technical Architecture and Performance
 ■ Activities that are generic to the Technical                       standards for IT services. In addition, it will also be
   Management function as a whole are discussed in this               responsible for specifying the operational activities
   section as they enable Technical Management as a                   required to manage the IT Infrastructure on an
   function to execute its role.                                      ongoing basis.
 ■ A set of discrete activities and processes, which are          ■   Involvement in projects, not only during Service
   performed by all three functions of Technical,                     Design and Service Transition, but also for Continual
   Application and IT Operations Management, are                      Service Improvement or operational projects, such as
   covered in Chapter 5.                                              Operating System upgrades, server consolidation
 Generic Technical Management activities are highlighted              projects or physical moves.
 as follows:                                                      ■   Availability and Capacity Management are dependent
                                                                      on Technical Management for engineering IT services
 ■ Identifying the knowledge and expertise required to
                                                                      to meet the levels of service required by the business.
      manage and operate the IT Infrastructure and to
                                                                      This means that modelling and workload forecasting
      deliver IT services. This process starts during the
                                                                      are often done with Technical Management resources.
      Service Strategy phase, is expanded in detail in Service
                                                                  ■   Assistance in assessing risk, identifying critical service
      Design and is executed in Service Operation. Ongoing
      assessment and updating of these skills is done during          and system dependencies and defining and
      Continual Service Improvement.                                  implementing countermeasures.
                                                                  ■   Designing and performing tests for the functionality,
 ■    Documentation of the skills that exist in the
      organization, as well as those skills that need to be           performance and manageability of IT services.
      developed. This will include the development of             ■   Managing vendors. Many Technical Management
      Skills Inventories and the performance of Training              departments or groups are the only ones who know
      Needs Analyses.                                                 exactly what is required of a vendor and how to
 ■    Initiating training programmes to develop and refine            measure and manage them. For this reason, many
      the skills in the appropriate technical resources and           organizations rely on Technical Management
      maintaining training records for all technical resources.       departments to manage contracts with vendors of
                                                                      specific CIs. If this is the case it is important to ensure
 ■    Design and delivery of training for users, the Service
                                                                      that these relationships are managed as part of the
      Desk and other groups. Although training
                                                                      SLM process.
      requirements must be defined in Service Design, they
                                                                  ■   Definition and management of Event Management
      are executed in Service Operation. Where Technical
      Management does not deliver training, it is responsible         standards and tools. Technical Management will also
      for identifying organizations that can provide it.              monitor and respond to many categories of events.
                                                                  ■   Technical Management departments or groups are
 ■    Recruiting or contracting resources with skills that
      cannot be developed internally, or where there are              integral to the performance of Incident Management.
      insufficient people to perform the required Technical           They receive incidents through Functional Escalation
      Management activities.                                          and provide second- and higher-level support. They
                                                                      are also involved in maintaining categories and
 ■    Procuring skills for specific activities where the
                                                                      defining the escalation procedures that are executed
      required skills are not available internally or in
                                                                      in Incident Management.
      the open market, or where it is more cost-efficient
                                                                  ■   Technical Management as a function provides the
      to do so.
                                                                      resources that execute the Problem Management
 ■    Definition of standards used in the design of new
                                                                      process. It is its technical expertise and knowledge
      architectures and participation in the definition of
                                                                      that is used to diagnose and resolve problems. It is
                                                                                 Organizing for Service Operation |      123

    also its relationship with the vendors that is used to   Infrastructure. In all but the smallest organizations, where
    escalate and follow up with vendor support teams.        a single combined team or department may suffice,
■   Technical Management resources will be involved in       separate teams or departments will be needed for each
    defining coding systems that are used in Incident and    type of infrastructure being used.
    Problem Management (e.g. Incident Categories).           IT Operations Management consists of a number of
■   Technical Management resources are used to support       technological areas. Each of these requires a specific set of
    Problem Management in validating and maintaining         skills to manage and operate it. Some skill sets are related
    the KEDB.                                                and can be performed by generalists, whereas others are
■   Change Management relies on the technical                specific to a component, system or platform.
    knowledge and expertise to evaluate changes, and
                                                             The primary criterion of Technical Management
    many changes will be built by Technical Management.
                                                             organizational structure is that of specialization or division
■   Releases are frequently deployed using Technical
                                                             of labour. The principle is that people are grouped
    Management resources.
                                                             according to their technical skill sets, and that these skill
■   Technical Management will provide information for,       sets are determined by the technology that needs to
    and operationally maintain, the Configuration            be managed.
    Management system and its data. This will be done in
    cooperation with Application Management to ensure        Sections 6.6 and 6.7 cover the organizational aspects of
    that the correct CI attributes and relationships are     Technical Management in detail, but this list provides
    created from the deployment of services and the          some examples of typical Technical Management teams
    ongoing maintenance over the life of CIs.                or departments:
■   Technical Management is involved in the Continual        ■ Mainframe team or department – if one or more
    Service Improvement processes, particularly in               mainframe types are still being used by the
    identifying opportunities for improvement and then in        organization
    helping to evaluate alternative solutions.               ■   Server team or department – often split again by
■   As a custodian of technical knowledge and expertise,         technology types (e.g. Unix server, Wintel server)
    Technical Management ensures that all system and         ■   Storage team or department, responsible for the
    operating documentation is up to date and properly           management of all data storage devices and media
    utilized. This includes ensuring that all management,    ■   Network Support team or department, looking after
    administration and user manuals are up to date and           the organization’s internal WANs/LANs and managing
    complete and that technical staff are familiar with          any external network suppliers
    their contents.
                                                             ■   Desktop team or department, responsible for all
■   Updating and maintaining data used for reporting on          installed desktop equipment
    technical and service capabilities, e.g. Capacity and
                                                             ■   Database team or department, responsible for the
    Performance Management, Availability Management,
                                                                 creation, maintenance and support of the
    Problem Management, etc.
                                                                 organization’s databases
■   Assisting IT Financial Management to identify the cost
                                                             ■   Middleware team or department, responsible for the
    of technology and IT human resources used to
                                                                 integration, testing and maintenance of all middleware
    manage IT services.
                                                                 in use in the organization
■   Involvement in defining the operational activities
                                                             ■   Directory Services team or department, responsible for
    performed as part of IT Operations Management. Many
                                                                 maintaining access and rights to service elements in
    Technical Management departments, groups or teams
                                                                 the infrastructure
    also perform the operational activities as part of an
                                                             ■   Internet or Web team or department, responsible for
    organization’s IT Operations Management function.
                                                                 managing the availability and security of access to
                                                                 servers and content by external customers, users and
6.3.4 Technical Management organization
                                                                 partners
Technical Management is not normally provided by a           ■   Messaging team or department, responsible for e-mail
single department or group. One or more Technical
                                                                 services
Support teams or departments will be needed to provide
                                                             ■   IP-based Telephony team or department (e.g. VoIP).
technical management and support for the IT
124   | Organizing for Service Operation



 6.3.5 Technical Design and Technical                             ● Installation and configuration of components under
 Maintenance and Support                                              their control.
                                                              ■   Process metrics. Technical Management teams
 Technical Management consists of specialist technical
 architects and designers (who are primarily involved             execute many Service Management process activities.
 during Service Design) and specialist maintenance                Their ability to do so will be measured as part of the
 and support staff (who are primarily involved during             process metrics where appropriate (see section on
 Service Operation).                                              each process for more details). Examples include:
                                                                  ● Response time to events and event completion
 In this publication, they are viewed as being part of the            rates
 same function, but many organizations see them as two            ● Incident resolution times for second- and third-line
 separate teams or even departments. The problem with
                                                                      support
 this approach is that good design needs input from the
                                                                  ● Problem resolution statistics
 people who are required to manage the solution – and
                                                                  ● Number of escalations and reason for those
 good operation requires involvement from the people
 who designed the solution.                                           escalations
                                                                  ● Number of changes implemented and backed out
 The problems that need to be overcome are similar to             ● Number of unauthorized changes detected
 those faced in managing the Application Lifecycle (see
                                                                  ● Number of releases deployed, total and successful
 section 6.5 for a more detailed discussion). The solution
                                                                  ● Security issues detected and resolved
 will include the following elements:
                                                                  ● Actual system utilization against Capacity Plan
 ■ Support staff should be involved during the design or              forecasts (where the team has contributed to the
   architecture of a solution. Design staff should be                 development of the plan)
   involved in setting maintenance objectives and                 ● Tracking against SIPs
   resolving support issues.
                                                                  ● Expenditure against budget.
 ■ A change in how both Design and Support staff are
                                                              ■   Technology performance. These metrics are based
   measured. Designers should be held partly
                                                                  on Service Design specifications and technical
   accountable for design flaws that create operational
                                                                  performance standards set by vendors, and will
   outages. Support staff should be held partly
                                                                  typically be contained in OLAs or Standard Operation
   accountable for contribution to the technical
                                                                  Procedures. Actual metrics will vary by technology, but
   architecture.
                                                                  are likely to include:
                                                                  ● Utilization rates (e.g. memory or processor for
 6.3.6 Technical Management metrics
                                                                      server, bandwidth for networks, etc.)
 Metrics for Technical Management will largely depend on
                                                                  ● Availability (of systems, network, devices, etc.),
 which technology is being managed, but some generic
                                                                      which is helpful for measuring team or system
 metrics include:
                                                                      performance, but is not to be confused with
 ■ Measurement of agreed outputs. These could                         Service Availability – which requires the ability to
      include:                                                        measure the overall availability of the service and
      ● Contribution to achievement of services to the                may use the availability figures for a number of
          business. Although many of the Technical                    individual systems or components
          Management teams will not be in direct contact          ● Performance (e.g. response times, queuing
          with the business, the technology they manage               rates, etc.).
          impacts the business. Metrics should reflect both   ■   Mean Time Between Failures of specified
          negative (incidents traced to their team) and           equipment. This metric is used to ensure that good
          positive (system performance and availability)          purchasing decisions are being made and, when
          contributions                                           compared with maintenance schedules, whether the
      ● Transaction rates and availability for critical           equipment is being properly maintained
          business transactions                               ■   Measurement of maintenance activity, including:
      ● Service Desk training                                     ● Maintenance performed per schedule
      ● Recording problem resolutions into the KEDB               ● Number of maintenance windows exceeded
      ● User measures of the quality of outputs as defined        ● Maintenance objectives achieved (number and
          in the SLAs                                                 percentage).
                                                                                    Organizing for Service Operation |        125

■ Training and skills development. These metrics                Skills Inventories can also be used as part of the Service
   ensure that staff have the skills and training to            Portfolio to assess whether a new service can be delivered
   manage the technology that is under their control,           with existing staff and skill sets, or whether an investment
   and will also identify areas where training is still         needs to be made in new staff or training. Skills
   required.                                                    Inventories can therefore contribute significantly to
                                                                Capacity Planning.
6.3.7 Technical Management documentation                        The definition and maintenance of Skills Inventories
Technical Management is involved in drafting and                requires a good interface with Human Resource processes
maintaining several documents as part of other processes        and tools in the organization.
(e.g. Capacity Planning, Change Management, Problem
Management, etc.). These documents are discussed in
some detail in the relevant process descriptions. However,      6.4 IT OPERATIONS MANAGEMENT
there are some documents that are specific to the               In business, the term ‘Operations Management’ is used to
Technical Management groups or teams who will provide           mean the department, group or team of people
document management and control for documents                   responsible for performing the organization’s day-to-day
relating to the technology under their control. Technical       operational activities – such as running the production line
Management documentation includes the following.                in a manufacturing environment or managing the
                                                                distribution centres and fleet movements within a logistics
6.3.7.1 Technical documentation                                 organization.
The sourcing and maintenance of technical                       Operations Management generally has the following
documentation for all CIs is the responsibility of Technical    characteristics:
Management. These include:
                                                                ■ There is work to ensure that a device, system or
■ Technical manuals                                                 process is actually running or working (as opposed to
■ Management and administration manuals                             strategy or planning)
■ User manuals for CIs. These will typically exclude            ■   This is where plans are turned into actions
   application user manuals, which are maintained by            ■   The focus is on daily or shorter-term activities,
   Application Management.                                          although it should be noted that these activities
                                                                    will generally be performed and repeated over a
6.3.7.2 Maintenance Schedules                                       relatively long period (as opposed to one-off project
These schedules are drawn up and agreed during the                  type activities)
Service Design phase related to Availability and Capacity       ■   These activities are executed by specialized technical
Management, but they are essentially the property of the            staff, who often have to undergo technical training to
various Technical Management departments, groups or                 learn how to perform each activity
teams. This is because they have the technical expertise        ■   There is a focus on building repeatable, consistent
for specific technologies and are most likely to know what          actions that – if repeated frequently enough at the
is needed to keep them in working order.                            right level of quality – will ensure the success of
For more details on the definition of Maintenance                   the operation
Schedules and Service Maintenance Objectives, refer to the      ■   This is where the actual value of the organization is
ITIL Service Design publication.                                    delivered and measured
                                                                ■   There is a dependency on investment in equipment
6.3.7.3 Skills Inventory                                            or human resources or both
A Skills Inventory is a system or tool that identifies the      ■   The value generated, must exceed the cost of the
skills required to deliver and support IT services and also         investment and all other organizational overheads
the individuals who possess those skills. Skills Inventories        (such as management and marketing costs) if the
are most effective if they are aligned with processes,              business is to succeed.
architectures and performance standards.                        In a similar way, IT Operations Management can be
In addition, Skills Inventories should identify the training    defined as the function responsible for the ongoing
available to cultivate each skill should existing staff leave   management and maintenance of an organization’s IT
the organization.                                               Infrastructure to ensure delivery of the agreed level of IT
                                                                services to the business.
126   | Organizing for Service Operation



 IT Operations can be defined as the set of activities               infrastructure and consistency of IT Services is a
 involved in the day-to-day running of the IT Infrastructure         primary concern of IT Operations. Even operational
 for the purpose of delivering IT services at agreed levels to       improvements are aimed at finding simpler and better
 meet stated business objectives.                                    ways of doing the same thing.
                                                                   ■ At the same time, IT Operations is part of the process
 6.4.1 IT Operations Management role                                 of adding value to the different lines of business and
 The role of Operations Management is to execute the                 to support the value network (see the ITIL Service
 ongoing activities and procedures required to manage and            Strategy publication). The ability of the business to
 maintain the IT Infrastructure so as to deliver and support         meet its objectives and to remain competitive
 IT Services at the agreed levels. These have already been           depends on the output and reliability of the day-to-
 described in section 5, but are summarized here for                 day operation of IT. As such, IT Operations
 completeness:                                                       Management must be able to continually adapt to
                                                                     business requirements and demand. The Business does
 ■ Operations Control, which oversees the execution
                                                                     not care that IT Operations complied with a standard
   and monitoring of the operational activities and events
                                                                     procedure or that a server performed optimally. As
   in the IT Infrastructure. This can be done with the
                                                                     business demand and requirements change, IT
   assistance of an Operations Bridge or Network
                                                                     Operations Management must be able to keep pace
   Operations Centre. In addition to executing routine
                                                                     with them, often challenging the status quo.
   tasks from all technical areas, Operations Control also
   performs the following specific tasks:                          IT Operations must achieve a balance between these roles,
   ● Console Management, which refers to defining                  which will require the following:
       central observation and monitoring capability and           ■ An understanding of how technology is used to
       then using those consoles to exercise monitoring                provide IT services
       and control activities                                      ■   An understanding of the relative importance and
   ● Job Scheduling, or the management of routine                      impact of those services on the business
       batch jobs or scripts                                       ■   Procedures and manuals that outline the role of IT
   ● Backup and Restore on behalf of all Technical                     Operations in both the management of technology
       and Application Management teams and                            and the delivery of IT services
       departments and often on behalf of users                    ■   A clearly differentiated set of metrics to report to the
   ● Print and Output management for the collation                     business on the achievement of Service objectives; and
       and distribution of all centralized printing or                 to report to IT managers on the efficiency and
       electronic output                                               effectiveness of IT Operations
   ● Performance of maintenance activities on behalf               ■   All IT Operations staff understand exactly how the
       of Technical or Application Management teams or                 performance of the technology affects the delivery of
       departments.                                                    IT services
 ■ Facilities Management, which refers to the                      ■   A cost strategy aimed at balancing the requirements
   management of the physical IT environment, typically                of different business units with the cost savings
   a Data Centre or computer rooms and recovery sites                  available through optimization of existing technology
   together with all the power and cooling equipment.                  or investment in new technology
   Facilities Management also includes the coordination            ■   A value, rather than cost, based Return on Investment
   of large-scale consolidation projects, e.g. Data Centre             strategy.
   consolidation or server consolidation projects. In some
   cases the management of a data centre is outsourced,            6.4.2 IT Operations Management objectives
   in which case Facilities Management refers to the
                                                                   The objectives of IT Operations Management include:
   management of the outsourcing contract.
                                                                   ■ Maintenance of the status quo to achieve stability of
 As with many IT Service Management processes and
                                                                     the organization’s day-to-day processes and activities
 functions, IT Operations Management plays a dual role.
                                                                   ■ Regular scrutiny and improvements to achieve
 ■ IT Operations Management is responsible for executing             improved service at reduced costs, while maintaining
      the activities and performance standards defined               stability
      during Service Design and tested during Service              ■ Swift application of operational skills to diagnose and
      Transition. In this sense IT Operations’ role is primarily     resolve any IT operations failures that occur.
      to maintain the status quo. The stability of the IT
                                                                                  Organizing for Service Operation |     127


6.4.3 IT Operations Management                                    ● Expenditure against budget.
organization                                                   ■ If maintenance activities have been delegated, then

Figure 6.1 in the introduction to Chapter 6 illustrated that     metrics related to these activities will also be
IT Operations Management is seen as a function in its own        appropriate:
right but that, in many cases, staff from Technical and          ● Maintenance performed per schedule
Application Management groups form part of this                  ● Number of maintenance windows exceeded
function.                                                        ● Maintenance objectives achieved (number and
                                                                    percentage).
This means that some Technical and Application
                                                               ■ Metrics related to Facilities Management are extensive,
Management departments or groups will manage and
execute their own operational activities. Others will            but typically include:
delegate these activities to a dedicated IT Operations           ● Costs versus budget related to maintenance,
department.                                                         construction, security, shipping, etc.
                                                                 ● Incidents related to the building, e.g. repairs
There is no single method for assigning activities, as it
                                                                    needed to the facility
depends on the maturity and stability of the infrastructure
                                                                 ● Reports on access to the facility
being managed. For example, Technical and Application
                                                                 ● Number of security events and Incidents and their
Management areas that are fairly new and unstable tend
to manage their own operations. Groups where the                    resolution
technology or application is stable, mature and well             ● Power usage statistics, especially as related to
understood tend to have standardized their operations               changes in layout and environmental conditioning
more and will therefore feel more comfortable delegating            strategies
these activities.                                                ● Events or incidents related to shipping and
                                                                    distribution.
Some options of how to structure IT Operations are
discussed in detail in section 6.7 of this publication.
                                                               6.4.5 IT Operations Management
6.4.4 IT Operations Management metrics                         documentation
                                                               A number of documents are produced and used during IT
IT Operations Management is measured in terms of its
                                                               Operations Management. This list is a summary of some of
effective execution of specified activities and procedures,
                                                               the most important and does not include reports that are
as well as its execution of process activities. Examples of
                                                               produced by IT Operations Management on behalf of
these are as follows:
                                                               other processes or functions.
■ Successful completion of scheduled jobs
■ Number of exceptions to scheduled activities and jobs        6.4.5.1 Standard Operating Procedures
■ Number of data or system restores required                   The SOPs are a set of documents containing detailed
■ Equipment installation statistics, including number of       instructions and activity schedules for every IT Operations
  items installed by type, successful installations, etc.      Management team, department or group.
■ Process metrics. IT Operations Management executes
                                                               These documents represent the routine work that needs to
  many Service Management process activities. Their            be done for every device, system or procedure. They also
  ability to do so will be measured as part of the             outline the procedures to be followed if an exception is
  process metrics where appropriate (see section on            detected or if a change is required.
  each process for more details). Examples include:
  ● Response time to events                                    SOP documents could also be used to define standard
                                                               levels of performance for devices or procedures. In some
  ● Incident resolution times for incidents
                                                               organizations the SOP documents are referred to in the
  ● Number of security-related incidents
                                                               OLA. Instead of listing detailed performance measures in
  ● Number of escalations and reason for those
                                                               the OLA, a clause is inserted to refer to the performance
      escalations                                              standards in the SOP and how these will be measured and
  ● Number of changes implemented and backed out               reported.
  ● Number of unauthorized changes detected
  ● Number of releases deployed, total and successful
  ● Tracking against SIPs
128   | Organizing for Service Operation



 6.4.5.2 Operations Logs                                         could simply be listed briefly with a reference to the
 Any activity that is conducted as part of IT Operations         section or page in the SOP.
 should be recorded for a number of reasons, including:          Most Shift Schedules take the form of a checklist where
 ■ They can be used to confirm the successful                    operators can check off the item as it is completed,
   completion of specific jobs or activities                     together with the time of completion. This makes it easy
 ■ They can be used to confirm that an IT service was
                                                                 to see the progress of activities and also helps to identify
                                                                 any potential issues where jobs are taking too long.
   delivered as agreed
 ■ They can be used by Problem Management to                     Shift Reports are a form of Operations Log, but have the
   research the root cause of incidents                          additional functions as follows:
 ■ They are the basis for reports on the performance of          ■ To record major events and actions that occurred
   the IT Operations Management teams and                           during the shift
   departments.                                                  ■ To form part of the handover between shift leaders
 The format of these logs is as varied as the number             ■ To report any exceptions to Service Maintenance
 of systems and Operations Management teams or                     Objectives
 departments. Examples of Operations Logs include                ■ To identify any uncompleted activity that could result
 the following:                                                    in degraded performance on any service during the
 ■ Operating System Logs stored on each device                     next service hours.
 ■ Application Activity Logs stored in a file on the
      application server                                         6.4.5.4 Operations Schedule
 ■    Event Logs stored on the monitoring tool server            The Operations Schedules are similar to Shift Schedules
 ■    Utilization Logs for key devices                           but cover all aspects of IT Operations at a high level. This
                                                                 schedule will include an overview of all planned changes,
 ■    Physical access logs recording who accessed secure
                                                                 maintenance, routine jobs and additional work, together
      buildings and when
                                                                 with information about upcoming business or vendor
 ■    Handwritten logs of actions performed by operators.
                                                                 events. The Operations Schedule is used as the basis for
      This must be in a formal logbook or binder, numbered
                                                                 the Daily Operations Meeting and is the master reference
      and stored in a secure environment. Checks should
                                                                 for all IT Operations managers to track progress and detect
      ensure that pages are not removed.
                                                                 exceptions.
 A policy needs to be established as part of the SOPs to
 state how long logs need to be kept, how they are
                                                                 6.5 APPLICATION MANAGEMENT
 archived and when they can be deleted. These policies will
 take into account statutory and compliance requirements.        Application Management is responsible for managing
 Policies should also specify the parameters for adequate        applications throughout their lifecycle. The Application
 storage and backup strategies to store and retrieve             Management function is performed by any department,
 log files.                                                      group or team involved in managing and supporting
                                                                 operational applications. Application Management also
 6.4.5.3 Shift Schedules and Reports                             plays an important role in the design, testing and
 Shift Schedules are documents that outline the exact            improvement of applications that form part of IT services.
 activities that need to be carried out during the shift. They   As such, it may be involved in development projects,
 will also list all dependencies and activity sequences. There   but is not usually the same as the Applications
 will probably be more than one Shift Schedule, where            Development teams.
 each team will have a version for its own systems. It is
 important that all schedules are coordinated before the         6.5.1 Application Management role
 start of the shift. This is usually done by a person who is     Application Management is to applications what Technical
 specialized in Shift Scheduling, with the help of               Management is to the IT Infrastructure. Application
 scheduling tools.                                               Management plays a role in all applications, whether
                                                                 purchased or developed in-house. One of the key
 A Shift Schedule could consist of a number of routine
                                                                 decisions that they contribute to is the decision of
 items that are included in the SOP. In this case the items
                                                                 whether to buy an application or build it (this is discussed
                                                                 in detail in the Service Design publication). Once that
                                                                                  Organizing for Service Operation |    129

decision is made, Application Management will play           These objectives are achieved through:
a dual role:
                                                             ■ Applications that are well designed, resilient and
■ It is the custodian of technical knowledge and                 cost-effective
  expertise related to managing applications. In this role   ■ Ensuring that the required functionality is available to
  Application Management, working together with                  achieve the required business outcome
  Technical Management, ensures that the knowledge           ■ The organization of adequate technical skills to
  required to design, test, manage and improve IT              maintain operational applications in optimum
  services is identified, developed and refined.               condition
■ It provides the actual resources to support the ITSM       ■ Swift use of technical skills to speedily diagnose and
  Lifecycle. In this role, Application Management ensures      resolve any technical failures that do occur.
  that resources are effectively trained and deployed to
  design, build, transition, operate and improve the         6.5.3 Application Management principles
  technology required to deliver and support IT services.
By performing these two roles, Application Management is     6.5.3.1 Build or buy?
able to ensure that the organization has access to the       One of the key decisions in Application Management is
right type and level of human resources to manage            whether to buy an application that supports the required
applications and thus to meet business objectives. This      functionality, or whether to build the application
starts in Service Strategy and is expanded in Service        specifically for the organization’s requirements. These
Design, tested in Service Transition and refined in          decisions are often made by a Chief Technical Officer
Continual Service Improvement (see other ITIL publications   (CTO) or Steering Committee, but they are dependent
in this series).                                             on information from a number of sources. These are
                                                             discussed in detail in Service Design, but are
Part of this role is to ensure a balance between the skill
                                                             summarized here from an Application Management
level and the cost of these resources.
                                                             function perspective.
In additional to these two high-level roles, Application
                                                             Application Management will assist in this decision during
Management also performs the following two
                                                             Service Design as follows:
specific roles:
                                                             ■ Application sizing and workload forecasts
■ Providing guidance to IT Operations about how best
                                                                 (see section 4.6.4)
  to carry out the ongoing operational management of
                                                             ■   Specification of manageability requirements
  applications. This role is partly carried out during the
                                                             ■   Identification of ongoing operational costs
  Service Design process, but it is also a part of
  everyday communication with IT Operations                  ■   Data access requirements for reporting or integration
  Management as they seek to achieve stability and               into other applications
  optimum performance.                                       ■   Investigating to what extent the required functionality
■ The integration of the Application Management                  can be met by existing tools – and how much
  Lifecycle into the ITSM Lifecycle. This is discussed           customization will be required to achieve this
  below.                                                     ■   Estimating the cost of customization
                                                             ■   Identifying what skills will be required to support the
The objectives, activities and structures that enable
                                                                 solution (e.g. if an application is purchased, will it
Application Management to play these roles effectively are
                                                                 require a new set of employees, or can existing
discussed below.
                                                                 employees be trained to support it?)
                                                             ■   Administration requirements
6.5.2 Application Management objectives
                                                             ■   Security requirements.
The objectives of Application Management are to support
the organization’s business processes by helping to          If the decision is to build the application, a further
identify functional and manageability requirements for       decision needs to be made on whether the development
application software, and then to assist in the design and   will be outsourced or built using employees. This is
deployment of those applications and the ongoing             detailed in the Service Strategy and Service Design
support and improvement of those applications.               publications, but there are some important considerations
                                                             affecting Service Operation, for example:
130   | Organizing for Service Operation



 ■ How will manageability requirements be specified and         This should not replace the SDLC, which is still a valid
      agreed (e.g. designing application and transaction        approach used by developers, especially by third-party
      monitoring)? These are sometimes forgotten when the       software companies. However, it does mean that there
      operational teams or departments are not represented      should be greater alignment between the development
      in the project                                            view of applications and the ‘live’ management of those
 ■    What are the Acceptance Criteria for operational          applications.
      performance; how and where will the solution be           This is more difficult in large-scale purchased applications,
      tested and who will perform the tests?                    such as e-mail, since the developers do not typically
 ■    Who will own and manage the Definitive Library for        interact individually with their application’s users.
      that application?                                         However, the basic lifecycle still holds true in that the
 ■    Who will design and maintain the operational              application needs requirements, design, customization,
      management and administration scripts for these           operation and deployment. Optimization is achieved
      applications?                                             through better management, improvements to
 ■    Who is responsible for environment set-up and             customization and upgrades.
      owning and maintaining the different infrastructure
                                                                The Application Management Lifecycle is illustrated as
      components?
                                                                follows:
 ■    How will the solution be instrumented so that it is
      capable of generating the required events?
                                                                                         Requirements
 6.5.3.2 Operational Models
 An Operational Model is the specification of the
 operational environment in which the application will
 eventually run when it goes live. This will be used during
 testing and transition phases to simulate and evaluate the
                                                                    Optimize                                      Design
 live environment. This is a way of ensuring that the
 application can be sized correctly and the required
 environmental conditions can be documented and
 understood by all. The Operational Model should be
 defined and used in testing during the Service Design and
 Service Transition phases respectively (see Service Design          Operate                                      Build
 and Service Transition publications).

 6.5.4 Application Management Lifecycle
 The lifecycle followed to develop and manage applications
 has been referred to by many names, including the                                          Deploy
 Software Lifecycle (SLC) and Software Development
 Lifecycle (SDLC). These are generally used by Applications
 Development teams and their Project Managers to define         Figure 6.5 Application Management Lifecycle
 their involvement in designing, building, testing,
                                                                ITSM processes and Applications Development processes
 deploying and supporting applications. Examples of these
                                                                have to be aligned as part of the overall strategy of
 approaches are Structured Systems Analysis and Design
                                                                delivering IT services in support of the business.
 Methodology (SSADM), Dynamic Systems Development
 Method (DSDM), Rapid Application Development (RAD),            Applications Development and Operations are part of the
 etc.                                                           same overall lifecycle and both should be involved at all
                                                                stages, although their level of involvement will vary
 ITIL is primarily interested in the overall management
                                                                depending on the stage of the lifecycle.
 of applications as part of IT Services, whether they are
 developed in-house or purchased from a third party.
 For this reason, the term Application Management
 Lifecycle has been used, as it implies a more holistic view.
                                                                               Organizing for Service Operation |     131

                                                           6.5.4.1 Requirements
Relationship between the Application
Management and Service Management Lifecycles               This is the phase during which the requirements for a new
                                                           application are gathered, based on the business needs of
The Application Management Lifecycle should not be
                                                           the organization. This phase is active primarily during the
seen as an alternative to the Service Management
                                                           Service Design phase of the ITSM Lifecycle.
Lifecycle. Applications are part of services and have to
be managed as such. Nevertheless, applications are a       There are six types of requirements for any application,
unique blend of technology and functionality and this      whether being developed in-house, outsourced or
requires a specialized focus at each stage of the          purchased:
Service Management Lifecycle.
                                                           ■ Functional requirements are those specifically required
Each stage of the Application Management Lifecycle             to support a particular business function
has its own specific set of objectives, activities,
                                                           ■ Manageability requirements, looked at from a Service
deliverables and dedicated teams. Each stage also has
a clear responsibility to ensure that their outputs            Management perspective, address the need for a
match up to the specific objectives of the Service             responsive, available and secure service, and deal with
Management Lifecycle. Different aspects of                     such issues as deployment, operations, system
Application Management are covered in detail in each           management and security
of the ITIL publications, as follows:                      ■   Usability requirements are those that address the
■ Service Strategy: Defines the overall architecture           needs of the end user, and result in features of the
   of applications and infrastructure. This will include       system that facilitate its ease of use
   defining the criteria for developing in-house,          ■   Architectural requirements, especially if this requires a
   outsourcing development, or purchasing and                  change to existing architecture standards
   customizing applications. Service Strategy will also    ■   Interface requirements, where there are dependencies
   assist in defining the Service Portfolio (including         between existing applications or tools and the new
   applications) which also includes information
                                                               application
   about the Return on Investment of applications
   and the services they support. Thus high-level          ■   Service Level Requirements, which specify how the
   requirements are set during this phase.                     service should perform, the quality of its output and
                                                               any other qualitative aspects measured by the user or
■ Service Design: Helps to establish requirements
                                                               customer.
   for functionality and manageability of applications
   and works with Development teams to ensure
   that they meet these objectives. Service Design         6.5.4.2 Design
   covers most of the Requirements phase and is            This is the phase during which requirements are translated
   involved during the Build phase of the Application      into specifications. Design includes the design of the
   Management Lifecycle.                                   application itself, and the design of the environment, or
■ Service Transition: Application Development and          operational model that the application has to run on.
   Management teams are involved in testing and            Architectural considerations are the most important aspect
   validating what has been built and deploying it         of this phase, since they can impact on the structure and
   operationally.                                          content of both application and operational model.
■ Service Operation: This covers the Operate phase         Architectural considerations for the application (design of
   of the Application Management Lifecycle. These          the application architecture) and architectural
   processes and structures are discussed in detail in     considerations for the operation model (design of the
   this publication.                                       system architecture) are strongly related and need to be
                                                           aligned.
■ Continual Service Improvement: Covers the
   Optimize phase of the Application Management            In the case of purchased software, most organizations will
   Lifecycle. Continual Service Improvement                not be allowed direct input to the design of the software
   measures the quality and relevance of applications      (which has already been built). However, it is important
   in operation and provides recommendations on            that Application Management is able to provide feedback
   how to improve applications if there is a clear         to the software vendor about the functionality,
   Return on Investment for doing so.
                                                           manageability and performance of the software. This will,
                                                           in turn, be taken up by the software vendor as part of the
                                                           continual improvement of the software.
132   | Organizing for Service Operation



 Part of the evaluation process for purchased software           Testing also takes place during this phase, although here
 should include an evaluation of whether the vendor is           the emphasis is on ensuring that the deployment process
 responsive to such feedback. At the same time, they             and mechanisms work effectively, e.g. testing whether the
 should ensure that there is a balance between being             application still functions to specification after it has been
 responsive and changing their software so much that it is       downloaded and installed. This is known as Early Life
 disruptive or that it changes some basic functionality.         Support and covers a pre-defined guarantee period that
                                                                 testing, validation and monitoring of a new application or
 Design for purchased software will also include the design
                                                                 service during that period occurs. Early Life Support is
 of any customization that is required. Of special
                                                                 covered in detail in the Service Transition publication.
 importance here is an evaluation of whether future version
 of the software will support the customization.
                                                                 6.5.4.5 Operate
 6.5.4.3 Build                                                   In the Operate phase, the IT services organization operates
                                                                 the application as part of delivering a service required by
 In the Build phase, both the application and the
                                                                 the business. The performance of the application in
 operational model are made ready for deployment.
                                                                 relation to the overall service is measured continually
 Application components are coded or acquired, integrated
                                                                 against the Service Levels and key business drivers. It is
 and tested.
                                                                 important to distinguish that applications themselves do
 Please note that Test is not a separate stage in the            not equate to a service. It is common in many
 lifecycle, even though it is a discrete activity, and even      organizations to refer to applications as ‘services’;
 though tests are conducted independently of both the            however, applications are but one component of many
 development and operational activities. Without the Build       needed to provide a business service.
 and Deploy phases, there would be nothing to test and,
                                                                 The Operate phase is not exclusive to applications and is
 without testing, there would be no control over what is
                                                                 discussed throughout this publication, with a more
 developed and deployed.
                                                                 detailed list of activities given in section 6.5.5 below.
 Testing is an integral component of both the Build and
 Deploy phases as a validation of the activity and output of     6.5.4.6 Optimize
 those phases – even if it uses different environments and       In the Optimize phase, the results of the Service Level
 staff. Testing in the Build phase focuses on whether the        performance measurements are measured, analysed and
 application meets its functionality and manageability           acted upon. Possible improvements are discussed and
 specifications. Often the distinction is made between a         developments initiated if necessary. The two main
 development and test environment. The test environment          strategies in this phase are to maintain and/or improve the
 allows for testing the combination of application and           Service Levels and to lower cost. This could lead to
 operational model. Testing is covered in the ITIL Service       iteration in the lifecycle or to justified retirement of an
 Transition publication.                                         application.
 For purchased software, this will involve the actual            One important thing to remember about the Application
 purchase of the application, any required middleware and        Management Lifecycle is that, because it is circular, the
 the related hardware and networking equipment. Any              same application can reside in different phases of the
 customization that is required will need to be done here,       lifecycle at the same time. For example, when the next
 as will the creation of tables, categories, etc. that will be   version of an application is being designed, and the
 used. This is often done as a pilot implementation by the       current version is being deployed, the previous version
 relevant Application Management team or department.             might still be in operation in parts of an organization. This
                                                                 obviously requires strong version, configuration and
 6.5.4.4 Deploy                                                  release control.
 In this phase, both the operational model and the
                                                                 Particular phases might take longer or seem more
 application are deployed. The operational model is
                                                                 significant than others, but they are all crucial. Every
 incorporated in the existing IT environment and the
                                                                 application must go through all of them at least once and,
 application is installed on top of the operational model,
                                                                 because of the circular nature of the lifecycle, will go
 using the Release and Deployment Management process
                                                                 through some more than once.
 described in the ITIL Service Transition publication.
                                                                 This approach also supports iterative development
                                                                 approaches, where software is continually being
                                                                                     Organizing for Service Operation |      133

developed in incremental steps. Each step follows the                application architectures during the Service Strategy
lifecycle and the application is built in increments, using          processes.
business priorities as a driver.                                 ■   Research and Development of solutions that can help
Good communication is the key as an application works its            expand the Service Portfolio or which can be used to
way through the phases of the lifecycle. It is critical that         simplify or automate IT Operations, reduce costs or
high-quality information is passed along by those handling           increase levels of IT service.
the application in one phase of its existence to those           ■   Involvement in the design and building of new
handling it in the next phase. It is also important that an          services. All Application Management teams or
organization monitors the quality of the Application                 departments will contribute to the design of the
Management Lifecycle. Changes in the lifecycle, for                  Technical Architecture and Performance standards for
example in the way an organization passes information                IT Services. In addition they will also be responsible for
between the different phases, will affect its quality.               specifying the operational activities required to
Understanding the characteristics of every phase in the              manage applications on an ongoing basis.
Application Management Lifecycle is crucial to improving         ■   Involvement in projects, not only during the Service
the quality of the whole. Methods and tools used in one              Design process, but also for Continual Service
phase might have an impact on others, while optimization             Improvement or operational projects, such as
of one phase might sub-optimize the whole.                           Operating System upgrades, server consolidation
                                                                     projects or physical moves.
6.5.5 Application Management generic                             ■   Designing and performing tests for the functionality,
activities                                                           performance and manageability of IT Services (bearing
                                                                     in mind that testing should be controlled and
While most Application Management teams or
                                                                     performed by an independent tester – see Service
departments are dedicated to specific applications or sets
                                                                     Transition publication).
of applications, there are a number of activities which they
have in common. These include:                                   ■   Availability and Capacity Management are dependent
                                                                     on Application Management for contributing to the
■ Identifying the knowledge and expertise required to                design of applications to meet the levels of service
    manage and operate applications in the delivery of IT            required by the business. This means that modelling
    services. This process starts during the Service Strategy        and workload forecasting are often done together with
    phase, is expanded in detail in Service Design and is            Technical and Application Management resources.
    executed in Service Operation. Ongoing assessment            ■   Assistance in assessing risk, identifying critical service
    and updating of these skills are done during Continual           and system dependencies and defining and
    Service Improvement.                                             implementing countermeasures.
■   Initiating training programmes to develop and refine         ■   Managing vendors. Many Application Management
    the skills in the appropriate Application Management             departments or groups are the only ones who know
    resources and maintaining training records for                   exactly what is required of a vendor and how to
    these resources.                                                 measure and manage them. For this reason, many
■   Recruiting or contracting resources with skills that             organizations rely on Application Management to
    cannot be developed internally, or where there are               manage contracts with vendors of specific
    insufficient people to perform the required Application          applications. If this is the case it is important to ensure
    Management activities.                                           that these relationships are managed as part of the
■   Design and delivery of end-user training. Training may           SLM process.
    be developed and delivered by either the Application         ■   Involvement in definition of Event Management
    Development or Application Management groups, or                 standards and especially in the instrumentation of
    by a third party, but Application Management is                  applications for the generation of meaningful events.
    responsible for ensuring that training is conducted          ■   Application Management as a function provides the
    as appropriate.                                                  resources that execute the Problem Management
■   Insourcing for specific activities where the required            process. It is their technical expertise and knowledge
    skills are not available internally or in the open market,       that is used to diagnose and resolve problems. It is
    or where it is more cost-efficient to do so.                     also their relationship with the vendors that is used to
■   Definition of standards used in the design of new                escalate and follow up with vendor support teams or
    architectures and participation in the definition of             departments.
134   | Organizing for Service Operation



 ■ Application Management resources will be involved in           ■ Third-level support for incidents related to the
      defining coding systems that are used in Incident and           application(s) covered by that team or department
      Problem Management (e.g. Incident Categories).              ■   Involvement in operation testing plans and
 ■    Application Management resources are used to                    deployment issues
      support Problem Management in validating and                ■   Application bug tracking and patch management
      maintaining the KEDB together with the Application              (coding fixes for in-house code, transports/patches for
      Development teams.                                              third-party code)
 ■    Change Management relies on the technical                   ■   Involvement in application operability and
      knowledge and expertise to evaluate changes and                 supportability issues such as error code design, error
      many changes will be built by Application                       messaging, event management hooks
      Management teams.                                           ■   Application sizing and performance; volume metrics
 ■    Successful Release Management is dependent on                   and load testing etc. This is in support of Capacity and
      involvement from Application Management staff. In               Availability Management processes
      fact they are frequently the drivers of the Release         ■   Involvement in developing Release Policies
      Management process for their applications.                  ■   Identification of enhancements to existing software,
 ■    Application Management will define, manage and                  both from a functionality and manageability
      maintain attributes and relationships of application CIs        perspective.
      in the CMS.
 ■    Application Management is involved in the Continual         6.5.6 Application Management organization
      Service Improvement processes, particularly in              Although all Application Management departments,
      identifying opportunities for improvement and then in       groups or teams perform similar activities, each application
      helping to evaluate alternative solutions.                  or set of applications has a different set of management
 ■    Application Management ensures that all system and          and operational requirements. Examples of these
      operating documentation is up to date and properly          differences include:
      utilized. This includes ensuring that all design,
      management and user manuals are up to date and              ■ The purpose of the application. Each application
      complete and that Application Management staff and            was developed to meet a specific set of objectives,
      users are familiar with their contents.                       usually business objectives. For effective support and
                                                                    improvement, the group that manages that
 ■    Collaboration with Technical Management on
                                                                    application needs to have a comprehensive
      performing Training Needs Analysis and maintaining
                                                                    understanding of the business context and how the
      Skills Inventories.
                                                                    application is used to meet its objectives. This is often
 ■    Assisting IT Financial Management to identify the cost
                                                                    achieved by Business Analysts who are close to the
      of the ongoing management of applications.
                                                                    business and responsible for ensuring that business
 ■    Involvement in defining the operational activities            requirements are effectively translated into application
      performed as part of IT Operations Management. Many           specifications. Business Analysts should recognize that
      Application Management departments, groups or                 business requirements must be translated into both
      teams also perform the operational activities as part of      functional and manageability specifications.
      an organization’s IT Operations Management function.
                                                                  ■ The functionality of the application. Each
 ■    Input into, and maintenance of, software configuration        application is designed to work in a different way and
      policies.                                                     to perform different functions at different times.
 ■    Together with Software Development teams, the               ■ The platform on which the application runs.
      definition and maintenance of documentation related           Although the platform is usually managed by a
      to applications. These will include user manuals,             Technical Management team or department, each of
      administration and management manuals, as well as             them affects the way in which an application needs to
      any SOPs required to manage operational aspects of            be managed and operated.
      the application.
                                                                  ■ The type or brand of technology used. Even
 Application Management teams or departments will be                applications that have similar functionality operate
 needed for all key applications. The exact nature of the           differently on different databases or platforms. These
 role will vary depending upon the applications being               differences have to be understood in order to manage
 supported, but generic responsibilities are likely to include:     the application effectively.
                                                                                         Organizing for Service Operation |         135

Even though the activities to manage these applications             ■ Sales force automation
are generic, the specific schedule of activities and the way        ■ Sales order processing applications
they are performed will be different. For this reason,              ■ Call centre and marketing applications
Application Management teams and departments tend to                ■ Business-specific applications (e.g. health care,
be organized according to the categories of applications                insurance, banking, etc.)
that they support. Typical examples of Application
                                                                    ■ IT applications, such as Service Desk, Enterprise System
Management organizations include:
                                                                        Management, etc.
■ Financial applications. In larger organizations where a           ■ Web portals
  number of different applications are used for different           ■ Online shopping.
  aspects of Financial Management, there may be
  several department, groups or teams managing these                6.5.6.1 Organizational roles
  applications, e.g. Debtors and Creditors, Age Analysis,
                                                                    Traditionally, Application Development and Management
  General Ledger, etc.
                                                                    teams and departments have been autonomous units.
■ Messaging and collaboration applications
                                                                    Each one manages its own environment in its own way
■ HR applications                                                   and each has a separate interface to the business. This is
■ Manufacturing support applications                                illustrated in Table 6.2.

Table 6.2 Organizational roles
                            Application Development                                 Application Management
 Primary focus              Building functionality for their customer. What the     Focus on what the functionality is as well as
                            application does is more important to them than         how to deliver it.
                            how it is operated
                                                                                    Manageability aspects of the application,
                                                                                    i.e. how to ensure stability and performance
                                                                                    of the application

 Management mode            Most development work is done in projects where         Most work is done as part of repeatable,
                            the focus is on delivering specific units of work to    ongoing processes. A relatively small number
                            specification, on time and within budget.               of people work in projects.
                            This means that it is often difficult for developers    This means that it is very difficult for
                            to understand and build for ongoing operations,         operational staff to get involved in
                            especially since they are not available for support     development projects, as that takes them away
                            of the application once they have moved on to           from their ‘real jobs’
                            the next project

 Measurement                Staff are rewarded for creativity and for completing Staff are rewarded for consistency and for
                            one project so that they can move on to the next preventing unexpected events and
                            project                                              unauthorized functionality (e.g. ‘bells and
                                                                                 whistles’ added by developers)

 Cost                       Development projects are relatively easy to             Ongoing management costs are often mixed in
                            quantify since the resources are known and it is        with the costs of other IT services since
                            easy to link their expenses to a specific application   resources are often shared across multiple IT
                            or IT Service                                           services and applications

 Lifecycles                 Development staff focus on Software                     Staff involved in ongoing management
                            Development Lifecycles, which highlight the             typically only control one or two phases of
                            dependencies for successful operation, but do not       these lifecycles – Operation and Improvement
                            assign accountability for these
136   | Organizing for Service Operation



 Over the last several years, these two worlds are being
 brought together by recent moves to Object Oriented and
                                                                                              Requirements
 SOA approaches, together with growing pressure from the
 Business to be more responsive and easy to work with.
 This means that Application Development will have
 greater accountability for the successful operation of
 applications they design, while Application Management
                                                                        Optimize                                         Design
 will have greater involvement in the development
 of applications.                                                                        IT Service Management
                                                                                             Strategy, Design,
 This does not change the fundamental role of each group,                                        Transition
                                                                                            and Improvement
 but it does require a more integrated approach to the SLC.
 It will also mean that the output of Application                                                                        Build
                                                                         Operate
 Development will be more commoditized and that                                                                         and Test
 Application Management will be more involved in
 Development projects.
 This will require the following changes:
 ■ A single interface to the business for all stages of the                                      Deploy
      lifecycle and a common requirements and
      specification-setting process.
 ■    A change in how both Development and Management
                                                                             Application Development             Application Management
      staff are measured. Development teams should be
      held partly accountable for design flaws that create          Figure 6.6 Role of teams in the Application
      operational outages. Management staff should be held          Management Lifecycle
      partly accountable for contribution to the technical
      architecture and manageability design of applications.        6.5.7 Application Management roles and
 ■    A single Change Management process for both                   responsibilities
      groups, with Change Control in each group being
      subordinate to the overall authority of Change                6.5.7.1 Applications Managers/Team-leaders
      Management (see Service Transition publication).              An Applications Manager or Team-leader (depending upon
 ■    A clear mapping of Development and Management                 the size and/or importance of the team or department
      activities in the lifecycle, which is illustrated at a high   and the application they support, and the organization’s
      level in Figure 6.5. The exact activities and how they        structure and culture) will be needed for each of the
      interact should be defined in each organization,              applications teams or departments. The role will:
      although some generic guidelines are given in each
                                                                    ■ Take overall responsibility for leadership, control and
      of the ITIL publications.
                                                                        decision-making for the applications team or
 ■    Greater focus on integrating functionality and
                                                                        department
      manageability requirements early in the project.
                                                                    ■   Provide technical knowledge and leadership in the
 Figure 6.6 shows a common Application Management                       specific applications support activities covered by the
 Lifecycle with involvement from both groups. In this                   team or department
 diagram it is clear that Application Development will be           ■   Ensure necessary technical training, awareness and
 driving some phases with input from Application                        experience levels are maintained within the team or
 Management. In other cases Application Management will                 department relevant to the applications being
 be driving the phase with input and support from                       supported and processes being used
 Application Development. Both groups are subordinated              ■   Involve ongoing communication with users and
 to the IT Service Strategy of the organization and their               customers regarding application performance and
 efforts are coordinated through Service Transition                     evolving requirements of the business
 mechanisms and processes.
                                                                    ■   Report to senior management on all issues relevant to
                                                                        the applications being supported
                                                                               Organizing for Service Operation |   137

■ Perform line-management for all team or department           ● Transaction rates and availability for critical
    members.                                                      business transactions
                                                              ● Service Desk training
6.5.7.2 Applications Analyst/Architect                        ● Recording problem resolutions into the KEDB
Application Analysts and Architects are responsible for       ● User measures of the quality of outputs as defined
matching requirements to application specifications.              in the SLAs.
Specific activities include:                                ■ Process metrics. Technical Management teams
■ Working with users, sponsors and all other                  execute many Service Management process activities.
    stakeholders to determine their evolving needs            Their ability to do so will be measured as part of the
■   Working with Technical Management to determine the        process metrics where appropriate (see section on
    highest level of system requirements required to meet     each process for more details). Examples include:
    the business requirements within budget and               ● Response time to events and event completion
    technology constraints                                        rates
■   Performing cost-benefit analyses to determine the         ● Incident resolution times for second- and third-line
    most appropriate means to meet the stated                     support
    requirement                                               ● Problem resolution statistics
■   Developing Operational Models that will ensure            ● Number of escalations and reason for those
    optimal use of resources and the appropriate level            escalations
    of performance                                            ● Number of changes implemented and backed out
■   Ensuring that applications are designed to be             ● Number of unauthorized changes detected
    effectively managed given the organization’s              ● Number of releases deployed, total and successful,
    technology architecture, available skills and tools           including ensuring adherence to the Release
■   Developing and maintaining standards for application          Policies of the organization
    sizing, performance modelling, etc                        ● Security issues detected and resolved
■   Generating a set of acceptance test requirements,         ● Actual system utilization against Capacity Plan
    together with the designers, test engineers and the           forecasts (where the team has contributed to the
    user, which determine that all of the high-level              development of the plan)
    requirements have been met, both functional               ● Tracking against SIPs
    and with regard to manageability                          ● Expenditure against budget.
■   Input into the design of configuration data required    ■ Application performance. These metrics are based
    to manage and track the application effectively.          on Service Design specifications and technical
An appropriate number of Application Analysts will be         performance standards set by vendors and will
needed for each of the Application Management teams or        typically be contained in OLAs or SOPs. Actual metrics
department to perform the generic activities described in     will vary by application, but are likely to include:
paragraph 6.5.5.                                              ● Response times
                                                              ● Application availability, which is helpful for
The ways in which Application Management groups can
be organized, and the options available, are discussed in         measuring team or application performance but is
some detail in section 6.7 below.                                 not to be confused with Service Availability –
                                                                  which requires the ability to measure the overall
6.5.8 Application Management metrics                              availability of the service, and may use the
                                                                  availability figures for a number of individual
Metrics for Application Management will largely depend
                                                                  systems or components
on which applications are being managed, but some
                                                              ● Integrity of data and reporting.
generic metrics include:
                                                            ■ Measurement of maintenance activity, including:
■ Measurement of agreed outputs. These could                  ● Maintenance performed per schedule
    include:                                                  ● Number of maintenance windows exceeded
    ● Ability of users to access the application and its
                                                              ● Maintenance objectives achieved (number and
        functionality
                                                                  percentage).
    ● Reports and files are transmitted to the users
138   | Organizing for Service Operation



 ■ Application Management teams are likely to work               The Application Portfolio forms part of the overall IT
   closely with Application Development teams on                 Service Portfolio, which is discussed in detail in the Service
   projects, and appropriate metrics should be used to           Strategy publication.
   measure this, including:
   ● Time spent on projects                                        The Application Portfolio and the Service
   ● Customer and user satisfaction with the output of             Catalogue
        the project                                                The Application Portfolio should not be mistaken for
   ● Cost of involvement in the project.                           the Service Catalogue and should not be advertised
 ■ Training and skills development. These metrics                  as a list of services to customers or users. Applications
   ensure that staff have the skills and training to               are one of the components used to provide IT
                                                                   services, usually not the service itself.
   manage the technology that is under their control,
   and will also identify areas where training is                  The Application Portfolio should therefore be used as
   still required.                                                 a planning document only by those managers and
                                                                   staff who are involved with the development and
 6.5.9 Application Management                                      management of the organization’s IT Strategy, as well
                                                                   as IT staff who are tasked with managing the
 documentation                                                     applications or the platforms on which the
 A number of documents are produced and used during                applications run.
 Application Management. This list is a summary of some
                                                                   The Service Catalogue should focus on listing the
 of the most important and does not include reports or             services that are available, rather than simply listing
 documents that are produced by Application Management             applications and assuming that users and customers
 on behalf of other process or functions (e.g. RFC, Known          can make the link. Having said that, there are times
 Error documentation, Release Records, etc.). Note that            when the application is synonymous with the service,
 documents should be controlled as CIs and related to the          e.g. word-processing applications are typically known
 relevant applications or Application Management teams.            by their name; an application hosting service will
                                                                   mention the names of the application hosted, etc.
 6.5.9.1 Application Portfolio
 The Application Portfolio is used primarily as part of
 Service Strategy, but is referenced here for completeness.      6.5.9.2 Application Requirements
 The Application Portfolio is a list (more accurately a system   There are two sets of documents containing requirements
 or database) of all applications in use within the              for applications:
 organization, together with the following information:          ■ Business Requirements outline the Business Case for
 Key attributes of the application                                 the required application, in other words what the
                                                                   business will do with the application. This will include
 ■ Customers and users
                                                                   the Return on Investment for the application as well as
 ■ Business purpose                                                all related improvements to the business. Business
 ■ Level of business criticality                                   requirements will also include the Service Level
 ■ Architecture (including the IT Infrastructure                   Requirements as defined by the service customers and
   dependencies)                                                   users.
 ■ Developers, support groups, suppliers or vendors              ■ Application Requirements documents are based on
 ■ The investment made in the application to date. In              the Business Requirements and specify exactly how
   this respect the Application Portfolio can be used as           the application will meet those requirements. In short,
   an asset register for applications,                             Application Requirements documents gather
                                                                   information that will be used to commission new
 The purpose of the Application Portfolio is to analyse the
                                                                   applications or changes to existing applications, for
 need for and use of applications in the organization. It can
                                                                   example:
 be used to link functionality and investment to business
                                                                   ● To design the architecture of the application
 activity and is therefore an important part of ongoing IT
 planning and control. Another benefit of the Application               (specification of the different components of the
 Portfolio is that it can be used to identify duplication and           system, how they relate to one another and how
 excessive licensing of applications.                                   they will be managed)
                                                                                   Organizing for Service Operation |    139

   ● To specify a Request for Proposal (RFP) for a                facilitating communication between users, Developers
     Commercial, Off the Shelf (COTS) application                 and Application Management staff.
   ● To initiate the design and building of an                  ■ Change Cases use scenarios to predict the impact of
     application in-house.                                        potential changes to utilization, architecture or
                                                                  functionality, and project the impact of specific
Requirements documents are normally owned by a project
                                                                  change scenarios. Change Cases are used to clarify
leader, either of a development project team, or for a
                                                                  scope and direction with the sponsor. Extra
team drawing up specifications for an RFP. Requirements
                                                                  architecture and design work will be needed at this
documents are subject to document control for the
                                                                  point to ensure the Change Cases can be met in the
project as they form part of the overall scope of the
                                                                  future at reasonable cost. The sponsor must be
project.
                                                                  prepared to pay the extra cost. If not, the Change
Four different types of Application Requirements need to          Cases should be reduced to what the sponsor is
be defined (for more detailed information, please refer to        prepared to pay for. Change Cases are also used to
the ITIL Service Design and Service Transition publications):     evaluate the architecture. They influence the
■ Functional Requirements describe the things an                  development process enabling the design of
  application is intended to do, and can be expressed as          appropriate architectural features to minimize the
  services, tasks or functions the application is required        impact of future changes.
  to perform.                                                   For more information, refer to the ITIL Service Design and
■ Manageability Requirements are used to define                 Continual Service Improvement publications.
  what is needed to manage the application or to
  ensure that it performs the required functions                6.5.9.4 Design documentation
  consistently and at the right level. Manageability            This is not one specific document, but refers to any
  requirements also identify constraints on the IT system.      document produced by Application Development or
  These requirements serve as a basis for early system          Management staff that specifies how an application will be
  sizing and estimates of cost, and can support the             built. As these documents are generally owned and
  assessment of the viability of the proposed IT system.        managed by the Development teams, this publication will
  Most importantly, they drive design of the operational        not cover them in detail. However, to ensure successful
  models and performance standards used in IT                   operation, Application Management must ensure that
  Operations Management.                                        design documentation contains:
■ Usability Requirements are normally specified by the
                                                                ■ Sizing specifications
  users of the application and refer to its ease of use.
  Any special requirements for handicapped users also           ■ Workload profiles and utilization forecasts
  need to be specified here.                                    ■ Technical Architecture
■ Test Requirements specify what is required to ensure          ■ Data models
  that the test environment is representative of the            ■ Coding standards
  operational environment and that the test is valid (i.e.      ■ Performance standards
  that it actually tests what it is supposed to).               ■ Software Configuration Management definitions
                                                                ■ Environment definitions and building considerations (if
6.5.9.3 Use and Change Cases                                       appropriate).
Use and Change Cases are managed as part of the Service
                                                                For COTS applications, these documents take the form of
Design and Continual Service Improvement processes, but
                                                                Application Specifications that are used as input into the
are maintained by Application Management. For
                                                                writing of RFPs. In these cases the documents are owned
purchased software, it is common for the team that
                                                                and managed by Application Management.
develops the functional specifications to maintain the Use
Case for that application.                                      For more information on Design Documentation, refer to
                                                                the ITIL Service Design publication.
■ Use Cases document the intended use of the
   application with real-life scenarios to demonstrate its
                                                                6.5.9.5 Manuals
   boundaries and its full functionality. Use Cases can
   also be used as modelling and sizing scenarios and for       Application Management is responsible for the
                                                                management of manuals for all applications. Although
                                                                these are normally developed by the Application
140    | Organizing for Service Operation



 Development teams or third party suppliers, Application         individual or shared between two or more, the importance
 Management is responsible for ensuring that the manuals         is the consistency of accountability and execution, along
 are relevant to the operational versions of the applications.   with the interaction with other roles in the organization.
 Three types of manuals are generally maintained by
 Application Management:
                                                                 6.6.1 Service Desk roles
                                                                 The following roles are needed for the Service Desk.
 ■ Design manuals contain information about the
   structure and architecture of the application. These are
                                                                 6.6.1.1 Service Desk Manager
   helpful for creating reports or defining event
   correlation rules. They could also help in diagnosing         In larger organizations where the Service Desk is of a
   problems.                                                     significant size, a Service Desk Manager role may be
                                                                 justified with the Service Desk Supervisor(s) reporting to
 ■ Administration or management manuals describe
                                                                 him or her. In such cases this role may take responsibility
   the activities required to maintain and operate the
                                                                 for some of the activities listed above and may
   application at the levels of performance specified in
                                                                 additionally perform the following activities:
   the Design phase. These manuals will also provide
   detailed troubleshooting, Known Error and Fault               ■ Manage the overall desk activities, including the
   descriptions, and step-by-step instructions for common            supervisors
   maintenance tasks.                                            ■ Act as a further escalation point for the supervisor(s)
 ■ User manuals describe the application functionality as        ■ Take on a wider customer-services role
   it is used by an end-user. These manuals contain step-        ■ Report to senior managers on any issue that could
   by-step instructions on how to use the application, as            significantly impact the business
   well as descriptions of what should typically be              ■ Attend Change Advisory Board meetings
   entered into certain fields, or what to do if there is an     ■ Take overall responsibility for incident and Service
   error.                                                            Request handling on the Service Desk. This could also
                                                                     be expanded to any other activity taken on by the
      Manuals and Standard Operating Procedures                      Service Desk – e.g. monitoring certain classes of event.
      Manuals should not be seen as a replacement for
                                                                 Note: In all cases, clearly defined job descriptions should
      SOPs, but as input into the SOPs.
                                                                 be drafted and agreed so that specific responsibilities are
      SOPs should contain all aspects of applications that       known.
      need to be managed as part of standard operations.
      If they are not extracted from the manuals, there is a     6.6.1.2 Service Desk Supervisor
      high likelihood that they will be ignored or
      performed in a non-standard manner. Application            In very small desks it is possible that the senior Service
      Management should ensure that any such                     Desk Analyst will also act as the Supervisor – but in larger
      instructions are extracted from the manuals and            desks it is likely that a dedicated Service Desk Supervisor
      inserted into separate SOP documentation for               role will be needed. Where shift hours dictate it, there may
      Operations. It is also responsible for ensuring that       be two or more post-holders who fulfil the role, usually on
      these instructions are updated with every change or        an overlapping basis. The Supervisor’s role is likely to
      new release of the software.                               include:
                                                                 ■ Ensuring that staffing and skill levels are maintained
                                                                     throughout operational hours by managing shift
 6.6 SERVICE OPERATION ROLES AND                                     staffing schedules, etc.
 RESPONSIBILITIES                                                ■   Undertaking HR activities as needed
 The key to effective ITSM is ensuring that there is clear       ■   Acting as an escalation point where difficult or
 accountability and roles defined to carry out the practice          controversial calls are received
 of Service Operation. A role is often tied to a job             ■   Production of statistics and management reports
 description or work group description but does not              ■   Representing the Service Desk at meetings
 necessarily need to be filled by one individual. The size of    ■   Arranging staff training and awareness sessions
 an organization, how it is structured, the existence of         ■   Liaising with senior management
 external partners and other factors will influence how roles    ■   Liaising with Change Management
 are assigned. Whether a particular role is filled by a single
                                                                                   Organizing for Service Operation |      141

■ Performing briefings to Service Desk staff on changes         ■ Perform line-management for all team or department
  or deployments that may affect volumes at the Service             members.
  Desk
■ Assisting analysts in providing first-line support when       6.6.2.2 Technical Analysts/Architects
  workloads are high, or where additional experience is         This term refers to any staff member in Technical
  required.                                                     Management who performs the activities listed in
                                                                paragraph 6.3.3, excluding the daily operational actions,
6.6.1.3 Service Desk Analysts                                   which are performed by Operators in either Technical or IT
The primary Service Desk Analyst role is that of providing      Operations Management. Based on the list of generic
first-level support through taking calls and handling the       activities in paragraph 6.3.3, the role of Technical Analysts
resulting incidents or Service Requests using the Incident      and Architects includes:
Reporting and Request Fulfilment processes, in line with        ■ Working with users, sponsors, Application
the objectives described earlier. The exact number of staff         Management and all other stakeholders to determine
required is discussed in paragraph 6.2.4.1.                         their evolving needs
                                                                ■   Working with Application Management and other
6.6.1.4 Super Users                                                 areas in Technical Management to determine the
Super Users are discussed in detail in the section on               highest level of system requirements required to meet
Service Desk staffing in paragraph 6.2.4. In summary, this          the requirements within budget and technology
role will consist of business users who act as liaison points       constraints
with IT in general and the Service Desk in particular. The      ■   Defining and maintaining knowledge about how
role of the Super User can be summarized as follows:                systems are related and ensuring that dependencies
■ To facilitate communication between IT and the                    are understood and managed accordingly
   business at an operational level                             ■   Performing cost-benefit analyses to determine the
■ To reinforce expectations of users regarding what                 most appropriate means to meet the stated
   Service Levels have been agreed                                  requirements
■ Staff training for users in their area                        ■   Developing Operational Models that will ensure
■ Providing support for minor incidents or simple                   optimal use of resources and the appropriate level
  request fulfilment                                                of performance
■ Involvement with new releases and rollouts.                   ■   Ensuring that the infrastructure is configured to be
                                                                    effectively managed given the organization’s
6.6.2 Technical Management roles                                    technology architecture, available skills and tools
                                                                ■   Ensuring the consistent and reliable performance
The following roles are needed in the Technical
Management areas                                                    of the infrastructure to deliver the required level
                                                                    of service to the business
6.6.2.1 Technical Managers/Team-leaders                         ■   Defining all tasks required to manage the
                                                                    infrastructure and ensuring that these tasks are
A Technical Manager or Team-leader (depending upon the
                                                                    performed appropriately
size and/or importance of the team and the organization’s
                                                                ■   Input into the design of configuration data required
structure and culture) may be needed for each of the
                                                                    to manage and track the application effectively.
technical teams or departments. The role will:
                                                                The ways in which Technical Management can be
■ Take overall responsibility for leadership, control and
                                                                organized, and the options available, are discussed in
   decision-making for the technical team or department
                                                                some detail in section 6.7.
■ Provide technical knowledge and leadership in the
  specific technical areas covered by the team or
                                                                6.6.2.3 Technical Operator
  department
■ Ensure necessary technical training, awareness and
                                                                This term is used to refer to any staff who performs day-
  experience levels are maintained within the team or           to-day operational tasks in Technical Management. Usually,
  department                                                    these tasks are delegated to a dedicated IT Operations
                                                                team, and this role is therefore discussed in paragraph
■ Report to senior management on all technical issues
                                                                6.6.3.4 on IT Operators.
  relevant to their area of responsibility
142   | Organizing for Service Operation



 6.6.3 IT Operations Management roles                        leader will be needed on each of the shifts, to perform the
 The following roles and needed in the IT Operations         following activities:
 Management area:                                            ■ Take overall responsibility for leadership, control and
                                                                 decision-making during the shift period
 6.6.3.1 IT Operations Manager                               ■   Ensure that all operational activities are satisfactorily
 An IT Operations Manager will be needed to take overall         performed within agreed timescales and in accordance
 responsibility for all of the IT Operations Management          with company policies and procedures
 activities, which include:                                  ■   Liaise with the other shift leader(s) to ensure
 ■ Operations Control, which oversees the execution              handover, continuity and consistency between the
   and monitoring of the operational activities in the IT        shifts
   Infrastructure. This can be done with the assistance of   ■   Act as line-manager for all Operations Analysts on
   an Operations Bridge or Network Operations Centre. In         his/her shift
   addition to executing routine tasks from all technical    ■   Assume overall health and safety, and security
   areas, Operations Control also performs the following         responsibility for the shift (unless specifically
   specific tasks:                                               designated to other staff members).
   ● Console Management, which refers to defining
       central observation and monitoring capability and     6.6.3.3 IT Operations Analysts
       then using those consoles to exercise monitoring      IT Operations Analysts are senior IT Operations staff who
       and control activities                                are able to determine the most effective and efficient way
   ● Job Scheduling, or the management of routine            to conduct a series of operations, usually in high-volume,
       batch jobs or scripts                                 diverse environments.
   ● Backup and Restore on behalf of all Technical           This role is normally performed as part of Technical
       and Application Management teams or department        Management, but large organizations may find that the
       and often on behalf of users                          volume and diversity of operational activities requires
   ● Print and Output management for the collation           some more in-depth planning and execution. Examples
       and distribution of all centralized printing or       include Job Scheduling and the definition of a Backup
       electronic output.                                    strategy and schedule.
 ■ Facilities Management, which refers to the
   management of the physical IT environment, typically      6.6.3.4 IT Operators
   a Data Centre or computer rooms and recovery sites        IT Operators are the staff who perform the day-to-day
   together with all the power and cooling equipment.        operational activities that are defined in Technical or
   Facilities Management also includes the coordination      Application Management and, in some cases, IT
   of large-scale consolidation projects, e.g. data centre   Operations Analysts. Typical Operator roles include:
   consolidation or server consolidation projects. In some
   cases the management of a Data Centre is outsourced,      ■ Performing backups
   in which case Facilities Management refers to the         ■ Console operations, i.e. monitoring the status of
   management of the outsourcing contract.                       specific systems, job queues, etc. and providing first-
                                                                 level intervention if appropriate
 The role of the IT Operations Manager is to:
                                                             ■   Managing print devices, restocking with paper,
 ■ Provide overall leadership, control and decision-             toner, etc.
   making and take responsibility for the IT Operations      ■   Ensuring that batch jobs, archiving, etc. are performed
   Management teams and department                           ■   Running scheduled housekeeping jobs, such as
 ■ Report to senior management on all IT Operations              database maintenance, file clean-up, etc.
   issues                                                    ■   Burning images for distribution and installation on
 ■ Perform line-management for all IT Operations team or         new servers, desktops or laptops
   department managers/supervisors.                          ■   Physical installation of standard equipment in the
                                                                 Data Centre.
 6.6.3.2 Shift Leaders
 Many IT Operations areas will work extended hours – on
 either a two- or three-shift basis. In such cases a shift
                                                                                  Organizing for Service Operation |      143


6.6.4 Application Management roles                              requirements have been met, both functional and with
                                                                regard to manageability
6.6.4.1 Applications Managers/Team-leaders                    ■ Input into the design of configuration data required to
An Applications Manager or Team-leader should be                manage and track the application effectively.
considered for each of the applications teams or
                                                              An appropriate number of Application Analysts will be
departments. The role will:
                                                              needed for each of the Application Management teams or
■ Take overall responsibility for leadership, control and     department to perform the activities described elsewhere
    decision-making for the applications team or              in this publication, primarily in paragraph 6.5.5.
    department
                                                              The ways in which Application Management groups can
■   Provide technical knowledge and leadership in the         be organized, and the options available, are discussed in
    specific applications support activities covered by the   some detail in section 6.7.
    team or department
■   Ensure necessary technical training, awareness and        6.6.5 Event Management roles
    experience levels are maintained within the team or
                                                              It is unusual for an organization to appoint an ‘Event
    department relevant to the applications being
                                                              Manager’, as events tend to occur in multiple contexts and
    supported and processes being used
                                                              for many different reasons. However, it is important that
■   Involve ongoing communication with users and              Event Management procedures are coordinated to prevent
    customers regarding application performance and           duplication of effort and tools. The roles of the Service
    evolving requirements of the business                     Operation functions in Event Management are as follows.
■   Report to senior management on all issues relevant to
    the applications being supported                          6.6.5.1 The role of the Service Desk
■   Perform line-management for all team or department        The Service Desk is not typically involved in Event
    members.                                                  Management as such, unless an event requires some
                                                              response that is within the scope of the Service Desk’s
6.6.4.2 Applications Analyst/Architect                        defined activity, for example notifying a user that a report
Application Analysts and Architects are responsible for       is ready. Generally, though, this type of activity is
matching requirements to application specifications.          performed by the Operations Bridge, unless the Service
Specific activities include:                                  Desk and Operations Bridge have been combined.
■ Working with users, sponsors and all other                  The investigation and resolution of events that have been
    stakeholders to determine their evolving needs            identified as being Incidents will initially be undertaken by
■   Working with Technical Management to determine the        the Service Desk and then escalated to the appropriate
    highest level of system requirements required to meet     Service Operation team(s)
    the requirements within budget and technology
                                                              The Service Desk is also responsible for communicating
    constraints
                                                              information about this type of incident to the relevant
■   Performing cost-benefit analyses to determine the         Technical or Application Management team and, where
    most appropriate means to meet the stated                 appropriate, the user.
    requirement
■   Developing Operational Models that will ensure            6.6.5.2 The role of Technical and Application
    optimal use of resources and the appropriate level
                                                              Management
    of performance
                                                              Technical and Application Management plays several
■   Ensuring that applications are designed to be
                                                              important roles as follows:
    effectively managed given the organization’s
    technology architecture, available skills and tools       ■ During Service Design, they will participate in the
■   Developing and maintaining standards for application        instrumentation of the service, classify events, update
    sizing, performance modelling, etc.                         correlation engines and ensure that any auto
■   Generating a set of acceptance test requirements,           responses are defined
    together with the designers, test engineers and the       ■ During Service Transition they will test the service to
    user, which determine that all of the high-level            ensure that events are properly generated and that
                                                                the defined responses are appropriate
144   | Organizing for Service Operation



 ■ During Service Operation these teams will typically          ■ Developing and maintaining the Incident Management
   perform Event Management for the systems under                  process and procedures.
   their control. It is unusual for teams to have a
                                                                In many organizations the role of Incident Manager is
   dedicated person to manage Event Management, but
                                                                assigned to the Service Desk Supervisor – though in larger
   each manager or team leader will ensure that the
                                                                organizations with high volumes a separate role may be
   appropriate procedures are defined and executed
                                                                necessary. In either case it is important that the Incident
   according to the process and policy requirements
                                                                Manager is given the authority to manage incidents
 ■ Technical and Application Management will also be
                                                                effectively through first, second and third line.
   involved in dealing with incidents and problems
   related to events                                            6.6.6.2 First line
 ■ If Event Management activities are delegated to the
                                                                This is covered in detail under the Service Desk (section
   Service Desk or IT Operations Management, Technical
                                                                6.1) and will not be repeated here.
   and Application Management must ensure that the
   staff are adequately trained and that they have access
                                                                6.6.6.3 Second line
   to the appropriate tools to enable them to perform
   these tasks.                                                 Many organizations will choose to have a second-line
                                                                support group, made up of staff with greater (though still
 6.6.5.3 The role of IT Operations Management                   general) technical skills than the Service Desk – and with
                                                                additional time to devote to incident diagnosis and
 Where IT Operations is separated from Technical or
                                                                resolution without interference from telephone
 Application Management, it is common for Event
                                                                interruptions.
 Monitoring and first-line response to be delegated to IT
 Operations Management. Operators for each area will be         Such a group can handle many of the less complicated
 tasked with monitoring events, responding as required, or      incidents, leaving more specialist (third-line) support
 ensuring that Incidents are created as appropriate. The        groups to concentrate on dealing with more deep-rooted
 instructions for how to do so must be included in the          incidents and/or new developments etc.
 SOPs for those teams.                                          Where a second-line group is used, there are often
 Event Monitoring is commonly delegated to the                  advantages of locating this group close to the Service
 Operations Bridge where it exists. The Operations Bridge       Desk to aid with good communications and to ease
 can initiate and coordinate, or even perform, the              movement of staff between the groups, which may be
 responses required by the service, or provide first-level      helpful for training/awareness and during busy periods
 support for those events which generate an incident.           or staff shortages. A second-line support manager (or
                                                                supervisor if just a small group) will normally head
 6.6.6 Incident Management roles                                this group.
 The following roles are needed for the Incident                It is conceivable that this group may be outsourced – and
 Management process.                                            this is more likely and practical if the Service Desk itself
                                                                has been outsourced.
 6.6.6.1 Incident Manager
 An Incident Manager has the responsibility for:                6.6.6.4 Third line
 ■ Driving the efficiency and effectiveness of the Incident     Third-line support will be provided by a number of
      Management process                                        internal technical groups and/or third-party
                                                                suppliers/maintainers. The list will vary from organization
 ■    Producing management information
                                                                to organization but is likely to include:
 ■    Managing the work of incident support staff (first- and
      second-line)                                              ■ Network Support
 ■    Monitoring the effectiveness of Incident Management       ■ Voice Support (if separate)
      and making recommendations for improvement                ■ Server Support
 ■    Developing and maintaining the Incident Management        ■ Desktop Support
      systems                                                   ■ Application Management – likely that there may be
 ■    Managing Major Incidents                                     separate teams for different applications or application
                                                                   types – some of which may be external
                                                                                    Organizing for Service Operation |      145

  supplier/maintainers. In many cases the same team              ■ Liaison with suppliers, contractors, etc. to ensure that
  will be responsible for Application Developments as              third parties fulfil their contractual obligations,
  well as support – and it is therefore important that             especially with regard to resolving problems and
  resources are prioritized so that support is given               providing problem-related information and data
  adequate prominence                                            ■ Arranging, running, documenting and all follow-up
■ Database Support                                                 activities relating to Major Problem Reviews.
■ Hardware Maintenance Engineers
■ Environmental Equipment Maintainers/Suppliers.                 6.6.8.2 Problem-Solving Groups
                                                                 The actual solving of problems is likely to be undertaken
Note: Depending upon where an organization decides to
                                                                 by one or more technical support groups and/or suppliers
source its support services, any of the above groups could
                                                                 or support contractors – under the coordination of the
be internal or external groups.
                                                                 Problem Manager.
6.6.7 Request Fulfilment roles                                   Where an individual problem is serious enough to warrant
Initial handling of Service Requests will be undertaken by       it, a dedicated problem management team should be
the Service Desk and Incident Management staff.                  formulated to work together in overcoming that particular
                                                                 problem. The Problem Manager has a role to play in
Eventual fulfilment of the request will be undertaken by         making sure that the correct number and level of
the appropriate Service Operation team(s) or departments         resources is available in the team and for escalation and
and/or by external suppliers, as appropriate. Often,             communication up the management chain of all
Facilities Management, Procurement and other business            organizations concerned.
areas aid in the fulfilment of the Service Request. In most
cases there will be no need for additional roles or posts to     6.6.9 Access Management roles
be created.
                                                                 Since Access Management is an execution of Security and
In exceptional cases where a very high number of Service         Availability Management, these two areas will be
Requests are handled, or where the requests are of critical      responsible for defining the appropriate roles. It is unusual
importance to the organization, it may be appropriate to         for an organization to appoint an ‘Access Manager’,
have one or more of the Incident Management team                 although it is important that there is a single Access
dedicated to handling and managing Service Requests.             Management process and a single set of policies related to
                                                                 managing rights and access. This process and the related
6.6.8 Problem Management roles                                   policies are likely to be defined and maintained by
The following roles are needed for the Problem                   Information Security Management and executed by the
Management process.                                              various Service Operation functions. Their activities can be
                                                                 summarized as follows.
6.6.8.1 Problem Manager
There should be a designated person (or, in larger               6.6.9.1 The role of the Service Desk
organizations, a team) responsible for Problem                   The Service Desk is typically used as a means to request
Management. Smaller organizations may not be able to             access to a service. This is normally done using a Service
justify a full-time resource for this role, and it can be        Request. The Service Desk will validate the request by
combined with other roles in such cases, but it is essential     checking that the request has been approved at the
that it not just left to technical resources to perform. There   appropriate level of authority, that the user is a legitimate
needs to be a single point of coordination and an owner          employee, contractor or customer and that they qualify for
of the Problem Management process. This role will                access.
coordinate all Problem Management activities and will            Once it has performed these checks (usually by accessing
have specific responsibility for:                                the relevant databases and Service Level Management
■ Liaison with all problem resolution groups to ensure           documents) it will pass the request to the appropriate
   swift resolution of problems within SLA targets               team to provide access. It is quite common for the Service
■ Ownership and protection of the KEDB                           Desk to be delegated responsibility for providing access
■ Gatekeeper for the inclusion of all Known Errors and           for simple services during the call.
  management of search algorithms                                The Service Desk will also be responsible for
■ Formal closure of all Problem Records                          communicating with the user to ensure that they know
146   | Organizing for Service Operation



 when access has been granted and to ensure that they           The Operations Bridge, if it exists, can be used to monitor
 receive any other required support.                            events related to Access Management and can even
                                                                provide first-line support and coordination in the
 The Service Desk is also well situated to detect and report
                                                                resolution of those events where appropriate.
 incidents related to access. For example, users attempting
 to access services without authority; or users reporting
 incidents that indicate that a system or service has been      6.7 SERVICE OPERATION ORGANIZATION
 used inappropriately, i.e. by a former employee who            STRUCTURES
 used an old username to gain access and make
 unauthorized changes.                                          Some general information has already been provided
                                                                about organizational considerations for each function (see
                                                                paragraphs 6.2.3, 6.3.4 and 6.5.6.). This section considers
 6.6.9.2 The role of Technical and Application
                                                                some specific organizational structures for all functions.
 Management
                                                                There are a number of ways of organizing Service
 Technical and Application Management play several              Operation functions, and each organization will have to
 important roles as follows:                                    make it own decisions, based upon its scale, geography,
 ■ During Service Design, they will ensure that                 culture and business environment. Some options are
      mechanisms are created to simplify and control Access     discussed in the rest of this section.
      Management on each service that is designed. They
      will also specify ways in which abuse of rights can be    6.7.1 Organization by technical
      detected and stopped                                      specialization
 ■    During Service Transition they will test the service to   In this type of organization, departments are created
      ensure that access can be granted, controlled and         according to technology and the skills and activities
      prevented as designed                                     needed to manage that technology. IT Operations will
 ■    During Service Operation these teams will typically       follow the structure of the Technical and Application
      perform Access Management for the systems under           Management departments. The implication of this is that
      their control. It is unusual for teams to have a          IT Operations is geared toward the operational agendas of
      dedicated person to manage Access Management, but         the Technical and Application Management departments.
      each manager or team leader will ensure that the
                                                                This structure can work well, provided that these
      appropriate procedures are defined and executed
                                                                groups are fully represented in the Service Design,
      according to the process and policy requirements
                                                                Testing and Improvement processes, which will ensure
 ■    Technical and Application Management will also be         that their agendas are aligned with the requirements
      involved in dealing with Incidents and Problems           of the business.
      related to Access Management
 ■    If Access Management activities are delegated to the      This structure also assumes that all Technical and
      Service Desk or IT Operations Management, Technical       Application Management departments have clearly
      and Application Management must ensure that the           distinguished between their Management activity and
      staff are adequately trained and that they have access    operations activity. It also requires that they have
      to the appropriate tools to enable them to perform        standardized these operational activities so that they can
      these tasks.                                              be effectively managed by the IT Operations Manager
                                                                without undue interference from the Technical and
                                                                Application Management teams or departments.
 6.6.9.3 The role of IT Operations Management
 Where IT Operations is separated from Technical or             An example of an IT Operations organization structure
 Application Management, it is common for operational           based on technical expertise is given in Figure 6.7
 Access Management tasks to be delegated to IT                  The advantages of this type of organizational structure
 Operations Management. Operators for each area will be         include:
 tasked with providing or revoking access to key systems or
 resources. The circumstances under which they may do so,       ■ It is easier to set internal performance objectives since
 and the instructions for how to do so, must be included in        all staff in a single department have a similar set of
 the SOPs for those teams.                                         tasks on a similar technology
                                                                                              Organizing for Service Operation |   147

■ Individual devices, systems or platforms can be                   The disadvantages of this type of organizational structure
  managed more effectively since people with the                    include the following:
  appropriate skills are dedicated to manage these and
                                                                    ■ When people are divided into separate departments
  measured according to their performance
                                                                        the priorities of their own group tend to override the
■ Managing training programmes is easier since skill sets
                                                                        priorities of other departments. An example of this is
  are clearly defined and separated into specific groups.               when departments refuse to accept ownership of an
                                                                        incident, each one blaming the other while the
                                                                        business continues to be disrupted.



                                                                IT Operations
                                                                   Manager




                           IT Operations       Infrastructure                   Application               Facilities
                              Control           Operations                      Operations               Management



                                                    Mainframe                   Financial Apps
                                                    Operations                    Operations




                                                      Server                      HR Apps
                                                    Operations                   Operations



                                                     Storage                    Business Apps
                                                    Operations                   Operations



                                                     Network
                                                    Operations



                                                     Desktop
                                                    Operations




                                                     Database
                                                    Operations



                                               Directory Service
                                                  Operations



                                                    Middleware
                                                    Operations



                                                 Internet/Web
                                                  Operations


Figure 6.7 IT Operations organized according to technical specialization (sample)
148   | Organizing for Service Operation



 ■ Knowledge about the infrastructure and relationships            ■ Maintenance (this implies that one team will
      between components is difficult to collect and                   coordinate and perform all maintenance across
      fragmented. Individual groups tend to collect and                all technologies)
      maintain only the data that is required to support           ■   Contract Management or Third Party Management
      their own function, and do not give access to it             ■   Monitoring and Control
      very easily.                                                 ■   Operations Bridge
 ■    Each technology managed by a group is seen as a              ■   Network Operations Centre
      separate entity. This becomes a problem on systems
                                                                   ■   Operations Strategy and Planning (which, as part of
      that consist of components managed by different
                                                                       the Service Design processes, normally defines the
      teams, e.g. an application, managed by the
                                                                       standards to be used in IT Operations) – this
      Application Management team, runs on a server
                                                                       department can set strategy or standards for every
      managed by the Server Management department,
                                                                       type of Technical and Application Management area.
      using a network segment managed by the Local Area
      Networking department. If a change is made by one            The Operations Strategy and Planning department is used
      team or department without consulting the others,            to illustrate this type of structure in Figure 6.8.
      this could be disastrous for the service.                    The advantages of this type of organizational structure
 ■    It is more difficult to understand the impact of a           include the following:
      single department’s poor performance on the IT
                                                                   ■ It is easier to manage groups of related activities since
      Service since there are many different groups
      contributing to the same service, each with its own set        all the people involved in these activities report to the
      of performance objectives.                                     same manager
                                                                   ■ Measurement of teams or departments is based more
 ■    It is more difficult to track overall IT Service
      performance since each group is being measured on              on output than on isolated activities. This helps to
      an individual basis.                                           build higher levels of assurance that a service can
                                                                     be delivered.
 ■    Coordinating Change Assessments and Schedules is
      more difficult since many different departments have         The disadvantages of this type of organizational structure
      to provide input for each change.                            include the following:
 ■    Work requiring knowledge of multiple technologies is         ■ Resources with similar skills may be duplicated across
      difficult since most resources are only trained for and          different functions, which results in higher costs
      concerned with the management of a single                    ■ Although measurement is more output-based, it is
      technology. Projects therefore have to include cross-
                                                                       still focused on the performance of internal activities
      training, which is time-consuming and expensive.
                                                                       rather than driven by the experience of the customer
                                                                       or end user.
 6.7.2 Organization by activity
 This type of organization structure focuses on the fact that      6.7.3 Organizing to manage processes
 similar activities have to be performed on all technologies
                                                                   It is not a good idea to structure the whole organization
 in the organization. This means that people who perform
                                                                   according to processes. Processes are used to overcome
 similar activities, regardless of the technology, should be
                                                                   the ‘silo effect’ of departments, not to create silos.
 grouped together, although within each department there
                                                                   However, there are a number of processes that will need a
 may be teams focusing on a specific technology,
                                                                   dedicated organization structure to support and manage
 application, etc.
                                                                   it. For example, it will be very difficult for Financial
 In this type of organization, there is no clear differentiation   Management to be successful without a dedicated Finance
 between the different Technical and Application                   department – even if that department consists of a small
 Management areas. Similar activities from many different          number of staff.
 areas can be grouped into a single department.
                                                                   In process-based organizations people are organized into
 Examples of departments that have been set up to                  groups or departments that perform or manage a specific
 perform a specific set of activities across multiple              process. This is similar to the activity-based structure,
 technologies include:                                             except that its departments focus on end-to-end sets of
                                                                   activities rather than on one individual type of activity.
                                                                                       Organizing for Service Operation |      149


                                                     Organization by Activity

                                                          IT Strategy and
                                                        Planning Manager



                         Architecture                                                          New
                                                  Capacity               Service
                             and                                                            Technology
                                                  Planning              Portfolio
                          Standards                                                          Research




                              Applications          Mainframe




                             Infrastructure           Servers




                                                      Storage




                                                     Network




                                                    Web-based



Figure 6.8 A department based on executing a set of activities


It should be noted that this type of organization structure        Examples of process-based groups or departments include:
should only be used if IT Operations Management is
                                                                   ■ Capacity Operations
responsible for more than just IT Operations. In some
                                                                   ■ Availability Monitoring and Control
organizations, for example, IT Operations is responsible for
                                                                   ■ IT Financial Management
defining SLAs and negotiating UCs.
                                                                   ■ Security Administration
In addition, processes specifically exist to link the activities
                                                                   ■ Asset and Configuration Management (including
of different groups to achieve a specific outcome. Using
                                                                       equipment installation and deployment).
processes as the basis to create departments can defeat
the purpose of having processes in the first place. Process-       The advantages of this type of organizational structure
based departments are really only effective when they are          include the following:
able to coordinate the execution of the process through            ■ Processes are easier to define
the entire organization.
                                                                   ■ There is less role conflict as job descriptions and
This means that process-based departments should only                  process role descriptions are the same. In other
be considered if IT Operations Management is to play the               structures a single job description will typically include
role of Process Owner for a specific process.                          activities for several roles
150   | Organizing for Service Operation



 ■ Metrics of team or department performance and                  may be structured in this way, while another region uses a
      process performance are the same, effectively aligning      process- or activity-based structure.
      ‘internal’ and ‘external’ metrics.
                                                                  Figure 6.9 also illustrates that one location could perform
 The disadvantages of this type of organizational structure       centralized operations for all regions if they are similar
 include the following:                                           enough. In this example, the American Server Operations
                                                                  Department manages all server operations in all locations,
 ■ A basic principle of processes is that they are a means
                                                                  Brussels manages all database operations and Singapore
   of linking the activities of various departments and
                                                                  manages all storage operations.
   groups. By using processes as a basis for
   organizational design, additional processes need to be         The advantages of this type of organizational structure
   defined to ensure that the departments work together.          include the following:
 ■ Even if a department is responsible for executing a            ■ Organization structure can be customized to meet
   process, there will still be external dependencies.                local conditions
   Groups may not view process activities outside of their
                                                                  ■ IT Operations can be customized to meet differing
   own process as being important, resulting in processes
                                                                      levels of IT service from region to region.
   that cannot be fully executed because dependencies
   cannot be met.                                                 The disadvantages of this type of organizational structure
 ■ While some aspects of a process can be centralized,            include the following:
   there will always be a number of activities that will          ■ Reporting lines and authority structures can be
   have to be performed by other groups. The                          confusing. For example, does Network Operations
   relationship between the dedicated team or                         report into the local Data Centre Manager or to a
   department and the people performing the                           centralized Network Operations Manager?
   decentralized activities is often difficult to define and      ■   Operational standards are difficult to impose, resulting
   manage.                                                            in inconsistent and duplicated activities and tools,
                                                                      resulting in reduced economies of scale, which in turn
 6.7.4 Organizing IT Operations by                                    increases the overall cost of operations.
 geography                                                        ■   Duplication of roles, activities, tools and facilities
 IT Operations can be physically distributed and in some              across multiple locations could be very costly.
 cases each location needs to be organized according to its       ■   Shared services, such as e-mail, are more difficult to
 own particular context.                                              deliver as each regional organization operates
 This structure is typically used in the following                    differently.
 circumstances:                                                   ■   Communication with customers and inside IT will be
                                                                      more difficult as they are not co-located and it may be
 ■ Data Centres are geographically distributed
                                                                      difficult for staff in one location to understand the
 ■ Different regions or countries have different                      priorities of customers or staff in another location.
      technologies or provide a different set of services
 ■    There are different business models or organizational       6.7.5 Hybrid organization structures
      structures in the different regions, i.e. the business is
                                                                  It is unlikely that IT Operations Management will be
      decentralized by geography and each Business Unit is
                                                                  structured using only one type of organization structure.
      fairly autonomous
                                                                  Most organizations use a technical specialization, with
 ■    Different legislation applies to different countries        some additional activity- or process-based departments.
      or regions (e.g. safety regulations)
 ■
                                                                  The type of structure used and the exact combination of
      Different standards apply to different countries
                                                                  technical specialization, activity-based and process-based
      or regions
                                                                  departments will depend on a number of organizational
 ■    Cultural or language differences exist between staff
                                                                  variables.
      managing IT.
 An example of this type of structure is given in Figure 6.9.
 Note that in this example each geographical department is
 structured internally using Technical Specialization. This
 could be different in each region. For example one region
                                                                                                 Organizing for Service Operation |      151


                                               IT Operations
                                                  Manager




         American IT             European IT            African IT Operations –          Asia Pacific IT
      Operations – Miami     Operations – Brussels           Johannesburg            Operations – Singapore



             Mainframe               Mainframe                   Mainframe                   Mainframe
             Operations              Operations                  Operations                  Operations



               Server
             Operations



                                                                                              Storage
                                                                                             Operations



              Network                 Network                     Network                     Network
             Operations              Operations                  Operations                  Operations



              Desktop                 Desktop                     Desktop                     Desktop
             Operations              Operations                  Operations                  Operations



                                      Database
                                     Operations



            Internet/Web            Internet/Web                Internet/Web                Internet/Web
             Operations              Operations                  Operations                  Operations


Figure 6.9 IT Operations organized according to geography

  Organizational structure variables                                          ■ The type and level of skills available to the
                                                                                  organization
  The exact criteria chosen and the resulting
  organizational structure will depend on a number of                         ■ The size, age and maturity of the organization
  variables, which may include:
                                                                              ■ The management style of the organization
  ■ The nature of the business
                                                                              ■ Dependence on IT for business-critical activities,
  ■ Business requirements and expectations                                        processes and functions
  ■ The technological and technical architecture                              ■ The way in which IT participates in the value
                                                                                  network (i.e. the way IT interacts with the business
  ■ The stability of the current IT Infrastructure and
                                                                                  and its partners, suppliers and customers)
     the availability of skills to manage it
                                                                              ■ The relationship between IT and its vendors.
  ■ The governance of the organization (i.e. the way
     in which authority is assigned and decisions are                         For a more complete description of how these factors
     made – as well as any formal governance                                  influence organizational design, please refer to the
     framework that is used, such as COBIT or SOX)                            ‘Organizational Development’ section of the Service
                                                                              Strategy publication.
  ■ The legislative, political and socio-economic
     environment of the organization
152   | Organizing for Service Operation




                                                                IT Operations
                                                                   Manager




                    IT Operations             Infrastructure                     Facilities            Application
                       Control                Management                        Management            Management



                         Server                   Mainframe                                             Financial Apps
                       Management                Management                                              Management



                          Server                   Mainframe                       HR Apps              Financial Apps
                        Operations                 Operations                     Management              Operations



                        Network                    Storage                          HR Apps             Business Apps
                       Management                Management                        Operations            Management



                         Network                    Storage                                             Business Apps
                        Operations                 Operations                                             Operations



                        Database                   Desktop
                       Management                Management



                         Database                   Desktop
                        Operations                 Operations



                                                 Internet/Web
                                                 Management



                                                 Internet/Web
                                                  Operations



 Figure 6.10 Centralized IT Operations, Technical and Application Management structure

 6.7.5.1 Combined functions                                               In this structure, IT Operations Management is responsible
 One last type of organization should be discussed. This                  for the Technical and Application Management functions,
 structure incorporates IT Operations, Technical and                      which in turn are responsible for managing their own
 Application Management departments into a single                         operational activities. Each department is able to delegate
 structure. This is sometimes the case where all groups are               some of these activities to the Operations Control
 co-located in a single data centre. Here, the Data Centre                department.
 Manager takes responsibility for all Technical, Application              The advantages of this organization structure are:
 and IT Operations Management.
                                                                          ■ There is greater consistency and control between the
 This type of organization structure is illustrated                             more tactical and more operational Technical
 in Figure 6.10.                                                                Management activities
                                                                                  Organizing for Service Operation |       153

■ It is easier to enforce the performance standards and        In Application Management, the central team could
  technical architectures that are created in Service          participate in ongoing design and testing of the
  Design, since the people who were involved in design         application, monitoring and control; perform backups,
  are managing the activities of the people who are            data integrity checks, etc. The local team could provide
  executing those activities                                   on-site support and education to end users and work with
■ As there is no duplication between location or activity,     the local Technical Management team to resolve more
  this structure is often more cost-effective.                 complex problems involving local equipment.

The disadvantage of this organization structure is:            There is one potential issue that needs to be resolved
                                                               however, and that is who the local team reports to. In
■ The scope of this structure makes it very difficult to
                                                               some organizations they report to the manager of the
   manage effectively in large organizations or in
                                                               centralized team. This has the added advantage of
   organizations with multiple Data Centres.
                                                               consistent performance and management across the
                                                               whole enterprise.
6.7.5.2 Organizing Application and Technical
Management                                                     In other organizations the local teams report to the most
                                                               senior IT Manager at that site. This has the added
Technical and Application Management organizations tend
                                                               advantage that IT Services can be customized to meet
to be fairly straightforward. As stated in paragraphs 6.3.4
                                                               local conditions, but it creates a lot of confusion about
and 6.5.6, Technical Management departments are usually
                                                               who the local teams should take direction from.
based on the technology they manage and Application
Management departments on the applications and sets of         The advantages of this type of organizational structure
applications they manage.                                      include the following:
However, there are some alternative organization               ■ Organization structure can be customized to meet
structures and variations, which are discussed in this            local conditions
section.                                                       ■ Technical and Application Management can be
                                                                  customized to meet differing levels of IT service from
6.7.5.3 Geography                                                 region to region.
In organizations with multiple locations, it is common for     The disadvantages of this type of organizational structure
the Technical and Application Management departments           include the following:
to be represented in each physical location. However, this
                                                               ■ Reporting lines and authority structures can be
does not mean that each location will have all the same
departments, or that they are all responsible for the same        confusing
actions.                                                       ■ Standards are difficult to impose, resulting in
                                                                 inconsistent and duplicated activities and tools,
As support and management tools mature more and more             resulting in reduced economies of scale, which in turn
IT Infrastructure and application CIs can be managed             increases the overall cost of operations
remotely. This means that each department will have a
                                                               ■ Duplication of roles, activities, tools and facilities
strong, centralized Technical or Application Management
                                                                 across multiple locations could be very costly.
team, with local members to provide specialized, on-site
activities or support.
                                                               6.7.5.4 Combined Technical and Application
For example, in Server Management, the central team will       Management structure
help to create standards for server configuration, they will
                                                               Some organizations organize their Technical and
monitor and control remote devices, perform backups,
                                                               Application Management functions according to systems.
perform Operating System upgrades, etc. The local teams
                                                               This means that each department will consist of
will provide basic on-site support, hardware maintenance
                                                               application specialists and IT Infrastructure technical
and repair and configuration and installation of new
                                                               specialists, all geared towards managing the services
servers.
                                                               based on that set of systems. Components that are shared
                                                               across all these systems, such as the network, will be
                                                               managed by dedicated Technical Management
                                                               departments.
154   | Organizing for Service Operation



 The advantage of this organization structure is:
 ■ It is easier to produce high-quality output to the end
      user because all department members are focused on
      the success of the system as a whole, rather than the
      performance of an individual technology component
      or application.
 The disadvantages of this organization structure are:
 ■ Duplication of skills and resources across several
   departments will increase the cost of the organization.
   For example, each group is likely to have an individual
   or team dedicated to managing servers – each of
   which will be doing very similar tasks.
 ■ Communication between staff who are managing
   similar technology is reduced. This reduces the
   amount of learning by experience and increases
   reliance on collaborative knowledge management
   tools.
 ■ When people with similar skills are in the same
   department, the department will compensate for
   members with lower skill and competency levels.
   When there is only one person with Server
   Management skills on a system-based department,
   and their competency is minimal, it will affect the
   performance of the entire department.
  Technology
considerations   7
                                                                                                                          |   157


7 Technology considerations
Each function and process is defined in the relevant            and linked to Incident, Problem, Known Error and Change
section in Chapters 4 and 6. This chapter brings all            Records as appropriate.
technology requirements together to define the overall
requirement of an integrated set of Service Management          7.1.4 Discovery/Deployment/Licensing
technology for Service Operation.                               technology
The same technology, with some possible additions,              In order to populate or verify the CMS data and to assist in
should be used for the other phases of ITSM – Service           Licence Management, discovery or automated audit tools
Strategy, Service Design, Service Transition and Continual      will be required. Such tools should be capable of being
Service Improvement – to give consistency and allow an          run from any location on the network and allow
effective ITSM Lifecycle to be properly managed.                interrogation and recovery of information relating to all
                                                                components that make up, or are connected to, the IT
The main requirements for Service Operation are as set out
                                                                Infrastructure.
in this chapter.
                                                                Such technology should allow ‘filtering’ so that the data
                                                                being carried forward can be vetted and only required
7.1 GENERIC REQUIREMENTS
                                                                data extracted. It is also very helpful if ‘changes only’ since
An integrated ITSM technology (or toolset, as some              the last audit can be extracted and reported upon.
suppliers sell their technology as ‘modules’ whereas some
                                                                The same technology can often be used to deploy new
organizations may choose to integrate products from
                                                                software to target locations – this is an essential
alternative suppliers) is needed that includes the following
                                                                requirement for all Service Operation teams or
core functionality.
                                                                departments, to allow patches, transports etc. to be
                                                                distributed to the correct users.
7.1.1 Self-Help
Many organizations find it beneficial to offer ‘Self-Help’      An interface to ‘Self Help’ capabilities is desirable to allow
capabilities to their users. The technology should therefore    approved software downloads to be requested in this way
support this capability with some form of web front-end         but automatically handled by the deployment software.
allowing web pages to be defined offering a menu-driven         Tools that allow automatic comparison of software
range of Self-Help and Service Requests – with a direct         licences’ details held (in the CMS, ideally) and actual
interface into the back-end process-handling software.          licence numbers deployed – with reporting of any
                                                                discrepancies – are extremely desirable.
7.1.2 Workflow or process engine
A workflow or process control engine is needed to allow         7.1.5 Remote control
the pre-definition and control of defined processes such as     It is often helpful for the Service Desk Analysts and other
an Incident Lifecycle, Request Fulfilment Lifecycle, Problem    support groups to be able to take control of the user’s
Lifecycle, Change Model, etc.                                   desk-top (under properly controlled security conditions) so
This should allow responsibilities, activities, timescales,     as to allow them to conduct investigations or correct
escalation paths and alerting to be pre-defined and then        settings, etc. Facilities to allow this level of remote control
automatically managed.                                          will be needed.


7.1.3 Integrated CMS                                            7.1.6 Diagnostic utilities
The tool should have an integrated CMS to allow the             It could be extremely useful for the Service Desk and other
organization’s IT infrastructure assets, components,            support groups if the technology incorporated the
services and any ancillary CIs (such as contracts, locations,   capability to create and use diagnostic scripts and other
licences, suppliers etc. – anything that the IT organization    diagnostic utilities (such as, for example, case-based
wishes to control) to be held, together with all relevant       reasoning tools) to assist with earlier diagnosis of
attributes, in a centralised location – and to allow            incidents. Ideally, these should be ‘context sensitive’ and
relationships between each to be stored and maintained,         presentation of the scripts automated so far as possible.
158    | Technology considerations



 7.1.7 Reporting                                                More advanced tools integration capabilities are needed to
 There is no use in storing data unless it can be easily        allow greater exploitation of this sort of business and IT
 retrieved and used to meet the organization’s purposes.        integration.
 The technology should therefore incorporate good
 reporting capabilities, as well as allow standard interfaces   7.2 EVENT MANAGEMENT
 which can be used to input data to industry-standard
                                                                The following features are desirable for any Event
 reporting packages, dashboards, etc. Ideally, instant, on-
                                                                Management technology:
 screen as well as printed reporting can be provided
 through the use of context-sensitive ‘top ten’ reports.        ■ Multi-environmental, open interface to allow
                                                                    monitoring and alerting across heterogeneous services
 7.1.8 Dashboards                                                   and an organization’s entire IT Infrastructure.
 Dashboard-type technology is useful to allow ‘see at a         ■   Easy to deploy, with minimal set up costs.
 glance’ visibility of overall IT service performance and       ■   ‘Standard’ agents to monitor most common
 availability levels. Such displays can be included in              environments/components/systems.
 management-level reports to users and customers – but          ■   Open interfaces to accept any standard (e.g. SNMP)
 can also give real-time information for inclusion in IT web        event input and generation of multiple alerting.
 pages to give dynamic reporting, and can be used for           ■   Centralized routing of all events to a single location,
 support and investigation purposes. Capabilities to support        programmable to allow different location(s) at various
 customized views of information to meet specific levels of         times.
 interest can be particularly useful.                           ■   Support for design/test phases – so that new
 However, they sometimes represent a technical rather than          applications/services can be monitored during
 service view of the infrastructure and in such cases they          design/test phases and results fed back into the
 may be of less interest to customers and users.                    design and transition.
                                                                ■   Programmable assessment and handling of alerts
 7.1.9 Integration with Business Service                            depending upon symptoms and impact.
 Management                                                     ■   The ability to allow an operator to acknowledge an
 There is a trend within the IT industry to try to bring            alert, and if no response is entered within a defined
 together business-related IT with the processes and                timeframe, to escalate the alert.
 disciplines of IT Service Management – some call this          ■   Good reporting functionality to allow feed-back into
 Business Service Management. To facilitate this, business          design and transition phases as well a meaningful
 applications and tools need to be interfaced with ITSM             management information and business user
 support tools to give the required functionality. This can         ‘dashboard’.
 be illustrated by this example:                                Such technology should allow a direct interface into the
                                                                organization’s Incident Management processes (via entry
      An Eastern European telecoms company was able to          into the Incident Log), as well as the capability to escalate
      interface its telephone cell-net monitoring and billing   to support staff, third-party suppliers, engineers etc. via e-
      system to its Event Management, Incident                  mail, SMS messaging, etc.
      Management and Configuration Management
      processes. In this way it was able to detect any          Specialist facilities, or perhaps separate specialist tools, will
      unusual usage/billing patterns and interpret these        be required for website monitoring. Such facilities must be
      such that it could identify, with a high degree of        able to simulate customer traffic onto the website and to
      certainly, that a telephone had been stolen and was       report on availability and performance in relation to the
      being used to make illicit calls.                         ‘customer experience’.
      It was able to raise events for such patterns and
      automate actions to suspend usage of the mobile
      phone devices and, in parallel, identify the exact
      location of the illicit user (using GPRS technology)
      and raise incidents so that the police had the
      capability of finding the suspected thief and
      recovering the device.
                                                                                          Technology considerations |      159


7.3 INCIDENT MANAGEMENT                                        7.4 REQUEST FULFILMENT
                                                               Integrated ITSM technology is needed so that Service
7.3.1 Integrated ITSM technology                               Requests can be linked to incidents or events that have
Integrated ITSM technology is required that has the            initiated them (and been stored in the same CMS, which
following functionality:                                       can be interrogated to report against SLAs). Some
■ An integral CMS to allow automated relationships to          organizations will be content to use the Incident
    be made and maintained between incidents, service          Management element of such tools and to treat Service
    requests, problems, Known Errors and all other             Requests as a subset and defined category of incidents.
    configuration items.                                       Where an organization chooses to raise separate Service
■   The CMS that can be used to assist in determining          Requests, it will require a tool which allows this capability.
    priority and aid in investigation and diagnosis.           Front-end Self-Help capabilities will be needed to allow
■   A process flow engine to allow processes to be pre-        users to submit requests via some form of web-based,
    defined (including pre-defined incident models, see        menu-driven selection process.
    paragraph 3.2.1.5) and automatically controlled – with
                                                               In all other respects the facilities needed to manage
    flexible internal routing to all relevant support groups
                                                               Service Requests are very similar to those for managing
    and external e-mail/SMS interfaces.
                                                               incidents: pre-defined workflow control of Request
■   Automated alerting and escalation capabilities to          Models, priority levels, automated escalation, effective
    prevent an incident being overlooked or delayed.           reporting, etc.
■   Open interfacing to Event Management tools, so that
    any failures can be automatically raised as incidents.
■   A web interface to allow self-help and service requests
                                                               7.5 PROBLEM MANAGEMENT
    to be input via Internet/Intranet screens.
                                                               7.5.1 Integrated Service Management
■   An integrated KEDB so that diagnosed and/or resolved
    incident/problems can be recorded and searched to          Technology
    help in speeding future incident resolution.               An integrated ITSM tool is needed that differentiates
■   Easy-to-use reporting facilities to allow incident         between incidents and problems – so that separate
    metrics to be produced and to facilitate incident          Problem Records can be raised to deal with the underlying
    analysis for Problem Management and Availability           causes of incidents, but linked to the related incidents. The
    Management purposes.                                       functionality of Problem Records should be similar to
■   Diagnostic tools (either integrated or interfaces to       those needed for Incident Records and also allow for
    separate products), as already mentioned under             multiple incident matching against Problem Records.
    Service Desk.
                                                               7.5.2 Change Management
7.3.2 Workflow and automated escalation                        Integration with Change Management is very important,
The target times should be included in support tools,          so that Request, Event, Incident and Problem Records can
which should be used to automate the workflow control          be related to RFCs that have caused problems. This is to
and escalation paths.                                          evaluate the success of the Change Management process
                                                               – as well as Incident and Known Error Records – and so
If for example a second-line support group has not             that RFCs can be readily raised to control the activities
resolved an incident within a 60-minute agreed target, the     needed to overcome problems that have been identified
incident must be automatically routed to the appropriate       through Root-Cause Analysis or Proactive Trend Analysis.
(determined by incident categorization) third-line support
group – and any necessary hierarchic escalation should be      7.5.3 Integrated CMS
automatically undertaken (e.g. SMS message to the Service
                                                               It is also important to have an integrated CMS which
Desk Manager, Incident Manager and/or IT Services
                                                               allows Problem Records to be linked to the components
Manager and perhaps to the user, if appropriate). The
                                                               affected and the services impacted – and to any other
second-line support group must be informed of the
                                                               relevant CIs.
escalation action as part of the automated process.
                                                               Configuration Management forms part of a larger SKMS
                                                               which includes linkages to many of the data repositories
                                                               used in Service Operations. The process and practices of
160   | Technology considerations



 Configuration Management and its underlying                   ■ An automated call distribution (ACD) system to allow a
 technologies requirements are included in the Service             single telephone number (or numbers if a distributed
 Transition publication.                                           or segmented Service Desk is the preferred option)
                                                                   and group pick-up capabilities. Warning: If options are
 7.5.4 Known Error Database                                        offered via the ACD, via keyboard or Interactive Voice
 An effective KEDB will be as essential requirement,               Recognition (IVR) selection, do not use too many
 which should allow easy storage and retrieval of Known            levels of options or offer ambiguous options. Also do
 Error data.                                                       not include any ‘dead ends’ or options which, once
                                                                   chosen, do not allow the caller to go back to previous
 Good reporting facilities are needed to ease the                  menus.
 production of management reports, allowing the data to
                                                               ■   Computer Telephony Interface (CTI) software to allow
 be incorporated automatically without the need for re-
                                                                   caller recognition (via the linked ACD) and automated
 keying of data – and to allow drill-down capabilities for
                                                                   population of the users’ details into the incident
 Incident and Problem Analysis.
                                                                   record from the CMS.
 Note: In some cases, components or systems being              ■   VoIP – use of this technology can significantly reduce
 investigated by Problem Management may be provided                telephony costs when dealing with remote and
 by third-party vendors or manufacturers. To address this,         international users
 vendors’ support tools and/or KEDBs may also need             ■   Statistical software to allow telephony statistics to be
 to be used.                                                       gathered and easily interrogated/printed for analysis –
                                                                   this should allow the following information to be
 7.6 ACCESS MANAGEMENT                                             obtained for any selected period:
                                                                   ● Number of calls received, in total and broken
 Access Management uses a variety of technologies, mainly:
                                                                       down by any ‘splits’ – where any call-routing has
 ■ Human Resource Management technology, to validate                   been chosen and being provided by an IVR
      the identity of users and to track their status                  system/keypad response
 ■    Directory Services Technology (see section 5.8 for a         ● Call arrival profiles and answer times
      description of Directory Services). This technology          ● Call abandon rates
      enables technology managers to assign names to               ● Call handling rates by individual Service Desk
      resources on a network and then provide access to                call handlers
      those resources based on the profile of the user.            ● Average call durations
      Directory Services tools also enable Access
                                                               ■   Hands-free headsets, with dual-user access capabilities
      Management to create roles and groups and to link
                                                                   (on at least some of the headsets) for use during
      these to both users and resources
                                                                   training of new staff, etc.
 ■    Access Management features in Applications,
      Middleware, Operating Systems and Network                7.7.2 Support tools
      Operating Systems
                                                               There are a range of free-standing Service Desk support
 ■    Change Management systems
                                                               tools available in the marketplace – and some
 ■    Request Fulfilment technology (see section 7.4).         organizations may choose to produce their own simple
                                                               incident logging/management systems. If an organization
 7.7 SERVICE DESK                                              seriously intends to implement ITSM then a fully
                                                               integrated ITSM toolset will be required that has a CMS at
 Adequate tools and technology support should be
                                                               the centre and provides integrated support for all the ITIL-
 provided to enable Service Desk staff to perform their
                                                               defined processes.
 roles as efficiently and effectively as possible. This will
 include the following.                                        Specific elements of such a tool that will be particularly
                                                               beneficial for the Service Desk include the following.
 7.7.1 Telephony
 Because a high percentage of incidents are likely to be       7.7.2.1 Known Error Database
 raised by telephone calls from users, the Service Desk        An integrated KEDB should be used to store details of
 should be provided with good, modern telephony                previous incidents/problems and their resolutions – so that
 services. This should include:                                any recurrences can be more quickly diagnosed and fixed.
                                                                                          Technology considerations |       161

To facilitate this, functionality is needed to categorize and   ■ Downloads of additional software packages – tools are
quickly retrieve previous Known Errors, using pattern             available to check a pre-defined software policy and to
matching and key word searching against symptoms.                 allow the download of additional software packages, if
Management of the KEDB is the responsibility of Problem           covered by the policy. This can include automated
Management, but the Service Desk will use to help speed           software licence checks and financial approvals as well
incident handling.                                                as CMS updating.
                                                                ■ Advanced notice of any planned downtime or services
7.7.2.2 Diagnostic scripts                                        outages or degradations.
Multi-level diagnostic scripts should be developed, stored      The self-help solution should include the capability for
and managed to allow Service Desk staff to pinpoint the         users to log incidents themselves, which can be used
cause of failures. Specialist support groups and suppliers      during periods that the Service Desk is closed (if not
should be asked to provide details of the likely failures and   operating 24/7) and attended to by Service Desk staff at
the key questions to be asked to identify exactly what has      the start of the next shift.
gone wrong – and for details of the resolution actions to
be taken.                                                       Some care has to be exercised to ensure that the Self-Help
                                                                activities selected for inclusion are not too advanced for
These details should then be included in context-sensitive      the average user, and that safeguards are included to
scripts that should appear on-screen, dependent upon the        prevent a ‘little knowledge being a dangerous thing’! It
multi-level categorization of the incident, and should be       may be possible to offer slightly more advanced Self-Help
driven by the user’s answers to diagnostic questions.           facilities to ‘Super Users’ who have had extra training. It is
                                                                also necessary to be very careful about assumptions made
7.7.2.3 Self-Help web Interface                                 when staffing a Service Desk about the amount of use that
It is often cost effective and expedient to provide some        users will make of Self-Help facilities.
form of automated ‘Self-Help’ functionality, so users can
                                                                Note: As already covered in the list above, it is possible to
seek and obtain assistance which will enable them to
                                                                combine some simpler Request Fulfilment activities as part
resolve their own difficulties. Ideally this should be via a
                                                                of an overall Self-Help system – which can also be of
24/7 web interface that is driven by menu selection and
                                                                significant benefit in reducing calls to the Service Desk
might include, as appropriate:
                                                                (see paragraph 7.1.1 for further details).
■ Frequently asked questions (FAQs) and solutions.
■ ‘How to do’ search capabilities – to guide users              7.7.2.4 Remote control
    through a context-sensitive list of tasks or activities.    As already stated, but repeated here for completeness, it is
■   A bulletin-type service containing details of               often helpful for the Service Desk Analysts to be able to
    outstanding service issues/problems together with           take control of the user’s desktop so as to allow them to
    anticipated restoration times.                              conduct investigations or correct settings, etc. Facilities to
■   Password change capabilities – using secure password        allow this level of remote control will be needed.
    protection software to check identities, perform
    authorization and change passwords without the need         7.7.3 IT Service Continuity Planning for
    for Service Desk intervention.                              ITSM support tools
■   Software fix downloads (patches, service packs, bug         Organizations are likely to become quickly dependent
    fixes etc. where it is determined that the user has the     upon their ITSM tools and will find it difficult to work
    wrong version or a fix is needed) – tools are available     without them. A full Business Impact Analysis and
    to automate the checking process, to compare the            Risk Analysis should be performed and plans then
    actual desktop image with the agreed ‘standard’ builds      developed to ensure appropriate IT Service Continuity
    and to allow upgrades to be offered and accepted            and resilience levels.
    where necessary.
■   Software repairs – where it is detected that a
    corruption may have occurred, to allow software fixes,
    removal and/or re-installation.
■   Software removal requests – automatically completed
    with any licence being returned to the pool.
Implementing Service
          Operation    8
                                                                                                                       |   165


8 Implementing Service Operation
It should be noted that Service Operation is a phase in a       ■ Changes of management or personnel (ranging from
lifecycle and not an entity in its own right. By the time a       loss or transfer of individuals right through to major
service, process, organization structure or technology is         take-overs or acquisitions)
operating, it has already been implemented. However,            ■ Change of service levels or in service provision –
there are a number of processes and functions described           outsourcing, in-sourcing, partnerships, etc.
in this publication, and it is therefore important to address
the implementation considerations which should have             8.1.2 Change assessment
been addressed by the time they come into operation.            Service Operation staff must be involved in the assessment
A number of these have been covered in the relevant             of all changes to ensure that operational issues are fully
section – for example guidance is given about                   taken into account. This involvement should commence as
organization structures and roles in Chapter 6. This will       soon as possible (see paragraph 4.6.1) not just at the later
not be repeated here. Rather, this section will focus on        stages of change – i.e. CAB and ECAB membership – by
some generic implementation guidance for Service                which time many fundamental decisions will have been
Operation as a whole.                                           made and influence is likely to be very limited. The
                                                                Change Manager should inform all affected parties of the
                                                                change being assessed so input can be prepared and
8.1 MANAGING CHANGE IN SERVICE                                  available prior to CAB meetings.
OPERATION
                                                                However, it is important that Service Operation staff are
Service Operation should strive to achieve stability – but      involved at these latter stages as they may be involved in
not stagnation! There are many valid and advantageous           the actual implementation and they will wish to ensure
reasons why ‘change is a good thing’ – but Service              that careful scheduling takes place to avoid potential
Operation staff must ensure that any changes are                contentions or particularly sensitive periods.
absorbed without adverse impact upon the stability of the
IT services being offered.                                      8.1.3 Measurement of successful change
                                                                The ultimate measure of success in respect of changes
8.1.1 Change triggers
                                                                made to Service Operation is that customers and users do
There are many things that may trigger a change in the          not experience any variation or outage of service. So far as
Service Operation environment. These include:                   possible, the effects of changes should be invisible, apart
■ New or upgraded hardware or network components                from any enhanced functionality, quality or financial
■ New or upgraded applications software                         savings resulting from the change.
■ New or upgraded system software (operating systems,
    utilities, middleware etc. including patches and            8.2 SERVICE OPERATION AND PROJECT
    bug fixes                                                   MANAGEMENT
■   Legislative, conformance or governance changes
                                                                Because Service Operation is generally viewed as ‘business
■   Obsolescence – some components may become
                                                                as usual’ and often focused on executing defined
    obsolete and require replacement or cease to be
                                                                procedures in a standard way, there is a tendency not to
    supported by the supplier/maintainer
                                                                use Project Management processes when they would in
■   Business imperative – you have to be flexible to work       fact be appropriate. For example, major infrastructure
    in ITSM, particularly during Service Operation, and         upgrades, or the deployment of new or changed
    there will be many occasions when the business needs        procedures, are significant tasks where formal Project
    IT changes to meet dynamic business requirements            Management can be used to improve control and manage
■   Enhancements to processes, procedures and/or                costs/resources.
    underpinning tools to improve IT delivery or reduce
    financial costs                                             Using Project Management to manage these types of
                                                                activity would have the following benefits:
166   |   Implementing Service Operation


 ■ The project benefits are clearly stated and agreed             Transition to ensure that when new services reach the live
 ■ There is more visibility of what is being done and how         environment they are fit for purpose, from a Service
   it is being managed, which makes it easier for other IT        Operation perspective, and are ‘supportable’ in the future.
   groups and the business to quantify the contributions          In this context, ‘supportable’ means:
   made by operational teams
 ■ This in turn makes it easier to obtain funding for             ■ Capable of being supported from a technical and
   projects that have traditionally been difficult to cost            operational viewpoint from within existing, or pre-
   justify                                                            agreed additional resources and skills levels
 ■ Greater consistency and improved quality                       ■   Without adverse impact on other existing technical or
                                                                      operational working practices, processes or schedules
 ■ Achievement of objectives results in higher credibility
                                                                  ■   Without any unexpected operational costs or ongoing
   for operational groups.
                                                                      or escalating support expenditure
                                                                  ■   Without any unexpected contractual or legal
 8.3 ASSESSING AND MANAGING RISK IN                                   complications
 SERVICE OPERATION                                                ■   No complex support paths between multiple support
 There will be a number of occasions where it is imperative           departments of third-party organizations.
 that risk assessment to Service Operation is quickly             Note: Change is not just about technology. It also requires
 undertaken and acted upon.                                       training, awareness, cultural change, motivational issues
 The most obvious area is in assessing the risk of potential      and a lot more. Further details regarding wider
 changes or Known Errors (already covered elsewhere) but          management of change are covered in the Service
 in addition Service Operation staff may need to be               Transition publication.
 involved in assessing the risk and impact of:
 ■ Failures, or potential failures – either reported by           8.5 PLANNING AND IMPLEMENTING
      Event Management or Incident/Problem Management,            SERVICE MANAGEMENT TECHNOLOGIES
      or warnings raised by manufacturers, suppliers or
                                                                  There are a number of factors that organizations need to
      contractors
                                                                  plan for in readiness for, and during deployment and
 ■    New projects that will ultimately result in delivery into   implementation of, ITSM support tools. These include the
      the live environment                                        following.
 ■    Environmental risk (encompassing IT Service
      Continuity-type risks to the physical environment and       8.5.1 Licences
      locale as well as political, commercial or industrial-
                                                                  The overall cost of ITSM tools, particularly the integrated
      relations related risks)
                                                                  tool that will form the heart of the required toolset, is
 ■    Suppliers, particularly where new suppliers are             usually determined by the number and type of user
      involved or where key service components are under          licences that the organization needs.
      the control of third parties
 ■    Security risks – both theoretical or actual arising from    Such tools are often sold in modular format, so the exact
      security related incidents or events                        functionality of each module needs to be well understood
                                                                  and some initial sizing must be conducted to determine
 ■    New customers/services to be supported.
                                                                  how many – and what type – of users will need access to
                                                                  each module.
 8.4 OPERATIONAL STAFF IN SERVICE
                                                                  Licences are often available in the following types (the
 DESIGN AND TRANSITION                                            exact terminology may vary depending upon the software
 All IT groups will be involved during Service Design and         supplier).
 Service transition to ensure that new components or
 service are designed, tested and implemented to provide          8.5.1.1 Dedicated licences
 the correct levels of functionality, usability, availability,    For use by those staff that requires frequent and
 capacity, etc.                                                   prolonged use of the module (e.g. Service Desk staff
 Additionally, Service Operation staff must be involved           would need a dedicated licence to use an Incident
 during the early stages of Service Design and Service            Management module).
                                                                                      Implementing Service Operation |      167

8.5.1.2 Shared licences                                           An alternative to this is where the use of a tool is offered
For staff who make fairly regular use of the module, but          as part of a specific consultancy assignment (e.g. a
with significant intervals in between, so can usually             specialist Capacity Management consultancy, say, who
manage with a shared licence (e.g. third-line support staff       may offer a regular but relatively infrequent Capacity
may need regular access to an Incident Management                 Planning consultancy package and provide use of the
module – but only at times when they are actively                 tools for the duration of the assignment). In such cases the
updating an incident record). The ratio of required licences      licence fees are likely to be included as part of, or as an
to users needs to be estimated, so the correct number of          addendum to, the consultancy fee.
licences can be purchased – this will depend upon the             A further variation is where software is licensed and
number of potential users, the length of periods of use           charged on an agent/activity basis. An example of this is
and the expected frequency between usages to give an              interrogation/monitoring and/or simulation software (e.g.
estimated concurrency level.                                      agent software that can simulate pre-defined customer
The cost of a shared licence is usually more expensive            paths through an organization’s website, to assess and
than that of dedicated licences – but the overall cost is         report upon performance and availability). Such software is
less as users are sharing and fewer licences are therefore        typically charged on the basis of the number of agents,
needed in total.                                                  their location and/or the amount of activity generated.
                                                                  In all cases, full investigations of the licensing structure
8.5.1.3 Web licences                                              must be investigated and well understood during the
Usually allowing some form of ‘light interface’ via web           procurement investigations and well before tools are
access to the tool’s capabilities, this is usually suitable for   deployed – so that the ultimate costs do not come as any
staff requiring remote access, only occasional access, or         sort of surprise.
usage of just a small subset of the functionality (e.g.
engineering staff wishing to log details of actions taken on      8.5.2 Deployment
incidents or users just wanting to log an incident directly).     Many ITSM tools, particularly Discovery and Event
Web licences usually cost a lot less than other licences          Monitoring tools, will require some client/agent software
(may even be free with other licences!) and the ratio             deploying to all target locations before they can be used.
of use is also often lower – so overall costs are                 This will need careful planning and execution – and
reduced further.                                                  should be handled through formal Release and
Note that some staff may require access to multiple               Deployment Management (see Service Transition
licences (e.g. support staff may require a dedicated or           publication).
shared licence when in the office during the day, but may         Even where network deployment is possible, this needs
require a web licence when providing out of hours                 careful scheduling and testing – and records must be
support from home). Keep in mind that licences may be             maintained throughout the rollout so that support staff
required for customers/users/suppliers using the same tool        have knowledge of who has been upgraded and who has
to input, view or update records or reports.                      not. Some form of interim Change Management may be
Note: Some licence agreements (of any of the types                necessary and the CMS should be updated as the rollout
mentioned above) may restrict the usage of the software           progresses.
to an individual device or CPU!                                   It is often necessary for a reboot of the devices for the
                                                                  client software to be recognized – and this needs to be
8.5.1.4 Service on demand                                         arranged in advance, otherwise long delays can occur if
There has been a trend within the IT industry for suppliers       staff do not generally switch off their desktops overnight.
to offer IT applications ‘on demand’, where access is given       There may be particular problems deploying to laptops
to the application for a period of demand and then                and other portable equipment and special arrangements
severed when it is no longer needed – and charged on              may be necessary for staff to log on and receive the
the basis of the time spent using the application. This type      new software.
of offering may be offered by some ITSM tool suppliers –
which could be attractive to smaller organizations or if the      8.5.3 Capacity checks
tools in question are very specialised and used relatively
                                                                  Some Capacity Management may be necessary in advance
infrequently.
                                                                  to ensure that all of the target locations have sufficient
168   |   Implementing Service Operation


 storage and processing capacity to host and run the new       for an additional period when the tools go live and into
 software – any that cannot will need upgrading or             the future, as needed.
 replacing, and lead times for these actions need to be
 included in the plans.                                        8.5.5 Type of introduction
 The capacity of the network should also be checked to         A decision is needed on what type of introduction is
 establish whether it can handle the transmission of           needed – whether to go for a ‘Big Bang’ introduction or
 management information, the transmission of log files and     some sort of phased approach. As most organizations will
 the distribution of clients’ and also possibly software and   not start from a ‘green field’ situation, and will have live
 configuration files.                                          services to keep running during the introduction, a phased
                                                               approach is more likely to be necessary.
 8.5.4 Timing of technology deployment                         In many cases a new tool will be replacing an older,
 Care is needed to ensure that tools are deployed at the       probably less sophisticated, tool and the switchover
 appropriate time in relation to the organization’s level of   between the two is another factor to be planned.
 ITSM sophistication and knowledge. If tools are deployed
                                                               This will often involve deciding what data needs to be
 too soon, they may be seen as an immediate panacea and
                                                               carried forward from the old tool to the new one – and
 any necessary action to change processes, working
                                                               this may require significant reformatting to achieve the
 practices or attitudes may be hindered or overlooked.
                                                               required results. Ideally this transfer should be done
 A tool alone is usually not enough to make things work        electronically – but in some cases a small amount of re-
 better. There is an old adage: ‘A fool with a tool is         keying of live data may be inevitable and should be
 still a fool!’                                                factored into the plans.
 The organization must first examine the processes that the    Caution: older tools generally relied on more manual entry
 tool is seeking to address and also ensure that staff are     and maintenance of data so if electronic data migration is
 ‘bought in’ to the new processes and way of working and       being used, an audit should be performed to verify data
 have a adopted a ‘service culture’.                           quality.
 However, tools can and often do make things a reality for     Where data transfer is complicated or time consuming to
 many people – they are tangible and technical staff can       achieve, an alternative might be to allow a period of
 immediately see how the new processes can work and            parallel running – with the old tool being available for an
 how they may improve their way of working.                    initial period alongside the new one, so that historical data
                                                               can be referenced if needed. In such cases it will be
 Some processes just cannot be done without adequate
                                                               prudent to make the old tool ‘read-only’ so that no
 tooling, so there is a careful balance to be made to ensure
                                                               mistakes can be made by logging new data in the old
 tools are introduced when they are needed – but not
                                                               tool.
 before!
                                                               Complete details on the Release and Deployment
 Similarly, care is needed to ensure that training in any
                                                               Management process can be found in the Service
 tools is provided at the correct point – not too early or
                                                               Transition publication.
 knowledge will diminish or be lost, but early enough so
 that staff can be formally trained and fully familiarize
 themselves with the operation of the tools well in advance
 of live deployment. Additional training should be planned
     Challenges, Critical
Success Factors and risks   9
                                                                                                                         |    171


9 Challenges, Critical Success Factors and
  risks
9.1 CHALLENGES                                                  9.1.2 Justifying funding
There are a number of challenges faced within Service           It is often difficult to justify expenditure in the area of
Operation that need to be overcome. These include those         Service Operation, as money spent in this sphere is often
set out in this section.                                        regarded as ‘infrastructure costs’ – with nothing new to
                                                                show for the investment.
9.1.1 Lack of engagement with                                   The Service Strategy publication discusses how to ensure a
development and project staff                                   Return on Investment and eliminate the perception of
Traditionally, there has been a separation between Service      investment as a purely Infrastructure ‘overhead’. Good
Operation staff and those staff involved in developing new      guidance is offered on how to justify investment.
applications or running projects that will eventually deliver   In reality, many investments in ITSM, particularly in the
new functionality into the operational environment.             Service Operation areas, can save money and show a
This separation was originally deliberate and driven by         positive Return on Investment – as well as resulting
the desire to prevent collusion and avoid potential             improvement in service quality. Some examples of
security risks (in some organizations it is still a             potential areas of savings include:
legislative requirement). However, instead of using             ■ Reduced software licence costs through the better
this separation of duties to create positive contributions,        management of licences and deployed copies
in many organizations it is a source of rivalry and             ■ Reduced support costs due to fewer incidents and
political manoeuvring.                                             problems and reduced resolution times
All too often, ITSM is seen as something that has been          ■ Reduced headcount through workforce rationalization,
initiated in the operational areas and is nothing to do with       supporting roles and accountability structures
development or projects.                                        ■ Less ‘lost business’ due to poor IT service quality
This view is very damaging as the appropriate time to be        ■ Better utilization of existing infrastructure equipment
thinking of Service Operation issues is at the outset of new      and deferral of further expenditure due to better
developments or projects – when there is still time to            capacity management
include these factors in the planning stages.                   ■ Better-aligned processes, leading to less duplication of
                                                                  activities and better usage of existing resources.
The Service Design and Service Transition publications
describe the steps needed to ensure that IT Operations
                                                                9.1.3 Challenges for Service Operation
issues are considered from the outset of new
developments and projects.                                      Managers
                                                                The following is a list of some of the challenges that
  Anecdotes                                                     Managers in Service Operation should expect to face.
                                                                There is no easy solution to these challenges, mainly
  One organization uses an ‘Operation Transition-In
                                                                because they are by-products of the organization culture
  Policy’ to ensure that services being deployed have
  had the appropriate level of input from the                   and the decisions made during the process of deciding
  operational teams. This is basically a policy that            the organizational structure. The purpose of including the
  clearly shows under what circumstances an                     list is to ensure that Service Operation Managers are
  application is ‘ready’ to transition into Operations.         conscious of them and can create a plan to deal with
  This helped with communication to development and             them.
  project teams and also provided a clear set of
                                                                The differences between Design activities and Operational
  guidelines on how to work with the operational teams.
                                                                activities will continue to present challenges. This is for a
  Another organization uses Operations Use Cases to             number of reasons, including the following:
  get development teams to include requirements that
  should be fulfilled by the application to be run in
  production under the control of Operations personnel.
172   | Challenges, Critical Success Factors and risks



 ■ Service Design may tend to focus on an individual             ■ Service Transition that is not used effectively to
   service at a time, whereas Service Operation tends to            manage the transition between the Design and
   focus on delivering and supporting all services at the           Operation phases. For example, some organizations
   same time. Operation Managers should work closely                may only use Change Management to schedule the
   with Service Design and Service Transition to provide            deployment of changes that have already been made
   the Operation perspective to ensure that design                  – rather than testing to see whether the change will
   and transition outcomes support the overall                      successfully make the transition between Design and
   operational needs.                                               Operation. It is imperative that the practices of Service
 ■ Service Design will often be conducted in projects,              Transition are followed and organization policies to
   while Service Operation focuses on ongoing,                      prevent poorly managed Change practices are in
   repeatable management processes and activities. The              place. Operation, Change and Transition Managers
   result of this is that operational staff are often not           must have the authority to deny any changes into the
   available to participate in Service Design project               operational environment, without exception, that are
   activities, which in turn results in IT services that are        not thoroughly tested.
   difficult to operate, or which do not include adequate        These challenges can only be dealt with if Service
   manageability design elements. In addition, once              Operation staff are involved in Service Design and
   project staff have finished the design of one IT Service      Transition, and this will require that they are formally
   they could move onto the next project and not be              tasked and measured to do this. Roles identified in the
   available to support difficulties in the operational          Service Design processes should be included in Technical
   environment. Overcoming this challenge requires               and IT Application Management staff job descriptions and
   Service Operation to plan for its staff to be actively        their time allocated on a project-by-project basis.
   involved in design projects, to resource the transition
   activities and participate in Early Life Support of           Another set of challenges relates to measurement. Each
   services introduced in the operational environment.           alternative structure will introduce different combinations
 ■ The two stages in the lifecycle have different metrics,       of items that are easy or difficult to measure. For example
   which encourages Service Design to complete the               measuring the performance of a device or team could be
   project on time, to specification and in budget. In           relatively easy, but determining whether that performance
   many cases it is difficult to forecast what the service       is good or bad for the overall IT Service is another matter
   will look like and how much it will cost after it has         altogether. A good Service Level Management process will
   been deployed and operated for some time. When it             help to resolve this, but this means that Service Operation
   does not run as expected, IT Operations Management            teams must be an integral part of that process (see
   is held responsible. While this challenge will always be      Continual Service Improvement publication).
   a reality in Service Management, this can be addressed        A third set of challenges relates to the use of Virtual
   by active involvement in the Service Transition stage         Teams. Traditional, hierarchical management structures are
   of the lifecycle. The objective of Service Transition is to   becoming inadequate because of the complexity and
   ensure that designed services will operate as expected        diversity of most organizations. A management paradigm
   and the Operations Manager can provide the                    (Matrix Management) has emerged where employees
   knowledge needed to Service Transition to assess, and         report to different sources for different tasks. This has
   remedy, issues before they become issues in the               resulted in a complex web of accountability and an
   operational environment.                                      increased risk of activities falling through the cracks. On
                                                                 the other hand, it also enables the organization to make
                                                                 skills and knowledge available where they are most
                                                                 needed to support the business. Knowledge Management
                                                                 and the mapping of authority structures will become
                                                                 increasingly important as organizations expand and
                                                                 diversify. This is discussed in the ITIL Service Strategy
                                                                 publication.
                                                                 One of the most significant challenges faced by Service
                                                                 Operation Managers is the balancing of many internal and
                                                                 external relationships. Most IT organizations today are
                                                                 complex and as services become more commoditized
                                                                      Challenges, Critical Success Factors and risks |    173

there is an increased use of value networks, partnerships      should go out of their way to make their support known,
and shared services models. While a significant advantage      not just by their words but also by their actions and
to dynamically evolving business needs, this increases the     adherence to the organization’s agreed processes
complexity of managing services cohesively, efficiently and    and procedures.
providing the invisible seam between the customer and
                                                               Middle Managers should also give their full support to
the intricate web of how services are actually delivered. A
                                                               hiring staff to support the process, instead of accepting
Service Operation Manager should invest in relationship
                                                               the need for formalized Service Operation and then simply
management knowledge and skills to help deal with the
                                                               increasing the workload of existing staff to get it done.
complexity of this challenge.
                                                               9.2.2 Business support
9.2 CRITICAL SUCCESS FACTORS                                   It is important that the Business Units also support Service
                                                               Operation. This level of support can be far better achieved
9.2.1 Management support                                       if the Service Operation staff involve the business in all of
Senior and Middle Management support is needed for all         their activities and are open in their reporting of both
ITSM activities and processes, particularly in Service         successes and failures – and their efforts to improve.
Operation.
                                                               It is equally important that the Business Units understand,
Senior Management support is critical for obtaining and        accept and carry out the role they play in Service
maintaining adequate funding and resourcing. Rather than       Operation. Good service requires good customers!
seeing Service Operation as a ‘black hole’ for investment,     Adhering to the policies, processes and procedures, such
Senior Management should quantify and champion the             as using the Service Desk for logging all requests, is a
benefits of good Service Operation. They should also be        direct responsibility of the customer to support and
fully informed of the dire results that can occur because of   promote within the business.
poor Service Operation.
                                                               Regular communications with the business to understand
Senior Management must provide visible support during          their concerns and aspirations and to give feedback on
the launch of new Service Operation initiatives (such as       efforts to meet their needs are essential in building the
through appearances at seminars, signatories to memos          correct relationships and ensuring ongoing support.
and announcements, etc.) and their ongoing support must
                                                               Also the business should agree to the costs for
be equally well demonstrated. Entirely the wrong
                                                               implementing Service Operation and understand the
messaging can be given if a senior manager fails to turn
                                                               return on the investment, unless this has already been
up to an important project meeting or launch seminar.
                                                               agreed as part of the Design process.
Even worse are senior managers who support the initiative
verbally, but abuse their authority to encourage
                                                               9.2.3 Champions
circumvention of the Service Operation practice.
                                                               ITSM projects and the resulting ongoing practice
Senior Managers should also empower the Middle                 (performed by Service Operation staff) are often more
Managers who will be directly responsible for Service          successful if one or more ‘champions’ are forthcoming
Operation. Supporting the initiative publicly, but then        who can lead others through their enthusiasm and
overriding budget requirements or necessary changes, will      commitment for ITSM.
harm both the implementation and ongoing Service
Operation initiative.                                          In some cases these champions may be senior managers
                                                               who are leading from the top. But champions can also be
Middle Managers must also provide the necessary support        successful if they come from other tiers of the
– and in particular this should be demonstrated by their       organization. One or two junior staff can still have a
actions. If a Middle Manager is seen to be circumventing       significant beneficial influence on a successful conclusion.
or overriding an agreed procedure (e.g. implementing a
change that has not been through the Change                    Champions are often created or heavily influenced
Management process) then this gives the clear message          through formal Service Management training, particularly
that others can do the same – and that the procedure is        at more advanced levels where the potential benefits to
worthless and can be ignored by all. Middle Managers           an organization, and to the individuals who make a career
                                                               path in Service Management, can be fully explored.
174   | Challenges, Critical Success Factors and risks



 It should be noted that champions emerge over time.            organization – and all must be instilled with a ‘Service
 They cannot be created or appointed. Often it is users or      Management culture’.
 customers who provide the most help in creating good
                                                                It is possible to have the finest Service Operation practice
 Service Management processes as they are acutely aware
                                                                and tools in the world – but Service Management will not
 of needed improvements from a business perspective. It is
                                                                be successful unless the people are also attuned to the
 important to recognize that these are usually highly
                                                                overall Service Management objectives. Buy-in and
 motivated staff who often voluntarily take on the greatest
                                                                support of all staff are therefore very important – and the
 workloads. If their input is to be most effective they must
                                                                role of training and awareness, and even formal
 be given time to work as the champion.
                                                                qualifications that benefit the individual, should not be
                                                                underestimated.
 9.2.4 Staffing and retention
 Having the appropriate number of staff with the                Training required for successful Service Management