VIEWS: 11 PAGES: 276 POSTED ON: 4/5/2012
Service Operation London: TSO Published by TSO (The Stationery Office) and available from: Online www.tsoshop.co.uk Mail,Telephone, Fax & E-mail TSO PO Box 29, Norwich, NR3 1GN Telephone orders/General enquiries: 0870 600 5522 Fax orders: 0870 600 5533 E-mail: email@example.com Textphone 0870 240 3701 TSO Shops 123 Kingsway, London,WC2B 6PQ 020 7242 6393 Fax 020 7242 6394 16 Arthur Street, Belfast BT1 4GD 028 9023 8451 Fax 028 9023 5401 71 Lothian Road, Edinburgh EH3 9AZ 0870 606 5566 Fax 0870 606 5588 TSO@Blackwell and other Accredited Agents Published for the Office of Government Commerce under licence from the Controller of Her Majesty’s Stationery Office. © Crown Copyright 2007 This is a Crown copyright value added product, reuse of which requires a Click-Use Licence for value added material issued by OPSI. Applications to reuse, reproduce or republish material in this publication should be sent to OPSI, Information Policy Team, St Clements House, 2-16 Colegate, Norwich, NR3 1BQ, Tel No (01603) 621000 Fax No (01603) 723000, E-mail: firstname.lastname@example.org, or complete the application form on the OPSI website http://www.opsi.gov.uk/click-use/value-added-licence- information/index.htm OPSI, in consultation with Office of Government Commerce (OGC), may then prepare a Value Added Licence based on standard terms tailored to your particular requirements including payment terms The OGC logo ® is a Registered Trade Mark of the Office of Government Commerce ITIL ® is a Registered Trade Mark, and a Registered Community Trade Mark of the Office of Government Commerce, and is Registered in the U.S. Patent and Trademark Office The Swirl logo ™ is a Trade Mark of the Office of Government Commerce First published 2007 ISBN 978 0 11 331046 3 Printed in the United Kingdom for The Stationery Office | iii Contents List of figures v 4.4 Problem Management 58 4.5 Access Management 68 List of tables vi 4.6 Operational activities of processes OGC’s foreword vii covered in other lifecycle phases 72 Chief Architect’s foreword viii 5 Common Service Operation activities 79 5.1 Monitoring and control 82 Preface ix 5.2 IT Operations 92 Acknowledgements x 5.3 Mainframe Management 95 1 Introduction 1 5.4 Server Management and Support 95 1.1 Overview 3 5.5 Network Management 96 1.2 Context 3 5.6 Storage and Archive 97 1.3 Purpose 7 5.7 Database Administration 97 1.4 Usage 7 5.8 Directory Services Management 98 1.5 Chapter overview 7 5.9 Desktop Support 98 5.10 Middleware Management 99 2 Service Management as a practice 9 5.11 Internet/Web Management 99 2.1 What is Service Management? 11 5.12 Facilities and Data Centre Management 100 2.2 What are services? 11 5.13 Information Security Management and 2.3 Functions and processes across the Service Operation 101 lifecycle 12 5.14 Improvement of operational activities 102 2.4 Service Operation fundamentals 13 6 Organizing for Service Operation 105 3 Service Operation principles 17 6.1 Functions 107 3.1 Functions, groups, teams, departments 6.2 Service Desk 109 and divisions 19 6.3 Technical Management 121 3.2 Achieving balance in Service Operation 19 6.4 IT Operations Management 125 3.3 Providing service 28 6.5 Application Management 128 3.4 Operation staff involvement in Service Design and Service Transition 28 6.6 Service Operation roles and responsibilities 140 3.5 Operational Health 28 6.7 Service Operation Organization Structures 146 3.6 Communication 29 3.7 Documentation 31 7 Technology considerations 155 7.1 Generic requirements 157 4 Service Operation processes 33 7.2 Event Management 158 4.1 Event Management 35 7.3 Incident Management 159 4.2 Incident Management 46 7.4 Request fulfilment 159 4.3 Request Fulfilment 55 7.5 Problem Management 159 iv | 7.6 Access Management 160 Appendix C: Kepner and Tregoe 199 7.7 Service Desk 160 C1 Defining the problem 201 C2 Describing the problem 201 8 Implementing Service Operation 163 C3 Establishing possible causes 201 8.1 Managing change in Service Operation 165 C4 Testing the most probable cause 201 8.2 Service Operation and Project Management 165 C5 Verifying the true cause 201 8.3 Assessing and managing risk in Service Operation 166 Appendix D: Ishikawa Diagrams 203 8.4 Operational staff in Service Design and Appendix E: Detailed description of Transition 166 Facilities Management 207 8.5 Planning and Implementing Service E1 Building Management 209 Management technologies 166 E2 Equipment Hosting 209 9 Challenges, Critical Success Factors E3 Power Management 210 and risks 169 E4 Environmental Conditioning and 9.1 Challenges 171 Alert Systems 210 9.2 Critical Success Factors 173 E5 Safety 211 9.3 Risks 175 E6 Physical Access Control 211 E7 Shipping and Receiving 212 Afterword 177 E8 Involvement in Contract Management 212 Appendix A: Complementary industry E9 Maintenance 212 guidance 181 A1 COBIT 183 Appendix F: Physical Access Control 213 A2 ISO/IEC 20000 183 Glossary 219 A3 CMMI 184 Acronyms list 221 A4 Balanced Scorecard 184 Definitions list 223 A5 Quality Management 184 Index 251 A6 ITIL and the OSI Framework 184 Appendix B: Communication in Service Operation 185 B1 Routine operational communication 187 B2 Communication between shifts 188 B3 Performance Reporting 189 B4 Communication in projects 192 B5 Communication related to changes 194 B6 Communication related to exceptions 195 B7 Communication related to emergencies 196 B8 Communication with users and customers 197 | v List of figures All diagrams in this publication are intended to provide an Figure 6.5 Application Management Lifecycle illustration of ITIL Service Management Practice concepts Figure 6.6 Role of teams in the Application Management and guidance. They have been artistically rendered to Lifecycle visually reinforce key concepts and are not intended to meet a formal method or standard of technical drawing. Figure 6.7 IT Operations organized according to The ITIL Service Management Practices Integrated Service technical specialization (sample) Model conforms to technical drawing standards and Figure 6.8 A department based on executing a set of should be referred to for complete details. Please see activities www.best-management-practice.com/itil for details. Figure 6.9 IT Operations organized according to Figure 1.1 Source of Service Management Practice geography Figure 1.2 ITIL Core Figure 6.10 Centralized IT Operations, Technical and Figure 2.1 A conversation about the definition and Application Management structure meaning of services Figure D.1 Sample of starting an Ishikawa Diagram Figure 2.2 A basic process Figure D.2 Sample of a completed Ishikawa Diagram Figure 3.1 Achieving a balance between external and internal focus Figure 3.2 Achieving a balance between focus on stability and responsiveness Figure 3.3 Balancing service quality and cost Figure 3.4 Achieving a balance between focus on cost and quality Figure 3.5 Achieving a balance between being too reactive or too proactive Figure 4.1 The Event Management process Figure 4.2 Incident Management process flow Figure 4.3 Multi-level incident categorization Figure 4.4 Problem Management process flow Figure 4.5 Important versus trivial causes Figure 4.6 Service Knowledge Management System Figure 5.1 Achieving maturity in Technology Management Figure 5.2 The Monitor Control Loop Figure 5.3 Complex Monitor Control Loop Figure 5.4 ITSM Monitor Control Loop Figure 6.1 Service Operation functions Figure 6.2 Local Service Desk Figure 6.3 Centralized Service Desk Figure 6.4 Virtual Service Desk vi | List of tables Table 3.1 Examples of extreme internal and external focus Table 3.2 Examples of extreme focus on stability and responsiveness Table 3.3 Examples of extreme focus on quality and cost Table 3.4 Examples of extremely reactive and proactive behaviour Table 4.1 Simple priority coding system Table 4.2 Pareto cause ranking chart Table 5.1 Active and Passive Reactive and Proactive Monitoring Table 6.1 Survey techniques and tools Table 6.2 Organizational roles Table B.1 Communication requirements in IT services Table B.2 Communication requirements between shifts Table B.3 Performance Reporting requirements: IT service Table B.4 Performance Reporting requirements: Service Operation team or department Table B.5 Performance Reporting requirements: infrastructure or process Table B.6 Communication within projects Table B.7 Communication on handover of projects Table B.8 Communication about changes Table B.9 Communication during exceptions Table B.10 Communication during emergencies Table B.11 Communication with users and customers Table F.1 Access control devices | vii OGC’s foreword Since its creation, ITIL has grown to become the most widely accepted approach to IT service management in the world. However, along with this success comes the responsibility to ensure that the guidance keeps pace with a changing global business environment. Service management requirements are inevitably shaped by the development of technology, revised business models and increasing customer expectations. Our latest version of ITIL has been created in response to these developments. This is one of the five core publications describing the IT service management practices that make up ITIL. They are the result of a two-year project to review and update the guidance. The number of service management professionals around the world who have helped to develop the content of these publications is impressive. Their experience and knowledge have contributed to the content to bring you a consistent set of high-quality guidance. This is supported by the ongoing development of a comprehensive qualifications scheme, along with accredited training and consultancy. Whether you are part of a global company, a government department or a small business, ITIL gives you access to world-class service management expertise. Essentially, it puts IT services where they belong – at the heart of successful business operations. Peter Fanning Acting Chief Executive Office of Government Commerce viii | Chief Architect’s foreword ITIL Service Management Practice guidance is structured around the Service Lifecycle. Common across the lifecycle is the overall practice itself, which relies on processes, functions, activities, organizational models and measurement, which together allow IT Service Management (ITSM) to integrate with the business processes, provide measurable value and evolve the ITSM industry forward in our pursuit of service excellence. Nowhere else in the ITIL Service Lifecycle does the effect of how we perform as service providers touch the customers as intimately as Service Operations. This is where the strategy, design, transition and improvements are delivered and supported on a day-to-day basis. The Service Operation publication brings Service Management to life for the business, and the accountability for the performance of the services, the people who create them and the technology that enables them are monitored, controlled and delivered in this stage of the Service Lifecycle. This publication will help guide us all to achieve service excellence and to see the value of ITSM in a broad, business-focused view of it. Whether you are new to the practice of ITIL or a seasoned practitioner, the guidance in this publication will expand your vision and knowledge of how to be the best-of-breed service provider through implementation of Service Operation. There is a saying that hindsight is 20/20. The guidance in Service Operation is distilled from over 20 years of experience in ITSM by world experts, business people and ITSM practitioners and the lessons learned by them about what service excellence really is and how to achieve it. Anyone involved in operating services will benefit from the guidance in the following pages of this publication. Service Operation offers the best advice and guidance from around the world and a path to what is possible in your future. Sharon Taylor Chief Architect, ITIL Service Management Practices | ix Preface This publication encompasses and supersedes the operational aspects of the ITIL Service Support and Service Delivery publications and also covers most of the scope of ICT infrastructure Management. It also incorporates operational aspects from the Planning to Implement, Application Management, Software Asset Management and Security Management publications. The basic principles of best practice IT service management encompassed within earlier versions of ITIL remain unchanged. Common sense remains common sense! However, the technologies, tools and relationships have changed significantly, even in the relatively short time since the latest version of ITIL was completed. Whilst this publication re-uses and updates relevant material from the earlier versions where appropriate, it also includes many new concepts and industry practices to give complete coverage of best-practice guidance for today’s Service Operation in a single volume, for today’s business and technological environment. Contact information Full details of the range of material published under the ITIL banner can be found at www.best-management-practice.com/itil For further information on qualifications and training accreditation, please visit www.itil-officialsite.com. Alternatively, please contact: APMG Service Desk Sword House Totteridge Road High Wycombe Buckinghamshire HP13 6DG Tel: +44 (0) 1494 452450 E-mail: email@example.com x | Acknowledgements Chief Architect and authors Algorri, Mary Fischer, Bill Thayer and Diana Osberg of The Sharon Taylor Chief Architect Walt Disney Company’s Enterprise IT, Dennis Deane and (Aspect Group Inc) John Sowerby of DHL, Richard Fahey and Chris Hughes of HP Global Delivery Application Services, Cindi Locker and David Cannon (HP) Author Dhiraj Gupta of Progressive Casualty Insurance Company, David Wheeldon (HP) Author Peter Doherty and Robert Stroud from Computer Associates and Paul Tillston from Hewlett-Packard, Brian Jakubec, Vernon Blakes, Angela Chin, Colin Lovell, Ken ITIL authoring team Hamilton, Rose Lariviere, Jenny McPhee, Tom Nielsen, Roc The ITIL authoring team contributed to this guide through Paez, Lloyd Robinson, Paul Wilmot, Jeanette Smith and commenting on content and alignment across the set. So Ken Wendle of Hewlett-Packard. thanks are also due to the other ITIL authors, specifically Jeroen Bronkhorst (HP), Gary Case (Pink Elephant), Ashley In order to develop ITIL Service Management Practices to Hannah (HP), Majid Iqbal (Carnegie Mellon University), reflect current best practice and produce publications of Shirley Lacy (ConnectSphere), Vernon Lloyd (Fox IT), Ivor lasting value, OGC consulted widely with different Macfarlane (Guillemot Rock), Michael Nieves (Accenture), stakeholders throughout the world at every stage in the Stuart Rance (HP), Colin Rudd (ITEMS) and George process. OGC would also like to thank the following Spalding (Pink Elephant). individuals and their organisations for their contributions to refreshing the ITIL guidance: Mentors The ITIL Advisory Group Christian Nissen and Paul Wilkinson. Pippa Bass, OGC; Tony Betts, Independent; Signe-Marie Further contributions Hernes Bjerke, Det Norske Veritas; Alison Cartlidge, Xansa; Diane Colbeck, DIYmonde Solutions Inc; Ivor Evans, A number of people generously contributed their time DIYmonde Solutions Inc; Karen Ferris, ProActive; Malcolm and expertise to this Service Operation publication. Jim Fry, FRY-Consultants; John Gibert, Independent; Colin Clinch, as OGC Project Manager, is grateful for the support Hamilton, RENARD Consulting Ltd; Lex Hendriks, EXIN; provided by HP to the authoring team on the Carol Hulm, British Computer Society-ISEB; Tony Jenkins, development of this publication and particularly the DOMAINetc; Phil Montanaro, EDS; Alan Nance, ITPreneurs; contribution of Peter Doherty and Robert Stroud, and for Christian Nissen, Itilligence; Don Page, Marval Group; Bill the support of Jenny Dugmore, Convenor of Working Powell, IBM; Sergio Rubinato Filho, CA; James Siminoski, Group ISO/IEC 20000, Janine Eves, Carol Hulm, Aidan SOScorp; Robert E. Stroud, CA; Jan van Bon, Inform-IT; Ken Lawes and Michiel van der Voort. Wendle, HP; Paul Wilkinson, Getronics PinkRoccade; The authors would also like to thank Stuart Rance and Takashi Yagi, Hitachi. Ashley Hanna of Hewlett-Packard, Christian F Nissen (ITILLIGENCE), Maria Vase (Itilligence), Eu Jin Ho (UBS), Jan Reviewers Bjerregaard, (Sun Microsystems), Jan Øberg (ØBERG Jorge Acevedo, Computec S.A; Valerie Arraj, InteQ; Colin Partners), Lars Zobbe Mortensen (Zobbe Consult & Ashcroft, City of London; Martijn Bakker, Getronics Zoftware), Mette Nielsen (Carlsberg IT), Michael Imhoff PinkRoccade; Jeff Bartrop, BT & Customer Service Direct; (IBM), Niels Berner (Novo Nordisk), Nina Schertiger (HP), John Bennett, Centram Ltd; Niels Berner, Novo Nordisk; Ian Signe-Marie Hernes Bjerke (DNV), Steen Sverker Nilsson Bevan, Fox IT; Signe-Marie Hernes Bjerke, DNV; Jan (Westergaard CSM), Ulf Myrberg (BiTa), Russell Jukes, Bjerregaard, Sun Microsystems; Enrico Boverino, CA; Debbi Jancaitis, Sheldon Parmer, Ramon Alanis, Tim Stephen Bull, Sierra Systems; Bradley Busch, InTotality; Benson and Nenen Ong of Hewlett-Packard IT, Jaye Howard Carpenter, IBM; Diane Colbeck, DIYmonde Thompson, Dee Seymour, Andranik Ziyalyan, Young Solutions Inc; Nicole Conboy, Nicole Conboy & Associates; Chang, Lauren Abernethy, April McCowan, Becky Sharon Dale, aQuip International; Sandra Daly, Dawling Wershbale, Rob Garman, Scott McPherson, Sandra Consultancy; Michael Donahue, IBM; Paul Donald, Lucid IT; Breading, Rick Streeter, Leon Gantt, Charlotte Devine, Greg Juan Antonio Fernandez, Quint Wellington Redrood; Juan | xi Jose Figueiras, Globant; Rae Garrett, Pink Elephant; Klaus Goedel, HP; Detlef Gross, Automation Consulting Group GmbH; Matthias Hall, University of Dundee; Lex Hendriks, EXIN; Jabe Hickey, IBM; Kevin Hite, Microsoft; Eu Jin Ho, UBS; Michael Imhoff, IBM; Scott Jaegar, Plexant; Tony Jenkins, DOMAINetc; Tony Kelman-Smith, HP; Peter Koepp, Independent; Joanne Kopcho, Capgemini America; Debbie Langenfield, IBM; Sarah Lascelles, Interserve Project Services Ltd; Peter Loos, Accenture Services GmbH; Emmanuel Marchand, Advens; Jesus Martin, Ibermatica SA; Phil Montanaro, EDS; Luis Moran, Independent; Lars Zobbe Mortensen, Zobbe Consult & Zoftware; Ron Morton, HP; Darren Murtagh, Retravision; Ulf Myrberg, BiTa; Mette Nielsen, Carlsberg IT; Steen Sverker Nilsson, Westergaard CSM; Jan Øberg, ØBERG Partners; Eddy Peters, CTG; Poul Mols Poulsen, Coop Norden IT; Bill D Powell, IBM; Roger Purdie, The Art of Service; Padmini Ramamurthy, Satyam Computer Services Ltd; Frances Scarff, OGC; Nina Schertiger, HP; Markus Schiemer, Unisys; Barbara Schiesser, Swiss ICT; Klaus Seidel, Microsoft; Gilbert Silva, Techbiz Informatica Ltd; Joseph Stephen, Department of Transportation, US Government; Michala Sterling, Mid Sussex District Council; Rohan Thuraisingham, Friends Provident Management Services Ltd; Matthew Tolman, Sandvik; Jan van Bon, Inform-IT; Maria Vase, ITILLIGENCE; Christoph Wettstein, CLAVIS klw AG; Andi Wijaya, IBM; Aaron Wolfe, Pink Elephant; Takashi Yagi, Hitachi; YoungHoon Youn, IBM. Introduction 1 | 3 1 Introduction This publication provides best-practice advice and separate components, such as hardware, software guidance on all aspects of managing the day-to-day applications and networks, that make up the end-to-end operation of an organization’s information technology (IT) service from a business perspective) and to detect any services. It covers issues relating to the people, processes, threats or failures to service quality. infrastructure technology and relationships necessary to As services may be provided, in whole or in part, by one ensure the high-quality, cost-effective provision of IT or more partner/supplier organizations, the Service service necessary to meet business needs. Operation view of end-to-end service must be extended to The advent of new technology and the now blurred lines encompass external aspects of service provision – and between the traditional technology silos of hardware, where necessary shared or interfacing processes and tools networks, telephony and software applications are needed to manage cross-organizational workflows. management mean that an updated approach to Service Operation is neither an organizational unit nor a managing service operations is needed. Organizations are single process – but it does include several functions and increasingly likely to consider different ways of providing many processes and activities, which are described in their IT at optimum cost and flexibility, with the Chapters 4, 5 and 6. introduction of utility IT, pay-per-use IT Services, virtual IT provision, dynamic capacity and Adaptive Enterprise computing, as well as task-sourcing and outsourcing 1.2 CONTEXT options. 1.2.1 Service Management These alternatives have led to a myriad of IT business relationships, both internally and externally, that have IT is a commonly used term that changes meaning with increased in complexity as much as the technologies context. From the first perspective, IT systems, applications being managed have. Business dependency on these and infrastructure are components or sub-assemblies of a complex relationships is increasingly critical to survival larger product. They enable or are embedded in processes and prosperity. and services. From the second perspective, IT is an organization with its own set of capabilities and resources. IT organizations can be of various types such as business 1.1 OVERVIEW functions, shared services units and enterprise-level core Service Operation is the phase in the ITSM Lifecycle that is units. responsible for ‘business-as-usual’ activities. From the third perspective, IT is a category of services Service Operation can be viewed as the ‘factory’ of IT. utilized by business. They are typically IT applications and This implies a closer focus on the day-to-day activities infrastructure that are packaged and offered as services by and infrastructure that are used to deliver services. internal IT organizations or external service providers. IT However, this publication is based on the understanding costs are treated as business expenses. From the fourth that the overriding purpose of Service Operation is to perspective, IT is a category of business assets that provide deliver and support services. Management of the a stream of benefits for their owners, including, but not infrastructure and the operational activities must limited to, revenue, income and profit. IT costs are treated always support this purpose. as investments. Well planned and implemented processes will be to no 1.2.2 Good practice in the public domain avail if the day-to-day operation of those processes is not properly conducted, controlled and managed. Nor will Organizations operate in dynamic environments with the service improvements be possible if day-to-day activities need to learn and adapt. There is a need to improve to monitor performance, assess metrics and gather data performance while managing trade-offs. Under similar are not systematically conducted during Service Operation. pressure, customers seek advantage from service providers. They pursue sourcing strategies that best serve Service Operation staff should have in place processes and their own business interest. In many countries, support tools to allow them to have an overall view of government agencies and non-profit-making enterprises Service Operation and delivery (rather than just the have a similar propensity to outsource for the sake of 4 | Introduction Standards Employees Industry practices Customers Sources Enablers Academic research Suppliers (Generate) (Aggregate) Training and education Advisors Internal experience Technologies Substitutes Competition Drivers Regulators Compliance Scenarios (Filter) (Filter) Customers Commitments Knowledge fit for business objectives, context and purpose Figure 1.1 Source of Service Management Practice operational effectiveness. This puts additional pressure on knowledge have matching circumstances, the service providers to maintain a competitive advantage knowledge may not be as effective in use. with regard to the alternatives that customers may have. ■ Owners of proprietary knowledge expect to be The increase in outsourcing has particularly exposed rewarded for their long-term investments. internal service providers to unusual competition. They may make such knowledge available only To cope with the pressure, organizations benchmark under commercial terms, through purchases and themselves against peers and seek to close gaps in licensing agreements. capabilities. One way to close such gaps is the adoption of ■ Publicly available frameworks and standards such as good practices across the industry. There are several ITIL, Control Objectives for IT (COBIT), CMMI, eSCM-SP, sources for good practices, including public frameworks, PRINCE2, ISO 9000, ISO 20000 and ISO 27001 are standards and the proprietary knowledge of organizations validated across a diverse set of environments and and individuals (see Figure 1.1). situations rather than the limited experience of a single organization. They are subject to broad Public frameworks and standards are attractive when review across multiple organizations and disciplines. compared with proprietary knowledge: They are vetted by diverse sets of partners, suppliers ■ Proprietary knowledge is deeply embedded in and competitors. organizations and therefore difficult to adopt, ■ The knowledge of public frameworks is more likely to replicate or transfer, even with the cooperation of be widely distributed among a large community of the owners. Such knowledge is often in the form professionals through publicly available training and of tacit knowledge which is inextricable and certification. It is easier for organizations to acquire poorly documented. such knowledge through the labour market. ■ Proprietary knowledge is customized for the local Ignoring public frameworks and standards can needlessly context and specific business needs, to the point of place an organization at a disadvantage. Organizations being idiosyncratic. Unless the recipients of such should cultivate their own proprietary knowledge on top Introduction | 5 Continual Service Improvement Service Transition Service Strategy Service Design Service Operation en ce Co Impro vem ervi nti ve t nu m pro l S Im tinua al S ent erv n Co ice Figure 1.2 ITIL Core of a body of knowledge based on public frameworks and ■ Service Strategy standards. Collaboration and coordination across ■ Service Design organizations are easier on the basis of shared practices ■ Service Transition and standards. ■ Service Operation ■ Continual Service Improvement. 1.2.3 ITIL and good practice in Service Management Each publication addresses capabilities having direct impact on a service provider’s performance. The structure The context of this publication is the ITIL Framework as a of the core is in the form of a lifecycle. It is iterative and source of good practice in Service Management. ITIL is multidimensional. It ensures that organizations are set up used by organizations worldwide to establish and improve to leverage capabilities in one area for learning and capabilities in Service Management. ISO/IEC 20000 improvements in others. The Core is expected to provide provides a formal and universal standard for organizations structure, stability and strength to Service Management seeking to have their Service Management capabilities capabilities, with durable principles, methods and tools. audited and certified. While ISO/IEC 20000 is a standard to This serves to protect investments and provide the be achieved and maintained, ITIL offers a body of necessary basis for measurement, learning and knowledge useful for achieving the standard. improvement. The ITIL Library has the following components: The guidance in ITIL can be adapted for changes of use in ■ ITIL Core: best-practice guidance applicable to all various business environments and organizational types of organizations that provide services to a strategies. The Complementary Guidance provides business flexibility to implement the Core in a diverse range of ■ ITIL Complementary Guidance: a complementary set environments. Practitioners can select Complementary of publications with guidance specific to industry Guidance as needed to provide traction for the Core in a sectors, organization types, operating models and given business context, much as tyres are selected based technology architectures. on the type of automobile, purpose and road conditions. This is to increase the durability and portability of The ITIL Core consists of five publications (see Figure 1.2). knowledge assets and to protect investments in Service Each provides the guidance necessary for an integrated Management capabilities. approach as required by the ISO/IEC 20000 standard specification: 6 | Introduction 220.127.116.11 Service Strategy 18.104.22.168 Service Transition The Service Strategy volume provides guidance on how to The Service Transition volume provides guidance for the design, develop and implement Service Management, not development and improvement of capabilities for only as an organizational capability but also as a strategic transitioning new and changed services into operations. asset. Guidance is provided on the principles underpinning This publication provides guidance on how the the practice of Service Management which are useful for requirements of Service Strategy encoded in Service developing Service Management policies, guidelines and Design are effectively realized in Service Operations while processes across the ITIL Service Lifecycle. Service Strategy controlling the risks of failure and disruption. The guidance is useful in the context of Service Design, Service publication combines practices in Release Management, Transition, Service Operation and Continual Service Programme Management and Risk Management and Improvement. Topics covered in Service Strategy include places them in the practical context of Service the development of markets, internal and external, service Management. It provides guidance on managing the assets, service catalogue and implementation of strategy complexity related to changes to services and Service through the Service Lifecycle. Financial Management, Management processes, preventing undesired Service Portfolio Management, Organizational consequences while allowing for innovation. Guidance is Development and Strategic Risks are among other provided on transferring the control of services between major topics. customers and service providers. Organizations use the guidance to set objectives and expectations of performance towards serving customers 22.214.171.124 Service Operation and market spaces and to identify, select and prioritize This volume embodies practices in the management of opportunities. Service Strategy is about ensuring that Service Operations. It includes guidance on achieving organizations are in a position to handle the costs and effectiveness and efficiency in the delivery and support of risks associated with their service portfolios and are set up services so as to ensure value for the customer and the not just for operational effectiveness but for distinctive service provider. Strategic objectives are ultimately realized performance. Decisions made with regard to Service through Service Operations, therefore making it a critical Strategy have far-reaching consequences, including those capability. Guidance is provided on how to maintain with delayed effect. stability in Service Operations, allowing for changes in design, scale, scope and service levels. Organizations are Organizations already practising ITIL use this volume to provided with detailed process guidelines, methods and guide a strategic review of their ITIL-based Service tools for use in two major control perspectives: reactive Management capabilities and to improve the alignment and proactive. Managers and practitioners are provided between those capabilities and their business strategies. with knowledge allowing them to make better decisions in This volume of ITIL encourages readers to stop and think areas such as managing the availability of services, about why something is to be done before thinking of controlling demand, optimizing capacity utilization, how. Answers to the first type of questions are closer to scheduling of operations and fixing problems. Guidance is the customer’s business. Service Strategy expands the provided on supporting operations through new models scope of the ITIL Framework beyond the traditional and architectures such as shared services, utility audience of ITSM professionals. computing, web services and mobile commerce. 126.96.36.199 Service Design 188.8.131.52 Continual Service Improvement The Service Design volume provides guidance for the This volume provides instrumental guidance in creating design and development of services and service and maintaining value for customers through better management processes. It covers design principles and design, introduction and operation of services. It combines methods for converting strategic objectives into portfolios principles, practices and methods from Quality of services and service assets. The scope of Service Design Management, Change Management and Capability is not limited to new services. It includes the changes and Improvement. Organizations learn to realize incremental improvements necessary to increase or maintain value to and large-scale improvements in service quality, customers over the lifecycle of services, the continuity of operational efficiency and business continuity. Guidance is services, achievement of service levels and conformance to provided for linking improvement efforts and outcomes standards and regulations. It guides organizations on how with Service Strategy, Service Design and Service to develop design capabilities for Service Management. Transition. A closed-loop feedback system, based on the Introduction | 7 Plan, Do, Check, Act (PDCA) model specified in ISO/IEC and adopt’ the guidance for its own specific needs, 20000, is established and capable of receiving inputs for environment and culture. This will involve taking into change from any planning perspective. account the organization’s size, skills/resources, culture, funding, priorities and existing ITSM maturity and The day-to-day operational management of IT Services is modifying the guidance as appropriate to suit the significantly influenced by how well an organization’s organization’s needs. overall IT service strategy has been defined and how well the ITSM processes have been planned and implemented. For organizations finding ITIL for the first time, some form This is the fourth publication in the ITIL Service of initial assessment to compare the organization’s current Management Practices series and the other publications processes and practices with those recommended by ITIL on Service Strategy, Service Design and Service Transition would be a very valuable starting point. These assessments should be consulted for best practice guidance on these are described in more detail in the ITIL Continual Service important stages prior to Service Operation. Improvement publication. Service Operation is extremely important, as it is on a day- Where significant gaps exist, it may be necessary to to-day operational basis that events occur which can address them in stages over a period of time to meet the adversely impact service quality. The way in which an organization’s business priorities and keep pace with what organization’s IT infrastructure and its supporting ITSM the organization is able to absorb and afford. processes are operated will have the most direct and immediate short-term bearing upon service quality. 1.5 CHAPTER OVERVIEW Chapter 2 introduces the concept of Service Management 1.3 PURPOSE as a practice. Here, Service Management is positioned as a Service Operation is a critical phase of the ITSM lifecycle. strategic and professional component of any organization. Well-planned and well-implemented processes will be to This chapter also provides an overview of Service no avail if the day-to-day operation of those processes is Operation as a critical component of the Service not properly conducted, controlled and managed. Nor will Management Practice. service improvements be possible if day-to-day activities The key principles of Service Operation are covered in to monitor performance, assess metrics and gather data Chapter 3 of this publication. These principles outline are not systematically conducted during Service Operation. some of the basic concepts and principles on which the Service Operation staff should have in place processes and rest of the publication is based. support tools to allow them to have an overall view of Chapter 4 covers the processes performed within Service Service Operation and delivery (rather than just the Operation – most of the Service Operation processes are separate components, such as hardware, software reactive because of the nature of the work being applications and networks, that make up the end-to-end performed to maintain IT services in a robust, stable service from a business perspective) and to detect any condition. This chapter also covers proactive processes to threats or failures to service quality. emphasize that the aim of Service Operation is stability – As services may be provided, in whole or in part, by one but not stagnation. Service Operation should be constantly or more partner/supplier organizations, the Service looking at ways of doing things better and more cost- Operation view of end-to-end service must be extended to effectively, and the proactive processes have an important encompass external aspects of service provision – and role to play here. where necessary shared or interfacing processes and tools Chapter 5 covers a number of Common Service Operation are needed to manage cross-organizational workflows. activities, which are groups of activities and procedures performed by Service Operation Functions. These 1.4 USAGE specialized, and often technical, activities are not processes in the true sense of the word, but they are all This publication should be used in conjunction with the vital for the ability to deliver quality IT services at optimal other four publications that make up the ITIL Service cost. Lifecycle. Chapter 6 covers the organizational aspects of Service Readers should be aware that the best-practice guidelines Operation – the individuals or groups who carry out in this and other volumes are not intended to be Service Operation processes or activities – and includes prescriptive. Each organization is unique and must ‘adapt 8 | Introduction some guidance on Service Operation organization structures. Chapter 7 describes the tools and technology that are used during Service Operation. Chapter 8 covers some aspects of implementation that will need to be considered before the operational phase of the lifecycle becomes active. Chapter 9 highlights the challenges, Critical Success Factors and risks faced during Service Operation, while the Afterword summarizes and concludes the publication. ITIL does not stand alone in providing guidance to IT managers and the appendices outline some of the key supplementary frameworks, methodologies and approaches that are commonly used in conjunction with ITIL during Service Operation. Service Management as a practice 2 | 11 2 Service Management as a practice 2.1 WHAT IS SERVICE MANAGEMENT? ■ The perishable nature of service output and service capacity: There is value for the customer from Service Management is a set of specialized organizational assurance on the continued supply of consistent capabilities for providing value to customers in the form of quality. Providers need to secure a steady supply services. The capabilities take the form of functions and of demand from customers. processes for managing services over a lifecycle, with specializations in strategy, design, transition, operation and However, Service Management is more than just a set of continual improvement. The capabilities represent a capabilities. It is also a professional practice supported by service organization’s capacity, competency and an extensive body of knowledge, experience and skills. A confidence for action. The act of transforming resources global community of individuals and organizations in the into valuable services is at the core of Service public and private sectors fosters its growth and maturity. Management. Without these capabilities, a service Formal schemes exist for the education, training and organization is merely a bundle of resources that by itself certification of practising organizations and individuals has relatively low intrinsic value for customers. influence its quality. Industry best practices, academic research and formal standards contribute to its intellectual Definition of Service Management capital and draw from it. Service Management is a set of specialized The origins of Service Management are in traditional organizational capabilities for providing value to service businesses such as airlines, banks, hotels and customers in the form of services. phone companies. Its practice has grown with the adoption by IT organizations of a service-oriented Organizational capabilities are shaped by the challenges approach to managing IT applications, infrastructure and they are expected to overcome. An example of this is how processes. Solutions to business problems and support for in the 1950s Toyota developed unique capabilities to business models, strategies and operations are increasingly overcome the challenge of smaller scale and financial in the form of services. The popularity of shared services capital compared to its American rivals. Toyota developed and outsourcing has contributed to the increase in the new capabilities in production engineering, operations number of organizations that are service providers, management and managing suppliers to compensate for including internal organizational units. This in turn has its inability to afford large inventories, make components, strengthened the practice of Service Management and at produce raw materials or own the companies that the same time imposed greater challenges upon it. produced them. [Source: Magretta, Joan 2002. What Management Is: How it works and why it’s everyone’s business. The Free Press.] Service Management capabilities 2.2 WHAT ARE SERVICES? are similarly influenced by the following challenges that distinguish services from other systems of value-creation, 2.2.1 The value proposition such as manufacturing, mining and agriculture: Definition of service ■ Intangible nature of the output and intermediate A service is a means of delivering value to customers products of service processes: Difficult to measure, by facilitating outcomes customers want to achieve, control and validate (or prove). without the ownership of specific costs and risks. ■ Demand is tightly coupled with the customer’s assets: Users and other customer assets such as processes, Services are a means of delivering value to customers by applications, documents and transactions arrive with facilitating outcomes customers want to achieve, without demand and stimulate service production. the ownership of specific costs and risks. Services facilitate ■ High level of contact for producers and consumers of outcomes by enhancing the performance of associated services: Little or no buffer between the customer, the tasks and reducing the effect of constraints. The result is front-office and the back-office. an increase in the probability of desired outcomes. 12 | Service Management as a practice I must ask, do you I believe services are a means of delivering value by have a definition facilitating outcomes customers want to achieve for services? without the ownership of specific costs and risks. What would that mean in operational terms? Well, services facilitate outcomes by Give me a few handles. having a positive effect on activities, objects and tasks, to create conditions for better performance. As a result, the But without the ownership of probability of desired outcomes is higher. costs and risks? Customers cannot wish them away. No, they cannot but what they can do is Manager Manager let the provider take ownership. That’s Aha! Because the provider is (Operations) (Strategy) really why it is a service. If customers specialized with capabilities for manage it all by themselves, they dealing with those costs and risks. wouldn’t need a service would they? (A casual conversation Yes, and also because the customer at the water-cooler) would rather specialize in those outcomes. And also because the provider can Let’s write a book on potentially spread those costs and risks service management! across more than one customer. Figure 2.1 A conversation about the definition and meaning of services 2.3 FUNCTIONS AND PROCESSES ACROSS 2.3.2 Processes THE LIFECYCLE Processes are examples of closed-loop systems because they provide change and transformation towards a goal 2.3.1 Functions and utilize feedback for self-reinforcing and self-corrective Functions are units of organizations specialized to perform action (see Figure 2.2). It is important to consider the certain types of work and responsible for specific entire process or how one process fits into another. outcomes. They are self-contained, with capabilities and Process definitions describe actions, dependencies and resources necessary for their performance and outcomes. sequence. Processes have the following characteristics: Capabilities include work methods internal to the functions. Functions have their own body of knowledge, ■ Measurable: We are able to measure the process in a which accumulates from experience. They provide relevant manner. It is performance driven. Managers structure and stability to organizations. want to measure cost, quality and other variables, while practitioners are concerned with duration and Functions are a means of structuring organizations so as productivity. to implement the specialization principle. Functions ■ Specific results: The reason a process exists is to typically define roles and the associated authority and deliver a specific result. This result must be individually responsibility for a specific performance and outcomes. identifiable and countable. While we can count Coordination between functions through shared processes changes, it is impossible to count how many Service is a common pattern in organization design. Functions Desks were completed. tend to optimize their work methods locally, to focus on ■ Customers: Every process delivers its primary results assigned outcomes. Poor coordination between functions, to a customer or stakeholder. They may be internal or combined with an inward focus, leads to functional silos external to the organization but the process must that hinder alignment and feedback critical to the success meet their expectations. of the organization as a whole. Process models help avoid ■ Responds to a specific event: While a process may this problem with functional hierarchies by improving be ongoing or iterative, it should be traceable to a cross-functional coordination and control. Well-defined specific trigger. processes can improve productivity within and across functions. Service Management as a practice | 13 Data, Process information and knowledge Suppliers Desired Outcome Activity 1 Activity 2 Activity 3 Customer Service control and quality Trigger Figure 2.2 A basic process Functions are often mistaken for processes. For example, systems thinking. Each control perspective can reveal there are misconceptions about Capacity Management patterns that may not be apparent from the other. being a Service Management process. First, Capacity Management is an organizational capability with 2.4 SERVICE OPERATION FUNDAMENTALS specialized processes and work methods. Whether it is a function or a process depends entirely on organization 2.4.1 Purpose/goal/objective design. It is a mistake to assume that Capacity Management can only be a process. It is possible to The purpose of Service Operation is to coordinate and measure and control capacity and to determine whether it carry out the activities and processes required to deliver is adequate for a given purpose. Assuming that it is always and manage services at agreed levels to business users a process, with discrete countable outcomes, can be an and customers. Service Operation is also responsible for error. the ongoing management of the technology that is used to deliver and support services. 2.3.3 Specialization and coordination across Well-designed and well-implemented processes will be of the lifecycle little value if the day-to-day operation of those processes Specialization and coordination are necessary in the is not properly conducted, controlled and managed. Nor lifecycle approach. Feedback and control between the will service improvements be possible if day-to-day functions and processes within and across the elements of activities to monitor performance, assess metrics and the lifecycle make this possible. The dominant pattern in gather data are not systematically conducted during the lifecycle is the sequential progress starting from SS Service Operation. through SD-ST-SO and back to SS through CSI. However, that is not the only pattern of action. Every element of the 2.4.2 Scope lifecycle provides points for feedback and control. Service Operation includes the execution of all ongoing activities required to deliver and support services. The The combination of multiple perspectives allows greater scope of Service Operation includes: flexibility and control across environments and situations. The lifecycle approach mimics the reality of most ■ The services themselves. Any activity that forms part organizations where effective management requires the of a service is included in Service Operation, whether use of multiple control perspectives. Those responsible for it is performed by the Service Provider, an external the design, development and improvement of processes supplier or the user or customer of that service for Service Management can adopt a process-based ■ Service Management processes. The ongoing control perspective. Those responsible for managing management and execution of many Service agreements, contracts and services may be better served Management processes are performed in Service by a lifecycle-based control perspective with distinct Operation, even though a number of ITIL processes phases. Both these control perspectives benefit from 14 | Service Management as a practice (such as Change and Capacity Management) originate ■ It is difficult to obtain funding during the operational at the Service Design or Service Transition stage phase, to fix design flaws or unforeseen requirements of the Service Lifecycle, they are in use continually – since this was not part of the original value in Service Operation. Some processes are not proposition. In many cases it is only after some time in included specifically in Service Operation, such as operation that these problems surface. Most Strategy Definition, the actual design process itself. organizations do not have a formal mechanism to These processes focus more on longer-term planning review operational services for design and value. This and improvement activities, which are outside the is left to Incident and Problem Management to resolve direct scope of Service Operation; however, – as if it is purely an operational issue. Service Operation provides input and influences ■ It is difficult to obtain additional funding for tools or these regularly as part of the lifecycle of actions (including training) aimed at improving the Service Management. efficiency of Service Operation. This is partly because ■ Technology. All services require some form of they are not directly linked to the functionality of a technology to deliver them. Managing this technology specific service and partly because there is an is not a separate issue, but an integral part of the expectation from the customer that these costs should management of the services themselves. Therefore a have been built into the cost of the service from the large part of this publication is concerned with the beginning. Unfortunately, the rate of technology management of the infrastructure used to deliver change is very high. Shortly after a solution has been services. deployed that will efficiently manage a set of services, ■ People. Regardless of what services, processes and new technology becomes available that can do it technology are managed, they are all about people. It faster, cheaper and more effectively. is people who drive the demand for the organization’s ■ Once a service has been operational for some time, it services and products and it is people who decide becomes part of the baseline of what the business how this will be done. Ultimately, it is people who expects from the IT services. Attempts to optimize the manage the technology, processes and services. service or to use new tools to manage it more Failure to recognize this will result (and has resulted) effectively are seen as successful only if the service has in the failure of Service Management projects been very problematic in the past. In other words, some services are taken for granted and any action to 2.4.3 Value to business optimize them is perceived as ‘fixing services that are Each stage in the ITIL Service Lifecycle provides value to not broken’. business. For example, service value is modelled in Service This publication suggests a number of processes, functions Strategy; the cost of the service is designed, predicted and and measures which are aimed at addressing these areas. validated in Service Design and Service Transition; and measures for optimization are identified in Continual 2.4.4 Optimizing Service Operation Service Improvement. The operation of service is where performance these plans, designs and optimizations are executed and Service Operation is optimized in two ways: measured. From a customer viewpoint, Service Operation is where actual value is seen. ■ Long-term incremental improvement. This is based on evaluating the performance and output of all There is a down side to this, though: Service Operation processes, functions and outputs ■ Once a service has been designed and tested, it is over time. The reports are analysed and a decision expected to run within the budgetary and Return on made about whether improvement is needed and, if Investment targets established earlier in the lifecycle. so, how best to implement it through Service Design In reality, however, very few organizations plan and Transition. Examples include the deployment of a effectively for the costs of ongoing management of new set of tools, changes to process designs, services. It is very easy to quantify the costs of a reconfiguration of the infrastructure, etc. This type of project, but very difficult to quantify what the service improvement is covered in detail in the Continual will cost after three years of operation. Service Improvement publication. Service Management as a practice | 15 ■ Short-term ongoing improvement of working In order to resolve one or more incidents, problems or practices within the Service Operation processes, Known Errors, some form of change may be necessary. functions and technology itself. These are generally Smaller, often standard, changes can be handled through smaller improvements that are implemented without a Request Fulfilment process, but larger, higher-risk or any change to the fundamental nature of a process or infrequent changes must go through a formal Change technology. Examples include tuning, workload Management process. balancing, personnel redeployment and training, etc. Although both of these are discussed in some detail within 184.108.40.206 Access Management the scope of Service Operation, the Continual Service Access Management is the process of granting authorized Improvement publication will provide a framework and users the right to use a service, while restricting access to alternatives within which improvement may be driven as non-authorized users. It is based on being able accurately part of the overall support of business objectives. to identify authorized users and then manage their ability to access services as required during different stages of 2.4.5 Processes within Service Operation their Human Resources (HR) or contractual lifecycle. Access Management has also been called Identity or Rights There are a number of key Service Operation processes Management in some organizations. that must link together to provide an effective overall IT support structure. The overall structure is briefly described here and then each of the processes is described in more 2.4.6 Functions within Service Operation detail in Chapter 4. Processes alone will not result in effective Service Operation. A stable infrastructure and appropriately skilled 220.127.116.11 Event Management people are needed as well. To achieve this, Service Operation relies on several groups of skilled people, all Event Management monitors all events that occur focused on using processes to match the capability of the throughout the IT infrastructure, to monitor normal infrastructure to the needs of the business. operation and to detect and escalate exception conditions. These groups fall into four main functions, listed here and 18.104.22.168 Incident and Problem Management discussed in detail in Chapter 6. Incident Management concentrates on restoring unexpectedly degraded or disrupted services to users as 22.214.171.124 Service Desk quickly as possible, in order to minimize business impact. The Service Desk is the primary point of contact for users when there is a service disruption, for Service Requests, or Problem Management involves: root-cause analysis to even for some categories of Request for Change. The determine and resolve the cause of incidents, proactive Service Desk provides a point of communication to the activities to detect and prevent future problems/incidents users and a point of coordination for several IT groups and a Known Error sub-process to allow quicker diagnosis and processes and resolution if further incidents do occur. 126.96.36.199 Technical Management 188.8.131.52 Request Fulfilment Technical Management provides detailed technical skills Request Fulfilment is the process for dealing with Service and resources needed to support the ongoing operation Requests – many of them actually smaller, lower-risk, of the IT Infrastructure. Technical Management also plays changes – initially via the Service Desk, but using a an important role in the design, testing, release and separate process similar to that of Incident Management improvement of IT services. In small organizations, it is but with separate Request Fulfilment records/tables – possible to manage this expertise in a single department, where necessary linked to the Incident or Problem but larger organizations are typically split into a number Record(s) that initiated the need for the request. To be a of technically specialized departments. Service Request, it is normal for some prerequisites to be defined and met (e.g. needs to be proven, repeatable, pre- approved, proceduralized). 16 | Service Management as a practice 184.108.40.206 IT Operations Management ■ Financial Management, which is covered in the Service IT Operations Management executes the daily operational Strategy publication. activities needed to manage the IT Infrastructure. This is ■ Knowledge Management, which is covered in the done according to the Performance Standards defined Service Transition publication. during Service Design. In some organizations this is a ■ IT Service Continuity, which is covered in the Service single, centralized department, while in others some Design publication. activities and staff are centralized and some are provided ■ Service Reporting and Measurement, which are by distributed or specialized departments. IT Operations covered in the Continual Service Improvement Management has two functions that are unique and are publication. generally formal organizational structures. These are: ■ IT Operations Control, which is generally staffed by shifts of operators and which ensures that routine operational tasks are carried out. IT Operations Control will also provide centralized monitoring and control activities, usually using an Operations Bridge or Network Operations Centre. ■ Facilities Management refers to the management of the physical IT environment, usually data centres or computer rooms. In many organizations Technical and Application Management are co-located with IT Operations in large data centres. 220.127.116.11 Application Management Application Management is responsible for managing Applications throughout their lifecycle. The Application Management function supports and maintains operational applications and also plays an important role in the design, testing and improvement of applications that form part of IT services. Application Management is usually divided into departments based on the application portfolio of the organization, thus allowing easier specialization and more focused support. 18.104.22.168 Interfaces to other Service Management Lifecycle stages There are several other processes that will be executed or supported during Service Operation, but which are driven during other phases of the Service Management Lifecycle. These will be discussed in the final part of Chapter 4 and include: ■ Change Management, which is a major process that should be closely linked to Configuration Management and Release Management. These topics are primarily covered in the Service Transition publication. ■ Capacity and Availability Management, which are covered in the Service Design publication. Service Operation principles 3 | 19 3 Service Operation principles When considering Service Operation it is tempting to processes across the organization – e.g. ensuring that focus only on managing day-to-day activities and all people who resolve incidents complete the Incident technology as ends in themselves. However, Service Record in the same way. In this publication the term Operation exists within a far greater context. As part of the ‘group’ does not refer to a group of companies that Service Management Lifecycle, it is responsible for are owned by the same entity. executing and performing processes that optimize the cost ■ Team: A team is a more formal type of group. These and quality of services. As part of the organization, it is are people who work together to achieve a common responsible for enabling the business to meet its objective, but not necessarily in the same organization objectives. As part of the world of technology, it is structure. Team members can be co-located, or work responsible for the effective functioning of components in multiple different locations and operate virtually. that support services. The principles in this chapter are Teams are useful for collaboration, or for dealing with aimed at helping Service Operation practitioners to a situation of a temporary or transitional nature. achieve a balance between all of these roles and to focus Examples of teams include project teams, application on effectively managing the day-to-day aspects while development teams (often consisting of people from maintaining a perspective of the greater context. several different business units) and incident or problem resolution teams. 3.1 FUNCTIONS, GROUPS, TEAMS, ■ Department: Departments are formal organization structures which exist to perform a specific set of DEPARTMENTS AND DIVISIONS defined activities on an ongoing basis. Departments The Service Operation publication uses several terms to have a hierarchical reporting structure with managers refer to the way in which people are organized to execute who are usually responsible for the execution of the processes or activities. There are several published activities and also for day-to-day management of the definitions for each term and it is not the purpose of this staff in the department. publication to enter the debate about which definition is ■ Division: A division refers to a number of departments best. Please note that the following definitions are generic that have been grouped together, often by geography and not prescriptive. They are provided simply to define or product line. A division is normally self-contained assumptions and to facilitate understanding of the and is able to plan and execute all activities in a material. The reader should adapt these principles to the supply chain. organizational practices used in their own organization. ■ Role: A role refers to a set of connected behaviours or ■ Function: A function is a logical concept that refers to actions that are performed by a person, team or group the people and automated measures that execute a in a specific context. For example, a Technical defined process, an activity or a combination of Management department can perform the role of processes or activities. In larger organizations, a Problem Management when diagnosing the root function may be broken out and performed by several cause of incidents. This same department could also departments, teams and groups, or it may be be expected to play several other roles at different embodied within a single organizational unit (e.g. times, e.g. it may assess the impact of changes Service Desk). In smaller organizations, one person or (Change Management role), manage the performance group can perform multiple functions – e.g. a of devices under their control (Capacity Management Technical Management department could also role), etc. The scope of their role and what triggers incorporate the Service Desk function. them to play that role are defined by the relevant ■ Group: A group is a number of people who are similar process and agreed by their line manager. in some way. In this publication, groups refer to people who perform similar activities – even though 3.2 ACHIEVING BALANCE IN SERVICE they may work on different technology or report into OPERATION different organizational structures or even in different companies. Groups are usually not formal organization Service Operation is more than just the repetitive structures, but are very useful in defining common execution of a standard set of procedures or activities. All 20 | Service Operation principles functions, processes and activities are designed to deliver Both views are necessary when delivering services. The a specified and agreed level of services, but they have to organization that focuses only on business requirements be delivered in an ever-changing environment. without thinking about how they are going to deliver will end up making promises that cannot be kept. The This forms a conflict between maintaining the status quo organization that focuses only on internal systems without and adapting to changes in the business and thinking about what services they support will end up technological environments. One of Service Operation’s with expensive services that deliver little value. key roles is therefore to deal with this conflict and to achieve a balance between conflicting sets of priorities. The potential for role conflict between the external and internal views is the result of many variables, including This section of the publication highlights some of the key the maturity of the organization, its management culture, tensions and conflicts and identifies how IT organizations its history, etc. This makes a balance difficult to achieve, can recognize that they are suffering from an imbalance and most organizations tend more towards one role by tending more towards one extreme or the other. It also than the other. Of course, no organization will be provides some high-level guidelines on how to resolve the totally internally or externally focused, but will find itself in conflict and thus move towards a best-practice approach. a position along a spectrum between the two. This is Every conflict therefore represents an opportunity for illustrated in Figure 3.1: growth and improvement. An organization here An organization here is 3.2.1 Internal IT view versus external is out of balance quite balanced, and is in danger of but tends to business view not meeting business under-deliver on The most fundamental conflict in all phases of the ITSM requirements promises to the business Lifecycle is between the view of IT as a set of IT services (the external business view) and the view of IT as a set of Extreme Focus Extreme Focus on Internal on External technology components (internal IT view). ■ The external view of IT is the way in which services are experienced by its users and customers. They do not always understand, nor do they care about, the details of what technology is used to manage those services. All they are concerned about is that the Figure 3.1 Achieving a balance between external and services are delivered as required and agreed. internal focus ■ The internal view of IT is the way in which IT components and systems are managed to deliver the Table 3.1 outlines some examples of the characteristics of services. Since IT systems are complex and diverse, this positions at the extreme ends of the spectrum. The often means that the technology is managed by purpose of this table is to assist organizations in several different teams or departments – each of identifying to which extreme they are closer, not to which is focused on achieving good performance and identify real-life positions to which organizations should availability of ‘its’ systems. aspire. Service Operation principles | 21 Table 3.1 Examples of extreme internal and external focus Extreme internal focus Extreme external focus Primary focus Performance and management of IT Infrastructure Achieving high levels of IT service performance with devices, systems and staff, with little regard to the little regard to how it is achieved end result on the IT service Metrics ■ Focus on technical performance without ■ Focus on External Metrics without showing internal showing what this means for services staff how these are derived or how they can be ■ Internal metrics (e.g. network uptime) reported improved to the business instead of service performance ■ Internal staff are expected to devise their own metrics. metrics to measure internal performance. Customer/user ■ High consistency of delivery, but only delivers a ■ Poor consistency of delivery experience percentage of what the business needs. ■ ‘IT consists of good people with good intentions, ■ Uses a ‘push’ approach to delivery, i.e. prefers but cannot always execute’ to have a standard set of services for all ■ Reactive mode of operation. business units. ■ Uses a ‘pull’ approach to delivery, i.e. prefers to deliver customized services upon request Operations ■ Standard operations across the board ■ Multiple delivery teams and multiple technologies strategy ■ All new services need to fit into the current ■ New technologies require new operations architecture and procedures. approaches and often new IT Operations teams. Procedures Focus purely on how to manage the technology, Focuses primarily on what needs to be done and when and manual not on how its performance relates to IT services and less on how this should be achieved Cost strategy ■ Cost reduction achieved purely through ■ Budget allocated on the basis of which business unit technology consolidation is perceived to have the most need ■ Optimization of operational procedures and ■ Less articulate or vocal business units often have resources inferior services as there is not enough funding ■ Business impact of cost cutting often only allocated to their services. understood later ■ Return on Investment calculations are focused purely on cost savings or ‘payback periods’. Training Training is conducted as an apprenticeship, where ■ Training is conducted on a project-by-project basis new Operations staff have to learn the way things ■ There are no standard training courses since have to be done, not why operational procedures and technology are constantly changing. Operations ■ Specialized staff, organized according to ■ Generalist staff, organized partly according to staff technical specialty technical capability and partly according to their ■ Staff work on the false assumption that good relationship with a business unit technical achievement is the same as good ■ Reliance on ‘heroics’, where staff go out of their customer service. way to resolve problems that could have been prevented by better internal processes. 22 | Service Operation principles This does not mean that the external focus is unimportant. ■ Input from and feedback to Continual Service The whole point of Service Management is to provide Improvement to identify areas where there is an services that meet the objectives of the organization as a imbalance and the means to identify and enforce whole. It is critical to structure services around customers. improvement. At the same time, it is possible to compromise the ■ A clear communication and training plan for business. quality of services by not thinking about how they While many organizations are good at developing will be delivered. Communication Plans for projects, this often does not Building Service Operation with a balance between extend into their operational phase. internal and external focus requires a long-term, dedicated approach reflected in all phases of the ITSM Service 3.2.2 Stability versus responsiveness Lifecycle. This will require the following: No matter how good the functionality is of an IT service and no matter how well it has been designed, it will be ■ An understanding of what services are used by the worth far less if the service components are not available business and why. or if they perform inconsistently. ■ An understanding of the relative importance and impact of those services on the business. This means that Service Operation needs to ensure that ■ An understanding of how technology is used to the IT Infrastructure is stable and available as designed. At provide IT services. the same time, Service Operation needs to recognize that business and IT requirements change. ■ Involvement of Service Operation in Continual Service Improvement projects that aim to identify ways of Some of these changes are evolutionary. For example, the delivering more, increase service quality and lower functionality, performance and architecture of a platform cost. may change over a number of years. Each change brings ■ Procedures and manuals that outline the role of IT with it an opportunity to provide better levels of service to Operations in both the management of technology the business. In evolutionary changes, it is possible to plan and the delivery of IT services. how to respond to the change and thus maintain stability ■ A clearly differentiated set of metrics to report to the while responding to the changes. business on the achievement of service objectives; and Many changes, though, happen very quickly and to report to IT managers on the efficiency and sometimes under extreme pressure. For example, a effectiveness of Service Operation. Business Unit unexpectedly wins a contract that requires ■ All IT Operations staff understand exactly how the additional IT services, more capacity and faster response performance of the technology affects the delivery of times. The ability to respond to this type of change IT services and in turn how these affect the business without impacting other services is a significant challenge. and the business goals. Many IT organizations are unable to achieve this balance ■ A set of standard services delivered consistently to all and tend to focus on either the stability of the IT Business Units and a set of non-standard (sometimes Infrastructure or the ability to respond to changes quickly. customized) services delivered to specific Business An organization here is An organization here Units – together with a set of Standard Operating out of balance and is in is quite balanced, but Procedures (SOPs) that can meet both sets of danger of ignoring may tend to requirements. changing business overspend on change requirements ■ A cost strategy aimed at balancing the requirements of different business units with the cost savings Extreme Focus Extreme Focus on available through optimization of existing technology on Stability Responsiveness or investment in new technology – and an understanding of the cost strategy by all involved IT resources. ■ A value-based, rather than cost-based, Return on Investment strategy. Figure 3.2 Achieving a balance between focus on ■ Involvement of IT Operations staff in the Service stability and responsiveness Design and Service Transition phases of the ITSM Lifecycle. Service Operation principles | 23 Table 3.2 Examples of extreme focus on stability and responsiveness Extreme focus on stability Extreme focus on responsiveness Primary focus ■ Technology ■ Output to the business ■ Developing and refining standard IT management ■ Agrees to required changes before determining what techniques and processes. it will take to deliver them. Typical IT can demonstrate that it is complying with SOPs IT staff are not available to define or execute routine problems and Operational Level Agreements (OLAs), even when tasks because they are busy on projects for new experienced there is clear misalignment to business requirements services Technology ■ Growth strategy based on analysing existing ■ Technology purchased for each new business growth demand on existing systems requirement strategy ■ New services are resisted and Business Units ■ Using multiple technologies and solutions for similar sometimes take ownership of ‘ their own’ solutions, to meet slightly different business needs. systems to get access to new services. Technology Existing or standard technology to be used; services Over-provisioning. No attempt is made to model the used to must be adjusted to work within existing parameters new service on the existing infrastructure. New, deliver IT dedicated technology is purchased for each new project services Capacity ■ Forecasts based on projections of current ■ Forecasts based on future business activity for each Management workloads service individually and do not take into account IT ■ System performance is maintained at consistent activity or other IT services levels through tuning and demand management, ■ Existing workloads not relevant. not by workload forecasting and management. Table 3.2 outlines some examples of the characteristics of ■ Initiate changes at the earliest appropriate stage in the positions at extreme ends of the spectrum. The purpose of ITSM Lifecycle. This will ensure that both functional this table is to assist organizations in identifying to which (business) and manageability (IT operational) extreme they are closer, not to identify real-life positions requirements can be assessed and built or changed to which organizations should aspire. together. ■ Ensure IT involvement in business changes as early as Building an IT organization that achieves a balance between stability and responsiveness in Service Operation possible in the change process to ensure scalability, will require the following actions: consistency and achievability of IT services sustaining business changes. ■ Ensure investment in technologies and processes that ■ Service Operation teams should provide input into the are adaptive rather than rigid, e.g. virtual server and ongoing design and refinement of the architectures application technology and the use of Change Models and IT services (see Service Design and Service (see Service Transition publication). Strategy publications). ■ Build a strong Service Level Management (SLM) ■ Implement and use SLM to avoid situations where process which is active from the Service Design phase business and IT managers and staff negotiate informal to the Continual Service Improvement phase of the agreements. ITSM Lifecycle. ■ Foster integration between SLM and the other Service 3.2.3 Quality of service versus cost of Design processes to ensure proper mapping of service business requirements to IT operational activities and components of the IT Infrastructure. This makes it Service Operation is required consistently to deliver the easier to model the effect of changes and agreed level of IT service to its customers and users, while improvements. at the same time keeping costs and resource utilization at an optimal level. 24 | Service Operation principles Service Cost of Service Range of optimal balance between Cost and Quality Quality of Service (Performance, Availability, Recovery) Figure 3.3 Balancing service quality and cost Figure 3.3 represents the investment made to deliver a initiated within Service Operation and carried forward by service at increasing levels of quality. Continual Service Improvement. Some costs can be reduced incrementally over time, but most cost savings In Figure 3.3, an increase in the level of quality usually can be made only once. For example, once a duplicate results in an increase in the cost of that service, and vice software tool has been eliminated, it cannot be eliminated versa. However, the relationship is not always directly again for further cost savings. proportional: Achieving an optimal balance between cost and quality ■ Early in the service’s lifecycle it is possible to achieve (shown between the dotted lines in Figure 3.3) is a key significant increases in service quality with a relatively role of Service Management. There is no industry standard small amount of money. For example, improving for what this range should be, since each service will have service availability from 55% to 75% is fairly a different range of optimization, depending on the nature straightforward and may not require a huge of the service and the type of business objective being investment. met. For example, the business may be prepared to spend ■ Later in the service’s lifecycle, even small more to achieve high availability on a mission-critical improvements in quality are very expensive. For service, while it is prepared to live with the lower quality example, improving the same service’s availability from of an administrative tool. 96% to 99.9% may require large investments in high- availability technology and support staff and tools. Determining the appropriate balance of cost and quality should be done during the Service Strategy and Service While this may seem straightforward, many organizations Design Lifecycle phases, although in many organizations it are under severe pressure to increase the quality of service is left to the Service Operation teams – many of whom do while reducing their costs. In Figure 3.3, the relationship not generally have all the facts or authority to be able to between cost and quality is sometimes inverse. It is make this type of decision. possible (usually inside the range of optimization) to increase quality while reducing costs. This is normally Service Operation principles | 25 Unfortunately, it is also common to find organizations that available, or ‘under sizing’ because the business does not are spending vast quantities of money without achieving understand the manageability requirements of the any clear improvements in quality. Again, Continual solution. Either result will cause customer dissatisfaction Service Improvement will be able to identify the cause of and even more expense when the solution is re- the inefficiency, evaluate the optimal balance for that engineered or retro-fitted to the requirements that should service and formulate a corrective plan. have been specified during Service Design. Achieving the correct balance is important. Too much focus on quality will result in IT services that deliver more An organization here is An organization here is than necessary, at a higher cost, and could lead to a out of balance and is in quite balanced, but may discussion on reducing the price of services. Too much danger of losing service tend to overspend to quality because of heavy deliver higher levels of focus on cost will result in IT delivering on or under cost cutting service than are strictly budget, but putting the business at risk through sub- necessary standard IT services. Extreme Focus Extreme Focus on Cost on Quality Special note: just how far is too much? Over the past several years, IT organizations have been under pressure to cut costs. In many cases this resulted in optimized costs and quality. But, in other cases, costs were cut to the point where quality started to suffer. At first, the signs were subtle – small Figure 3.4 Achieving a balance between focus on cost increases in incident resolution times and a slight and quality increase in the number of incidents. Over time, though, the situation became more serious as staff Table 3.3 outlines some examples of the characteristics of worked long hours to handle multiple workloads and positions at extreme ends of the cost/quality spectrum. services ran on ageing or outdated infrastructure. The purpose of this table is to assist organizations in There is no simple calculation to determine when identifying to which extreme they are closer, not to costs have been cut too far, but good SLM is crucial identify real-life positions to which organizations should to making customers aware of the impact of cutting aspire. too far, so recognizing these warning signs and symptoms can greatly enhance an organization’s Achieving a balance will ensure delivery of the level of ability to correct this situation. service necessary to meet business requirements at an optimal (as opposed to lowest possible) cost. This will require the following: Service Level Requirements – together with a clear understanding of the business purpose of the service and ■ A Financial Management process and tools that can the potential risks – will help to ensure that the service is account for the cost of providing IT services; and delivered at the appropriate cost. They will also help to which model alternative methods of delivering services avoid ‘over sizing’ of the service just because budget is at differing levels of cost. For example, comparing the Table 3.3 Examples of extreme focus on quality and cost Extreme focus on quality Extreme focus on cost Primary focus Delivering the level of quality demanded by the Meeting budget and reducing costs business regardless of what it takes Typical ■ Escalating budgets ■ IT limits the quality of service based on their problems ■ IT services generally deliver more than is necessary budget availability experienced for business success ■ Escalations from the business to get more service ■ Escalating demands for higher-quality services. from IT. Financial IT usually does not have a method of communicating Financial reporting is done purely on budgeted Management the cost of IT services. Accounting methods are based amounts. There is no way of linking activities in IT to on an aggregated method (e.g. cost of IT per user). the delivery of IT services. 26 | Service Operation principles cost of delivering a service at 98% availability or at ■ The role that IT plays in the business and the mandate 99.9% availability; or the cost of providing a service that IT has to influence the strategy and tactics of the with or without additional functionality. business. For example, a company where the CIO is a ■ Ensuring that decisions around cost versus quality are board member is likely to have an IT organization that made by the appropriate managers during Service is far more proactive and responsive than a company Strategy and Service Design. IT operational managers where IT is seen as an administrative overhead. are generally not equipped to evaluate business ■ The level of integration of management processes and opportunities and should only be asked to make tools. Higher levels of integration will facilitate better financial decisions that are related to achieving knowledge of opportunities. operational efficiencies. ■ The maturity and scope of Knowledge Management in the organization; this is especially seen in 3.2.4 Reactive versus proactive organizations which have been able to store and A reactive organization is one which does not act unless it organize historical data effectively – especially is prompted to do so by an external driver, e.g. a new Availability and Problem Management data. business requirement, an application that has been From a maturity perspective, it is clear that newer developed or escalation in complaints made by users and organizations will have different priorities and experiences customers. An unfortunate reality in many organizations is from a more established organization – what is best the focus on reactive management mistakenly as the sole practice for a mature organization may not suit a younger means to ensure services that are highly consistent and organization. Therefore an imbalance could result from an stable, actively discouraging proactive behaviour from organization being either less or more mature. Consider operational staff. The unfortunate irony of this approach is the following: that discouraging effort investment in proactive Service Management can ultimately increase the effort and cost of ■ Less mature organizations (or organizations with reactive activities and further risk stability and consistency newer IT services or technology) will generally be in services. more reactive, simply because they do not know all the variables involved in running their business and A proactive organization is always looking for ways to providing IT services. improve the current situation. It will continually scan the ■ IT staff in newer organizations tend to be generalists internal and external environments, looking for signs of because it is unclear exactly what is required to deliver potentially impacting changes. Proactive behaviour is stable IT services to the business. usually seen as positive, especially since it enables the ■ Incidents and problems in newer organizations are organization to maintain competitive advantage in a fairly unpredictable because the technology is changing environment. However, being too proactive can relatively new and changes quickly. be expensive and can result in staff being distracted. The ■ More mature organizations tend to be more proactive, need for proper balance in reactive and proactive behaviour often achieves the optimal result. simply because they have more data and reporting available and know the typical patterns of incidents Generally, it is better to manage IT services proactively, but and workflows. Thus, they forecast exceptions far achieving this is not easily planned or achieved. This is more easily. because building a proactive IT organization is dependent ■ Staff working in mature organizations also generally on many variables, including: tend to have more established relationships between ■ The maturity of the organization. The longer the IT staff and the business and so can be more proactive organization has been delivering a consistent set of IT about meeting changing business requirements – this services, the more likely it is to understand the is especially true when IT is seen as a strategic relationship between IT and the business and the IT component of the business. Infrastructure and IT services. ■ The culture of the organization. Some organizations have a culture that is focused on innovation and are more likely to be proactive. Others are more likely to focus on the status quo and as such are likely to resist change and have more reactive focus. Service Operation principles | 27 An organization here is An organization here is While proactive behaviour in Service Operation is generally out of balance and is not quite balanced, but tends able to effectively to fix services that are not good, there are also times where reactive behaviour is support the business broken, resulting in higher needed. The role of Service Operation is therefore to strategy levels of change achieve a balance between being reactive and proactive. This will require: Extremely Extremely Reactive Proactive ■ Formal Problem Management and Incident Management processes, integrated between Service Operation and Continual Service Improvement. ■ The ability to be able to prioritize technical faults as well as business demands. This needs to be done during Service Operation, but the mechanisms need to Figure 3.5 Achieving a balance between being too be put in place during Service Strategy and Design. reactive or too proactive These mechanisms could include incident categorization systems, escalation procedures and Table 3.4 outlines some examples of the characteristics of tools to facilitate impact assessment for changes. positions at extreme ends of the spectrum. The purpose of ■ Data from Configuration and Asset Management to this table is to assist organizations in identifying to which provide data where required, saving projects time and extreme they are closer, not to identify real-life positions making decisions more accurate. to which organizations should aspire. ■ Ongoing involvement of SLM in Service Operation. Table 3.4 Examples of extremely reactive and proactive behaviour Extremely reactive Extremely proactive Primary focus Responds to business needs and incidents only Anticipates business requirements before they are after they are reported reported and problems before they occur ■ Preparing to deliver new services takes a long ■ Money is spent before the requirements are stated. Typical time because each project is dealt with as if it In some cases IT purchases items that will never be problems is the first used because they anticipated the wrong experienced ■ Similar incidents occur again and again, as there requirements or because the project is stopped is no way of trending them ■ IT staff tend to have been in the organization for a ■ Staff turnover is high and morale is generally long time and tend to assume that they know the low, as IT staff keep moving from project to business requirements better than the business does project without achieving a lasting, stable set of IT services Capacity Wait until there are capacity problems and then Anticipate capacity problems and spend money on Planning purchase surplus capacity to last until the next preventing these – even when the scenario is unlikely to capacity-related incident happen IT Service ■ No plans exist until after a major event or Over-planning (and over-spending) of IT Recovery Continuity disaster options. Usually immediate recovery is provided for Planning ■ IT Plans focus on recovering key systems, but most IT services, regardless of their impact or priority without ensuring that the business can recover its processes Change ■ Changes are often not logged, or logged at the Changes are requested and implemented even when Management last minute as Emergency Changes there is no real need, i.e. a significant amount of work ■ Not enough time for proper impact and cost done to fix items that are not broken assessments ■ Changes are poorly tested and controlled, resulting in a high number of incidents 28 | Service Operation principles 3.3 PROVIDING SERVICE This should not only be encouraged, but Service Operation staff should be measured on their involvement All Service Operation staff must be fully aware that they in Service Design activities – and such activities should be are there to ‘provide service’ to the business. They must included in job descriptions and roles, etc. This will help to provide a timely (rapid response and speedy delivery ensure continuity between business requirements and of requirements), professional and courteous service to technology design and operation and it will also help to allow the business to conduct its own activities – so that ensure that what is designed can also be operated. IT the commercial customer’s needs are met and the Operations Management staff should also be involved business thrives. during Service Transition to ensure consistency and to It is important that staff are trained not only in how to ensure that both stated business and manageability deliver and support IT services, but also in the manner in requirements are met. which that service should be provided. For example, staff Resources must be made available for these activities and that are capable and deliver service effectively may still the time required should be taken into account, as cause significant customer dissatisfaction if they are appropriate. insensitive or dismissive. Conversely, no amount of being nice to a customer will help if the service is not being delivered. 3.5 OPERATIONAL HEALTH A critical element of being a proficient service provider is Many organizations find it helpful to compare the placing as much emphasis on recruiting and training staff monitoring and control of Service Operation to health to develop competency in dealing with and managing monitoring and control. customer relationships and interactions as they do on In this sense, the IT Infrastructure is like an organism that technical competencies for managing the IT environment. has vital life signs that can be monitored to check whether it is functioning normally. This means that it is not 3.4 OPERATION STAFF INVOLVEMENT IN necessary to monitor continuously every component of SERVICE DESIGN AND SERVICE TRANSITION every IT system to ensure that it is functioning. It is extremely important that Service Operation staff are Operational Health can be determined by isolating a few involved in Service Design and Service Transition and important ‘vital signs’ on devices or services that are potentially also in Service Strategy where appropriate. defined as critical for the successful execution of a Vital Business Function. This could be the bandwidth utilization One key to achieving balance in Service Operation is an on a network segment, or memory utilization on a major effective set of Service Design processes. These will server. If these signs are within normal ranges, the system provide IT Operations Management with: is healthy and does not require additional attention. ■ Clear definition of IT service objectives and This reduction in the need for extensive monitoring will performance criteria result in cost reduction and operational teams and ■ Linkage of IT service specifications to the performance departments that are focused on the appropriate areas of the IT Infrastructure for service success. ■ Definition of operational performance requirements However, as with organisms, it is important to check ■ A mapping of services and technology systems more thoroughly from time to time, to check for ■ The ability to model the effect of changes in problems that do not immediately affect vital signs. For technology and changes to business requirements example a disk may be functioning perfectly, but it could ■ Appropriate cost models (e.g. customer or service be nearing its Mean Time Between Failures (MTBF) based) to evaluate Return on Investment and cost- threshold. In this case the system should be taken out of reduction strategies. service and given a thorough examination or ‘health check’. At the same time, it should be stressed that the The nature of IT Operations Management involvement end result should be the healthy functioning of the service should be carefully positioned. Service Design is a phase in as a whole. This means that health checks on components the Service Management Lifecycle using a set of processes, should be balanced against checks of the ‘end-to-end’ not a function independent of Service Operation. As such, service. The definition of what needs to be monitored and many of the people who are involved in Service Design what is healthy versus unhealthy is defined during Service will come from IT Operations Management. Design, especially Availability Management and SLM. Service Operation principles | 29 Operational Health is dependent on the ability to prevent common workarounds. These are used as soon as an incidents and problems by investing in reliable and error is detected, to determine the appropriate maintainable infrastructure. This is achieved through good response. availability design and proactive Problem Management. At ■ The ability to generate a call for human intervention the same time, Operational Health is also dependent on by raising an alert or generating an incident. the ability to identify faults and localize them effectively so While the concept of Operational Health is not a core that they have minimal impact on the service. This concept of Service Operation, it is often a helpful requires strong (preferably automated) Incident and metaphor to assist in determining what needs to be Problem Management. monitored and how frequently to perform preventive The idea of Operational Health has also led to a maintenance. specialized area called ‘Self Healing Systems’. This is an What and when to monitor for operational health should application of Availability, Capacity, Knowledge, Incident be determined in Service Design, tested and refined and Problem Management and refers to a system that has during Service Transition and optimized in Continual been designed to withstand the most severe operating Service Improvement, as necessary. conditions and to detect, diagnose and recover from most incidents and Known Errors. Self Healing Systems are known by different names, for example Autonomic 3.6 COMMUNICATION Systems, Adaptive Systems and Dynamic Systems. Good communication is needed with other IT teams and Characteristics of Self Healing Systems include: departments, with users and internal customers, and ■ Resilience is designed and built into the system, for between the Service Operation teams and departments example multiple redundant disks or multiple themselves. Issues can often be prevented or mitigated processors. This protects the system against hardware with appropriate communication. failure since it is able to continue operating using the This section is aimed at summarizing the communication duplicated hardware component. that should take place in Service Operation. This is not a ■ Software, data and operating system resilience is also separate process, but a checklist of the type of designed into the system, for example mirrored communication that is required for effective Service databases (where a database is duplicated on a Operation. backup device) and disk-striping technology (where individual bits of data are distributed across a disk An important principle is that all communication must array – so that a disk failure results in the loss of only have an intended purpose or a resultant action. a part of data, which can be easily recovered using Information should not be communicated unless there is a algorithms). clear audience. In addition, that audience should have ■ The ability to shift processing from one physical been actively involved in determining the need for that device to another without any disruption to the communication and what they will do with the service. This could be a response to a failure or information. because the device is reaching high utilization levels A detailed description of the types of communication (some systems are designed to distribute processing typical in Service Operation is contained in Appendix B of workloads continuously, to make optimum use of this publication, together with a description of the typical available capacity, which is also known as audience and the actions that are intended to be taken as virtualization). a result of each communication. These include: ■ Built-in monitoring utilities which enable the system to ■ Routine operational communication detect events and to determine whether these ■ Communication between shifts represent normal operations or not. ■ Performance reporting ■ A correlation engine (see paragraph 22.214.171.124 on Event ■ Communication in projects Management). This will enable the system to determine the significance of each event and also to ■ Communication related to changes determine whether there is any predefined response ■ Communication related to exceptions to that event. ■ Communication related to emergencies ■ A set of diagnostic tools, such as diagnostic scripts, ■ Training on new or customized processes and fault trees and a database of Known Errors and service designs 30 | Service Operation principles ■ Communication of strategy and design to Service that have more mature Service Management processes Operation teams. and tools will tend to rely on the tools and processes for communication (e.g. using an Incident Management tool Please note that there is no definitive medium for to escalate and track incidents, instead of requesting e- communication, nor is there a fixed location or frequency. mail or telephone calls for updates). In some organizations communication has to take place in meetings. Other organizations prefer to use e-mail or the Other organizations prefer to communicate using communication inherent in their Service Management meetings. However, it is important not to get into the tools. mode whereby the only time work is done, or management is involved, is during a meeting. Also, face- There should therefore be a policy around communication to-face meetings tend to increase costs (e.g. travel, time within each team or department and for each process. spent in informal discussions, refreshments, etc.), so Although this should be formal, the policy should not be meeting organizers should balance the value of the cumbersome or complex. For example, a manager might meeting with the number and identity of the attendees require that all communications regarding changes must and the time they will spend in, and getting to, the be sent by e-mail. As long as this is specified in the meeting. department’s SOPs (in whatever form they exist), there is no need to create a separate policy for it. The purpose of meetings is to communicate effectively to a group of people about a common set of objectives or Although the typical content of communication is fairly activities. Meetings should be well controlled and brief, consistent once processes have been defined, the means and the focus should be on facilitating action. A good rule of communication are changing with every new is not to hold a meeting if the information can be introduction of technology. The list of alternatives is communicated effectively by automated means. growing and, today, includes: A number of factors are essential for successful meetings. ■ E-mail, to traditional clients or mobile devices Although these may seem to be common sense, they are ■ SMS messages sometimes neglected: ■ Pagers ■ Instant messaging and web-based ‘chats’ ■ Establish and communicate a clear agenda to ensure that the meeting achieves its objective and to help the ■ Voice over Internet Protocol (VoIP) utilities that can facilitator prevent attendees from ‘hijacking’ the turn any connected device to an inexpensive meeting. communication medium ■ Ensure that the rules for participating are understood. ■ Teleconference and virtual meeting utilities, which Organizations tend to have a formal set of meeting have revolutionized meetings, which are now held rules, ranging from relatively informal to very formal across long distances (e.g. Roberts Rules of Order). ■ Document-sharing utilities. ■ Make use of ‘parking lots’ or notes that record issues The means of communication itself is outside the scope of that are not directly relevant to the purpose of the this publication. However, the following points should be meeting, but which can be called on if the need for noted: discussion arises. ■ Communication is primary and the means of ■ Minutes of the meeting: rules should be set about communication must ensure that they serve this goal. when minutes are taken. Minutes are used to remind For example, the need for secure communication may people who are assigned actions and to track the eliminate the possibility of some of the above means. progress of delegated actions. They are also useful in The need for quality may eliminate some VoIP options. ensuring that cross-functional decisions and actions ■ It is possible to use any means of communication as are tracked and followed through. long as all stakeholders understand how and when ■ Use techniques to encourage the appropriate level of the communication will take place. participation. One technique when discussing improvements, for example, is the ‘keep, stop, start’ 3.6.1 Meetings technique. Participants are encouraged to list items that they would like to keep, things that need to be Different organizations communicate in different ways. stopped and initiatives or actions that they would like Where organizations are distributed, they will tend to rely to see started. on e-mail and teleconferencing facilities. Organizations Service Operation principles | 31 Examples of typical meetings are given below: ● Request for additional resources, if required ● Discussion of potential problems or concerns 126.96.36.199 The Operations meeting ■ Confirmation of staff availability for roster duties Operations meetings are normally held between the ■ Confirmation of vacation schedules. managers of the IT operational departments, teams or groups, at the beginning of each business day or week. 188.8.131.52 Customer meetings The purpose of this type of meeting is to make staff aware From time to time it will be necessary to hold meetings of any issue relevant to Operations (such as change with customers, apart from the regular Service Level schedules, business events, maintenance schedules, etc.) Review meetings. Examples include: and to provide an opportunity for staff to raise any issues of which they are aware. This is an opportunity to ensure ■ Follow-up after serious incidents. The purpose of these that all departments in a data centre are synchronized. meetings is to repair the relationship with the customers, but also to ensure that IT has all the In geographically dispersed organizations it may not be information required to prevent recurrence. Customers possible to have a single daily Operations meeting. In also have the opportunity to provide information these cases it is important to coordinate the agenda of the about unforeseen business impacts. These meetings meetings and to ensure that each meeting has two are helpful in agreeing actions for similar types of components: incident that may occur in future. 1 The first part of the meeting will cover aspects that ■ A customer forum, which can be used for a range of apply to the organization as a whole, e.g. new purposes, including testing ideas for new services or policies, changes that affect all regions and business solutions, or gathering requirements for new or events that span all regions. revised services or procedures. A customer forum is 2 The second part of the meeting will cover aspects that generally a regular meeting with customers to discuss apply only to the local region, e.g. local operations areas of common concern. schedules, changes to local equipment, etc. The Operations meeting is usually chaired by the IT 3.7 DOCUMENTATION Operations Manager or a senior Operations Manager and IT Operations Management and all of the Technical and attended by all managers and supervisors (except those Application Management teams and departments are whose shifts are not on duty). It is also helpful to have at involved in creating and maintaining a range of least one representative from the Service Desk at the documents. These are detailed in Chapters 4, 5 and 6 of meeting so that they are aware of any situations that this publication and include the following: could give rise to incidents. ■ Participation in the definition and maintenance of Opportunities to improve services or processes should be process manuals for all processes they are involved in. captured, if raised, and forwarded to the team responsible These will include processes in other phases of the IT for Continual Service Improvement. Service Management Lifecycle (e.g. Capacity Management, Change Management, Availability 184.108.40.206 Department, group or team meetings Management) as well as for all processes included in These meetings are essentially the same as the Operations the Service Operation phase. meeting, but are aimed at a single IT department, group ■ Establishing their own technical procedures manuals. or team. Each manager or supervisor relays the These must be kept up to date and new material must information from the Operations meeting that is relevant be added as it becomes relevant, under Change to their team. Control. It should be remembered that their Additionally, these meetings will also cover the following: procedures should always be structured to meet the objectives and constraints defined within higher-level ■ A more detailed discussion of incidents, problems and Service Management processes, such as SLM. For changes that are still being worked on, with example, a technical procedure for managing servers information about: should always ensure that it aims at achieving the ● Progress to date availability and performance levels agreed to in the ● Confirmation of what still needs to be done Operational Level Agreements and Service Level ● Estimated completion times Agreements (SLAs). 32 | Service Operation principles ■ Participation in the creation and maintenance of planning documents, e.g. the Capacity and Availability Plans and the IT Service Continuity Plans. ■ Participation in the creation and maintenance of the Service Portfolio. This will include quantifying costs and establishing the operational feasibility of each proposed service. ■ Participation in the definition and maintenance of Service Management tool work instructions in order to meet reporting requirements. Service Operation processes 4 | 35 4 Service Operation processes The processes listed in paragraph 2.4.5 are discussed in formal Request Fulfilment process to manage detail in this chapter. As a reference, the overall structure customer and user requests for all types of requests is briefly described here and then each of the processes is which include facilities, moves and supplies as well as described in more detail later in the chapter. Please note those specific to IT services. These requests are not that the roles for each process and the tools used for each generally tied to the same SLA measures and process are described in Chapters 6 and 7 respectively. separating the records and the process flow is emerging as best practice in many organizations. ■ Event Management is the process that monitors all ■ Access Management: this is the process of granting events that occur through the IT infrastructure to allow for normal operation and also to detect and escalate authorized users the right to use a service, while exception conditions. restricting access to non-authorized users. It is based on being able accurately to identify authorized users ■ Incident Management concentrates on restoring the and then manage their ability to access services as service to users as quickly as possible, in order to required during different stages of their human minimize business impact. resources (HR) or contractual lifecycle. Access ■ Problem Management involves root-cause analysis to Management has also been called Identity or Rights determine and resolve the cause of events and Management in some organizations. incidents, proactive activities to detect and prevent future problems/incidents and a Known Error sub- In addition, there are several other processes that will be process to allow quicker diagnosis and resolution if executed or supported during Service Operation, but further incidents do occur. which are driven during other phases of the Service Management Lifecycle. The operational aspects of these NOTE: Without this distinction between incidents and processes will be discussed in the final part of this chapter problems, and keeping separate Incident and Problem and include: Records, there is a danger that either: ● Incidents will be closed too early in the overall ■ Change Management, a major process which should support cycle and there will be no actions taken to be closely linked to Configuration Management and prevent recurrence – so the same incidents will Release Management. These topics are primarily have to be fixed over and over again, or covered in the Service Transition publication. ● Incidents will be kept open so that root cause ■ Capacity and Availability Management, the operational analysis can be done and visibility will be lost of aspects of which are covered in this publication, but when the user’s service was actually restored – so which are covered in more detail in the Service Design SLA targets may not be met even though the publication. service has been restored within users’ ■ Financial Management, which is covered in the Service expectations. This often results in a large number Strategy publication. of open incidents, many of which will never be ■ Knowledge Management, which is covered in the closed unless a periodic ‘purge’ is undertaken. This Service Transition publication. can be very demotivating and can prevent effective ■ IT Service Continuity, which is covered in the Service visibility of current issues. Design publication. ■ Request Fulfilment involves the management of ■ Service Reporting and Measurement, which are customer or user requests that are not generated as covered in the Continual Service Improvement an incident from an unexpected service delay or publication. disruption. Some organizations may choose to handle such requests as a ‘category’ of incidents and manage the information through an Incident Management 4.1 EVENT MANAGEMENT system – but others may choose (because of high An event can be defined as any detectable or discernible volumes or business priority of such requests) to occurrence that has significance for the management of facilitate the provision of Request Fulfilment the IT Infrastructure or the delivery of IT service and capabilities separately via the Request Fulfilment evaluation of the impact a deviation might cause to the process. It has become popular practice to use a 36 | Service Operation processes services. Events are typically notifications created by an IT ■ Configuration Items: service, Configuration Item (CI) or monitoring tool. ● Some CIs will be included because they need to Effective Service Operation is dependent on knowing the stay in a constant state (e.g. a switch on a network status of the infrastructure and detecting any deviation needs to stay on and Event Management tools from normal or expected operation. This is provided by confirm this by monitoring responses to ‘pings’). good monitoring and control systems, which are based on ● Some CIs will be included because their status two types of tools: needs to change frequently and Event Management can be used to automate this and ■ active monitoring tools that poll key CIs to determine update the CMS (e.g. the updating of a file server). their status and availability. Any exceptions will ■ Environmental conditions (e.g. fire and smoke generate an alert that needs to be communicated to detection) the appropriate tool or team for action ■ Software licence monitoring for usage to ensure ■ passive monitoring tools that detect and correlate optimum/legal licence utilization and allocation operational alerts or communications generated by ■ Security (e.g. intrusion detection) CIs. ■ Normal activity (e.g. tracking the use of an application 4.1.1 Purpose/goal/objective or the performance of a server). The ability to detect events, make sense of them and The difference between monitoring and Event determine the appropriate control action is provided by Management Event Management. Event Management is therefore the basis for Operational Monitoring and Control (see These two areas are very closely related, but slightly Appendix B). different in nature. Event Management is focused on generating and detecting meaningful notifications In addition, if these events are programmed to about the status of the IT Infrastructure and services. communicate operational information as well as warnings While it is true that monitoring is required to detect and exceptions, they can be used as a basis for and track these notifications, monitoring is broader automating many routine Operations Management than Event Management. For example, monitoring activities, for example executing scripts on remote devices, tools will check the status of a device to ensure that or submitting jobs for processing, or even dynamically it is operating within acceptable limits, even if that balancing the demand for a service across multiple devices device is not generating events. to enhance performance. Put more simply, Event Management works with Event Management therefore provides the entry point for occurrences that are specifically generated to be the execution of many Service Operation processes and monitored. Monitoring tracks these occurrences, but activities. In addition, it provides a way of comparing it will also actively seek out conditions that do not actual performance and behaviour against design generate events. standards and SLAs. As such, Event Management also provides a basis for Service Assurance and Reporting; and 4.1.3 Value to business Service Improvement. This is covered in detail in the Continual Service Improvement publication. Event Management’s value to the business is generally indirect; however, it is possible to determine the basis for 4.1.2 Scope its value as follows: Event Management can be applied to any aspect of ■ Event Management provides mechanisms for early Service Management that needs to be controlled and detection of incidents. In many cases it is possible for which can be automated. These include: the incident to be detected and assigned to the Service Operation processes | 37 appropriate group for action before any actual service alert indicates that a payment authorization site is outage occurs. unavailable – impacting financial approval of ■ Event Management makes it possible for some types business transactions) of automated activity to be monitored by exception – ● a device’s CPU is above the acceptable utilization thus removing the need for expensive and resource rate intensive real-time monitoring, while reducing ● a PC scan reveals the installation of unauthorized downtime. software. ■ When integrated into other Service Management ■ Events that signify unusual, but not exceptional, processes (such as, for example, Availability or Capacity operation. These are an indication that the situation Management), Event Management can signal status may require closer monitoring. In some cases the changes or exceptions that allow the appropriate condition will resolve itself, for example in the case of person or team to perform early response, thus an unusual combination of workloads – as they are improving the performance of the process. This, in completed, normal operation is restored. In other turn, will allow the business to benefit from more cases, operator intervention may be required if the effective and more efficient Service Management situation is repeated or if it continues for too long. overall. These rules or policies are defined in the Monitoring ■ Event Management provides a basis for automated and Control Objectives for that device or service. operations, thus increasing efficiencies and allowing Examples of this type of event are: expensive human resources to be used for more ● A server’s memory utilization reaches within 5% of innovative work, such as designing new or improved its highest acceptable performance level functionality or defining new ways in which the ● The completion time of a transaction is 10% longer business can exploit technology for increased than normal. competitive advantage. Two things are significant about the above examples: 4.1.4 Policies/principles/basic concepts ■ Exactly what constitutes normal versus unusual There are many different types of events, for example: operation, versus an exception? There is no definitive rule about this. For example, a manufacturer may ■ Events that signify regular operation: provide that a benchmark of 75% memory utilization ● notification that a scheduled workload has is optimal for application X. However, it is discovered completed that, under the specific conditions of our organization, ● a user has logged in to use an application response times begin to degrade above 70% ● an e-mail has reached its intended recipient. utilization. The next section will explore how these ■ Events that signify an exception figures are determined. ● a user attempts to log on to an application with ■ Each relies on the sending and receipt of a message the incorrect password of some type. These are generally referred to as Event ● an unusual situation has occurred in a business notifications and they don’t just happen. The next process that may indicate an exception requiring paragraphs will explore exactly how events are further business investigation (e.g. a web page defined, generated and captured. 38 | Service Operation processes Event Event Notification Generated Event Detected Event Filtered Informational Significance? Exception Warning Event Correlation Trigger Incident/ Event Logged Auto Response Alert Problem/ Incident Change? Change Problem Human Incident Problem Change Intervention Management Management Management Review Actions No Effective? Yes Close Event End Figure 4.1 The Event Management process Service Operation processes | 39 4.1.5 Process activities, methods and In many organizations, however, defining which events to techniques generate is done by trial and error. System managers use the standard set of events as a starting point and then Figure 4.1 is a high-level and generic representation of tune the CI over time, to include or exclude events as Event Management. It should be used as a reference and required. The problem with this approach is that it only definition point, rather than an actual Event Management takes into account the immediate needs of the staff flowchart. Each activity in this process is described below. managing the device and does not facilitate good planning or improvement. In addition, it makes it very 220.127.116.11 Event occurs difficult to monitor and manage the service over all Events occur continuously, but not all of them are devices and staff. One approach to combating this detected or registered. It is therefore important that problem is to review the set of events as part of continual everybody involved in designing, developing, managing improvement activities. and supporting IT services and the IT Infrastructure that they run on understands what types of event need A general principle of Event notification is that the more to be detected. meaningful the data it contains and the more targeted the audience, the easier it is to make decisions about the This is discussed in paragraph 18.104.22.168, titled event. Operators are often confronted by coded error ‘Instrumentation’. messages and have no idea how to respond to them or what to do with them. Meaningful notification data and 22.214.171.124 Event notification clearly defined roles and responsibilities need to be Most CIs are designed to communicate certain information articulated and documented during Service Design and about themselves in one of two ways: Service Transition (see also paragraph 126.96.36.199 on ‘Instrumentation’). If roles and responsibilities are not ■ A device is interrogated by a management tool, which clearly defined, in a wide alert, no one knows who is collects certain targeted data. This is often referred to doing what and this can lead to things being missed or as polling. duplicated efforts. ■ The CI generates a notification when certain conditions are met. The ability to produce these 188.8.131.52 Event detection notifications has to be designed and built into the CI, for example a programming hook inserted Once an Event notification has been generated, it will be into an application. detected by an agent running on the same system, or transmitted directly to a management tool specifically Event notifications can be proprietary, in which case only designed to read and interpret the meaning of the event. the manufacturer’s management tools can be used to detect events. Most CIs, however, generate Event 184.108.40.206 Event filtering notifications using an open standard such as SNMP The purpose of filtering is to decide whether to (Simple Network Management Protocol). communicate the event to a management tool or to Many CIs are configured to generate a standard set of ignore it. If ignored, the event will usually be recorded in a events, based on the designer’s experience of what is log file on the device, but no further action will be taken. required to operate the CI, with the ability to generate The reason for filtering is that it is not always possible to additional types of event by ‘turning on’ the relevant turn Event notification off, even though a decision has event generation mechanism. For other CI types, some been made that it is not necessary to generate that type form of ‘agent’ software will have to be installed in order of event. It may also be decided that only the first in a to initiate the monitoring. Often this monitoring feature series of repeated Event notifications will be transmitted. is free, but sometimes there is a cost to the licensing of the tool. During the filtering step, the first level of correlation is performed, i.e. the determination of whether the event is In an ideal world, the Service Design process should define informational, a warning, or an exception (see next step). which events need to be generated and then specify how This correlation is usually done by an agent that resides on this can be done for each type of CI. During Service the CI or on a server to which the CI is connected. Transition, the event generation options would be set and tested. The filtering step is not always necessary. For some CIs, every event is significant and moves directly into a management tool’s correlation engine, even if it is 40 | Service Operation processes duplicated. Also, it may have been possible to turn off all an exception could be generated when an unwanted Event notifications. unauthorized device is discovered on the network. This can be managed by using either an Incident 220.127.116.11 Significance of events Record or a Request for Change (or even both), Every organization will have its own categorization of the depending on the organization’s Incident and Change significance of an event, but it is suggested that at least Management policies. Examples of exceptions include: these three broad categories be represented: ● A server is down ● Response time of a standard transaction across the ■ Informational: This refers to an event that does not network has slowed to more than 15 seconds require any action and does not represent an ● More than 150 users have logged on to the exception. They are typically stored in the system or service log files and kept for a predetermined period. General Ledger application concurrently Informational events are typically used to check on the ● A segment of the network is not responding to status of a device or service, or to confirm the routine requests. successful completion of an activity. Informational events can also be used to generate statistics (such as 18.104.22.168 Event correlation the number of users logged on to an application If an event is significant, a decision has to be made about during a certain period) and as input into exactly what the significance is and what actions need to investigations (such as which jobs completed be taken to deal with it. It is here that the meaning of the successfully before the transaction processing queue event is determined. hung). Examples of informational events include: Correlation is normally done by a ‘Correlation Engine’, ● A user logs onto an application usually part of a management tool that compares the ● A job in the batch queue completes successfully event with a set of criteria and rules in a prescribed order. ● A device has come online These criteria are often called Business Rules, although ● A transaction is completed successfully. they are generally fairly technical. The idea is that the ■ Warning: A warning is an event that is generated event may represent some impact on the business and the when a service or device is approaching a threshold. rules can be used to determine the level and type of Warnings are intended to notify the appropriate business impact. person, process or tool so that the situation can be A Correlation Engine is programmed according to the checked and the appropriate action taken to prevent performance standards created during Service Design and an exception. Warnings are not typically raised for a any additional guidance specific to the operating device failure. Although there is some debate about environment. whether the failure of a redundant device is a warning or an exception (since the service is still available). A Examples of what Correlation Engines will take into good rule is that every failure should be treated as an account include: exception, since the risk of an incident impacting the ■ Number of similar events (e.g. this is the third time business is much greater. Examples of warnings are: that the same user has logged in with the incorrect ● Memory utilization on a server is currently at 65% password, a business application reports that there has and increasing. If it reaches 75%, response times been an unusual pattern of usage of a mobile will be unacceptably long and the OLA for that telephone that could indicate that the device has department will be breached. been lost or stolen) ● The collision rate on a network has increased by ■ Number of CIs generating similar events 15% over the past hour. ■ Whether a specific action is associated with the code ■ Exception: An exception means that a service or or data in the event device is currently operating abnormally (however that ■ Whether the event represents an exception has been defined). Typically, this means that an OLA ■ A comparison of utilization information in the event and SLA have been breached and the business is with a maximum or minimum standard (e.g. has the being impacted. Exceptions could represent a total device exceeded a threshold?) failure, impaired functionality or degraded ■ Whether additional data is required to investigate the performance. Please note, though, that an exception event further, and possibly even a collection of that does not always represent an incident. For example, data by polling another system or database Service Operation processes | 41 ■ Categorization of the event standing order for the appropriate Operations ■ Assigning a priority level to the event. Management staff to check the logs on a regular basis and clear instructions about how to use each log. It 22.214.171.124 Trigger should also be remembered that the event If the correlation activity recognizes an event, a response information in the logs may not be meaningful until will be required. The mechanism used to initiate that an incident occurs; and where the Technical response is called a trigger. Management staff use the logs to investigate where the incident originated. This means that the Event There are many different types of triggers, each designed Management procedures for each system or team specifically for the task it has to initiate. Some examples need to define standards about how long events are include: kept in the logs before being archived and deleted. ■ Incident Triggers that generate a record in the Incident ■ Auto response: Some events are understood well Management system, thus initiating the Incident enough that the appropriate response has already Management process been defined and automated. This is normally as a ■ Change Triggers that generate a Request for Change result of good design or of previous experience (RFC), thus initiating the Change Management process (usually Problem Management). The trigger will initiate ■ A trigger resulting from a approved RFC that has been the action and then evaluate whether it was implemented but caused the event, or from an completed successfully. If not, an Incident or unauthorised change that has been detected – in Problem Record will be created. Examples of auto either case this will be referred to Change responses include: Management for investigation ● Rebooting a device ■ Scripts that execute specific actions, such as ● Restarting a service submitting batch jobs or rebooting a device ● Submitting a job into batch ■ Paging systems that will notify a person or team of ● Changing a parameter on a device the event by mobile phone ● Locking a device or application to protect it ■ Database triggers that restrict access of a user to against unauthorized access. specific records or fields, or that create or delete Note: locking a device may result in denial of service entries in the database. to authorized users, which could be exploited by a deliberate attacker – so great care should be taken 126.96.36.199 Response selection when deciding whether this is an appropriate At this point in the process, there are a number of automated action. Where this response is used it may response options available. It is important to note that the be prudent to also combine this with a call for human response options can be chosen in any combination. For intervention, so that the automated action can be example, it may be necessary to preserve the log entry for swiftly checked and approved. future reference, but at the same time escalate the event ■ Alert and human intervention: If the event requires to an Operations Management staff member for action. human intervention, it will need to be escalated. The purpose of the alert is to ensure that the person with The options in the flowchart are examples. Different the skills appropriate to deal with the event is notified. organizations will have different options, and they are sure The alert will contain all the information necessary for to be more detailed. For example, there will be a range of that person to determine the appropriate action – auto responses for each different technology. The process including reference to any documentation required of determining which one is appropriate and how to (e.g. user manuals). It is important to note that this is execute it are not represented in this flowchart. Some of not necessarily the same as the functional escalation the options available are: of an incident, where the emphasis is on restoring ■ Event logged: Regardless of what activity is service within an agreed time (which may require a performed, it is a good idea to have a record of the variety of activities). The alert requires a person, or event and any subsequent actions. The event can be team, to perform a specific action, possibly on a logged as an Event Record in the Event Management specific device and possibly at a specific time, e.g. tool, or it can simply be left as an entry in the system changing a toner cartridge in a printer when the level log of the device or application that generated the is low. event. If this is the case, though, there needs to be a 42 | Service Operation processes ■ Incident, problem or change? Some events will ■ Open or link to a Problem Record: It is rare for a represent a situation where the appropriate response Problem Record to be opened without related will need to be handled through the Incident, Problem incidents (for example as a result of a Service Failure or Change Management process. These are discussed Analysis (see Service Design publication) or maturity below, but it is important to note that a single assessment, or because of a high number of retry incident may initiate any one or a combination of network errors, even though a failure has not yet these three processes – for example, a non-critical occurred). In most cases this step refers to linking an server failure is logged as an incident, but as there is incident to an existing Problem Record. This will assist no workaround, a Problem Record is created to the Problem Management teams to reassess the determine the root cause and resolution and an RFC is severity and impact of the problem, and may result in logged to relocate the workload onto an alternative a changed priority to an outstanding problem. server while the problem is resolved. However, it is possible, with some of the more ■ Open an RFC: There are two places in the Event sophisticated tools, to evaluate the impact of the Management process where an RFC can be created: incidents and also to raise a Problem Record ● When an exception occurs: For example, a scan automatically, where this is warranted, to allow root- of a network segment reveals that two new cause analysis to commence immediately. devices have been added without the necessary ■ Special types of incident: In some cases an event authorization. A way of dealing with this situation will indicate an exception that does not directly is to open an RFC, which can be used as a vehicle impact any IT service, for example, a redundant air for the Change Management process to deal with conditioning unit fails, or unauthorized entry to a data the exception (as an alternative to the more centre. Guidelines for these events are as follows: conventional approach of opening an incident that ● An incident should be logged using an Incident would be routed via the Service Desk to Change Model that is appropriate for that type of Management). Investigation by Change exception, e.g. an Operations Incident or Security Management is appropriate here since Incident (see paragraph 188.8.131.52 for more details of unauthorized changes imply that the Change Incident Models). Management process was not effective. ● The incident should be escalated to the group that ● Correlation identifies that a change is needed: manages that type of incident. In this case the event correlation activity ● As there is no outage, the Incident Model used determines that the appropriate response to an should reflect that this was an operational issue event is for something to be changed. For rather than a service issue. The statistics would not example, a performance threshold has been normally be reported to customers or users, unless reached and a parameter on a major server needs they can be used to demonstrate that the money to be tuned. How does the correlation activity invested in redundancy was a good investment. determine this? It was programmed to do so either ● These incidents should not be used to calculate in the Service Design process or because this has downtime, and can in fact be used to demonstrate happened before and Problem Management or how proactive IT has been in making services Operations Management updated the Correlation available. Engine to take this action. ■ Open an Incident Record: As with an RFC, an 184.108.40.206 Review actions incident can be generated immediately when an With thousands of events being generated every day, it is exception is detected, or when the Correlation Engine not possible formally to review every individual event. determines that a specific type or combination of However, it is important to check that any significant events represents an incident. When an Incident events or exceptions have been handled appropriately, or Record is opened, as much information as possible to track trends or counts of event types, etc. In many cases should be included – with links to the events this can be done automatically, for example polling a concerned and if possible a completed diagnostic server that had been rebooted using an automated script script. to see that it is functioning correctly. In the cases where events have initiated an incident, problem and/or change, the Action Review should not duplicate any reviews that have been done as part of Service Operation processes | 43 those processes. Rather, the intention is to ensure that the ■ Access of an application or database by a user or handover between the Event Management process and automated procedure or job other processes took place as designed and that the ■ A situation where a device, database or application, expected action did indeed take place. This will ensure etc. has reached a predefined threshold of that incidents, problems or changes originating within performance. Operations Management do not get lost between the Event Management can interface to any process that teams or departments. requires monitoring and control, especially those that do The Review will also be used as input into continual not require real-time monitoring, but which do require improvement and the evaluation and audit of the Event some form of intervention following an event or group of Management process. events. Examples of interfaces with other processes include: 220.127.116.11 Close event ■ Interface with business applications and/or business Some events will remain open until a certain action takes processes to allow potentially significant business place, for example an event that is linked to an open events to be detected and acted upon (e.g. a business incident. However, most events are not ‘opened’ application reports abnormal activity on a customer’s or ‘closed’. account that may indicate some sort of fraud or Informational events are simply logged and then used as security breach). input to other processes, such as Backup and Storage ■ The primary ITSM relationships are with Incident, Management. Auto-response events will typically be closed Problem and Change Management. These interfaces by the generation of a second event. For example, a are described in some detail in paragraph 18.104.22.168. device generates an event and is rebooted through auto ■ Capacity and Availability Management are critical in response – as soon as that device is successfully back defining what events are significant, what appropriate online, it generates an event that effectively closes the thresholds should be and how to respond to them. In loop and clears the first event. return, Event Management will improve the It is sometimes very difficult to relate the open event and performance and availability of services by responding the close notifications as they are in different formats. It is to events when they occur and by reporting on actual optimal that devices in the infrastructure produce ‘open’ events and patterns of events to determine (by and ‘close’ events in the same format and specify the comparison with SLA targets and KPIs) if there is some change of status. This allows the correlation step in the aspect of the infrastructure design or operation that process to easily match open and close notifications. can be improved. ■ Configuration Management is able to use events to In the case of events that generated an incident, problem determine the current status of any CI in the or change, these should be formally closed with a link to infrastructure. Comparing events with the authorized the appropriate record from the other process. baselines in the Configuration Management System (CMS) will help to determine whether there is 4.1.6 Triggers, input and output/inter- unauthorized Change activity taking place in the process interfaces organization (see Service Transition publication). Event Management can be initiated by any type of ■ Asset Management (covered in more detail in the occurrence. The key is to define which of these Service Design and Transition publications) can use occurrences is significant and which need to be acted Event Management to determine the lifecycle status of upon. Triggers include: assets. For example, an event could be generated to signal that a new asset has been successfully ■ Exceptions to any level of CI performance defined in configured and is now operational. the design specifications, OLAs or SOPs ■ Events can be a rich source of information that can be ■ Exceptions to an automated procedure or process, e.g. processed for inclusion in Knowledge Management a routine change that has been assigned to a build systems. For example, patterns of performance can be team has not been completed in time correlated with business activity and used as input ■ An exception within a business process that is being into future design and strategy decisions. monitored by Event Management ■ The completion of an automated task or job ■ A status change in a device or database record 44 | Service Operation processes ■ Event Management can play an important role in ■ Number and percentage of events caused by existing ensuring that potential impact on SLAs is detected problems or Known Errors. This may result in a change early and any failures are rectified as soon as possible to the priority of work on that problem or Known so that impact on service targets is minimized. Error ■ Number and percentage of repeated or duplicated 4.1.7 Information Management events. This will help in the tuning of the Correlation Key information involved in Event Management includes Engine to eliminate unnecessary event generation and the following: can also be used to assist in the design of better event generation functionality in new services ■ SNMP messages, which are a standard way of ■ Number and percentage of events indicating communicating technical information about the status performance issues (for example, growth in the of components of an IT Infrastructure. number of times an application exceeded its ■ Management Information Bases (MIBs) of IT devices. transaction thresholds over the past six months) An MIB is the database on each device that contains ■ Number and percentage of events indicating potential information about that device, including its operating availability issues (e.g. failovers to alternative devices, system, BIOS version, configuration of system or excessive workload swapping) parameters, etc. The ability to interrogate MIBs and ■ Number and percentage of each type of event per compare them to a norm is critical to being able to generate events. platform or application ■ Number and ratio of events compared with the ■ Vendor’s monitoring tools agent software. number of incidents. ■ Correlation Engines contain detailed rules to determine the significance and appropriate response to events. Details on this are provided in paragraph 4.1.9 Challenges, Critical Success Factors 22.214.171.124. and risks ■ There is no standard Event Record for all types of 126.96.36.199 Challenges event. The exact contents and format of the record depend on the tools being used, what is being There are a number of challenges that might be monitored (e.g. a server and the Change Management encountered: tools will have very different data and probably use a ■ An initial challenge may be to obtain funding for the different format). However, there is some key data that necessary tools and effort needed to install and exploit is usually required from each event to be useful in the benefits of the tools. analysis. It should typically include the: ■ One of the greatest challenges is setting the correct ● Device level of filtering. Setting the level of filtering ● Component incorrectly can result in either being flooded with ● Type of failure relatively insignificant events, or not being able to ● Date/time detect relatively important events until it is too late. ● Parameters in exception ■ Rolling out of the necessary monitoring agents across ● Value. the entire IT infrastructure may be a difficult and time- consuming activity requiring an ongoing commitment over quite a long period of time – there is a danger 4.1.8 Metrics that other activities may arise that could divert For each measurement period in question, the metrics to resources and delay the rollout. check on the effectiveness and efficiency of the Event ■ Acquiring the necessary skills can be time consuming Management process should include the following: and costly. ■ Number of events by category ■ Number of events by significance 188.8.131.52 Critical Success Factors ■ Number and percentage of events that required In order to obtain the necessary funding a compelling human intervention and whether this was performed Business Case should be prepared showing how the ■ Number and percentage of events that resulted in benefits of effective Event Management can far outweigh incidents or changes the costs – giving a positive return on investment. Service Operation processes | 45 One of the most important CSFs is achieving the correct that will feed through the Continual Improvement process level of filtering. This is complicated by the fact that the back into Service Strategy, Service Design etc. significance of events changes. For example, a user Service Operation functions will be expected to participate logging into a system today is normal, but if that user in the design of the service and how it is measured (see leaves the organization and tries to log in it is a security section 3.4). breach. For Event Management, the specific design areas include There are three keys to the correct level of filtering, the following. as follows: ■ Integrate Event Management into all Service 184.108.40.206 Instrumentation Management processes where feasible. This will ensure Instrumentation is the definition of what can be monitored that only the events significant to these processes about CIs and the way in which their behaviour can be are reported. affected. In other words, instrumentation is about defining ■ Design new services with Event Management in mind and designing exactly how to monitor and control the IT (this is discussed in detail in paragraph 4.1.10). Infrastructure and IT services. ■ Trial and error. No matter how thoroughly Event Instrumentation is partly about a set of decisions that Management is prepared, there will be classes of need to be made and partly about designing mechanisms events that are not properly filtered. Event to execute these decisions. Management must therefore include a formal process to evaluate the effectiveness of filtering. Decisions that need to be made include: Proper planning is needed for the rollout of the ■ What needs to be monitored? monitoring agent software across the entire IT ■ What type of monitoring is required (e.g. active or Infrastructure. This should be regarded as a project with passive; performance or output)? realistic timescales and adequate resources being allocated ■ When do we need to generate an event? and protected throughout the duration of the project. ■ What type of information needs to be communicated in the event? 220.127.116.11 Risks ■ Who are the messages intended for? The key risks are really those already mentioned above: Mechanisms that need to be designed include: failure to obtain adequate funding; ensuring the correct level of filtering; and failure to maintain momentum in ■ How will events be generated? rolling out the necessary monitoring agents across the IT ■ Does the CI already have event generation Infrastructure. If any of these risks is not addressed it could mechanisms as a standard feature and, if so, which of adversely impact on the success of Event Management. these will be used? Are they sufficient or does the CI need to be customized to include additional 4.1.10 Designing for Event Management mechanisms or information? Effective Event Management is not designed once a ■ What data will be used to populate the Event Record? service has been deployed into Operations. Since Event ■ Are events generated automatically or does the CI Management is the basis for monitoring the performance have to be polled? and availability of a service, the exact targets and ■ Where will events be logged and stored? mechanisms for monitoring should be specified and ■ How will supplementary data be gathered? agreed during the Availability and Capacity Management processes (see Service Design publication). Note: A strong interface exists here with the application’s design. All applications should be coded in such a way However, this does not mean that Event Management is that meaningful and detailed error messages/codes are designed by a group of remote system developers and generated at the exact point of failure – so that these can then released to Operations Management together with be included in the event and allow swift diagnosis and the system that has to be managed. Nor does it mean resolution of the underlying cause. The need for the that, once designed and agreed, Event Management inclusion and testing of such error messaging is covered in becomes static – day-to-day operations will define more detail in the Service Transition publication. additional events, priorities, alerts and other improvements 46 | Service Operation processes 18.104.22.168 Error messaging 22.214.171.124 Identification of thresholds Error messaging is important for all components Thresholds themselves are not set and managed through (hardware, software, networks, etc.). It is particularly Event Management. However, unless these are properly important that all software applications are designed to designed and communicated during the instrumentation support Event Management. This might include the process, it will be difficult to determine which level of provision of meaningful error messages and/or codes that performance is appropriate for each CI. clearly indicate the specific point of failure and the most Also, most thresholds are not constant. They typically likely cause. In such cases the testing of new applications consist of a number of related variables. For example, the should include testing of accurate event generation. maximum number of concurrent users before response Newer technologies such as Java Management Extensions time slows will vary depending on what other jobs are (JMX) or HawkNL™ provide the tools for building active on the server. This knowledge is often only gained distributed, web-based, modular and dynamic solutions for by experience, which means that Correlation Engines have managing and monitoring devices, applications and to be continually tuned and updated through the process service-driven networks. These can be used to reduce or of Continual Service Improvement. eliminate the need for programmers to include error messaging within the code – allowing a valuable level of 4.2 INCIDENT MANAGEMENT normalization and code-independence. In ITIL terminology, an ‘incident’ is defined as: 126.96.36.199 Event Detection and Alert Mechanisms An unplanned interruption to an IT service or Good Event Management design will also include the reduction in the quality of an IT service. Failure of a design and population of the tools used to filter, correlate configuration item that has not yet impacted service and escalate Events. is also an incident, for example failure of one disk The Correlation Engine specifically will need to be from a mirror set. populated with the rules and criteria that will determine Incident Management is the process for dealing with the significance and subsequent action for each type all incidents; this can include failures, questions or of event. queries reported by the users (usually via a telephone call to the Service Desk), by technical staff, or Thorough design of the event detection and alert automatically detected and reported by event mechanisms requires the following: monitoring tools. ■ Business knowledge in relationship to any business processes being managed via Event Management ■ Detailed knowledge of the Service Level Requirements 4.2.1 Purpose/goal/objective of the service being supported by each CI The primary goal of the Incident Management process is ■ Knowledge of who is going to be supporting the CI to restore normal service operation as quickly as possible ■ Knowledge of what constitutes normal and abnormal and minimize the adverse impact on business operations, operation of the CI thus ensuring that the best possible levels of service quality and availability are maintained. ‘Normal service ■ Knowledge of the significance of multiple similar operation’ is defined here as service operation within events (on the same CI or various similar CIs SLA limits. ■ An understanding of what they need to know to support the CI effectively 4.2.2 Scope ■ Information that can help in the diagnosis of problems Incident Management includes any event which disrupts, with the CI or which could disrupt, a service. This includes events ■ Familiarity with incident prioritization and which are communicated directly by users, either through categorization codes so that if it is necessary to create the Service Desk or through an interface from Event an Incident Record, these codes can be provided Management to Incident Management tools. ■ Knowledge of other CIs that may be dependent on the affected CI, or those CIs on which it depends Incidents can also be reported and/or logged by technical ■ Availability of Known Error information from vendors staff (if, for example, they notice something untoward with or from previous experience. a hardware or network component they may report or log an incident and refer it to the Service Desk). This does not Service Operation processes | 47 mean, however, that all events are incidents. Many classes resolution targets within SLAs – and captured as targets of events are not related to disruptions at all, but are within OLAs and Underpinning Contracts (UCs). All support indicators of normal operation or are simply informational groups should be made fully aware of these timescales. (see section 4.1). Service Management tools should be used to automate timescales and escalate the incident as required based on Although both incidents and service requests are reported pre-defined rules. to the Service Desk, this does not mean that they are the same. Service requests do not represent a disruption to agreed service, but are a way of meeting the customer’s 188.8.131.52 Incident Models needs and may be addressing an agreed target in an SLA. Many incidents are not new – they involve dealing with Service requests are dealt with by the Request Fulfilment something that has happened before and may well process (see section 4.3). happen again. For this reason, many organizations will find it helpful to pre-define ‘standard’ Incident Models – 4.2.3 Value to business and apply them to appropriate incidents when they occur. The value of Incident Management includes: An Incident Model is a way of pre-defining the steps that should be taken to handle a process (in this case a process ■ The ability to detect and resolve incidents, which for dealing with a particular type of incident) in an agreed results in lower downtime to the business, which in way. Support tools can then be used to manage the turn means higher availability of the service. This required process. This will ensure that ‘standard’ incidents means that the business is able to exploit the are handled in a pre-defined path and within pre-defined functionality of the service as designed. timescales. ■ The ability to align IT activity to real-time business priorities. This is because Incident Management Incidents which would require specialized handling can be includes the capability to identify business priorities treated in this way (for example, security-related incidents and dynamically allocate resources as necessary. can be routed to Information Security Management and ■ The ability to identify potential improvements to capacity- or performance-related incidents that would be services. This happens as a result of understanding routed to Capacity Management. what constitutes an incident and also from being in The Incident Model should include: contact with the activities of business operational staff. ■ The steps that should be taken to handle the incident ■ The Service Desk can, during its handling of incidents, ■ The chronological order these steps should be taken identify additional service or training requirements found in IT or the business. in, with any dependences or co-processing defined ■ Responsibilities; who should do what Incident Management is highly visible to the business, and ■ Timescales and thresholds for completion of the it is therefore easier to demonstrate its value than most actions areas in Service Operation. For this reason, Incident ■ Escalation procedures; who should be contacted and Management is often one of the first processes to be when implemented in Service Management projects. The added ■ Any necessary evidence-preservation activities benefit of doing this is that Incident Management can be used to highlight other areas that need attention – (particularly relevant for security- and capacity-related thereby providing a justification for expenditure on incidents). implementing other processes. The models should be input to the incident-handling support tools in use and the tools should then automate 4.2.4 Policies/principles/basic concepts the handling, management and escalation of the process. There are some basic things that need to be taken into account and decided when considering Incident 184.108.40.206 Major incidents Management. These are covered in this section. A separate procedure, with shorter timescales and greater urgency, must be used for ‘major’ incidents. A definition of 220.127.116.11 Timescales what constitutes a major incident must be agreed and Timescales must be agreed for all incident-handling stages ideally mapped on to the overall incident prioritization (these will differ depending upon the priority level of the system – such that they will be dealt with through the incident) – based upon the overall incident response and major incident process. 48 | Service Operation processes From From User Email Event Web Phone Technical Mgmt Interface Call Staff Incident Identification Incident Logging Incident Categorization Yes Service Request? To Request Fulfilment No Incident Prioritization Major Incident Yes Procedure Major Incident? No Initial Diagnosis Functional Functional Yes Yes Escalation Escalation Needed? 2/3 Level Management Yes Hierarchic No Escalation Escalation Needed? No Investigation & Diagnosis Resolution and Recovery Incident Closure End Figure 4.2 Incident Management process flow Service Operation processes | 49 Note: People sometimes use loose terminology and/or Please see section 4.1 for further details. confuse a major incident with a problem. In reality, an incident remains an incident forever – it may grow in 18.104.22.168 Incident logging impact or priority to become a major incident, but an All incidents must be fully logged and date/time stamped, incident never ‘becomes’ a problem. A problem is the regardless of whether they are raised through a Service underlying cause of one or more incidents and remains a Desk telephone call or whether automatically detected via separate entity always! an event alert. Some lower-priority incidents may also have to be Note: If Service Desk and/or support staff visit the handled through this procedure – due to potential customers to deal with one incident, they may be asked to business impact – and some major incidents may not deal with further incidents ‘while they are there’. It is need to be handled in this way if the cause and important that if this is done, a separate Incident Record is resolutions are obvious and the normal incident process logged for each additional incident handled – to ensure can easily cope within agreed target resolution times – that a historical record is kept and credit is given for the provided the impact remains low! work undertaken. Where necessary, the major incident procedure should All relevant information relating to the nature of the include the dynamic establishment of a separate major incident must be logged so that a full historical record is incident team under the direct leadership of the Incident maintained – and so that if the incident has to be referred Manager, formulated to concentrate on this incident alone to other support group(s), they will have all relevant to ensure that adequate resources and focus are provided information to hand to assist them. to finding a swift resolution. If the Service Desk Manager is also fulfilling the role of Incident Manager (say in a small The information needed for each incident is likely to organization), then a separate person may need to be include: designated to lead the major incident investigation team – ■ Unique reference number so as to avoid conflict of time or priorities – but should ■ Incident categorization (often broken down into ultimately report back to the Incident Manager. between two and four levels of sub-categories) If the cause of the incident needs to be investigated at the ■ Incident urgency same time, then the Problem Manager would be involved ■ Incident impact as well but the Incident Manager must ensure that service ■ Incident prioritization restoration and underlying cause are kept separate. ■ Date/time recorded Throughout, the Service Desk would ensure that all ■ Name/ID of the person and/or group recording the activities are recorded and users are kept fully informed of incident progress. ■ Method of notification (telephone, automatic, e-mail, in person, etc.) 4.2.5 Process activities, methods and ■ Name/department/phone/location of user techniques ■ Call-back method (telephone, mail, etc.) The process to be followed during the management of an ■ Description of symptoms incident is shown in Figure 4.2. The process includes the ■ Incident status (active, waiting, closed, etc.) following steps. ■ Related CI 22.214.171.124 Incident identification ■ Support group/person to which the incident is allocated Work cannot begin on dealing with an incident until it is ■ Related problem/Known Error known that an incident has occurred. It is usually unacceptable, from a business perspective, to wait until a ■ Activities undertaken to resolve the incident user is impacted and contacts the Service Desk. As far as ■ Resolution date and time possible, all key components should be monitored so that ■ Closure category failures or potential failures are detected early so that the ■ Closure date and time. incident management process can be started quickly. Note: If the Service Desk does not work 24/7 and Ideally, incidents should be resolved before they have an responsibility for first-line incident logging and handling impact on users! passes to another group, such as IT Operations or Network 50 | Service Operation processes Support, out of Service Desk hours, then these staff need to achieve a correct and complete set of categories – if to be equally rigorous about logging of incident details. they are starting from scratch! The steps involve: Full training and awareness needs to be provided to such 1 Hold a brainstorming session among the relevant staff on this issue. support groups, involving the SD Supervisor and Incident and Problem Managers. 126.96.36.199 Incident categorization 2 Use this session to decide the ‘best guess’ top-level Part of the initial logging must be to allocate suitable categories – and include an ‘other’ category. Set up incident categorization coding so that the exact type of the relevant logging tools to use these categories for a the call is recorded. This will be important later when trial period. looking at incident types/frequencies to establish trends 3 Use the categories for a short trial period (long for use in Problem Management, Supplier Management enough for several hundred incidents to fall into each and other ITSM activities. category, but not too long that an analysis will take Please note that the check for Service Requests in this too long to perform). process does not imply that Service Requests are incidents. 4 Perform an analysis of the incidents logged during the This is simply recognition of the fact that Service Requests trial period. The number of incidents logged in each are sometimes incorrectly logged as incidents (e.g. a user higher-level category will confirm whether the incorrectly enters the request as an incident from the web categories are worth having – and a more detailed interface). This check will detect any such requests and analysis of the ‘other’ category should allow ensure that they are passed to the Request Fulfilment identification of any additional higher-level categories process. that will be needed. Multi-level categorization is available in most tools – 5 A breakdown analysis of the incidents within each usually to three or four levels of granularity. For example, higher-level category should be used to decide the an incident may be categorized as shown in Figure 4.3. lower-level categories that will be required. 6 Review and repeat these activities after a further period – of, say, one to three months – and again Hardware regularly to ensure that they remain relevant. Be aware that any significant changes to categorization may Server cause some difficulties for incident trending or management reporting – so they should be stabilized unless changes are genuinely required. Memory Board If an existing categorization scheme is in use, but it is not thought to be working satisfactorily, the basic idea of the Card failure Or technique suggested above can be used to review and amend the existing scheme. Software NOTE: Sometimes the details available at the time an incident is logged may be incomplete, misleading or Application incorrect. It is therefore important that the categorization of the incident is checked, and updated if necessary, at Finance suite call closure time (in a separate closure categorization field, so as not to corrupt the original categorization) – please see paragraph 188.8.131.52. Purchase order system Figure 4.3 Multi-level incident categorization 184.108.40.206 Incident prioritization Another important aspect of logging every incident is to agree and allocate an appropriate prioritization code – as All organizations are unique and it is therefore difficult to this will determine how the incident is handled both by give generic guidance on the categories an organization support tools and support staff. should use, particularly at the lower levels. However, there is a technique that can be used to assist an organization Prioritization can normally be determined by taking into account both the urgency of the incident (how quickly the Service Operation processes | 51 business needs a resolution) and the level of impact it is Some organizations may also recognize VIPs (high-ranking causing. An indication of impact is often (but not always) executives, officers, diplomats, politicians, etc.) whose the number of users being affected. In some cases, and incidents would be handled on a higher priority than very importantly, the loss of service to a single user can normal – but in such cases this is best catered for and have a major business impact – it all depends upon who is documented within the guidance provided to the Service trying to do what – so numbers alone is not enough to Desk staff on how to apply the priority levels, so they are evaluate overall priority! Other factors that can also all aware of the agreed rules for VIPs, and who falls into contribute to impact levels are: this category. ■ Risk to life or limb It should be noted that an incident’s priority may be ■ The number of services affected – may be multiple dynamic – if circumstances change, or if an incident is not services resolved within SLA target times, then the priority must be ■ The level of financial losses altered to reflect the new situation. ■ Effect on business reputation Note: some tools may have constraints that make it ■ Regulatory or legislative breaches. difficult automatically to calculate performance against SLA targets if a priority is changed during the lifetime of an An effective way of calculating these elements and incident. However, if circumstances do change, the change deriving an overall priority level for each incident is given in priority should be made – and if necessary manual in Table 4.1: adjustments made to reporting tools. Ideally, tools with Table 4.1 Simple priority coding system such constraints should not be selected. Impact 220.127.116.11 Initial diagnosis High Medium Low If the incident has been routed via the Service Desk, the High 1 2 3 Service Desk Analyst must carry out initial diagnosis, Urgency Medium 2 3 4 typically while the user is still on the telephone – if the call is raised in this way – to try to discover the full Low 3 4 5 symptoms of the incident and to determine exactly what has gone wrong and how to correct it. It is at this stage Priority code Description Target resolution time that diagnostic scripts and known error information can be most valuable in allowing earlier and accurate diagnosis. 1 Critical 1 hour If possible, the Service Desk Analyst will resolve the 2 High 8 hours incident while the user is still on the telephone – and 3 Medium 24 hours close the incident if the resolution is successful. 4 Low 48 hours If the Service Desk Analyst cannot resolve the incident 5 Planning Planned while the user is still on the telephone, but there is a prospect that the Service Desk may be able to do so within the agreed time limit without assistance from other In all cases, clear guidance – with practical examples – support groups, the Analyst should inform the user of their should be provided for all support staff to enable them to intentions, give the user the incident reference number determine the correct urgency and impact levels, so the and attempt to find a resolution. correct priority is allocated. Such guidance should be produced during service level negotiations. 18.104.22.168 Incident escalation However, it must be noted that there will be occasions ■ Functional escalation. As soon as it becomes clear when, because of particular business expediency or that the Service Desk is unable to resolve the incident whatever, normal priority levels have to be overridden. itself (or when target times for first-point resolution When a user is adamant that an incident’s priority level have been exceeded – whichever comes first!) the should exceed normal guidelines, the Service Desk should incident must be immediately escalated for further comply with such a request – and if it subsequently turns support. out to be incorrect this can be resolved as an off-line If the organization has a second-level support group management level issue, rather than a dispute occurring and the Service Desk believes that the incident can be when the user is on the telephone. resolved by that group, it should refer the incident to 52 | Service Operation processes them. If it is obvious that the incident will need and/or Incident Management staff initially, in conjunction deeper technical knowledge, or when the second-level with managers of the various support groups to which group has not been able to resolve the incident within incidents are escalated, to decide the order in which agreed target times (whichever comes first), the incidents should be picked up and actively worked on. incident must be immediately escalated to the These managers must ensure that incidents are dealt with appropriate third-level support group. Note that third- in true business priority order and that staff are not level support groups may be internal – but they may allowed to ‘cherry-pick’ the incidents they choose! also be third parties such as software suppliers or hardware manufacturers or maintainers. The rules for 22.214.171.124 Investigation and Diagnosis escalation and handling of incidents must be agreed In the case of incidents where the user is just seeking in OLAs and UCs with internal and external support information, the Service Desk should be able to provide groups respectively. this fairly quickly and resolve the service request – but if a Note: Incident Ownership remains with the Service fault is being reported, this is an incident and likely to Desk! Regardless of where an incident is referred to require some degree of investigation and diagnosis. during its life, ownership of the incident remains with Each of the support groups involved with the incident the Service Desk at all times. The Service Desk remains handling will investigate and diagnose what has gone responsible for tracking progress, keeping users wrong – and all such activities (including details of any informed and ultimately for Incident Closure. actions taken to try to resolve or re-create the incident) ■ Hierarchic escalation. If incidents are of a serious should be fully documented in the incident record so that nature (for example Priority 1 incidents) the a complete historical record of all activities is maintained appropriate IT managers must be notified, for at all times. informational purposes at least. Hierarchic escalation is also used if the ‘Investigation and Diagnosis’ and Note: Valuable time can often be lost if investigation and ‘Resolution and Recovery’ steps are taking too long or diagnostic action (or indeed resolution or recovery actions) proving too difficult. Hierarchic escalation should are performed serially. Where possible, such activities continue up the management chain so that senior should be performed in parallel to reduce overall managers are aware and can be prepared and take timescales – and support tools should be designed and/or any necessary action, such as allocating additional selected to allow this. However, care should be taken to resources or involving suppliers/maintainers. Hierarchic coordinate activities, particularly resolution or recovery escalation is also used when there is contention about activities, otherwise the actions of different groups may to whom the incident is allocated. conflict or further complicate a resolution! Hierarchic escalation can, of course, be initiated by the This investigation is likely to include such actions as: affected users or customer management, as they see ■ Establishing exactly what has gone wrong or being fit – that is why it is important that IT managers are made aware so that they can anticipate and prepare sought by the user for any such escalation. ■ Understanding the chronological order of events ■ Confirming the full impact of the incident, including The exact levels and timescales for both functional and the number and range of users affected hierarchic escalation need to be agreed, taking into ■ Identifying any events that could have triggered the account SLA targets, and embedded within support tools incident (e.g. a recent change, some user action?) which can then be used to police and control the process ■ Knowledge searches looking for previous occurrences flow within agreed timescales. by searching previous Incident/Problem Records The Service Desk should keep the user informed of any and/or Known Error Databases or relevant escalation that takes place and ensure the manufacturers’/suppliers’ Error Logs or Knowledge Incident Record is updated accordingly to keep a full Databases. history of actions. Note regarding Incident allocation 126.96.36.199 Resolution and Recovery There may be many incidents in a queue with the same When a potential resolution has been identified, this priority level – so it will be the job of the Service Desk should be applied and tested. The specific actions to be undertaken and the people who will be involved in taking Service Operation processes | 53 the recovery actions may vary, depending upon the nature ■ Ongoing or recurring problem? Determine (in of the fault – but could involve: conjunction with resolver groups) whether it is likely that the incident could recur and decide whether any ■ Asking the user to undertake directed activities on preventive action is necessary to avoid this. In their own desk top or remote equipment conjunction with Problem Management, raise a ■ The Service Desk implementing the resolution either Problem Record in all such cases so that preventive centrally (say, rebooting a server) or remotely using action is initiated. software to take control of the user’s desktop to ■ Formal closure. Formally close the Incident Record. diagnose and implement a resolution ■ Specialist support groups being asked to implement Note: Some organizations may chose to utilize an specific recovery actions (e.g. Network Support automatic closure period on specific, or even all, incidents reconfiguring a router) (e.g. incident will be automatically closed after two ■ A third-party supplier or maintainer being asked to working days if no further contact is made by the user). resolve the fault. Where this approach is to be considered, it must first be fully discussed and agreed with the users – and widely Even when a resolution has been found, sufficient testing publicized so that all users and IT staff are aware of this. It must be performed to ensure that recovery action is may be inappropriate to use this method for certain types complete and that the service has been fully restored to of incidents – such as major incidents or those involving the user(s). VIPs, etc. NOTE: in some cases it may be necessary for two or more Rules for re-opening incidents groups to take separate, though perhaps coordinated, recovery actions for an overall resolution to be Despite all adequate care, there will be occasions when implemented. In such cases Incident Management must incidents recur even though they have been formally coordinate the activities and liaise with all parties involved. closed. Because of such cases, it is wise to have pre- defined rules about if and when an incident can be re- Regardless of the actions taken, or who does them, the opened. It might make sense, for example, to agree that if Incident Record must be updated accordingly with all the incident recurs within one working day then it can be relevant information and details so that a full history re-opened – but that beyond this point a new incident is maintained. must be raised, but linked to the previous incident(s). The resolving group should pass the incident back to the The exact time threshold/rules may vary between Service Desk for closure action. individual organizations – but clear rules should be agreed and documented and guidance given to all Service Desk 188.8.131.52 Incident Closure staff so that uniformity is applied. The Service Desk should check that the incident is fully resolved and that the users are satisfied and willing to 4.2.6 Triggers, input and output/inter- agree the incident can be closed. The Service Desk should process interfaces also check the following: Incidents can be triggered in many ways. The most ■ Closure categorization. Check and confirm that the common route is when a user rings the Service Desk or initial incident categorization was correct or, where completes a web-based incident-logging screen, but the categorization subsequently turned out to be increasingly incidents are raised automatically via Event incorrect, update the record so that a correct closure Management tools. Technical staff may notice potential categorization is recorded for the incident – seeking failures and raise an incident, or ask the Service Desk to do advise or guidance from the resolving group(s) as so, so that the fault can be addressed. Some incidents may necessary. also arise at the initiation of suppliers – who may send ■ User satisfaction survey. Carry out a user satisfaction some form of notification of a potential or actual difficulty call-back or e-mail survey for the agreed percentage of that needs attention. incidents. The interfaces with Incident Management include: ■ Incident documentation. Chase any outstanding ■ Problem Management: Incident Management forms details and ensure that the Incident Record is fully documented so that a full historic record at a part of the overall process of dealing with problems in sufficient level of detail is complete. the organization. Incidents are often caused by underlying problems, which must be solved to prevent 54 | Service Operation processes the incident from recurring. Incident Management ■ The Incident Management tools, which contain provides a point where these are reported. information about: ■ Configuration Management provides the data used ● Incident and problem history to identify and progress incidents. One of the uses of ● Incident categories the CMS is to identify faulty equipment and to assess ● Action taken to resolve incidents the impact of an incident. It is also used to identify ● Diagnostic scripts which can help first-line analysts the users affected by potential problems. The CMS to resolve the incident, or at least gather also contains information about which categories of information that will help second- or third-line incident should be assigned to which support group. analysts resolve it faster. In turn, Incident Management can maintain the status ■ Incident Records, which include the following data: of faulty CIs. It can also assist Configuration ● Unique reference number Management to audit the infrastructure when working ● Incident classification to resolve an incident. ■ ● Date and time of recording and any subsequent Change Management: Where a change is required to implement a workaround or resolution, this will need activities to be logged as an RFC and progressed through ● Name and identity of the person recording and Change Management. In turn, Incident Management is updating the Incident Record able to detect and resolve incidents that arise from ● Name/organization/contact details of affected failed changes. user(s) ■ Capacity Management: Incident Management ● Description of the incident symptoms provides a trigger for performance monitoring where ● Details of any actions taken to try to diagnose, there appears to be a performance problem. Capacity resolve or re-create the incident Management may develop workarounds for incidents. ● Incident category, impact, urgency and priority ■ Availability Management; will use Incident ● Relationship with other incidents, problems, Management data to determine the availability of IT changes or Known Errors services and look at where the incident lifecycle can ● Closure details, including time, category, action be improved. taken and identity of person closing the record. ■ SLM: The ability to resolve incidents in a specified Incident Management also requires access to the CMS. time is a key part of delivering an agreed level of This will help it to identify the CIs affected by the incident service. Incident Management enables SLM to define and also to estimate the impact of the incident. measurable responses to service disruptions. It also provides reports that enable SLM to review SLAs The Known Error Database provides valuable information objectively and regularly. In particular, Incident about possible resolutions and workarounds. This is Management is able to assist in defining where discussed in detail in paragraph 184.108.40.206. services are at their weakest, so that SLM can define actions as part of the Service Improvement Plan (SIP) – 4.2.8 Metrics please see the Continual Service Improvement The metrics that should be monitored and reported upon publication for more details. SLM defines the to judge the efficiency and effectiveness of the Incident acceptable levels of service within which Incident Management process, and its operation, will include: Management works, including: ■ Total numbers of Incidents (as a control measure) ● Incident response times ■ Breakdown of incidents at each stage (e.g. logged, ● Impact definitions work in progress, closed etc) ● Target fix times ■ Size of current incident backlog ● Service definitions, which are mapped to users ■ Number and percentage of major incidents ● Rules for requesting services ■ Mean elapsed time to achieve incident resolution or ● Expectations for providing feedback to users. circumvention, broken down by impact code ■ Percentage of incidents handled within agreed 4.2.7 Information Management response time (incident response-time targets may be Most information used in Incident Management comes specified in SLAs, for example, by impact and urgency from the following sources: codes) Service Operation processes | 55 ■ Average cost per incident ■ Integration into the SLM process. This will assist ■ Number of incidents reopened and as a percentage of Incident Management correctly to assess the impact the total and priority of incidents and assists in defining and ■ Number and percentage of incidents incorrectly executing escalation procedures. SLM will also benefit assigned from the information learned during Incident ■ Number and percentage of incidents incorrectly Management, for example in determining whether categorized service level performance targets are realistic and achievable. ■ Percentage of Incidents closed by the Service Desk without reference to other levels of support (often 220.127.116.11 Critical Success Factors referred to as ‘first point of contact’) The following factors will be critical for successful Incident ■ Number and percentage the of incidents processed Management: per Service Desk agent ■ Number and percentage of incidents resolved ■ A good Service Desk is key to successful Incident remotely, without the need for a visit Management ■ Number of incidents handled by each Incident Model ■ Clearly defined targets to work to – as defined in SLAs ■ Breakdown of incidents by time of day, to help ■ Adequate customer-oriented and technically training pinpoint peaks and ensure matching of resources. support staff with the correct skill levels, at all stages of the process Reports should be produced under the authority of the ■ Integrated support tools to drive and control the Incident Manager, who should draw up a schedule and process distribution list, in collaboration with the Service Desk and ■ OLAs and UCs that are capable of influencing and support groups handling incidents. Distribution lists shaping the correct behaviour of all support staff. should at least include IT Services Management and specialist support groups. Consider also making the data 18.104.22.168 Risks available to users and customers, for example via SLA reports. The risks to successful Incident Management are actually similar to some of the challenges and the reverse of some 4.2.9 Challenges, Critical Success Factors of the Critical Success Factors mentioned above. They include: and risks ■ Being inundated with incidents that cannot be 22.214.171.124 Challenges handled within acceptable timescales due to a lack of The following challenges will exist for successful Incident available or properly trained resources Management: ■ Incidents being bogged down and not progressed as ■ The ability to detect incidents as early as possible. This intended because of inadequate support tools to raise will require education of the users reporting incidents, alerts and prompt progress the use of Super Users (see paragraph 126.96.36.199) and the ■ Lack of adequate and/or timely information sources configuration of Event Management tools. because of inadequate tools or lack of integration ■ Convincing all staff (technical teams as well as users) ■ Mismatches in objectives or actions because of poorly that all incidents must be logged, and encouraging aligned or non-existent OLAs and/or UCs. the use of self-help web-based capabilities (which can speed up assistance and reduce resource 4.3 REQUEST FULFILMENT requirements). The term ‘Service Request’ is used as a generic description ■ Availability of information about problems and Known for many varying types of demands that are placed upon Errors. This will enable Incident Management staff to the IT Department by the users. Many of these are actually learn from previous incidents and also to track the small changes – low risk, frequently occurring, low cost, status of resolutions. etc. (e.g. a request to change a password, a request to ■ Integration into the CMS to determine relationships install an additional software application onto a particular between CIs and to refer to the history of CIs when workstation, a request to relocate some items of desktop performing first-line support. equipment) or maybe just a question requesting information – but their scale and frequent, low-risk nature 56 | Service Operation processes means that they are better handled by a separate process, through the Request Fulfilment process and which others rather than being allowed to congest and obstruct the will have to go through more formal Change normal Incident and Change Management processes. Management. There will always be grey areas which prevent generic guidance from being usefully prescribed. 4.3.1 Purpose/goal/objective Request Fulfilment is the processes of dealing with Service 4.3.3 Value to business Requests from the users. The objectives of the Request The value of Request Fulfilment is to provide quick and Fulfilment process include: effective access to standard services which business staff can use to improve their productivity or the quality of ■ To provide a channel for users to request and receive business services and products. standard services for which a pre-defined approval and qualification process exists Request Fulfilment effectively reduces the bureaucracy ■ To provide information to users and customers about involved in requesting and receiving access to existing or the availability of services and the procedure for new services, thus also reducing the cost of providing obtaining them these services. Centralizing fulfilment also increases the ■ To source and deliver the components of requested level of control over these services. This in turn can help standard services (e.g. licences and software media) reduce costs through centralized negotiation with ■ To assist with general information, complaints or suppliers, and can also help to reduce the cost of support. comments. 4.3.4 Policies/principles/basic concepts 4.3.2 Scope Many Service Requests will be frequently recurring, so a The process needed to fulfil a request will vary depending predefined process flow (a model) can be devised to upon exactly what is being requested – but can usually be include the stages needed to fulfil the request, the broken down into a set of activities that have to be individuals or support groups involved, target timescales performed. Some organizations will be comfortable to let and escalation paths. Service Requests will usually be the Service Requests be handled through their Incident satisfied by implementing a Standard Change (see the Management processes (and tools) – with Service Requests Service Transition publication for further details on being handled as a particular type of ‘incident’ (using a Standard Changes). The ownership of Service Requests high-level categorization system to identify those resides with the Service Desk, which monitors, escalates, ‘incidents’ that are in fact Service Requests). dispatches and often fulfils the user request. Note, however, that there is a significant difference here – 188.8.131.52 Request Models an incident is usually an unplanned event whereas a Some Service Requests will occur frequently and will Service Request is usually something that can and should require handling in a consistent manner in order to meet be planned! agreed service levels. To assist this, many organizations will Therefore, in an organization where large numbers of wish to create pre-defined Request Models (which typically Service Requests have to be handled, and where the include some form of pre-approval by Change actions to be taken to fulfil those requests are very varied Management). This is similar in concept to the idea of or specialized, it may be appropriate to handle Service Incident Models already described in paragraph 184.108.40.206, Requests as a completely separate work stream – and to but applied to Service Requests. record and manage them as a separate record type. This may be particularly appropriate if the organization 4.3.5 Process activities, methods and has chosen to widen the scope of the Service Desk to techniques expand upon just IT-related issues and use the desk as a focal point for other types or request for service – for 220.127.116.11 Menu selection example, a request to service a photocopier or even going Request Fulfilment offers great opportunities for self-help so far as to include, for example, building management practices where users can generate a Service Request issues, such as a need to replace a light fitment or repair a using technology that links into Service Management leak in the plumbing. tools. Ideally, users should be offered a ‘menu’-type selection via a web interface, so that they can select and Note: It will ultimately be up to each organization to input details of Service Requests from a pre-defined list – decide and document which request it will handle Service Operation processes | 57 appropriate expectations can be set by giving target 18.104.22.168 Closure delivery and/or implementation targets/dates (in line with When the Service Request has been fulfilled it must be SLA targets). Where organizations are offering a self-help IT referred back to the Service Desk for closure. The Service support capability to the users, it would make sense to Desk should go through the same closure process as combine this with a Request Fulfilment system as described earlier in paragraph 22.214.171.124 – checking that the described. user is satisfied with the outcome. Specialist web tools to offer this type of ‘shopping basket’ experience can be used together with interfaces directly to 4.3.6 Triggers, input and output/inter- the back-end integrated ITSM tools, or other more general process interfaces business process automation or Enterprise Resource Most requests will be triggered through either a user Planning (ERP) tools that may be used for management of calling the Service Desk or a user completing some form the Request Fulfilment activities. of self-help web-based input screen to make their request. The latter will often involve a selection from a portfolio of 126.96.36.199 Financial approval available request types. One important extra step that is likely to be needed when The primary interfaces with Request Fulfilment include: dealing with a service request is that of financial approval. ■ Service Desk/Incident Management: Many Service Most requests will have some form of financial Requests may come in via the Service Desk and may implications, regardless of the type of commercial be initially handled through the Incident Management arrangements in place. The cost of fulfilling the request process. Some organizations may choose that all must first be established. It may be possible to agree fixed requests are handled via this route – but others may prices for ‘standard’ requests – and prior approval for such choose to have a separate process, for reasons already requests may be given as part of the organization’s overall discussed earlier in this chapter. annual financial management. In all other cases, an ■ A strong link is also needed between Request estimate of the cost must be produced and submitted to Fulfilment, Release, Asset and Configuration the user for financial approval (the user may need to seek Management – as some requests will be for the approval up their management/financial chain). If approval deployment of new or upgraded components that can is given, in addition to fulfilling the request, the process be automatically deployed. In such cases the ‘release’ must also include charging (billing or cross-charging) for can be pre-defined, built and tested but only the work done – if charging is in place. deployed upon request by those who want the ‘release’. Upon deployment, the CMS will have to be 188.8.131.52 Other approval updated to reflect the change. Where appropriate, In some cases further approval may be needed – such as software licence checks/updates will also be necessary. compliance-related or wider business approval. Request Fulfilment must have the ability to define and check such Where appropriate, it will be necessary to relate IT-related approvals where needed. Service Requests to any incidents or problems that have initiated the need for the request (as would be the case 184.108.40.206 Fulfilment for any other type of change). The actual fulfilment activity will depend upon the nature 4.3.7 Information Management of the Service Request. Some simpler requests may be completed by the Service Desk, acting as first-line support, Request Fulfilment is dependent on information from the while others will have to be forwarded to specialist groups following sources: and/or suppliers for fulfilment. ■ The Service Requests will contain information about: Some organizations may have specialist fulfilment groups ● What service is being requested (to ‘pick, pack and dispatch’) – or may have outsourced ● Who requested and authorized the service some fulfilment activities to a third-party supplier(s). The ● Which process will be used to fulfil the request Service Desk should monitor and chase progress and keep ● To whom it was assigned to and what action users informed throughout, regardless of the actual was taken fulfilment source. 58 | Service Operation processes ● The date and time when the request was logged 220.127.116.11 Critical Success Factors as well as the date and time of all actions taken Request Fulfilment depends on the following Critical ● Closure details. Success Factors: ■ Requests for Change: In some cases the Request ■ Agreement of what services will be standardized and Fulfilment process will be initiated by an RFC. This is who is authorized to request them. The cost of these typical where the Service Request relates to a CI services must also be agreed. This may be done as ■ The Service Portfolio, to enable the scope of agreed part of the SLM process. Any variances of the services Service Request to be identified must also be defined. ■ Security Policies will prescribe any controls to be ■ Publication of the services to users as part of the executed or adhered to when providing the service, Service Catalogue. It is important that this part of the e.g. ensuring that the requester is authorized to access Service Catalogue must be easily accessed, perhaps on the service, or that the software is licensed. the Intranet, and should be recognized as the first source of information for users seeking access to a 4.3.8 Metrics service. The metrics needed to judge the effectiveness and ■ Definition of a standard fulfilment procedure for each efficiency of Request Fulfilment will include the following of the services being requested. This includes all (each metric will need to be broken down by request procurement policies and the ability to generate type, within the period): purchase orders and work orders ■ The total number of Service Requests (as a control ■ A single point of contact which can be used to measure) request the service. This is often provided by the ■ Breakdown of service requests at each stage (e.g. Service Desk or through an Intranet request, but could logged, WIP, closed, etc.) be through an automated request directly into the ■ The size of current backlog of outstanding Service Request Fulfilment or procurement system. Requests ■ Self-service tools needed to provide a front-end ■ The mean elapsed time for handling each type of interface to the users. It is essential that these Service Request integrate with the back-end fulfilment tools, often managed through Incident or Change Management. ■ The number and percentage of Service Requests completed within agreed target times 18.104.22.168 Risks ■ The average cost per type of Service Request ■ Level of client satisfaction with the handling of Service Risks that may be encountered with Request Fulfilment Requests (as measured in some form of satisfaction include: survey). ■ Poorly defined scope, where people are unclear about exactly what the process is expected to handle 4.3.9 Challenges, Critical Success Factors ■ Poorly designed or implemented user interfaces so and risks that users have difficulty raising the requests that they need 22.214.171.124 Challenges ■ Badly designed or operated back-end fulfilment The following challenges will be faced when introducing processes that are incapable of dealing with the Request Fulfilment: volume or nature of the requests being made ■ Clearly defining and documenting the type of requests ■ Inadequate monitoring capabilities so that accurate that will be handled within the Request Fulfilment metrics cannot be gathered. process (and those that will either go through the Service Desk and be handled as incidents or those 4.4 PROBLEM MANAGEMENT that will need to go through formal Change Management) – so that all parties are absolutely clear ITIL defines a ‘problem’ as the unknown cause of one or on the scope. more incidents. ■ Establishing self-help front-end capabilities that allow the users to interface successfully with the Request 4.4.1 Purpose/goal/objective Fulfilment process. Problem Management is the process responsible for managing the lifecycle of all problems. The primary Service Operation processes | 59 objectives of Problem Management are to prevent 4.4.4 Policies/principles/basic concepts problems and resulting incidents from happening, to There are some important concepts of Problem eliminate recurring incidents and to minimize the impact Management that must be taken into account from the of incidents that cannot be prevented. outset. These include: 4.4.2 Scope 126.96.36.199 Problem Models Problem Management includes the activities required to Many problems will be unique and will require handling in diagnose the root cause of incidents and to determine the an individual way – but it is conceivable that some resolution to those problems. It is also responsible for incidents may recur because of dormant or underlying ensuring that the resolution is implemented through the problems (for example, where the cost of a permanent appropriate control procedures, especially Change resolution will be high and a decision has been taken not Management and Release Management. to go ahead with an expensive solution – but to ‘live with’ Problem Management will also maintain information the problem). about problems and the appropriate workarounds and As well as creating a Known Error Record in the Known resolutions, so that the organization is able to reduce the Error Database (see paragraph 188.8.131.52) to ensure quicker number and impact of incidents over time. In this respect, diagnosis, the creation of a Problem Model for handling Problem Management has a strong interface with such problems in the future may be helpful. This is very Knowledge Management, and tools such as the Known similar in concept to the idea of Incident Models already Error Database will be used for both. described in paragraph 184.108.40.206, but applied to problems as Although Incident and Problem Management are separate well as incidents. processes, they are closely related and will typically use the same tools, and may use similar categorization, impact 4.4.5 Process activities, methods and and priority coding systems. This will ensure effective techniques communication when dealing with related incidents and Problem Management consists of two major processes: problems. ■ Reactive Problem Management, which is generally 4.4.3 Value to business executed as part of Service Operation – and is therefore covered in this publication Problem Management works together with Incident ■ Proactive Problem Management which is initiated in Management and Change Management to ensure that IT Service Operation, but generally driven as part of service availability and quality are increased. When Continual Service Improvement (see this publication incidents are resolved, information about the resolution is for fuller details). recorded. Over time, this information is used to speed up the resolution time and identify permanent solutions, The reactive Problem Management process is shown in reducing the number and resolution time of incidents. This Figure 4.4. This is a simplified chart to show the normal results in less downtime and less disruption to business process flow, but in reality some of the states may be critical systems. iterative or variations may have to be made in order to handle particular situations. Additional value is derived from the following: ■ Higher availability of IT services ■ Higher productivity of business and IT staff ■ Reduced expenditure on workarounds or fixes that do not work ■ Reduction in cost of effort in fire-fighting or resolving repeat incidents. 60 | Service Operation processes Proactive Event Incident Supplier or Service Desk Problem Management Management Contractor Management Problem Detection Problem Logging Categorization Prioritization CMS Investigation & Diagnosis Workaround? Create Known Known Error Record Error Database Change Yes Management Change Needed? No Resolution Closure Major Major Problem Problem? Review Figure 4.4 Problem End Management process flow Service Operation processes | 61 220.127.116.11 Problem detection ■ Equipment details It is likely that multiple ways of detecting problems will ■ Date/time initially logged exist in all organizations. These will include: ■ Priority and categorization details ■ Incident description ■ Suspicion or detection of an unknown cause of one or more incidents by the Service Desk, resulting in a ■ Details of all diagnostic or attempted recovery Problem Record being raised – the desk may have actions taken. resolved the incident but has not determined a definitive cause and suspects that it is likely to recur, 18.104.22.168 Problem Categorization so will raise a Problem Record to allow the underlying Problems must be categorized in the same way as cause to be resolved. Alternatively, it may be incidents (and it is advisable to use the same coding immediately obvious from the outset that an incident, system) so that the true nature of the problem can be or incidents, has been caused by a major problem, so easily traced in the future and meaningful management a Problem Record will be raised without delay. information can be obtained. ■ Analysis of an incident by a technical support group which reveals that an underlying problem exists, or is 22.214.171.124 Problem Prioritization likely to exist. Problems must be prioritized in the same way and for the ■ Automated detection of an infrastructure or same reasons as incidents – but the frequency and impact application fault, using event/alert tools automatically of related incidents must also be taken into account. The to raise an incident which may reveal the need for a coding system described earlier in Table 4.1 (which Problem Record. combines impact with urgency to give an overall priority ■ A notification from a supplier or contractor that a level) can be used to prioritize problems in the same way problem exists that has to be resolved. that it might be used for incidents, though the definitions ■ Analysis of incidents as part of proactive Problem and guidance to support staff on what constitutes a Management – resulting in the need to raise a problem, and the related service targets at each level, Problem Record so that the underlying fault can be must obviously be devised separately. investigated further. Problem prioritization should also take into account the Frequent and regular analysis of incident and problem severity of the problems. Severity in this context refers to data must be performed to identify any trends as they how serious the problem is from an infrastructure become discernible. This will require meaningful and perspective, for example: detailed categorization of incidents/problems and regular ■ Can the system be recovered, or does it need to be reporting of patterns and areas of high occurrence. ‘Top replaced? ten’ reporting, with drill-down capabilities to lower levels, ■ How much will it cost? is useful in identifying trends. ■ How many people, with what skills, will be needed to Further details of how detected trends should be handled fix the problem? are included in the Continual Service Improvement ■ How long will it take to fix the problem? publication. ■ How extensive is the problem (e.g. how many CIs are affected)? 126.96.36.199 Problem logging Regardless of the detection method, all the relevant details 188.8.131.52 Problem Investigation and Diagnosis of the problem must be recorded so that a full historic An investigation should be conducted to try to diagnose record exists. This must be date and time stamped to the root cause of the problem – the speed and nature of allow suitable control and escalation. this investigation will vary depending upon the impact, A cross-reference must be made to the incident(s) which severity and urgency of the problem – but the appropriate initiated the Problem Record – and all relevant details level of resources and expertise should be applied to must be copied from the Incident Record(s) to the finding a resolution commensurate with the priority Problem Record. It is difficult to be exact, as cases may code allocated and the service target in place for that vary, but typically this will include details such as: priority level. ■ User details There are a number of useful problem solving techniques ■ Service details that can be used to help diagnose and resolve problems – 62 | Service Operation processes and these should be used as appropriate. Such techniques ■ Kepner and Tregoe: Charles Kepner and Benjamin are described in more detail later in this section. Tregoe developed a useful way of problem analysis which can be used formally to investigate deeper- The CMS must be used to help determine the level of rooted problems. They defined the following stages: impact and to assist in pinpointing and diagnosing the ● defining the problem exact point of failure. The Know Error Database (KEDB) should also be accessed and problem-matching ● describing the problem in terms of identity, techniques (such as key word searches) should be used to location, time and size see if the problem has occurred before and, if so, to find ● establishing possible causes the resolution. ● testing the most probable cause ● verifying the true cause. It is often valuable to try to recreate the failure, so as to understand what has gone wrong, and then to try various The method is described in fuller detail in Appendix C. ways of finding the most appropriate and cost-effective ■ Brainstorming: It can often be valuable to gather resolution to the problem. To do this effectively without together the relevant people, either physically or by causing further disruption to the users, a test system will electronic means, and to ‘brainstorm’ the problem – be necessary that mirrors the production environment. with people throwing in ideas on what the potential cause may be and potential actions to resolve the There are many problem analysis, diagnosis and solving problem. Brainstorming sessions can be very techniques available and much research has been done in constructive and innovative but it is equally important this area. Some of the most useful and frequently used that someone, perhaps the Problem Manager, techniques include: documents the outcome and any agreed actions and ■ Chronological Analysis: When dealing with a difficult keeps a degree of control in the session(s). problem, there are often conflicting reports about ■ Ishikawa Diagrams: Kaoru Ishikawa (1915–89), a exactly what has happened and when. It is therefore leader in Japanese quality control, developed a very helpful briefly to document all events in method of documenting causes and effects which can chronological order – to provide a timeline of events. be useful in helping identify where something may be This often makes it possible to see which events may going wrong, or be improved. Such a diagram is have been triggered by others – or to discount any typically the outcome of a brainstorming session claims that are not supported by the sequence of where problem solvers can offer suggestions. The events. main goal is represented by the trunk of the diagram, ■ Pain Value Analysis: This is where a broader view is and primary factors are represented as branches. taken of the impact of an incident or problem, or Secondary factors are then added as stems, and so on. incident/problem type. Instead of just analysing the Creating the diagram stimulates discussion and often number of incidents/problems of a particular type in a leads to increased understanding of a complex particular period, a more in-depth analysis is done to problem. An example diagram is given in Appendix D. determine exactly what level of pain has been caused ■ Pareto Analysis: This is a technique for separating to the organization/business by these important potential causes from more trivial issues. incidents/problems. A formula can be devised to The following steps should be taken: calculate this pain level. Typically this might include 1 Form a table listing the causes and their taking into account: frequency as a percentage. ● The number of people affected 2 Arrange the rows in the decreasing order of ● The duration of the downtime caused importance of the causes, i.e. the most important ● The cost to the business (if this can be readily cause first. calculated or estimated). 3 Add a cumulative percentage column to the By taking all of these factors into account, a much table. By this step, the chart should look more detailed picture of those incidents/problems or something like Table 4.2, which illustrates 10 incident/problem types that are causing most pain can causes of network failure in an organization. be determined – to allow a better focus on those 4 Create a bar chart with the causes, in order of things that really matter and deserve highest priority their percentage of total. in resolving. Service Operation processes | 63 Table 4.2 Pareto cause ranking chart Network failures Causes Percentage of total Computation Cumulative % Network Controller 35 0+35% 35 File corruption 26 35%+26% 61 Addressing conflicts 19 61%+19% 80 Server OS 6 80%+6% 86 Scripting error 5 86%+5% 91 Untested change 3 91%+3% 94 Operator error 2 94%+2% 96 Backup failure 2 96%+2% 98 Intrusion attempts 1 98%+1% 99 Disk failure 1 99%+1% 100 5 Superimpose a line chart of the cumulative percentages. The completed graph is illustrated in Figure 4.5. Network Failures 6 Draw line at 80% on the y-axis parallel to 40 120 the x-axis. Then drop the line at the point of intersection with the curve on the x-axis. 35 This point on the x-axis separates the important 100 causes and trivial causes. This line is represented as a dotted line in Figure 4.5. 30 From this chart it is clear to see that there are three 80 primary causes for network failure in the organization. 25 These should therefore be targeted first. 20 60 15 40 10 20 5 0 0 Network controller File corruption Addressing conflicts Server OS Scripting error Untested change Operator error Backup failure Intrusion attempts Disk failure Figure 4.5 Important versus trivial causes 64 | Service Operation processes 184.108.40.206 Workarounds Note: There may be some problems for which a Business In some cases it may be possible to find a workaround to Case for resolution cannot be justified (e.g. where the the incidents caused by the problem – a temporary way of impact is limited but the cost of resolution would be overcoming the difficulties. For example, a manual extremely high). In such cases a decision may be taken to amendment may be made to an input file to allow a leave the Problem Record open but to use a workaround program to complete its run successfully and allow a description in the Known Error Record to detect and billing process to complete satisfactorily, but it is resolve any recurrences quickly. Care should be taken to important that work on a permanent resolution continues use the appropriate code to flag the open Problem Record where this is justified – in this example the reason for the so that it does not count against the performance of the file becoming corrupted in the first place must be found team performing the process and so that unauthorized and corrected to prevent this happening again. rework does not take place. In cases where a workaround is found, it is therefore 220.127.116.11 Problem Closure important that the problem record remains open, and When any change has been completed (and successfully details of the workaround are always documented within reviewed), and the resolution has been applied, the the Problem Record. Problem Record should be formally closed – as should any related Incident Records that are still open. A check should 18.104.22.168 Raising a Known Error Record be performed at this time to ensure that the record As soon as the diagnosis is complete, and particularly contains a full historical description of all events – and if where a workaround has been found (even though it may not, the record should be updated. not yet be a permanent resolution), a Known Error Record must be raised and placed in the Known Error Database – The status of any related Known Error Record should be so that if further incidents or problems arise, they can be updated to shown that the resolution has been applied. identified and the service restored more quickly. 22.214.171.124 Major Problem Review However, in some cases it may be advantageous to raise a After every major problem (as determined by the Known Error Record even earlier in the overall process – organization’s priority system), while memories are still just for information purposes, for example – even though fresh a review should be conducted to learn any lessons the diagnosis may not be complete or a workaround for the future. Specifically, the review should examine: found, so it is inadvisable to set a concrete procedural point exactly when a Known Error Record must be raised. ■ Those things that were done correctly It should be done as soon as it becomes useful to do so! ■ Those things that were done wrong The Known Error Database and the way it should be used ■ What could be done better in the future are described in more detail in paragraph 126.96.36.199. ■ How to prevent recurrence ■ Whether there has been any third-party responsibility 188.8.131.52 Problem resolution and whether follow-up actions are needed. Ideally, as soon as a solution has been found, it should be Such reviews can be used as part of training and applied to resolve the problem – but in reality safeguards awareness activities for support staff – and any lessons may be needed to ensure that this does not cause further learned should be documented in appropriate procedures, difficulties. If any change in functionality is required this work instructions, diagnostic scripts or Known Error will require an RFC to be raised and approved before the Records. The Problem Manager facilitates the session and resolution can be applied. If the problem is very serious documents any agreed actions. and an urgent fix is needed for business reasons, then an The knowledge learned from the review should be Emergency RFC should be handled by the Change incorporated into a service review meeting with the Advisory Board Emergency Committee (CAB/EC) to business customer to ensure the customer is aware of the facilitate this urgent action. Otherwise, the RFC should actions taken and the plans to prevent future major follow the established Change Management process for incidents from occurring. This helps to improve customer that type of change – and the resolution should be satisfaction and assure the business that Service applied only when the change has been approved and Operations is handling major incidents responsibly and scheduled for release. In the meantime, the KEDB should actively working to prevent their future recurrence. be used to help resolve quickly any further occurrences of the incidents/problems that occur. Service Operation processes | 65 184.108.40.206 Errors detected in the development changes and keep Problem Management advised. environment Problem Management is also involved in rectifying the situation caused by failed changes. It is rare for any new applications, systems or software releases to be completely error-free. It is more likely that ● Configuration Management: Problem during testing of such new applications, systems or Management uses the CMS to identify faulty CIs releases a prioritization system will be used to eradicate and also to determine the impact of problems and the more serious faults, but it is possible that minor faults resolutions. The CMS can also be used to form the are not rectified – often because of the balance that has to basis for the KEDB and hold or integrate with the be made between delivering new functionality to the Problem Records. business as quickly as possible and ensuring totally fault- ● Release and Deployment Management: Is free code or components. responsible for rolling problem fixes out into the live environment. It also assists in ensuring that the Where a decision is made to release something into the associated known errors are transferred from the production environment that includes known deficiencies, development Known Error Database into the live these should be logged as Known Errors in the KEDB, Known Error Database. Problem Management will together with details of workarounds or resolution assist in resolving problems caused by faults during activities. There should be a formal step in the testing the release process. sign-off that ensures that this handover always takes place ■ Service Design (see Service Transition publication). ● Availability Management: Is involved with Experience has shown if this does not happen, it will lead determining how to reduce downtime and increase to far higher support costs when the users start to uptime. As such, it has a close relationship with experience the faults and raise incidents that have to be Problem Management, especially the proactive re-diagnosed and resolved all over again! areas. Much of the management information available in Problem Management will be 4.4.6 Triggers, input and output/inter- communicated to Availability Management. process interfaces ● Capacity Management: Some problems will The vast majority of Problem Records will be triggered in require investigation by Capacity Management reaction to one or more incidents, and many will be raised teams and techniques, e.g. performance issues. or initiated via Service Desk staff. Other Problem Records, Capacity Management will also assist in assessing and corresponding Known Error Records, may be triggered proactive measures. Problem Management provides in testing, particularly the latter stages of testing such as management information relative to the quality of User Acceptance Testing/Trials (UAT), if a decision is made decisions made during the Capacity Planning to go ahead with a release even though some faults are process. known. Suppliers may trigger the need for some Problem ● IT Service Continuity: Problem Management acts Records through the notification of potential faults or as an entry point into IT Service Continuity known deficiencies in their products or services (e.g. a Management where a significant problem is not warning may be given regarding the use of a particular CI resolved before it starts to have a major impact on and a Problem Record may be raised to facilitate the the business. investigation by technical staff of the condition of such CIs ■ Continual Service Improvement within the organization’s IT Infrastructure). ● Service Level Management: The occurrence of The primary relationship between Incident and Problem incidents and problems affects the level of service Management has been discussed in detail in paragraphs delivery measured by SLM. Problem Management 4.2.6 and 220.127.116.11. Other key interfaces include the contributes to improvements in service levels, and following: its management information is used as the basis of some of the SLA review components. SLM also ■ Service Transition provides parameters within which Problem ● Change Management: Problem Management Management works, such as impact information ensures that all resolutions or workarounds that and the effect on services of proposed resolutions require a change to a CI are submitted through and proactive measures. Change Management through an RFC. Change Management will monitor the progress of these 66 | Service Operation processes ■ Service Strategy to diagnose and implement a workaround as quickly as ● Financial Management: Assists in assessing the possible, which is where the KEDB can be of assistance. impact of proposed resolutions or workarounds, as It is essential that any data put into the database can be well as Pain Value Analysis. Problem Management quickly and accurately retrieved. The Problem Manager provides management information about the cost should be fully trained and familiar with the search of resolving and preventing problems, which is methods/algorithms used by the selected database and used as input into the budgeting and accounting should carefully ensure that when new records are added, systems and Total Cost of Ownership calculations. the relevant search key criteria are correctly included. 4.4.7 Information Management Care should be taken to avoid duplication of records (i.e. the same problem described in two or more ways as 18.104.22.168 CMS separate records). To avoid this, the Problem Manager The CMS will hold details of all of the components of the should be the only person able to enter a new record. IT Infrastructure as well as the relationships between these Other support groups should be allowed, indeed components. It will act as a valuable source for problem encouraged, to propose new records, but these should be diagnosis and for evaluating the impact of problems (e.g. vetted by the Problem Manager before entry to the KEDB. if this disk is down, what data is on that disk; which In large organizations where Problem Management staff services use that data; which users use those services?). exist in multiple locations but a single KEDB is used As it will also hold details of previous activities, it can also (recommended!), a procedure must be agreed between all be used as a valuable source of historical data to help Problem Management staff to ensure that such duplication identify trends or potential weaknesses – a key part of cannot occur. This may involve designating just one staff proactive Problem Management (see Continual Service member as the central KEDB Manager. Improvement publication). The KEDB should be used during the Incident and Problem Diagnosis phases to try to speed up the 22.214.171.124 Known Error Database resolution process – and new records should be added as The purpose of a Known Error Database is to allow storage quickly as possible when a new problem has been of previous knowledge of incidents and problems – and identified and diagnosed. how they were overcome – to allow quicker diagnosis and All support staff should be fully trained and conversant resolution if they recur. with the value that the KEDB can offer and the way it The Known Error Record should hold exact details of the should be used. They should be able readily to retrieve fault and the symptoms that occurred, together with and use data. precise details of any workaround or resolution action that Note: Some tools/implementations may choose to can be taken to restore the service and/or resolve the delineate Known Errors simply by changing a field in the problem. An incident count will also be useful to original Problem Record. This is acceptable provided the determine the frequency with which incidents are likely to same level of functionality is available. recur and influence priorities, etc. The KEDB, like the CMS, forms part of a larger Service It should be noted that a Business Case for a permanent Knowledge Management System (SKMS) illustrated in resolution for some problems may not exist. For example, Figure 4.6. More information on the SKMS can be found in if a problem does not cause serious disruption and a the Service Transition publication. workaround exists and/or the cost of resolving the problem far outweighs the benefits of a permanent resolution – then a decision may be taken to tolerate the existence of the problem. However, it will still be desirable Service Operation processes | 67 Change and Release Asset Management Configuration Life Technical Quality Service Desk View Presentation View View Cycle View Configuration View Management View User assets Layer Schedules/plans Financial Asset Asset Project configurations Service Applications Asset and User configuration, Change Request Status Status Reports Asset Service Strategy, Application Configuration Changes, Releases, Portal Change Advisory Board Statements and Bills Design, Transition, Environment Management Policies, Asset and Configuration agenda and Licence Management Operations Test Environment Processes, Procedures, item and related minutes Asset performance configuration Infrastructure forms, templates, incidents, problems, baselines and checklists workarounds, changes changes Search, Browse, Store, Retrieve, Update, Publish, Subscribe, Collaborate Knowledge Performance Management Monitoring Processing Query and Analysis Reporting Modelling Scorecards, Dashboards Forecasting, Planning, Budgeting Layer Alerting Business/Customer/Supplier/User – Service – Application – Infrastructure mapping Information Integration Service Portfolio Service Change Layer Service Catalogue Service Integrated CMDB Service Release Model Common Process, Data Data Schema Meta Data Extract, Transform, Data and reconciliation synchronization Mining Mapping Management Load Information Model Data Integration Definitive Media Physical CMDBs Platform Software Discovery, Project Document Enterprise Library Configuration Tools Configuration Asset Filestore Applications Definitive E.g. Storage Database Management Management Access Management Data and CMDB1 Middleware Network and audit Document Library Human Resources Information Mainframe tools Sources Structured Supply Chain Definitive Distributed Desktop Management and Tools CMDB2 Mobile Multimedia Library 1 Customer Relationship Management Project Definitive Software CMDB3 Multimedia Library 2 Figure 4.6 Service Knowledge Management System 4.4.8 Metrics ■ The percentage of Major Problem Reviews completed The following metrics should be used to judge the successfully and on time. effectiveness and efficiency of the Problem Management All metrics should be broken down by category, impact, process, or its operation: severity, urgency and priority level and compared with ■ The total number of problems recorded in the period previous periods. (as a control measure) ■ The percentage of problems resolved within SLA 4.4.9 Challenges, Critical Success Factors targets (and the percentage that are not!) and risks ■ The number and percentage of problems that A major dependency for Problem Management is the exceeded their target resolution times establishment of an effective Incident Management ■ The backlog of outstanding problems and the trend process and tools. This will ensure that problems are (static, reducing or increasing?) identified as soon as possible and that as much work is ■ The average cost of handling a problem done on pre-qualification as possible. However, it is also critical that the two processes have formal interfaces and ■ The number of major problems (opened and closed common working practices. This implies the following: and backlog) ■ The percentage of Major Problem Reviews successfully ■ Linking Incident and Problem Management tools performed ■ The ability to relate Incident and Problem Records ■ The number of Known Errors added to the KEDB ■ The second- and third-line staff should have a good ■ The percentage accuracy of the KEDB (from audits of working relationship with staff on the first line the database) ■ Making sure that business impact is well understood by all staff working on problem resolution. 68 | Service Operation processes In addition it is important that Problem Management is ■ There is less likelihood of errors being made in data able to use all Knowledge and Configuration Management entry or in the use of a critical service by an unskilled resources available. user (e.g. production control systems) ■ The ability to audit use of services and to trace the Another CSF is the ongoing training of technical staff in both technical aspects of their job as well as the business abuse of services implications of the services they support and the ■ The ability more easily to revoke access rights when processes they use. needed – an important security consideration ■ May be needed for regulatory compliance (e.g. SOX, HIPAA, COBIT). 4.5 ACCESS MANAGEMENT Access Management is the process of granting authorized 4.5.4 Policies/principles/basic concepts users the right to use a service, while preventing access to Access Management is the process that enables users to non-authorized users. It has also been referred to as Rights use the services that are documented in the Service Management or Identity Management in different Catalogue. It comprises the following basic concepts: organizations. ■ Access refers to the level and extent of a service’s 4.5.1 Purpose/goal/objective functionality or data that a user is entitled to use. ■ Identity refers to the information about them that Access Management provides the right for users to be able distinguishes them as an individual and which verifies to use a service or group of services. It is therefore the their status within the organization. By definition, the execution of policies and actions defined in Security and Identity of a user is unique to that user. (This is Availability Management. covered in more detail in paragraph 126.96.36.199.) ■ Rights (also called privileges) refer to the actual 4.5.2 Scope settings whereby a user is provided access to a service Access Management is effectively the execution of both or group of services. Typical rights, or levels of access, Availability and Information Security Management, in that include read, write, execute, change, delete. it enables the organization to manage the confidentiality, ■ Services or service groups. Most users do not use availability and integrity of the organization’s data and only one service, and users performing a similar set of intellectual property. activities will use a similar set of services. Instead of Access Management ensures that users are given the right providing access to each service for each user to use a service, but it does not ensure that this access is separately, it is more efficient to be able to grant each available at all agreed times – this is provided by user – or group of users – access to the whole set of Availability Management. services that they are entitled to use at the same time. Access Management is a process that is executed by all (This is discussed in more detail in paragraph 188.8.131.52.) Technical and Application Management functions and is ■ Directory Services refers to a specific type of tool usually not a separate function. However, there is likely to that is used to manage access and rights. These are be a single control point of coordination, usually in IT discussed in section 5.8. Operations Management or on the Service Desk. 4.5.5 Process activities, methods and Access Management can be initiated by a Service Request techniques through the Service Desk. 184.108.40.206 Requesting access 4.5.3 Value to business Access (or restriction) can be requested using one of any Access Management provides the following value: number of mechanisms, including: ■ Controlled access to services ensures that the ■ A standard request generated by the Human Resource organization is able to maintain more effectively the system. This is generally done whenever a person is confidentiality of its information hired, promoted, transferred or when they leave the ■ Employees have the right level of access to execute company their jobs effectively ■ A Request for Change ■ A Service Request submitted via the Request Fulfilment system Service Operation processes | 69 ■ By executing a pre-authorized script or option (e.g. decisions to restrict or provide access, rather than making downloading an application from a staging server as the decision. and when it is needed). As soon as a user has been verified, Access Management Rules for requesting access are normally documented as will provide that user with rights to use the requested part of the Service Catalogue. service. In most cases this will result in a request to every team or department involved in supporting that service to 220.127.116.11 Verification take the necessary action. If possible, these tasks should Access Management needs to verify every request for be automated. access to an IT service from two perspectives: The more roles and groups that exist, the more likely that ■ That the user requesting access is who they say Role Conflict will arise. Role Conflict in this context refers they are to a situation where two specific roles or groups, if ■ That they have a legitimate requirement for assigned to a single user, will create issues with separation that service. of duties or conflict of interest. Examples of this include: ■ One role requires detailed access, while another role The first category is usually achieved by the user providing their username and password. Depending on the prevents that access organization’s security policies, the use of the username ■ Two roles allow a user to perform two tasks that and password are usually accepted as proof that the should not be combined (e.g. a contractor can log person is a legitimate user. However, for more sensitive their time sheet for a project and then approve all services further identification may be required (biometric, payment on work for the same project). use of an electronic access key or encryption device, etc.). Role Conflict can be avoided by careful creation of roles The second category will require some independent and groups, but more often they are caused by policies verification, other than the user’s request. For example: and decisions made outside of Service Operation – either by the business or by different project teams working ■ Notification from Human Resources that the person is during Service Design. In each case the conflict must be a new employee and requires both a username and documented and escalated to the stakeholders to resolve. access to a standard set of services ■ Notification from Human Resources that the user has Whenever roles and groups are defined, it is possible that been promoted and requires access to additional they could be defined too broadly or too narrowly. There resources will always be users who need something slightly different from the pre-defined roles. In these cases, it is possible to ■ Authorization from an appropriate (defined in the use standard roles and then add or subtract specific rights process) manager as required – similar to the concept of Baselines and ■ Submission of a Service Request (with supporting Variants in Configuration Management (see Service evidence) through the Service Desk Transition publication). However, the decision to do this is ■ Submission of an RFC (with supporting evidence) not in the hands of individual operational staff members. through Change Management, or execution of a Each exception should be coordinated by Access pre-defined Standard Change Management and approved through the originating ■ A policy stating that the user may have access to an process. optional service if they need it. Access Management should perform a regular review of For new services the Change Record should specify which the roles and groups that it has created and manage to users or groups of users will have access to the Service. ensure that they are appropriate for the services that IT Access Management will then check to see that all the delivers and supports – and obsolete or unwanted users are still valid and automatically provide access as roles/groups should be removed. specified in the RFC. 18.104.22.168 Monitoring identity status 22.214.171.124 Providing rights As users work in the organization, their roles change and Access Management does not decide who has access to so also do their needs to access services. Examples of which IT services. Rather, Access Management executes changes include: the policies and regulations defined during Service Strategy and Service Design. Access Management enforces 70 | Service Operation processes ■ Job changes. In this case the user will possibly need this information available to all who have access to the access to different or additional services. Incident Management system will expose vulnerabilities. ■ Promotions or demotions. The user will probably use Information Security Management plays a vital role in the same set of services, but will need access to detecting unauthorized access and comparing it with the different levels of functionality or data. rights that were provided by Access Management. This will ■ Transfers. In this situation, the user may need access require Access Management involvement in defining the to exactly the same set of services, but in a different parameters for use in Intrusion Detection tools. region with different working practices and different sets of data. Access Management may also be required to provide a record of access for specific Services during forensic ■ Resignation or death. Access needs to be completely investigations. If a user is suspected of breaches of policy, removed to prevent the username being used as a inappropriate use of resources, or fraudulent use of data, security loophole. Access Management may be required to provide evidence ■ Retirement. In many organizations, an employee who of dates, times and even content of that user’s access to retires may still have access to a limited set of services, specific Services. This is normally provided by the including benefits systems or systems that allow them Operational staff of that service, but working as part of the to purchase company products at a reduced rate. Access Management process. ■ Disciplinary action. In some cases the organization will require a temporary restriction to prevent the user 126.96.36.199 Removing or restricting rights from accessing some or all of the services that they Just as Access Management provides rights to use a would normally have access to. There should be a Service, it is also responsible for revoking those rights. feature in the process and tools to do this, rather than Again, this is not a decision that it makes on its own. having to delete and reinstate the user’s access rights. Rather, it will execute the decisions and policies made ■ Dismissals. Where an employee or contractor is during Service Strategy and Design and also decisions dismissed, or where legal action is taken against a made by managers in the organization. customer (for example for defaulting on payment for products purchased on the Internet), access should be Removing access is usually done in the following revoked immediately. In addition, Access Management, circumstances: working together with Information Security ■ Death Management, should take active measures to prevent ■ Resignation and detect malicious action against the organization ■ Dismissal from that user. ■ When the user has changed roles and no longer Access Management should understand and document requires access to the service the typical User Lifecycle for each type of user and use it ■ Transfer or travel to an area where different regional to automate the process. Access Management tools should access applies. provide features that enable a user to be moved from one state to another, or from one group to another, easily and In other cases it is not necessary to remove access, but with an audit trail. just to provide tighter restrictions. These could include reducing the level, time or duration of access. Situations 188.8.131.52 Logging and tracking access in which access should be restricted include: Access Management should not only respond to requests. ■ When the user has changed roles or been demoted It is also responsible for ensuring that the rights that they and no longer requires the same level of access have provided are being properly used. ■ When the user is under investigation, but still requires In this respect, Access Monitoring and Control must be access to basic services, such as e-mail. In this case included in the monitoring activities of all Technical and their e-mail may be subject to additional scanning Application Management functions and all Service (but this would need to be handled very carefully Operation processes. and in full accordance with the organization’s security policy) Exceptions should be handled by Incident Management, ■ When a user is away from the organization on possibly using Incident Models specifically designed to temporary assignment and will not require access to deal with abuse of access rights. It should be noted that that service for some time. the visibility of such actions should be restricted. Making Service Operation processes | 71 4.5.6 Triggers, input and output/inter- 4.5.7 Information Management process interfaces 184.108.40.206 Identity Access Management is triggered by a request for a user or users to access a service or group of services. This could The identity of a user is the information about them that originate from any of the following: distinguishes them as an individual and which verifies their status within the organization. By definition, the ■ An RFC. This is most frequently used for large-scale identity of a user is unique to that user. Since there are service introductions or upgrades where the rights of a cases where two users share a common piece of significant number of users need to be updated as information (e.g. they have the same name), identity is part of the project. usually established using more than one piece of ■ A Service Request. This is usually initiated through information, for example: the Service Desk, or directly into the Request ■ Name Fulfilment system, and executed by the relevant ■ Address Technical or Application Management teams. ■ A request from the appropriate Human Resources ■ Contact details, e.g. telephone, e-mail address, etc. Management personnel (which should be channelled ■ Physical documentation, e.g. driver’s licence, passport, via the Service Desk). This is usually generated as part marriage certificate, etc. of the process for hiring, promoting, relocating and ■ Numbers that refer to a document or an entry in a termination or retirement. database, e.g. employee number, tax number, ■ A request from the manager of a department, who government identity number, driver’s licence number, could be performing an HR role, or who could have etc. made a decision to start using a service for the first ■ Biometric information, e.g. fingerprints, retinal images, time. voice recognition patterns, DNA, etc. ■ Expiration date (if relevant). Access Management should be linked to the Human Resource processes to verify the user’s identify as well as A user identity is provided to anyone with a legitimate to ensure that they are entitled to the services being requirement to access IT services or organizational requested. information. These could include: Information Security Management is a key driver for Access ■ Employees Management as it will provide the security and data ■ Contractors protection policies and tools needed to execute Access ■ Vendor staff (e.g. account managers, support Management. personnel, etc.) Change Management plays an important role as the ■ Customers (especially when purchasing products or means to control the actual requests for access. This is services over the Internet). because any request for access to a service is a change, Most organizations will verify a user’s identity before they although it is usually processed as a Standard Change or join the organization by requesting a subset of the above Service Request (possibly using a model) once the criteria information. The more secure the organization, the more for access have been agreed through SLM. types of information are required and the more thoroughly SLM maintains the agreements for access to each service. they are checked. This will include the criteria for who is entitled to access Many organizations will be faced with the need to provide each service, what the cost of that access will be, if access rights to temporary or occasional staff or appropriate and what level of access will be granted to contractors/suppliers. The management of access to such different types of user (e.g. managers or staff). personnel often proves problematic – closing access after There is also a strong relationship between Access use is often as difficult to manage, or more so, than Management and Configuration Management. The CMS providing access initially. Well-defined procedures can be used for data storage and interrogated to between IT and HR should be established that include fail- determine current access details. safe checks that ensure access rights are removed immediately they are no longer justified or required. When a user is granted access to an application, it should already have been established by the organization (usually 72 | Service Operation processes the Human Resources or Security Department) that the and protected as part of the organization’s security user is who they say they are. procedures. At this point, all that information is filed and the file is associated with a corporate identity, usually an employee 4.5.8 Metrics or contractor number and an identity that can be used to Metrics that can be used to measure the efficiency and access corporate resources and information, usually a user effectiveness of Access Management include: identity or ‘username’ and an associated password. ■ Number of requests for access (Service Request, RFC, etc.) 220.127.116.11 Users, groups, roles and service groups ■ Instances of access granted, by service, user, While each user has an individual identity, and each IT department, etc. service can be seen as an entity in its own right, it is often ■ Instances of access granted by department or helpful to group them together so that they can be individual granting rights managed more easily. Sometimes the terms ‘user profile’ ■ Number of incidents requiring a reset of access rights or ‘user template’ or ‘user role’ are used to describe this ■ Number of incidents caused by incorrect access type of grouping. settings. Most organizations have a standard set of services for all individual users, regardless of their position or job 4.5.9 Challenges, Critical Success Factors (excluding customers – who do not have any visibility to and risks internal services and processes). These will include services Conditions for successful Access Management include: such as messaging, office automation, Desktop Support, telephony, etc. New users are automatically provided with ■ The ability to verify the identity of a user (that the rights to use these services. person is who they say they are) ■ The ability to verify the identity of the approving However, most users also have some specialized role that person or body they perform. For example, in addition to the standard ■ The ability to verify that a user qualifies for access to a services, the user also performs a Marketing Management role, which requires that they have access to some specific service specialized marketing and financial modelling tools ■ The ability to link multiple access rights to an and data. individual user ■ The ability to determine the status of the user at any Some groups may have unique requirements – such as time (e.g. to determine whether they are still field or home workers who may have to dial in or use employees of the organization when they log on to a Virtual Private Network (VPN) connections, with security system) implications that may have to be more tightly managed. ■ The ability to manage changes to a user’s access To make it easier for Access Management to provide the requirements appropriate rights, it uses a catalogue of all the roles in ■ The ability to restrict access rights to unauthorized the organization and which services support each role. users This catalogue of roles should be compiled and ■ A database of all users and the rights that they have maintained by Access Management in conjunction with been granted. HR and will often be automated in the Directory Services tools (see section 5.8). 4.6 OPERATIONAL ACTIVITIES OF In addition to playing different roles, users may also PROCESSES COVERED IN OTHER LIFECYCLE belong to different groups. For example, all contractors are required to log their timesheets in a dedicated Time Card PHASES System, which is not used by employees. Access Management will assess all the roles that a user plays as 4.6.1 Change Management well as the groups that they belong to and ensure that Change Management is primarily covered in the Service they provide rights to use all associated services. Transition publication, but there are some aspects of Change Management which Service Operation staff will be Note: All data held on users will be subject to data involved with on a day-to-day basis. These include: protection legislation (this exists in most geographic locations in some form or other) so should be handled Service Operation processes | 73 ■ Raising and submitting RFCs as needed to address ■ Participation in the planning stages of major new Service Operation issues releases to advise on Service Operation issues ■ Participating in CAB or CAB/EC meetings to ensure ■ The physical handling of CIs from/to the DML as that Service Operation risks, issues and views are taken required to fulfil their operational roles – while into account adhering to relevant Release and Deployment ■ Implementing changes as directed by Change Management procedures, such as ensure that all items Management where they involve Service Operation are properly booked out and back in. component or services ■ Backing out changes as directed by Change 4.6.4 Capacity Management Management where they involve Service Operation Capacity Management should operate at three levels: component or services Business Capacity Management, Service Capacity ■ Helping define and maintain change models relating Management and Component Capacity Management. to Service Operation components or services ■ Business Capacity Management involves working ■ Receiving change schedules and ensuring that all with the business to plan and anticipate both longer- Service Operation staff are made aware of and term strategic issues and shorter-term tactical prepared for all relevant changes initiatives that are likely to have an impact on IT ■ Using the Change Management process for standard, capacity. operational-type changes. ■ Service Capacity Management is about understanding the characteristics of each of the IT 4.6.2 Configuration Management services, and then the demands that different types Configuration Management is primarily covered in the of users or transactions have on the underlying Service Transition publication, but there are some aspects infrastructure – and how these vary over time and of Configuration Management which Service Operation might be impacted by business change. staff will be involved with on a day-to-day basis. These ■ Component Capacity Management involves include: understanding the performance characteristics and ■ Informing Configuration Management of any capabilities and current utilization levels of all the discrepancies found between any CIs and the CMS technical components (CIs) that make up the IT Infrastructure, and predicting the impact of any ■ Making any amendments necessary to correct any changes or trends. discrepancies, under the authority of Configuration Management, where they involve any Service Many of these activities are of a strategic or longer-term Operation components or services. planning nature and are covered in the Service Strategy, Service Design and Service Transition publications. Responsibility for updating the CMS remains with However, there are a number of operational Capacity Configuration Management, but in some cases Operations Management activities that must be performed on a staff might be asked, under the direction of Configuration regular ongoing basis as part of Service Operation. These Management, to update relationships, or even to add new include the following. CIs or mark CIs as ‘disposed’ in the CMS, if these updates are related to operational activities actually performed by Operations staff. 18.104.22.168 Capacity and Performance Monitoring All components of the IT Infrastructure should be 4.6.3 Release and Deployment Management continually monitored (in conjunction with Event Management) so that any potential problems or trends Release and Deployment Management is primarily covered can be identified before failures or performance in the Service Transition publication, but there are some degradation occurs. Ideally, such monitoring should be aspects of this process which Service Operation staff will automated and thresholds should be set so that exception be involved with on a day-to-day basis. These may alerts are raised in good time to allow appropriate include: avoiding or recovery action to be taken before adverse ■ Actual implementation actions regarding the impact occurs. deployment of new releases, under the direction of The components and elements to be monitored will vary Release and Deployment Management, where they depending upon the infrastructure in use, but will typically relate to Service Operation components or services include: 74 | Service Operation processes ■ CPU utilization (overall and broken down by support group(s) are dealing with the fault and can system/service usage) intervene if necessary. ■ Memory utilization Manufacturers’ claimed performance capabilities and ■ IO rates (physical and buffer) and device utilization agreed service level targets, together with actual historical ■ Queue length (maximum and average) monitored performance and capacity data, should be used ■ File store utilization (disks, partitions, segments) to set alert levels. This may need to be an iterative process ■ Applications (throughput rates, failure rates) initially, performing some trial-and-error adjustments until ■ Databases (utilization, record locks, indexing, the correct levels are achieved. contention) Note: Capacity Management may have to become ■ Network transaction rates, error and retry rates involved in the capacity requirements and capabilities of IT ■ Transaction response time Service Management. Whether the organization has ■ Batch duration profiles enough Service Desk staff to handle the rate of incidents; ■ Internet/intranet site/page hit rates whether the CAB structure can handle the number of ■ Internet response times (external and internal to changes it is being asked to review and approve; whether firewalls) support tools can handle the volume of data being gathered are Capacity Management issues, which the ■ Number of system/application log-ons and concurrent Capacity Management team may be asked to help users investigate and answer. ■ Number of network nodes in use, and utilization levels. 22.214.171.124 Handling capacity- or performance- There are different kinds of monitoring tools needed to related incidents collect and interpret data at each level. For example, some If an alert is triggered, or an incident is raised at the tools will allow performance of business transactions to be Service Desk, caused by a current or ongoing Capacity or monitored, while others will monitor CI behaviour. Performance Management problem, Capacity Management Capacity Management must set up and calibrate alarm must become involved to identify the cause and find a thresholds (where necessary in conjunction with Event resolution. Working together with appropriate technical Management, as it is often Event Monitoring tools that support groups, and alongside Problem Management, all may be used) so that the correct alert levels are set and necessary investigations must be performed to detect that any filtering is established as necessary so that only exactly what has gone wrong and what is needed to meaningful events are raised. Without such filtering it is correct the situation. possible that ‘information only’ alerts can obscure more It may be necessary to switch to more detailed monitoring significant alerts that require immediate attention. In during the investigation phase to determine the exact addition, it is possible for serious failures to cause ‘alert cause. Monitoring is often set at a ‘background’ level storms’ due to very high volumes of repeat alerts, which during normal circumstances due to the large amount of again must be filtered so that the most meaningful data that can be generated and to avoid placing too high messages are not obscured. a burden on the IT Infrastructure – but when specific It may be appropriate to use external, third-party, difficulties are being investigated more detailed monitoring capabilities for some CIs or components of monitoring may be needed to pinpoint the exact cause. the IT Infrastructure (e.g. key internet sites/pages). When a solution, or potential solution, has been found, Capacity Management should be involved in helping any changes necessary to resolve the problem must be specify and select any such monitoring capabilities and approved via formal Change Management prior to in integrating the results or any alerts with other implementation. If the fault is causing serious disruption monitoring and handling systems. and an urgent resolution is needed, the urgent change Capacity Management must work with all appropriate process should be used. It is very important that no support groups to make decisions on where alarms are ‘tuning’ takes place without submission through Change routed and on escalation paths and timescales. Alerts Management, as even apparently small adjustments can should be logged to the Service Desk as well as to often have very large cumulative effects – sometimes appropriate support staff, so that appropriate Incident across the entire IT Infrastructure. Records can be raised so a permanent record of the event exists – and Service Desk staff have a view of how well the Service Operation processes | 75 126.96.36.199 Capacity and performance trends Operation functions will have to take action to implement Capacity Management has a role to play in identifying any such restrictions – usually accompanied by concurrent capacity or performance trends as they become action to implement the logging-out of users who have discernible. Further details of actions needed to address been inactive for an agreed period of time to free up such trends are included in the Continual Service resources for others. Improvement publication. 188.8.131.52 Workload Management 184.108.40.206 Storage of Capacity Management data There may be occasions when optimization of Large amounts of data are usually generated through infrastructure resources is needed to maintain or improve capacity and performance monitoring. Monitoring of performance or throughput. This can often be done meters and tables of just a few Kbytes each can quickly through Workload Management, which is a generic term grown into huge files if many components are being to cover such actions as: monitored at relatively short intervals. Another problem ■ Rescheduling a particular service or workload to run at with very short-term monitoring is that it is not possible to a different time of day, or day of the week etc. gather meaningful information without looking over a (usually away from peak-times to off-peak windows) – longer period. For example, a single snapshot of a CPU which will often mean having to make adjustments to will show the device to be either ‘busy’ or ‘idle’ – but a job-scheduling software. summary over, say, a 5-minute period will show the ■ Moving a service or workload from one location or set average utilization level over that period, which is a much of CIs to another – often to balance utilization or more meaningful measure of whether the device is able to traffic. work comfortably, or whether potential performance ■ Technical Virtualization: setting up and using problems are likely to occur. virtualization systems to allow movement of In any organization it is likely that the monitoring tools processing around the infrastructure to give better used will vary greatly – with a combination of system- performance/resilience in a dynamic fashion. specific tools, many of them part of the basic operating ■ Limiting or moving demand for resources through system, and specialist monitoring tools being used. In Demand Management techniques (see above and also order to coordinate the data being generated and allow the Service Design publication). the retention of meaningful data for analysis and trending It will only be possible to manage workloads effectively if purposes, some form of central repository for holding a good understanding exists of which workloads will run this summary data is needed: a Capacity Management at what time and how much resource utilization each Information System (CMIS). workload places upon the IT Infrastructure. Diligent The format, location and design of such a database should monitoring and analysis of workloads is therefore needed be planned and implemented in advance – see the Service on an ongoing operational basis. Design publication for further details – but there will be some operational aspects to handle, such as database 220.127.116.11 Modelling and applications sizing housekeeping and backups. Modelling and/or sizing of new services and/or applications must, where appropriate, be done during the 18.104.22.168 Demand Management planning and transition phases – see the Service Design Demand Management is the name given to a number of and Service Transition publications. However, the Service techniques that can be used to modify demand for a Operation functions have a role to play in evaluating the particular resource or service. Some techniques for accuracy of the predictions and feeding back any issues or Demand Management can be planned in advance – and discrepancies. these are covered in more detail in the Service Design publication. However, there are other aspects of Demand 22.214.171.124 Capacity Planning Management that are of a more operational nature, During Service Design and Service Transition, the capacity requiring shorter-term action. requirements of IT services are calculated. A forward- If, for example, the performance of a particular service is looking capacity plan should be maintained and regularly causing concern, and short-term restrictions on updated and Service Operation will have a role to play in concurrency of users are needed to allow performance this. Such a plan should look forward up to two years or improvements for a smaller restricted group, then Service 76 | Service Operation processes more, but should be reviewed regularly every three to 12 available to the specified users at the required time and at months, depending upon volatility and resources available. the agreed levels. The plan should be linked to the organization’s financial During Service Operation the IT teams and users are in the planning cycle, so that any required expenditure for best position to detect whether services actually meet the infrastructure upgrades, enhancements or additions can be agreed requirements and whether the design of these included in budget estimates and approved in advance. services is effective. The plan should predict the future but must also examine What seems like a good idea during the Design phase and report upon previous predictions, particularly to give may not actually be practical or optimal. The experience of some confidence in further predictions. Where any the users and operational functions makes them a primary discrepancies have been encountered, these should be input into the ongoing improvement of existing services explained and future remedial action described. and the design. The Capacity Plan might typically cover: However, there are a number of challenges with gaining access to this knowledge: ■ Current performance and utilization details, with recent trends for all key CIs, including ■ Most of the experiences of the operational teams and ● Backbone networks users are either informal, or spread across multiple ● LANs sources. ● Mainframes (if still used) ■ The process for collecting and collating this data ● Key servers needs to be formalized. ● Main data storage devices ■ Users and operational staff are usually fully occupied with their regular activities and tasks and it is very ● Selected (representative) desktop and laptop difficult for them to be involved in regular planning equipment and design activities. One argument often made here ● Key websites is that if design is improved, the operational teams ● Key databases will be less busy resolving problems and will therefore ● Key applications have more time to be involved in design activities. ● Operational capacity – electricity, floor space, However, practice shows that as soon as staff are freed environmental capacity (air condition), floor up, they often become the target of workforce weighting, heat generation and output, electrical reduction exercises. and water demand and supply etc. Having said this, there are three key opportunities for ● Magnetic media. operational staff to be involved in Availability ■ Estimated performance and utilization for all such CIs Improvement, since these are generally viewed as part of during the planning period (e.g. the next three their ongoing responsibility: months) ■ ■ Review of maintenance activities. Service Design Comparative data with previous estimates – to allow confidence in future estimates to be judged will define detailed maintenance schedules and activities, which are required to keep IT services ■ Reports on any specific capacity difficulties functioning at the required level of performance and encountered in the past period, with details of availability. Regular comparison of actual maintenance recovery and preventive actions taken for the future activities and times with the plans will highlight ■ Details of any required upgrades or procurements potential areas for improvement. One of the sources of needed and planned for the future, with indicative this information is a review of whether Service costs and timescales. Maintenance Objectives were met and, if not, why not. ■ Any potential capacity risks that are likely – with ■ Major problem reviews. Problems could be the result suggested countermeasures should they arise. of any number of factors, one of which is poor design. Problem reviews therefore may include opportunities 4.6.5 Availability Management to identify improvements to the design of IT services, During Service Design and Service Transition, IT services which will include availability and capacity are designed for availability and recovery. Service improvement. Operation is responsible for actually making the IT service Service Operation processes | 77 ■ Involvement in specific initiatives using techniques The Service Operation Manager must also be involved in such as Service Failure Analysis (SFA), Component regular, at least monthly, reviews of expenditure against Failure Impact Analysis (CFIA), or Fault Tree Analysis budgets – as part of the ongoing IT budgeting and (FTA) or as members of Technical Observation (TO) accounting process. Any discrepancies must be identified activities – either as part of the follow-up to major and necessary adjustments made. All committed problems or as part of an ongoing Service expenditure must go through the organization’s purchase Improvement Plan, in collaboration with dedicated order system so that commitments can be accrued and Availability Management staff. These Availability proper checks must be made on all goods received so that Management techniques are explained in more detail invoices and payments can be correctly authorized – or in the Service Design publication. discrepancies investigated and rectified. There may be occasions when Operational Staff It should be noted that some proposed cost reductions by themselves need downtime of one or more services to the business may actually increase IT costs, or at least unit enable them to conduct their operational or maintenance costs. Care should therefore be taken to ensure that IT is activities – which may impact on availability if not involved in discussing all cost-saving measures and properly scheduled and managed. In such cases they must contribute to overall decisions. Financial Management is liaise with SLM and Availability Management staff – who covered in detail in the Service Strategy publication. will negotiate with the business/users, often using the Service Desk to perform this role, to agree and schedule 4.6.8 IT Service Continuity Management such activities. Service Operation functions are responsible for the testing and execution of system and service recovery plans as 4.6.6 Knowledge Management determined in the IT Service Continuity plans for the It is vitally important that all data and information that can organization. In addition, managers of all Service be useful for future Service Operation activities are Operation functions must be on the Business Continuity properly gathered, stored and assessed. Relevant data, Central Coordination team. metrics and information should be passed up on the This is discussed in detail in Service Strategy and Service management chain and to other Service Lifecycle phases Design and will not be repeated here, except to indicate so that it can feed into the knowledge and wisdom layers that it is important that Service Operation functions must of the organization’s Service Knowledge Management be involved in the following areas: System, the structures of which have to be defined in Service Strategy and Service Design and refined in ■ Risk assessment, using its knowledge of the Continual Service Improvement (see other ITIL publications infrastructure and techniques such as CFIA and access in this series). to information in the CMS to identify single points of failure or other high-risk situations Key repositories of Service Operation, which have been ■ Execution of any Risk Management measures that are frequently mentioned elsewhere, are the CMS and the agreed, e.g. implementation of countermeasures, or KEDB, but this must be widened out to include all of the increased resilience to components of the Service Operation teams’ and departments’ infrastructures, etc. documentation, such as operations manuals, procedures ■ Assistance in writing the actual recovery plans for manuals, work instructions, etc. systems and services under its control 4.6.7 Financial Management for IT services ■ Participation in testing of the plans (such as involvement in off-site testing, simulations etc) on an Service Operation staff must participate in and support the ongoing basis under the direction of the IT Service overall IT budgeting and accounting system – and may Continuity Manager (ITSCM) be actively involved in any charging system that may be ■ Ongoing maintenance of the plans under the control in place. of ITSCM and Change Management Proper planning is necessary so that capital expenditure ■ Participation in training and awareness campaigns to (Capex) and operational expenditure (Opex) budget ensure that they are able to execute the plans and estimates can be prepared and agreed in good time to understand their roles in a disaster meet the budgetary cycles. ■ The Service Desk will play a key role in communicating with staff, customers and users during an actual disaster. Common Service Operation activities 5 | 81 5 Common Service Operation activities Chapter 4 dealt with the processes required for effective In reality, it is impossible to achieve quality services Service Operation and Chapter 6 will deal with the without aligning and ‘gearing’ every level of organizational aspects. This chapter focuses on a number technology (and the people who manage it) to the of operational activities that ensure that technology is services being provided. Service Management involves aligned with the overall Service and Process objectives. people, process and technology. These activities are sometimes described as processes, but In other words, the common Service Operation in reality they are sets of specialized technical activities all activities are not about managing the technology for aimed at ensuring that the technology required to deliver the sake of having good technology performance. They and support services is operating effectively and efficiently. are about achieving performance that will integrate the These activities will usually be technical in nature – technology component with the people and process although the exact technology will vary depending on the components to achieve service and business objectives. type of services being delivered. This publication will focus See Figure 5.1 for examples of how technology is managed in maturing organizations. on the activities required to manage IT. Important note on managing technology Figure 5.1 illustrates the steps involved in maturing from a It is tempting to divorce the concept of Service technology-centric organization to an organization that Management from the management of the harnesses technology as part of its business strategy. infrastructure that is used to deliver those services. Figure 5.1 further outlines the role of Technology Managers in organizations of differing maturity. The diagram is not comprehensive, but it does provide examples of the way in which technology is managed • IT is measured in terms of its contribution to the business Level 5 • All services are measured by their ability to add value • Technology is subordinate to the business function it enables Strategic • Service Portfolio drives investment and performance targets • Technology expertise is so entrenched in everyday operations Contribution it is viewed as a utility by the business • Services are quantified and initiatives aimed at delivering appropriate levels Business Level 4 • Service requirements and technology constraints drive procurement • Service Design specifies performance requirements and operational norms Centric Service • Consolidated systems support multiple services • All technology is mapped to services and is managed to service requirements Provision • Change Management covers both development and operations • Critical services have been identified together with their technological dependencies Level 3 • Systems are integrated to provide required performance, availability and recovery for those services • More focus on measuring performance across multiple devices and even platforms Technology Technology • • Virtual mapping of Configuration and Asset data with single Change Management for operations Consolidated Availability and Capacity Planning on some services Integration Centric • • Integrated Disaster Recovery Planning Systems are consolidated to save cost • Initiatives are aimed at achieving control and increasing the stability of the infrastructure Level 2 • IT has identified most technology components and understands what each is used for • Technical management focuses on achieving high performance of each component regardless of its function Technology • Availability of components is measured and reported • Reactive Problem Management and inventory control are performed Control • Change control is performed on ‘mission critical’ components • Point solutions are used to automate those processes that are in place, usually on a platform-by-platform basis • IT is driven by technology and most initiatives are aimed at trying to understand the infrastructure and deal with exceptions Level 1 • Technology management is performed by technical experts, and only they understand how to manage each device or platform • Most teams are driven by incidents, and most improvements are aimed at making management easier – not to improve services Technology • Organizations entrench technology specializations and do not encourage interaction with other groups • Management tools are aimed at managing single technologies, resulting in duplication Driven • Incident Management processes start being created Figure 5.1 Achieving maturity in Technology Management 82 | Common Service Operation activities in each type of organization. The bold headings indicate 5.1 MONITORING AND CONTROL the major role played by IT in managing technology. The The measurement and control of services is based on a text in the rows describes the characteristics of an IT continual cycle of monitoring, reporting and subsequent department at each level. action. This cycle is discussed in detail in this section The purpose of this diagram in this chapter is as follows: because it is fundamental to the delivery, support and ■ This chapter focuses on Technical Management improvement of services. activities, but there is no single way of representing It is also important to note that, although this cycle takes them. A less mature organization will tend to see place during Service Operation, it provides a basis for these activities as ends in themselves, not a means to setting strategy, designing and testing services and an end. A more mature organization will tend to achieving meaningful improvement. It is also the basis for subordinate these activities to higher-level Service SLM measurement. Therefore, although monitoring is Management objectives. For example, the Server performed by Service Operation functions, it should not be Management team will move from an insulated seen as a purely operational matter. All phases of the department, focused purely on managing servers, Service Lifecycle should ensure that measures and controls to a team that works closely with other Technology are clearly defined, executed and acted upon. Managers to find ways of increasing their value to the business. 5.1.1 Definitions ■ To make and reinforce the point that there is no ‘right’ way of grouping and organizing the departments that Monitoring refers to the activity of observing a perform these services. Some readers might interpret situation to detect changes that happen over time. the headings in this chapter as the names of departments, but this is not the case. The aim of this In the context of Service Operation, this implies the chapter is to identify the typical technical activities following: involved in Service Operation. Organizational aspects ■ Using tools to monitor the status of key CIs and key are discussed in Chapter 6. operational activities ■ The Service Operation activities described in the rest ■ Ensuring that specified conditions are met (or not of this chapter are not typical of any one of the levels met) and, if not, to raise an alert to the appropriate of maturity. Rather, the activities are usually all present group (e.g. the availability of key network devices) in some form at all levels. They are just organized and ■ Ensuring that the performance or utilization of a managed differently at each level. component or system is within a specified range (e.g. In some cases a dedicated group may handle all of a disk space or memory utilization) process or activity while in other cases processes or ■ To detect abnormal types or levels of activity in the activities may be shared or split between groups. infrastructure (e.g. potential security threats) However, by way of broad guidance, the following ■ To detect unauthorized changes (e.g. introduction of sections list the required activities under the functional software) groups most likely to be involved in their operation. This ■ To ensure compliance with the organization’s policies does not mean that all organizations have to use these (e.g. inappropriate use of e-mail) divisions. Smaller organizations will tend to assign groups ■ To track outputs to the business and ensure that they of these activities (if they are needed at all) to single meet quality and performance requirements departments, or even individuals. ■ To track any information that is used to measure Key Finally, the purpose of this chapter is not to provide a Performance Indicators (KPIs). detailed analysis of all the activities. They are specialized, and detailed guidance is available from the platform Reporting refers to the analysis, production and vendors and other, more technical, frameworks; new distribution of the output of the monitoring activity. categories will be added continually as technology evolves. This chapter simply aims to highlight the In the context of Service Operation, this implies the importance and nature of technology management for following: Service Management in the IT context. ■ Using tools to collate the output of monitoring information that can be disseminated to various groups, functions or processes Common Service Operation activities | 83 ■ Interpreting the meaning of that information ■ Determining where that information would best be Norm used ■ Ensuring that decision makers have access to the information that will enable them to make decisions ■ Routing the reported information to the appropriate Control Compare person, group or tool. Control refers to the process of managing the utilization or behaviour of a device, system or service. It is important to note, though, that simply Monitor manipulating a device is not the same as controlling it. Control requires three conditions: ■ The action must ensure that behaviour conforms to a defined standard or norm ■ The conditions prompting the action must be defined, understood and confirmed Input Activity Output ■ The action must be defined, approved and appropriate for these conditions. Figure 5.2 The Monitor Control Loop In the context of Service Operation, control implies the and frequency – and will run regardless of other following: conditions. ■ Using tools to define what conditions represent ■ Closed Loop Systems monitor an environment and normal operations or abnormal operations respond to changes in that environment. For example, ■ Regulate performance of devices, systems or services in network load balancing a monitor will evaluate the ■ Measure availability traffic on a circuit. If network traffic exceeds a certain ■ Initiate corrective action, which could be automated range, the control system will begin to route traffic (e.g. reboot a device remotely or run a script), or across a backup circuit. The monitor will continue to manual (e.g. notify operations staff of the status). provide feedback to the control system, which will continue to regulate the flow of network traffic 5.1.2 Monitor Control Loops between the two circuits. The most common model for defining control is the To help clarify the difference, solving Capacity Monitor Control Loop. Although it is a simple model, it has Management through over-provisioning is open loop; a many complex applications within IT Service Management. load-balancer that detects congestion/failure and redirects This section will define the basic concepts of the Monitor capacity is closed loop. Control Loop Model and subsequent sections will show how important these concepts are for the Service 126.96.36.199 Complex Monitor Control Loop Management Lifecycle. The Monitor Control Loop in Figure 5.2 is a good basis for Figure 5.2 outlines the basic principles of control. A single defining how Operations Management works, but within activity and its output are measured using a predefined the context of ITSM the situation is far more complex. norm, or standard, to determine whether it is within an Figure 5.3 illustrates a process consisting of three major acceptable range of performance or quality. If not, action activities. Each one has an input and an output, and the is taken to rectify the situation or to restore normal output becomes an input for the next activity. performance. In this diagram, each activity is controlled by its own Typically there are two types of Monitor Control Loops: Monitor Control Loop, using a set of norms for that specific activity. The process as a whole also has its ■ Open Loop Systems are designed to perform a own Monitor Control Loop, which spans all the activities specific activity regardless of environmental conditions. and ensures that all norms are appropriate and are For example, a backup can be initiated at a given time being followed. 84 | Common Service Operation activities Norm Control Compare Monitor Norm Norm Norm Control Compare Control Compare Control Compare Monitor Monitor Monitor Activity Activity Activity Input Output Input Output Input Output Input Figure 5.3 Complex Monitor Control Loop In Figure 5.3 there is a double feedback loop. One loop on what has been described so far, Monitor Control Loops focuses purely on executing a defined standard, and the can be used to manage: second evaluates the performance of the process and also ■ The performance of activities in a process or the standards whereby the process is executed. An procedure. Each activity and its related output can example of this would be if the first set of feedback loops potentially be measured to ensure that problems with at the bottom of the diagram represented individual the process are identified before the process as a stations on an assembly line and the higher-level loop whole is completed. For example, in Incident represented Quality Assurance. Management, the Service Desk monitors whether a The Complex Monitor Control Loop is a good technical team has accepted an incident in a specified organizational learning tool (as defined by Chris Argyris time. If not, the incident is escalated. This is done well (1976, Increasing Leadership Effectiveness. New York: Wiley). before the target resolution time for that incident The first level of feedback at individual activity level is because the aim of escalating that one activity is to concerned with monitoring and responding to data (single ensure that the process as whole is completed in time. facts, codes or pieces of information). The second level is ■ The effectiveness of a process or procedure as a concerned with monitoring and responding to information whole. In this case the ‘activity’ box represents the (a collection of a number of facts about which a entire process as a single entity. For example, Change conclusion may be drawn). Refer to the Service Transition Management will measure the success of the process publication for a full discussion on Data, Information, by checking whether a change was implemented on Knowledge and Wisdom. time, to specification and within budget. All of this is interesting theory, but does not explain how ■ The performance of a device. For example, the the Monitor Control Loop concept can be used to operate ‘activity’ box could represent the response time of a IT services. And especially – who defines the norm? Based server under a given workload. Common Service Operation activities | 85 ■ The performance of a series of devices. For ■ If not, how are the other instances of monitoring example, the end user response time of an application related to Operations Management? across the network. ■ If there are multiple loops, which processes are To define how to use the concept of Monitor Control responsible for each loop? Loops in Service Management, the following questions The following sections will expand on the concept of need to be answered: Monitor Control Loops and demonstrate how these ■ How do we define what needs to be monitored? questions are answered. ■ What are the appropriate thresholds for each of these? 188.8.131.52 The ITSM Monitor Control Loop ■ How will monitoring be performed (manual or automated)? In ITSM, the complex Monitor Control Loop can be represented as shown in Figure 5.4. ■ What represents normal operation? ■ What are the dependencies for normal operation? Figure 5.4 can be used to illustrate the control of a ■ What happens before we get the input? process or of the components used to deliver a service. ■ How frequently should the measurement take place? In this diagram the word ‘activity’ implies that it refers to a process. To apply it to a service, an ‘activity’ could ■ Do we need to perform active measurement to check also be a ‘CI’. There are a number of significant features whether the item is within the norm or do we wait in Figure 5.4 as given overleaf. until an exception is reported (passive measurement)? ■ Is Operations Management the only function that performs monitoring? Business Executives and Business Unit Managers Service Strategy 1 2 Continual Service 3 Improvement Service Design IT Management and Vendor Account Management Portfolios, Standards and Policies Service Transition Technical Architectures and Performance Standards Users Norm Norm Norm Control Compare Control Compare Control Compare Monitor Monitor Monitor Input Activity Output Input Activity Output Input Activity Output Internal and External Technical Staff and Experts Figure 5.4 ITSM Monitor Control Loop 86 | Common Service Operation activities ■ Each activity in a Service Management process (or ● Arrow 3. In this case the norms specified in each component used to provide a service) is Service Design are not being adhered to. This monitored as part of the Service Operation processes. could be because they are not appropriate or The operational team or department responsible for executable, or because of a lack of education or a each activity or component will apply the Monitor lack of communication. The norms and the lack of Control Loop as defined in the process, and using the compliance need to be investigated and action norms that were defined during the Service Design taken to rectify the situation. processes. The role of Operational Monitoring and Service Transition provides a major set of checks and Control is to ensure that the process or service balances in these processes. It does so as follows: functions exactly as specified, which is why they are primarily concerned with maintaining the status quo. ■ For new services, Service Transition will ensure that ■ The norms and Monitoring and Control mechanisms the technical architectures are appropriate; and that are defined in Service Design, but they are based on the Operational Performance Standards can be the standards and architectures defined during Service executed. This in turn will ensure that the Service Strategy. Any changes to the organization’s Service Operation teams or departments are able to meet the Strategy, architecture, service portfolios or Service Service Level Requirements. Level Requirements will precipitate changes to what is ■ For existing services, Change Management will monitored and how it is controlled. manage any of the changes that are required as part ■ The Monitor Control Loops are placed within the of a control (e.g. tuning) as well as any changes context of the organization. This implies that Service represented by the arrows labelled 1, 2 and 3. Strategy will primarily be executed by Business and IT Although Service Transition does not define strategy Executives with support from vendor account and design services per se, it provides coordination managers. Service Design acts as the bridge between and assurance that the services are working, and will Service Strategy and Service Operation and will continue to work, as planned. typically involve representatives from all groups. The activities and controls will generally be executed by IT Why is this loop covered under Service staff (sometimes involving users) and supported by IT Operation? Managers and the vendors. Service Improvement Figure 5.4 represents Monitoring and Control for the spans all areas, but primarily represents the interests of whole of IT Service Management. Some readers of the the business and its users. Service Operation publication may feel that it should ■ Notice that the second level of monitoring in this be more suitably covered in the Service Strategy complex Monitor Control Loop is performed by the publication. CSI processes through Service Strategy and Service However, Monitoring and Control can only effectively Design. These relationships are represented by the be deployed when the service is operational. This numbered arrows in Figure 5.4 as follows: means that the quality of the entire set of IT Service ● Arrow 1. In this case CSI has recognized that the Management processes depends on how they are monitored and controlled in Service Operation. service will be improved by making a change to the Service Strategy. This could be the result of the The implications of this are as follows: business needing a change to the Service Portfolio, ■ Service Operation staff are not the only people or that the architecture does not deliver what was with an interest in what is monitored and how expected. they are controlled. ● Arrow 2. In this case the Service Level ■ While Service Operation is responsible for Requirements need to be adjusted. It could be that monitoring and control of services and the service is too expensive; or that the components, they are acting as stewards of a very configuration of the infrastructure needs to be important part of the set of ITSM Monitoring and changed to enhance performance; or because Control loops. Operations Management is unable to maintain ■ If Service Operation staff define and execute service quality in the current architecture. Monitoring and Control procedures in isolation, none of the Service Management processes or Common Service Operation activities | 87 Monitoring, it will understand how poor the service quality functions will be fully effective. This is because the is, but will have no idea what is causing it or how to Service Operation functions will not support the change it. priorities and information requirements of the other processes, e.g. attempting to negotiate an In reality, most organizations have a combination of SLA when the only data available is page-swap Internal and External Monitoring, but in many cases these rates on a server and detailed bandwidth are not linked. For example, the Server Management team utilization of a network. knows exactly how well the servers are performing and the Service Level Manager knows exactly how the users perceive the quality of service provided by the servers. 184.108.40.206 Defining what needs to be monitored However, neither of them knows how to link these metrics The definition of what needs to be monitored is based on to define what level of server performance represents understanding the desired outcome of a process, device or good quality service. This becomes even more confusing system. IT should focus on the service and its impact on when server performance that is acceptable in the middle the business, rather than just the individual components of of the month, is not acceptable at month-end. technology. The first question that needs to be asked is ‘What are we trying to achieve?’. 220.127.116.11 Defining objectives for Monitoring and Control 18.104.22.168 Internal and External Monitoring and Many organizations start by asking the question ‘What are Control we managing?’. This will invariably lead to a strong At the outset, it will become clear that there are two levels Internal Monitoring System, with very little linkage to the of monitoring: real outcome or service that is required by the business. ■ Internal Monitoring and Control: Most teams or The more appropriate question is ‘What is the end result departments are concerned about being able to of the activities and equipment that my team manages?’. execute effectively and efficiently the tasks that have Therefore the best place to start, when defining what to been assigned to them. Therefore, they will monitor monitor, is to determine the required outcome. the items and activities that are directly under their control. This type of monitoring and control focuses The definition of Monitoring and Control objectives should on activities that are self-contained within that ideally start with the definition of the Service Level team or department. For example, the Service Desk Requirements documents (see Service Design publication). Manager will monitor the volume of calls to determine These will specify how the customers and users will how many staff need to be available to answer measure the performance of the service, and are used as the telephone. input into the Service Design processes. During Service Design, various processes will determine how the service ■ External Monitoring and Control: Although each will be delivered and managed. For example, Capacity team or department is responsible for managing its Management will determine the most appropriate and own area, they do not act independently. Every task cost-effective way to deliver the levels of performance that they perform, or device that they manage, has an required. Availability Management will determine how the impact on the success of the organization as a whole. infrastructure can be configured to provide the fewest Each team or department will also be controlling points of failure. items and activities on behalf of other groups, processes or functions. For example, the Server If there is any doubt about the validity or completeness Management team will monitor the CPU performance of objectives, the COBIT framework provides a on key servers and perform workload balancing so comprehensive, high-level set of objectives as a checklist. that a critical application is able to stay within More information on COBIT is provided in Appendix A of performance thresholds set by Application this publication. Management. The Service Design Process will help to identify the The distinction between Internal and External Monitoring following sets of inputs for defining Operational is an important one. If Service Operation focuses only on Monitoring and Control norms and mechanisms: Internal Monitoring, it will have very well-managed ■ They will work with customers and users to determine infrastructure, but no way of understanding or influencing how the output of the service will be measured. This the quality of services. If it focuses only on External will include measurement mechanisms, frequency and 88 | Common Service Operation activities sampling. This part of Service Design will focus Active versus Passive Monitoring specifically on the Functional Requirements. ■ Active Monitoring refers to the ongoing ■ They will identify key CIs, how they should be ‘interrogation’ of a device or system to determine its configured and what level of performance and status. This type of monitoring can be resource availability is required in order to meet the agreed intensive and is usually reserved to proactively monitor Service Levels. the availability of critical devices or systems; or as a ■ They will work with the developers and vendors of diagnostic step when attempting to resolve an the CIs that make up each service to identify any Incident or diagnose a problem. constraints or limitations in those components. ■ Passive Monitoring is more common and refers to ■ All support and delivery teams and departments will generating and transmitting events to a ‘listening need to identify what information will help them to device’ or monitoring agent. Passive Monitoring execute their role effectively. Part of the Service depends on successful definition of events and Design and development will be to instrument each instrumentation of the system being monitored (see service so that it can be monitored to provide this section 4.1). information, or so that it can generate meaningful events. Reactive versus Proactive ■ Reactive Monitoring is designed to request or trigger All of this means that a very important part of defining action following a certain type of event or failure. For what Service Operation monitors and how it exercises example, server performance degradation may trigger control is to identify the stakeholders of each service. a reboot, or a system failure will generate an incident. Stakeholders can be defined as anyone with an interest in Reactive monitoring is not only used for exceptions. It the successful delivery and receipt of IT services. Each can also be used as part of normal operations stakeholder will have a different perspective of what it will procedures, for example a batch job completes take to deliver or receive an IT service. Service Operation successfully, which prompts the scheduling system to will need to understand each of these perspectives in submit the next batch job. order to determine exactly what needs to be monitored ■ Proactive Monitoring is used to detect patterns of and what to do with the output. events which indicate that a system or service may be Service Operation will therefore rely on SLM to define about to fail. Proactive monitoring is generally used in exactly who these stakeholders are and how they more mature environments where these patterns have contribute to or use the service. This is discussed more been detected previously, often several times. fully in the Service Design and Continual Service Proactive Monitoring tools are therefore a means of Improvement publications. automating the experience of seasoned IT staff and are often created through the Proactive Problem Note on Internal and External Monitoring Management process (see Continual Service Objectives Improvement publication). The required outcome could be internal or external to Please note that Reactive and Proactive Monitoring could the Service Operation functions, although it should be active or passive, as per Table 5.1 overleaf. always be remembered that an internal action will often have an external result. For example, consolidating servers to make them easier to manage may result in a cost saving, which will affect the SLM negotiation and review cycle as well as the Financial Management processes. 22.214.171.124 Types of monitoring There are many different types of monitoring tool and different situations in which each will be used. This section focuses on some of the different types of monitoring that can be performed and when they would be appropriate. Common Service Operation activities | 89 Table 5.1 Active and Passive Reactive and Proactive Monitoring Active Passive Reactive Used to diagnose which device is causing the failure Detects and correlates event records to determine the and under what conditions (e.g. ‘ping’ a device, or meaning of the events and the appropriate action (e.g. run and track a sample transaction through a series a user logs in three times with the incorrect password, of devices) which generates represents a security exception and is escalated through Information Security Management Requires knowledge of the infrastructure topography procedures) and the mapping of services to CIs Requires detailed knowledge of the normal operation of the infrastructure and services Proactive Used to determine the real-time status of a device, Event records are correlated over time to build trends system or service – usually for critical components for Proactive Problem Management. or following the recovery of a failed device to ensure Patterns of events are defined and programmed into that it is fully recovered (i.e. is not going to cause correlation tools for future recognition further incidents) Continuous Measurement versus Exception-Based is physical inspection – often performed by the user Measurement rather than IT staff). Where Exception-Based Measurement is used, it is important that both the ■ Continuous Measurement is focused on monitoring OLA and the SLA for that service reflect this, as service a system in real time to ensure that it complies with a outages are more likely to occur, and users are often performance norm (for example, an application server required to report the exception. is available for 99.9% of the agreed service hours). The difference between Continuous Measurement and Performance versus output Active Monitoring is that Active Monitoring does not There is an important distinction between the reporting have to be continuous. However, as with Active used to track the performance of components or teams or Monitoring, this is resource intensive and is usually department used to deliver a service and the reporting reserved for critical components or services. In most used to demonstrate the achievement of service quality cases the cost of the additional bandwidth and objectives. processor power outweighs the benefit of continuous measurement. In these cases monitoring will usually IT managers often confuse these by reporting to the be based on sampling and statistical analysis (e.g. the business on the performance of their teams or system performance is reported every 30 seconds and departments (e.g. number of calls taken per Service Desk extrapolated to represent overall performance). In Analyst), as if that were the same thing as quality of these cases, the method of measurement will have to service (e.g. incidents solved within the agreed time). be documented and agreed in the OLAs to ensure Performance Monitoring and metrics should be used that it is adequate to support the Service Reporting internally by the Service Management to determine Requirements (see Continual Service Improvement whether people, process and technology are functioning publication). correctly and to standard. ■ Exception-Based Measurement does not measure Users and customers would rather see reporting related to the real-time performance of a service or system, but the quality and performance of the service. detects and reports against exceptions. For example, an event is generated if a transaction does not Although Service Operation is concerned with both types complete, or if a performance threshold is reached. of reporting, the primary concern of this publication is This is more cost-effective and easier to measure, but Performance Monitoring, whereas monitoring of Service could result in longer service outages. Exception-Based Quality (or Output-Based Monitoring) will be discussed in Measurement is used for less critical systems or on detail in the Continual Service Improvement publication. systems where cost is a major issue. It is also used where IT tools are not able to determine the status or quality of a service (e.g. if printing quality is part of the service specification, the only way to measure this 90 | Common Service Operation activities 126.96.36.199 Monitoring in Test Environments The relevant Application Management team should also As with any IT Infrastructure, a Test Environment will have defined the exact steps that it will take when the need to define how it will use monitoring and control. application fails. These controls are more fully discussed in the Service In addition, it should also be recognized that action may Transition publication. need to be taken by different people, for example a single ■ Monitoring the Test Environment itself: A Test event (such as an application failure) may trigger action by Environment consists of infrastructure, applications and the Application Management team (to restore service), the processes that have to be managed and controlled users (to initiate manual processing) and management (to just as any other environment. It is tempting to think determine how this event can be prevented in future). that the Test Environment does not need rigorous The implications of this principle are outlined in more monitoring and control because it is not a live detail in relation to Event Management (see section 4.1). environment. However, this argument is not valid. If a Test Environment is not properly monitored and 188.8.131.52 Service Operation audits controlled, there is a danger of running the tests on Regular audits must be performed on the Service equipment that deviates from the standards defined in Operation processes and activities to ensure: Service Design. ■ Monitoring items being tested: The results of testing ■ They are being performed as intended have to be accurately tracked and checked. Also it is ■ There is no circumvention important that any monitoring tools that have been ■ They are still fit for purpose, or to identify any built into new or changed services have to be tested required changes or improvements. as well. Service Operation Managers may choose to perform such audits themselves, but ideally some form of independent 184.108.40.206 Reporting and action element to the audits is preferable. ‘A report alone creates awareness; a report with an The organization’s internal IT audit team or department action plan achieves results.’ may be asked to be involved or some organizations may Reporting and dysfunction choose to engage third-party consultancy/audit/ assessment companies so that an entirely independent Practical experience has shown that there is more expert view is obtained. reporting in dysfunctional organizations than in effective organizations. This is because reports are not Service Operation audits are part of the ongoing being used to initiate pre-defined action plans, but measurement that takes place as part of Continual Service rather: Improvement and are discussed in more detail in that ■ to shift the blame for an incident publication. ■ to try to find out who is responsible for making a 220.127.116.11 Measurement, metrics and KPIs decision This section has focused primarily on the monitoring and ■ as input to creating action plans for future control as a basis for Service Operation. Other sections of occurrences. the publication have covered some basic metrics that In dysfunctional organizations a lot of reports are could be used to measure the effectiveness and efficiency produced which no one has the time to look at or of a process. query. Although this publication is not primarily about measurement and metrics, it is important that Monitoring without control is irrelevant and ineffective. organizations using these guidelines have robust Monitoring should always be aimed at ensuring that measurement techniques and metrics that support the service and operational objectives are being met. This objectives of their organization. This section is a summary means that unless there is a clear purpose for monitoring of these concepts. a system or service, it should not be monitored. This also means that when monitoring is defined, so too should any required actions. For example, being able to detect that a major application has failed is not sufficient. Common Service Operation activities | 91 Measurement A further reason for not including them is the fact that similar metrics can be used to achieve very different KPIs. Measurement refers to any technique that is used to For example, one organization used the metric evaluate the extent, dimension or capacity of an item ‘Percentage of Incidents resolved by the Service Desk’ in relation to a standard or unit. to evaluate the performance of the Service Desk. ■ Extent refers to the degree of compliance or This worked effectively for about two years, after which completion (e.g. are all changes formally the IT manager began to realize that this KPI was being authorized by the appropriate authority) used to prevent effective Problem Management, i.e. if, ■ Dimension refers to the size of an item, e.g. the after two years, 80% of all incidents are easy enough to be number of incidents resolved by the Service Desk resolved in 10 minutes on the first call, why have we not come up with a solution for them? In effect, the KPI now ■ Capacity refers to the total capability of an item, became a measure for how ineffective the Problem for example maximum number of standard Management teams were. transactions that can be processed by a server per minute. 18.104.22.168 Interfaces to other Service Lifecycle practices Measurement only becomes meaningful when it is possible to measure the actual output or dimensions of a Operational Monitoring and Continual Service system, function or process against a standard or desired Improvement level, e.g. the server must be capable of processing a This section has focused on Operational Monitoring and minimum of 100 standard transactions per minute. This Reporting, but monitoring also forms the starting point for needs to be defined in Service Design, and refined over Continual Service Improvement. This is covered in the time through Continual Service Improvement, but the Continual Service Improvement publication, but key measurement itself takes place during Service Operation. differences are outlined here. Metrics Quality is the key objective of monitoring for Continual Service Improvement (CSI). Monitoring will therefore focus Metrics refer to the quantitative, periodic assessment on the effectiveness of a service, process, tool, of a process, system or function, together with the organization or CI. The emphasis is not on assuring real- procedures and tools that will be used to make these time service performance; rather it is on identifying where assessments and the procedures for interpreting them. improvements can be made to the existing level of service, or IT performance. This definition is important because it not only specifies what needs to be measured, but also how to measure it, Monitoring for CSI will therefore tend to focus on what the acceptable range of performance will be and detecting exceptions and resolutions. For example, CSI is what action will need to be taken as a result of normal not as interested in whether an incident was resolved, but performance or an exception. From this, it is clear that any whether it was resolved within the agreed time and metric given in the previous section of this publication is a whether future incidents can be prevented. very basic one and will need to be applied and expanded CSI is not only interested in exceptions, though. If an SLA within the context of each organization before it can be is consistently met over time, CSI will also be interested in effective. determining whether that level of performance can be sustained at a lower cost or whether it needs to be Key Performance Indicators upgraded to an even better level of performance. CSI may A KPI refers to a specific, agreed level of performance therefore also need access to regular performance reports. that will be used to measure the effectiveness of an However, since CSI is unlikely to need, or be able to cope organization or process. with, the vast quantities of data that are produced by all monitoring activity, they will most likely focus on a KPIs are unique to each organization and have to be specific subset of monitoring at any given time. This could related to specific inputs, outputs and activities. They are be determined by input from the business or not generic or universal and thus have not been included improvements to technology. in this publication. 92 | Common Service Operation activities This has two main implications: second-line support groups if they do not work 24/7). In some organizations, the Service Desk is part of the ■ Monitoring for CSI will change over time. They may be Operations Bridge. interested in monitoring the e-mail service one quarter and then move on to look at HR systems in the next The physical location and layout of the Operation’s Bridge quarter. needs to be carefully designed to give the correct ■ This means that Service Operation and CSI need to accessibility and visibility of all relevant screens and build a process which will help them to agree on devices to authorised personnel. However, this will what areas need to be monitored and for what become a very sensitive area where controlled access and purpose. tight security will be essential. Smaller organizations may not have a physical Operations 5.2 IT OPERATIONS Bridge, but there will certainly still be the need for Console Management, usually combined with other technical roles. 5.2.1 Console Management/Operations For example, a single team of technical staff will manage Bridge the network, servers and applications. Part of their role will be to monitor the consoles for those systems – often These provide a central coordination point for managing using virtual consoles so that they can perform the activity various classes of events, detecting incidents, managing from any location. However, it should be noted that these routine operational activities and reporting on the status virtual consoles are powerful tools and, if used in insecure or performance of technology components. locations or over unsecured connections, could represent a Observation and monitoring of the IT Infrastructure can significant security threat. occur from a centralized console – to which all system events are routed. Historically, this involved the 5.2.2 Job Scheduling monitoring of the master operations console of one or IT Operations will perform standard routines, queries or more mainframes – but these days is more likely to reports delegated to it as part of delivering services; or as involve monitoring of a server farm(s), storage devices, part of routine housekeeping delegated by Technical and network components, applications, databases, or any other Application Management teams. CIs, including any remaining mainframe(s), from a single location, known as the Operations Bridge. Job Scheduling involves defining and initiating job- scheduling software packages to run batch and real-time There are two theories about how the Operations Bridge work. This will normally involve daily, weekly, monthly, was so named. One is that it resembles the bridge of a annual and ad hoc schedules to meet business needs. large, automated ship (such as spaceships commonly seen in science fiction movies). The other theory is that the In addition to the initial design, or periodic redesign, of Operations Bridge represents a link between the IT the schedules, there are likely to be frequent amendments Operations teams and the traditional Help Desk. In some or adjustments to make during which job dependencies organizations this means that the functions of Operational have to be identified and accommodated. There will also Control and the Help Desk were merged into the Service be a role to play in defining alerts and Exception Reports Desk, which performed both sets of duties in a single to be used for monitoring/checking job schedules. Change physical location. Management plays an important role in assessing and validating major changes to schedules, as well as creating Regardless of how it was named, an Operations Bridge will Standard Change procedures for more routine changes. pull together all of the critical observation points within the IT Infrastructure so that they can be monitored and Run-time parameters and/or files have to be received (or managed from a centralised location with minimal effort. expedited if delayed) and input – and all run-time logs The devices being monitored are likely to be physically have to be checked and any failures identified. dispersed and may be located in centralized computer If failures do occur, then re-runs will have to be initiated, installations or dispersed within the user community, under the guidance of the appropriate business units, or both. often with different parameters or amended data/file The Operations Bridge will combine many activities, which versions. This will require careful communications to might include Console Management, event handling, first- ensure correct parameters and files are used. line network management, Job Scheduling and out-of- Many organizations are faced with increasing overnight hours support (covering for the Service Desk and/or batch schedules which can, if they overrun the overnight Common Service Operation activities | 93 batch slot, adversely impact upon the online day services each service and Service Transition should ensure that – so are seeking ways of utilizing maximum overnight these are properly tested. capacity and performance, in conjunction with Capacity In addition, regulatory requirements specify that certain Management. This is where Workload Management types of organization (such as Financial Services or listed techniques can be useful, such as: companies) must have a formal Backup and Restore ■ Re-scheduling of work to avoid contention on specific strategy in place and that this strategy is executed and devices or at specific times and improve overall audited. The exact requirements will vary from country to throughput country and by industry sector. This should be determined ■ Migration of workloads to alternative during Service Design and built into the service platforms/environments to gain improved performance functionality and documentation. and/or throughput (virtualization capabilities make this The only point of taking backups is that they may need to far more achievable by allowing dynamic, automated be restored at some point. For this reason it is not as migration) important to define how to back a system up as it is to ■ Careful timing and ‘interleaving’ of jobs to gain define what components are at risk and how to effectively maximum utilization of available resources. mitigate that risk. Anecdote There are any number of tools available for Backup and Restore, but it is worth noting that features of storage One large organization, which was faced with batch technologies used for business data are being used for overrun/utilization problems, identified that, due to backup/restore (e.g. snapshots). There is therefore an human nature where people were seeking to be increasing degree of integration between Backup and ‘tidy’, all jobs were being started on the hour or at Restore activities and those of Storage and Archiving (see 15-minute intervals during the hour (i.e. n o’clock, 15 minutes past, half past, 15 minutes to, etc.). section 5.6). By re-scheduling of work so that it started as soon as 22.214.171.124 Backup other work finished, and staggering the start times of other work, it was able to gain significant reductions The organization’s data has to be protected and this will in contention and achieve much quicker overall include backup (copying) and storage of data in remote processing, which resolved its problems without a locations where it can be protected – and used should it need for upgrades. need to be restored due to loss, corruption or implementation of IT Service Continuity Plans. Job Scheduling has become a highly sophisticated activity, An overall backup strategy must be agreed with the including any number of variables – such as time- business, covering: sensitivity, critical and non-critical dependencies, workload ■ What data has to be backed up and the frequency balancing, failure and resubmission, etc. As a result, most and intervals to be used. operations rely on Job Scheduling tools that allow IT ■ How many generations of data have to be retained – Operations to schedule jobs for the optimal use of this may vary by the type of data being backed up, or technology to achieve Service Level Objectives. what type of file (e.g. data file or application The latest generation of scheduling tools allows for a executable). single toolset to schedule and automate technical ■ The type of backup (full, partial, incremental) and activities and Service Management process activities (such checkpoints to be used. as Change Scheduling). While this is a good opportunity ■ The locations to be used for storage (likely to include for improving efficiency, it also represents a greater single disaster recovery sites) and rotation schedules. point of failure. Organizations using this type of tool ■ Transportation methods (e.g. file transfer via the therefore still use point solutions as agents and also as a network, physical transportation on magnetic media). backup in case the main toolset fails. ■ Testing/checks to be performed, such as test-reads, test restores, check-sums etc. 5.2.3 Backup and Restore ■ Recovery Point Objective. This describes the point to Backup and Restore is essentially a component of good IT which data will be restored after recovery of an IT Service Continuity Planning. As such, Service Design Service. This may involve loss of data. For example, a should ensure that there are solid backup strategies for Recovery Point Objective of one day may be 94 | Common Service Operation activities supported by daily backups, and up to 24 hours of while any user or customer requirements or activity should data may be lost. Recovery Point Objectives for each IT be specified in the appropriate SLA. service should be negotiated, agreed and documented in OLAs, SLAs and UCs. 126.96.36.199 Restore ■ Recovery Time Objective. This describes the A restore can be initiated from a number of sources, maximum time allowed for recovery of an IT service ranging from an event that indicates data corruption, following an interruption. The Service Level to be through to a Service Request from a user or customer provided may be less than normal Service Level logged at the Service Desk. A restore may be needed in Targets. Recovery Time Objectives for each IT service the case of: should be negotiated, agreed and documented in ■ Corrupt data OLAs, SLAs and UCs. ■ Lost data ■ How to verify that the backups will work if they need to be restored. Even if there are no error codes ■ Disaster recovery/IT Service Continuity situation generated, there may be several reasons why the ■ Historical data required for forensic investigation. backup cannot be restored. A good backup strategy The steps to be taken will include: and operations procedures will minimize the risk of ■ Location of the appropriate data/media this happening. Backup procedures should include a verification step to ensure that the backups are ■ Transportation or transfer back to the physical recovery complete and that they will work if a restore is location needed. Where any backup failures are detected, ■ Agreement on the checkpoint recovery point and the recovery actions must be initiated. specific location for the recovered data (disk, directory, folder etc) There is also a need to procure and manage the necessary ■ Actual restoration of the file/data (copy-back and any media (disks, tapes, CDs, etc.) to be used for backups, so roll-back/roll-forward needed to arrive at the agreed that there is no shortage of supply. checkpoint Where automated devices are being used, pre-loading of ■ Checking to ensure successful completion of the the required media will be needed in advance. When restore – with further recovery action if needed until loading and clearing media returned from off-site storage success has been achieved. it is important that there is a procedure for verifying that ■ User/customer sign-off. these are the right ones. This will prevent the most recent backup being overwritten with faulty data, and then 5.2.4 Print and Output having no valid data to restore. After successful backups Many services consist of generating and delivering have been taken, the media must be removed for storage. information in printed or electronic form. Ensuring the The actual initiation of the backups might be automated, right information gets to the right people, with full or carried out from the Operations Bridge. integrity, requires formal control and management. Some organizations may utilize Operations staff to perform Print (physical) and Output (electronic) facilities and the physical transportation and racking of backup copies services need to be formally managed because: to/from remote locations, where in other cases this may ■ They often represent the tangible output of a service. be handed over to other groups such as internal security The ability to measure that this output has reached staff or external contractors. the appropriate destination is therefore very important If backups are being automated or performed remotely, (e.g. checking whether files with financial transaction then Event Monitoring capabilities should be considered data have actually reached a bank through an FTP so that any failures can be detected early and rectified service) before they cause problems. In such cases IT Operations ■ Physical and electronic output often contains sensitive has a role to play in defining alerts and escalation paths. or confidential information. It is vital that the In all cases, IT Operations staff must be trained in backup appropriate levels of security are applied to both the (and restore) procedures – which must be well generation and the delivery of this output. documented in the organization’s IT Operations Many organizations will have centralised bulk printing Procedures Manual. Any specific requirements or targets requirements which IT Operations must handle. should be referenced in OLAs or UCs where appropriate, Common Service Operation activities | 95 In addition to the physical loading and re-loading of paper ■ Interfacing to hardware (H/W) support; arranging and the operation and care of the printers, other activities maintenance, agreeing slots, identifying H/W failure, may be needed, such as: liaison with H/W engineering. ■ Agreement and setting of pre-notification of large ■ Provision of information and assistance to Capacity print runs and alerts to prevent excessive printing by Management to help achieve optimum throughput, rogue print jobs utilization and performance from the mainframe. ■ Physical control of high-value stationery such as company cheques or certificates, etc. 5.4 SERVER MANAGEMENT AND SUPPORT ■ Management of the physical and electronic storage Servers are used in most organizations to provide flexible required to generate the output. In many cases IT will and accessible services from hosting applications or be expected to provide archives for the printed and databases, running client/server services, Storage, Print and electronic materials File Management. Successful management of servers is ■ Control of all printed material so as to adhere to data therefore essential for successful Service Operation. protection legislation and regulation e.g. HIPAA (Health Insurance Portability and Accountability Act) in the The procedures and activities which must be undertaken USA, or FSA (Financial Services Authority) in the UK. by the Server Team(s) or department(s) – separate teams may be needed where different server-types are used Where print and output services are delivered directly to (UNIX, Wintel etc) – include: the users, it is important that the responsibility for maintaining the printers or storage devices is clearly ■ Operating system support: Support and defined. For example, most users assume that cleaning maintenance of the appropriate operating system(s) and maintenance of printers must be performed by IT. If and related utility software (e.g. failover software) this is not the case, this must be clearly stated in the SLA. including patch management and involvement in defining backup and restore policies. ■ Licence management for all server CIs, especially 5.3 MAINFRAME MANAGEMENT operating systems, utilities and any application Mainframes are still widely in use and have well software not managed by the Application established and mature practices. Mainframes form the Management teams. central component of many services and its performance ■ Third-level support: Third-level support for all server will therefore set a baseline for service performance and and/or server operating system-related incidents, user or customer expectations, although they may never including diagnosis and restoration activities. This will know that they are using the mainframe. also include liaison with third-party hardware support contractors and/or manufacturers as needed to The ways in which mainframe management teams are escalate hardware-related incidents. organized are quite diverse. In some organizations ■ Procurement advice: Advice and guidance to the Mainframe Management is a single, highly specialized team that manages all aspects from daily operations business on the selection, sizing, procurement and through to system engineering. In other organizations, the usage of servers and related utility software to meet activities are performed by several teams or departments, business needs. with engineering and third-level support being provided ■ System security: Control and maintenance of the by one team and daily operations being combined with access controls and permissions within the relevant the rest of IT Operations (and very probably managed server environment(s) as well as appropriate system through the Operations Bridge). and physical security measures. These include identification and application of security patches, Typically, the following activities are likely to be Access Management (see section 4.5) and intrusion undertaken: detection. ■ Mainframe operating system maintenance and support ■ Definition and management of virtual servers. This ■ Third-level support for any mainframe-related implies that any server that has been designed and incidents/problems built around a common standard can be used to ■ Writing job scripts process workloads from a range of applications or ■ System programming users. Server Management will be required to set these standards and then ensure that workloads are 96 | Common Service Operation activities appropriately balanced and distributed. They are also upgrades to the physical network infrastructure. This is responsible for being able to track which workload is done through Service Design and Service Transition. being processed by which server so that they are able ■ Third-level support for all network related activities, to deal with incidents effectively. including investigation of network issues (e.g. pinging ■ Capacity and Performance: Provide information and or trace route and/or use of network management assistance to Capacity Management to help achieve software tools – although it should be noted that optimum throughput, utilization and performance pinging a server does not necessarily mean that the from the available servers. This is discussed in more service is available!) and liaison with third-parties as detail in Service Design, but includes providing necessary. This also includes the installation and use of guidance on, and installation and operation of, ‘sniffer’ tools, which analyse network traffic, to assist in virtualization software so as to achieve value for incident and problem resolution. money by obtaining the highest levels of performance ■ Maintenance and support of network operating system and utilization from the minimal number of servers. and middleware software including patch ■ Other routine activities include: management, upgrades, etc. ● Defining standard builds for servers as part of the ■ Monitoring of network traffic to identify failures or to provisioning process. This is covered in more detail spot potential performance or bottleneck issues. in Service Design and Service Transition ■ Reconfiguring or rerouting of traffic to achieve ● Building and installing new servers as part of improved throughput or batter balance – definition of ongoing maintenance or for the provision of rules for dynamic balancing/routing. new services. This is discussed in more detail in ■ Network security (in liaison with the organization’s Service Transition Information Security Management) including firewall ● Setting up and managing clusters, which are aimed management, access rights, password protection etc. at building redundancy, improving service ■ Assigning and managing IP addresses, Domain Name performance and making the infrastructure easier Systems (DNSs – which convert the name of a service to manage. to its associated IP address) and Dynamic Host ■ Ongoing maintenance. This typically consists of Configuration Protocol (DHCP) systems, which enable replacing servers or ‘blades’ on a rolling schedule to access and use of the DNS. ensure that equipment is replaced before it fails or ■ Managing Internet Service Providers (ISPs). becomes obsolete. This results in servers that are not ■ Implementing, monitoring and maintaining Intrusion only fully functional, but also capable of supporting Detection Systems on behalf of Information Security evolving services. Management. They will also be responsible for ■ Decommissioning and disposal of old server ensuring that there is no denial of service to equipment. This is often done in conjunction with the legitimate users of the network. organization’s environmental policies for disposal. ■ Updating Configuration Management as necessary by documenting CIs, status, relationships, etc. 5.5 NETWORK MANAGEMENT Network Management is also often responsible, often in As most IT services are dependent on connectivity, conjunction with Desktop Support, for remote connectivity Network Management will be essential to deliver services issues such as dial-in, dial-back and VPN facilities provided and also to enable Service Operation staff to access and to home-workers, remote workers or suppliers. manage key service components. Some Network Management teams or departments will Network Management will have overall responsibility for also have responsibility for voice/telephony, including the all of the organization’s own Local Area Networks (LANs), provision and support for exchanges, lines, ACD, statistical Metropolitan Area Networks (MANs) and Wide Area software packages etc. and for Voice over Internet Protocol Networks (WANs) – and will also be responsible for liaising (VoIP) and Remote Monitoring (RMon) systems. with third-party network suppliers. At the same time, many organizations see VoIP and Their role will include the following activities: telephony as specialized areas and have teams dedicated to managing this technology. Their activities will be ■ Initial planning and installation of new similar to those described above. networks/network components; maintenance and Common Service Operation activities | 97 and who may access it. Specific responsibilities will Note on managing VoIP as a service include: Many organizations have experienced performance ■ Definition of data storage policies and procedures and availability problems with their VoIP solutions, in spite of the fact that there seems to be more than ■ File storage naming conventions, hierarchy and adequate bandwidth available. This results in dropped placement decisions calls and poor sound quality. This is usually because ■ Design, sizing, selection, procurement, configuration of variations in bandwidth utilization during the call, and operation of all data storage infrastructure which is often the result of utilization of the network ■ Maintenance and support for all utility and by other users, applications or other web activity. This middleware data-storage software has led to the differentiation between measuring the ■ Liaison with Information Lifecycle Management bandwidth available to initiate a call (Service Access Bandwidth – or SAB) and the amount of bandwidth team(s) or Governance teams to ensure compliance that must be continuously available during the call with freedom of information, data protection and IT (Service Utilization Bandwidth – or SUB). Care should governance regulations be taken in differentiating between these when ■ Involvement with definition and agreement of designing, managing or measuring VoIP services. archiving policy ■ Housekeeping of all data storage facilities ■ Archiving data according to rules and schedules 5.6 STORAGE AND ARCHIVE defined during Service Design. The Storage teams or departments will also provide input into the definition Many services require the storage of data for a specific of these rules and will provide reports on their time and also for that data to be available off-line for a effectiveness as input into future design certain period after it is no longer used. This is often due to regulatory or legislative requirements, but also because ■ Retrieval of archived data as needed (e.g. for audit history and audit data are invaluable for a variety of purposes, for forensic evidence, or to meet any other purposes, including marketing, product development, business requirements) forensic investigations, etc. ■ Third-line support for storage- and archive-related incidents. A separate team or department may be needed to manage the organization’s data storage technology such as: 5.7 DATABASE ADMINISTRATION ■ Storage devices, such as disks, controllers, tapes, etc. Database Administration must work closely with key ■ Network Attached Storage (NAS), which is storage Application Management teams or departments – and in attached to a network and accessible by several clients some organizations the functions may be combined or ■ Storage Area Networks (SANs) designed to attach linked under a single management structure. computer storage devices such as disk array controllers Organizational options include: and tape libraries. In addition to storage devices, a ■ Database administration being performed by each SAN will also require the management of several Application Management team for all the applications network components, such as hubs, cables, etc. under its control ■ Direct Attached Storage (DAS), which is a storage ■ A dedicated department, which manages all databases, device directly attached to a server regardless of type or application ■ Content Addressable Storage (CAS) which is storage ■ Several departments, each managing one type of that is based on retrieving information based on its database, regardless of what application they are content rather than location. The focus in this type of part of. system is on understanding the nature of the data and Database Administration works to ensure the optimal information stored, rather than on providing specific performance, security and functionality of databases that storage locations. they manage. Database Administrators typically have the Regardless of what type of storage systems are being following responsibilities: used, Storage and Archiving will require the management ■ Creation and maintenance of database standards of the infrastructure components as well as the policies and policies related to where data is stored, for how long, in what form ■ Initial database design, creation, testing 98 | Common Service Operation activities ■ Management of the database availability and generally kept up to date, it is also a good source of data performance; resilience, sizing, capacity and verification for the CMS. volumetrics etc. Directory Services Management refers to the process that ■ Resilience may require database replication, which is used to manage Directory Services. Its activities include: would be the responsibility of Database Administration ■ ■ Working as part of Service Design and Service Ongoing administration of database objects: indexes, tables, views, constraints, sequences snapshots and Transition to ensure that new services are accessible stored procedures; page locks – to achieve and controlled when they are deployed optimum utilization ■ Locating resources on a network (if these have not ■ The definition of triggers that will generate events, already been defined during Service Design) which in turn will alert database administrators ■ Tracking the status of those resources and providing of potential performance or integrity issues with the ability to manage those resources remotely the database ■ Managing the rights of specific users or groups of ■ Performing database housekeeping – the routine tasks users to access resources on a network that ensure that the databases are functioning ■ Defining and maintaining naming conventions to be optimally and securely, e.g. tuning, indexing, etc. used for resources on a network ■ Monitoring of usage; transaction volumes, response ■ Ensuring consistency of naming and access control on times, concurrency levels, etc. different networks in the organization ■ Generating reports. These could be reports based on ■ Linking different Directory Services throughout the the data in the database, or reports related to the organization to form a distributed Directory Service, performance and integrity of the database i.e. users will only see one logical set of network ■ Identification, reporting and management of database resources. This is called Distribution of Directory security issues; audit trails and forensics Services ■ Assistance in designing database backup, archiving ■ Monitoring Events on the Directory Services, such as and storage strategy unsuccessful attempts to access a resource, and taking ■ Assistance in designing database alerts and event the appropriate action where required management ■ Maintaining and updating the tools used to manage ■ Provision of third-level support for all database-related Directory Services. incidents. 5.9 DESKTOP SUPPORT 5.8 DIRECTORY SERVICES MANAGEMENT As most users access IT services using desktop or laptop A Directory Service is a specialized software application computers, it is key that these are supported to ensure the that manages information about the resources available agreed levels of availability and performance of services. on a network and which users have access to. It is the Desktop Support will have overall responsibility for all of basis for providing access to those resources and for the organization’s desktop and laptop computer hardware, ensuring that unauthorized access is detected and software and peripherals. Specific responsibilities will prevented (see section 4.5 for detailed information on include: Access Management). ■ Desktop policies and procedures, for example licensing Directory Services views each resource as an object of the policies, use of laptops or desktops for personal Directory Server and assigns it a name. Each name is purposes, USB lockdown, etc. linked to the resource’s network address, so that users ■ Designing and agreeing standard desktop images don’t have to memorize confusing and complex addresses. ■ Desktop service maintenance including deployment of Directory Services is based on the OSI’s X.500 standards releases, upgrades, patches and hot-fixes (in and commonly uses protocols such as Directory Access conjunction with Release Management (see Service Protocol (DAP) or Lightweight Directory Access Protocol Transition publication for further details) (LDAP). LDAP is used to support user credentials for ■ Design and implementation of desktop application login and often includes internal and external archiving/rebuild policy (including policy relating to user/customer data which is especially good for extranet cookies, favourites, templates, personal data, etc.) call logging. Since LDAP is a critical operational tool, and Common Service Operation activities | 99 ■ Third-level support of desktop-related incidents, Middleware Management is the set of activities that are including desk-side visits where necessary used to manage middleware. These include: ■ Support for connectivity issues (in conjunction with ■ Working as part of Service Design and Transition to Network Management) to home-workers, mobile ensure that the appropriate middleware solutions are staff, etc. chosen and that they can perform optimally when ■ Configuration control and audit of all desktop they are deployed equipment (in conjunction with Configuration ■ Ensuring the correct operation of middleware through Management and IT Audit). monitoring and control ■ Detecting and resolving Incidents related to 5.10 MIDDLEWARE MANAGEMENT middleware ■ Maintaining and updating middleware, including Middleware is software that connects or integrates software components across distributed or disparate licensing, and installing new versions applications and systems. Middleware enables the effective ■ Defining and maintaining information about how transfer of data between applications, and is therefore key applications are linked through Middleware. This to services that are dependent on multiple applications or should be part of the CMS (see Service Transition data sources. publication). A variety of technologies are currently used to support program-to-program communication, such as object 5.11 INTERNET/WEB MANAGEMENT request brokers, message-oriented middleware, remote Many organizations conduct much of their business procedure calls and point-to-point web services. Newer through the Internet and are therefore heavily dependent technologies are emerging all the time, for example upon the availability and performance of their websites. In Enterprise Service Bus (ESB), which enables programs, such cases a separate Internet/Web Support team or systems and services to communicate with each other department will be desirable and justified. regardless of the architecture and origin of the applications. This is especially being used in the context of The responsibilities of such a team or department deploying Service Oriented Architectures (SOAs). incorporate both Intranet and Internet and are likely to include: Middleware Management can be performed as part of an ■ Defining architectures for Internet and web services Application Management function (where it is dedicated to a specific application) or as part of a Technical ■ The specification of standards for development and Management function (where it is viewed as an extension management of web-based applications, content, to the Operating System of a specific platform). websites and web pages. This will typically be done during Service Design Functionality provided by middleware includes: ■ Design, testing, implementation and maintenance of ■ Providing transfer mechanisms for data from various websites. This will include the architecture of websites applications or data sources and the mapping of content to be made available ■ Sending work to another application or procedure for ■ In many organizations, web management will include processing the editing of content to be posted onto the web ■ Transmitting data or information to other systems, ■ Maintenance of all web development and such as sourcing data for publication on websites (e.g. management applications publishing Incident status information) ■ Liaison and advice to web-content teams within the ■ Releasing updated software modules across distributed business. Content may reside in applications or environments storage devices, which implies close liaison with ■ Collation and distribution of system messages and Application Management and other Technical instructions, for example Events or operational scripts Management teams that need to be run on remote devices ■ Liaison with and supplier management of ISPs, hosts, ■ Multicast setup with networks. Multicast is the delivery third-party monitoring or virtualization organizations of information to a group of destinations etc. In many organizations the ISPs are managed as simultaneously using the most efficient delivery route part of Network Management ■ Managing queue sizes. ■ Third-level support for Internet-/web-related incidents 100 | Common Service Operation activities ■ Support for interfaces with back-end and legacy fire suppression, water, heating and cooling systems. This will often mean working with members systems, etc. of the Application Development and Management ■ Safety is concerned with compliance to all legislation, teams to ensure secure access and consistency standards and policies relative to the safety of of functionality employees ■ Monitoring and management of website performance ■ Physical Access Control refers to ensuring that and including: heartbeat testing, user experience the facility is only accessed by authorized personnel simulation, benchmarking, on-demand load balancing, and that any unauthorized access is detected virtualization and managed. This is discussed in more detail in ■ Website availability, resilience and security. This will Appendix F form part of the overall Information Security ■ Shipping and Receiving refers to the management of Management of the organization. all equipment, furniture, mail, etc. that leaves or enters the building. It ensures that only appropriate items are 5.12 FACILITIES AND DATA CENTRE entering or leaving the building and that they are routed to the correct party MANAGEMENT ■ Involvement in Contract Management of the various Facilities Management refers to the management of the suppliers and service providers involved in the facility physical environment of IT Operations, usually located in ■ Maintenance refers to regular, scheduled upkeep of Data Centres or computer rooms. This is a vast and the facility, as well as the detection and resolution of complex area and this publication will provide an problems with the facility. overview of its key role and activities. A more detailed overview is contained in Appendix E. Important note regarding Data Centres In many respects Facilities Management could be viewed Data Centres are generally specialized facilities and, as a function in its own right. However, because this while they use and benefit from generic Facilities publication is focused on where IT Operations are housed, Management disciplines, they need to adapt these. it will cover Facilities Management specifically as it relates For example layout, heating and conditioning, power to the management of Data Centres and as a subset of the planning and many other aspects are all managed IT Operations Management function. uniquely in Data Centres. The main components of Facilities Management are This means that, although Data Centres may be as follows: facilities owned by an organization, they are better managed under the authority of IT Operations, ■ Building Management, which refers to the although there may be a functional reporting line maintenance and upkeep of the buildings that house between IT and the department that manages other the IT staff and Data Centre. Typical activities include facilities for the organization. cleaning, waste disposal, parking management and access control ■ Equipment Hosting, which ensures that all special 5.12.1 Data Centre strategies requirements are provided for the physical housing of Managing a Data Centre is far more than hosting an open equipment and the teams that support them space where technical groups install and manage ■ Power Management, which refers to managing the equipment, using their own approaches and procedures. It sourcing and utilization of power sources that are requires an integrated set of processes and procedures used to keep the facility functional. This definition of involving all IT groups at every stage of the ITSM Lifecycle. Power Management has a number of implications, Data Centre operations are governed by strategic and which are discussed in Appendix E. Note that design decisions for management and control and are information about power utilization is important for executed by operators. This requires a number of key planning the capacity of both new services and new factors to be put in place: buildings ■ Data Centre Automation. Specialized automation ■ Environmental Conditioning and Alert Systems, systems that reduce the need for manual operators which include the specification, maintenance and and which monitor and track the status of the facility monitoring of systems such as smoke detection and and all IT operations at all times Common Service Operation activities | 101 ■ Policy-based management, where the rules of 5.13 INFORMATION SECURITY MANAGEMENT automation and resource allocation are managed by AND SERVICE OPERATION policy, rather than having to go through complex change procedures every time processing is moved Information Security Management as a process is covered from one resource to another in the ITIL Service Design publication. Information Security ■ Management has overall responsibility for setting policies, Real time services 24 hours a day, 7 days a week standards and procedures to ensure the protection of the ■ Standardization of equipment. This provides greater organization’s assets, data, information and IT services. ease of management, more consistent levels of Service Operation teams play a role in executing these performance and a means of providing multiple policies, standards and procedures and will work closely services across similar technology. Standardization also with the teams or departments responsible for Information reduces the variety of technical expertise required to Security Management. manage equipment in the Data Centre and to provide services Service Operation teams cannot take ownership of ■ SOAs, where service components can be reused, Information Security Management, as this would represent interchanged and replaced very quickly and with no a conflict. There needs to be segregation of roles between impact on the business. This will make it possible for the groups defining and managing the process and the the Data Centre to be highly responsive in meeting groups executing specific activities as part of ongoing changing business demands without having to go operation. This will help protect against breaches to through lengthy and involved re-engineering and re- security measures, as no single individual should have architecting control over two or more phases of a transaction or ■ Virtualization. This means that IT Services are operation. Information Security Management should assign delivered using an ever-changing set of equipment, responsibilities to ensure a cross-check of duties. geared to meet current demand. For example, an The role of Service Operation teams is outlined next. application may run on a dedicated device together with its database during high-demand times, but 5.13.1 Policing and reporting shifted to a shared device with its database on a This will involve Operation staff performing specific remote device during non-peak times – all automated policing activities such as the checking of system journals, and automatic. This will mean even greater savings of logs, event/monitoring alerts etc, intrusion detection costs as any equipment can be used at any time, and/or reporting of actual or potential security breaches. without any human intervention, except to perform This is done in conjunction with Information Security maintenance and replace failed equipment. The IT Management to provide a check and balance system Infrastructure is more resilient since any component is to ensure effective detection and management of backed up by any number of similar components, any security issues. of which could take over a failed component’s workload automatically. Service Operation staff are often first to detect security Remote monitoring, control and management events and are in the best position to be able to shut equipment and systems will be essential to manage a down and/or remove access to compromised systems. virtualized environment, as many services will not be Particular attention will be needed in the case of third- linked to any one specific piece of equipment. party organizations that require physical access into the ■ Unified management systems have become more organization. Service Operation staff may be required important as services run across multiple locations and to escort visitors into sensitive areas and/or control technologies. Today it is important to define what their access. actions need to be taken and what systems will They may also have a role to play in controlling network perform that action. This means investing in solutions access to third parties, such as hardware maintainers that will allow Infrastructure managers to simply dialling in for diagnostic purposes, etc. specify what outcome is required, and allowing the management system to calculate the best combination 5.13.2 Technical assistance of tools and actions to achieve the outcome. Some technical support may need to be provided to IT Security staff to assist in investigating security incidents and assist in production of reports or in 102 | Common Service Operation activities gathering forensic evidence for use in disciplinary 5.13.6 Documented policies and procedures action or criminal prosecutions. Service Operation documented procedures must include Technical advice and assistance may also be needed all relevant information relating to security issues – regarding potential security improvements (e.g. setting up extracted from the organization’s overall security policy appropriate firewalls or access/password controls). documents. Consideration should be given to the use of handbooks to assist in getting the security messages out The use of event, incident, problem and configuration to all relevant staff. management information can be relied on to provide accurate chronologies of security-related investigations. 5.14 IMPROVEMENT OF OPERATIONAL 5.13.3 Operational security control ACTIVITIES For operational reasons, technical staff will often need to All Service Operation staff should be constantly looking for have privileged access to key technical areas (e.g. root areas in which process improvements can be made to give system passwords, physical access to Data Centres or higher IT service quality and/or performed in a more cost- communications rooms etc). It is therefore essential that effective way. This might include some of the following adequate controls and audit trails are kept of all such activities. privileged activities so as to deter and detect any security events. 5.14.1 Automation of manual tasks Physical controls need to be in place for all secure areas Any tasks which have to be carried out manually, with logging in-out of all staff. Where third-party staff or particularly those that have to be regularly repeated, are visitors need access, it may be Service Operation staff that likely to be more time consuming, costly and error prone are responsible for escorting and managing the movement than those that can be systemised and automated. All of such personnel. tasks should be examined for potential automation to In the case of privileged systems access, this needs to be reduce effort and costs and to minimize potential errors. restricted to only those people whose need to access the A judgement must be made on the costs of the system has been verified – and withdrawn immediately automation and the likely benefits that will occur. when that need no longer exists. An audit trail must be maintained of who has had access and when, and of all 5.14.2 Reviewing makeshift activities or activities performed using those access levels. procedures Because of the pragmatic nature of Service Operation, it 5.13.4 Screening and vetting may sometimes arise that makeshift activities or processes All Service Operation staff should be screened and are introduced to address short-term operational vetted to a security level appropriate to the organization expediencies. There is a danger that such practices can be in question. continued and become the ‘norm’ – leading to ongoing Suppliers and third-party contractors should also be inefficiencies. Where any makeshift activities or procedures screened and vetted – both the organizations and the do have to be introduced it is important that these are specific personnel involved. Many organizations have reviewed as soon as the immediate expediency is started using police or government agency background overcome – and either dispensed with or replaced with checks, especially where contractors will be working with efficient agreed processes for the longer term. classified systems. Where necessary, appropriate non- disclosure and confidentiality agreements must be agreed. 5.14.3 Operational Audits Regular audits should be conducted of all Service 5.13.5 Training and awareness Operation processes to ensure that they are working All Service Operation staff should be given regular and satisfactorily. ongoing training and awareness of the organization’s security policy and procedures. This should include details 5.14.4 Using Incident and Problem of disciplinary measures in place. In addition, any security Management requirements should be specified in the employee’s Problem and Incident Management provide a rich source contract of employment. of operational improvement opportunities. These Common Service Operation activities | 103 processes are discussed in detail in Chapter 4 of this publication. 5.14.5 Communication It should go without saying that good communication about changing requirements, technology and processes will result in improvement in Service Operation. However, communication is often neglected. Service Operation improvement is dependent on formal and regular communication between teams responsible for design, support and operation of services. 5.14.6 Education and training Service Operation teams should understand the importance of what they do on a daily basis. Education is required to ensure that staff understand what business functions or services are supported by their activities. This will encourage greater care and attention to detail and will also help Service Operation teams to better identify business priorities. Training programmes should ensure that all staff have the appropriate skills for the technology or applications that they are managing. Training should always be provided when new technology is introduced, or when existing technology is changed. Organizing for Service Operation 6 | 107 6 Organizing for Service Operation 6.1 FUNCTIONS environment. These are logical functions and do not necessarily have to be performed by an equivalent A function is a logical concept that refers to the people organizational structure. This means that Technical and and automated measures that execute a defined process, Application Management can be organized in any an activity or a combination of processes or activities. In combination and into any number of departments. The larger organizations a function may be broken up and second-level groupings in Figure 6.1 are examples of performed by several departments, teams and groups, or it typical groups of activities performed by Technical may be embodied within a single organizational unit. Management (see Chapter 5) and are not a suggested The Service Operation functions given in Figure 6.1 are organization structure. needed to manage the ‘steady state’ operational IT IT Operations Management IT Operations Service Desk Technical Control Application Management Management Console Management Job Scheduling Backup and Restore Print and Output Financial Mainframe Apps Facilities Management HR Server Apps Data Centres Recovery Sites Consolidation Contracts Business Network Apps Storage Database Directory Services Desktop Middleware Internet/Web Figure 6.1 Service Operation functions 108 | Organizing for Service Operation The following is an overview of the Service Operation routine operational tasks are carried out. IT functions in Figure 6.1: Operations Control will also provide centralized monitoring and control activities, usually using an ■ The Service Desk is the primary point of contact for Operations Bridge or Network Operations Centre. users when there is a service disruption, for service ● Facilities Management refers to the management requests or even for some categories of Request for Change. The Service Desk provides a point of of the physical IT environment, usually Data communication to the users and a point of Centres or computer rooms. In many organizations coordination for several IT groups and processes. To Technical and Application Management are co- enable them to perform these actions effectively the located with IT Operations in large Data Centres. In Service Desk is usually separate from the other Service some organizations many physical components of Operation functions. In some cases, e.g. where the IT Infrastructure have been outsourced and detailed technical support is offered to users on the Facilities Management may include the first call, it may be necessary for Technical or management of the outsourcing contracts. Application Management staff to be on the Service ■ Application Management is responsible for Desk. This does not mean that the Service Desk managing applications throughout their lifecycle. The becomes part of the Technical Management function. Application Management function supports and In fact, while they are on the Service Desk, they cease maintains operational applications and also plays an to be a part of the Technical Management or important role in the design, testing and improvement Application Management functions and become part of applications that form part of IT services. of the Service Desk, even if only temporarily. Application Management is usually divided into ■ Technical Management provides detailed technical departments based on the application portfolio of the skills and resources needed to support the ongoing organization (see the examples in Figure 6.1), thus operation of the IT Infrastructure. Technical allowing easier specialization and more focused Management also plays an important role in the support. In many organizations Application design, testing, release and improvement of IT Management departments have staff who perform services. In small organizations, it is possible to daily operations for those applications. As with manage this expertise in a single department, but Technical Management, these staff logically form part larger organizations are typically split into a number of of the IT Operations Management function. technically specialized departments (see later in this chapter). In many organizations, the Technical Special note on Information Security Management departments are also responsible for the Management daily operation of a subset of the IT Infrastructure. Although most would agree that Information Security Figure 6.1 shows that, although they are part of a Management is a function, it is highly specialized and Technical Management department, staff who perform spans several phases of the lifecycle. It is also these activities are logically part of the IT Operations responsible for the oversight of many activities within Management function. all Service Operation functions. For a more in-depth description of Information Security Management, ■ IT Operations Management is the function please refer to the Service Design publication and to responsible for the daily operational activities needed section 5.13 of this publication. to manage the IT Infrastructure. This is done according to the Performance Standards defined during Service Design. In some organizations this is a single, 6.1.1 Functions and activities centralized department, while in others some activities Chapter 5 of this publication introduced a number of and staff are centralized and some are provided by common Service Operation activities. Due to the technical distributed or specialized departments. This is nature and specialization of these activities, the teams, illustrated in Figure 6.1 by the overlapping from the groups or departments that perform them are often given Technical and Application Management functions. IT names that correspond to the particular activities. For Operations Management has two functions that are example, Network Management could be performed by a unique and which are generally formal organizational ‘Network Management Department’. This, however, is by structures. These are: no means a rule. There are a number of options available ● IT Operations Control, which is generally staffed in mapping activities to a team or department, for by shifts of operators and which ensures that example: Organizing for Service Operation | 109 ■ One activity could be performed by several teams or organizations will tend to combine these activities into departments, e.g. if an organization has five major single departments, or even individuals – if they are even Application Support departments, each supporting needed at all. a different set of applications, each of these departments could perform Database Administration Special note on outsourcing for ‘its’ applications These organizational considerations are likely to be ■ One department could perform several activities, e.g. most relevant to internal IT organizations. The the Network Management Department could be situation becomes even more complex when some or responsible for managing the network, Directory all of a particular activity or function are outsourced. Services Management and Server Management Prime opportunities for outsourcing have been the ■ An activity could be performed by groups, e.g. Service Desk and Network Operations. This will be Security Administration can be performed by any covered in more detail in ITIL Complementary person with responsibility for managing an application, Guidance, but some of the key points to remember are: server, middleware or desktop. ■ Regardless of who is performing the activity, the These organizational decisions are influenced by a number company contracting the outsourcer is still of factors, such as: responsible for ensuring that it is performed to a ■ The size and location of the organization. Smaller, less standard that will support the delivery of services distributed organizations will tend to combine these to their customers and users. functions, whereas large, decentralized organizations ■ Outsourcing to solve an organization’s problems may have several teams or departments performing or as an alternative to good Service Management the same activity (e.g. per region). processes rarely works. The best results are ■ The complexity of technology used in the obtained if these are in place before outsourcing. organization. The higher the number of different ■ Outsourcing works best when there is active technologies used, the more likely there are to be involvement by both organizations. If the staff and several different teams, each doing something similar, managers of the customer organization but in a different context (e.g. UNIX Server disengage, the outsourcer is unlikely to be Management and Windows Server Management). successful, simply because nobody understands ■ The availability of skills. Where technical skills are the organization better than the people who work there. scarce, it is common for organizations to use generalists to perform multiple groups of activities – ■ The outsourcer should not determine their although, in some cases, security considerations make outputs or how they are measured. These are this very difficult. For example, an organization determined by understanding the business working on classified or secret projects may have to requirements of users and customers and ensuring that they can be met by the outsourcer’s hire expensive, specialized resources even when that capabilities. means relocating them or contracting through security-cleared vendors. ■ Although the outsourcer’s services become an ■ The culture of the organization. Some organizations integral part of the organization, they are still a third-party organization, with a different set of prefer to work in highly specialized environments, business objectives, policies and practices. Security while others tend to prefer the flexibility of standards must be upheld and both parties must generalist staff. clearly understand their respective roles and ■ The financial situation of the organization will contributions. determine how many people, with what type of skill, can be employed and how they will be organized. As a result of these factors, it is impossible for this 6.2 SERVICE DESK publication to prescribe an appropriate organizational structure that will fit every situation, however, the A Service Desk is a functional unit made up of a dedicated following sections list the required activities under the number of staff responsible for dealing with a variety of functional groups most likely to be involved in their service events, often made via telephone calls, web operation. Please note that this does not mean that all interface, or automatically reported infrastructure events. organizations have to use these divisions. Smaller 110 | Organizing for Service Operation The Service Desk is a vitally important part of an ■ A reduced negative business impact organization’s IT Department and should be the single ■ Better-managed infrastructure and control point of contact for IT users on a day-by-day basis – and ■ Improved usage of IT Support resources and increased will handle all incidents and service requests, usually productivity of business personnel using specialist software tools to log and manage all ■ More meaningful management information for such events. decision support The value of an effective Service Desk should not be ■ It is common practice that the Service Desk provides underrated – a good Service Desk can often compensate ‘entry-level’ positions for ITSM staff. Working on the for deficiencies elsewhere in the IT organization, but a Service Desk is an excellent ‘grounding’ for anyone poor Service Desk (or the lack of a Service Desk) can give who wishes to pursue a career in Service a poor impression of an otherwise very effective IT Management. However, this could also present organization! challenges with people who do not understand the business or technology. Users calling the Service Desk It is therefore very important that the correct calibre of should be able to speak to someone who is able to staff is used on the Service Desk and that IT Managers do address their needs, and Service Desk Analysts should their best to make the desk an attractive place to work to not be burned out in less than a year because of improve staff retention. undue stress. Care should be taken to select The exact nature, type, size and location of a Service Desk appropriately skilled individuals with a good will vary, depending upon the type of business, number of understanding of the business and to provide users, geography, complexity of calls, scope of services adequate training – thus preventing reduction in levels and many other factors. of support due to a lack of knowledge at the first line. In alignment to customer and business requirements, the IT organization’s senior managers should decide the exact 6.2.2 Service Desk objectives nature of its required Service Desk (and whether it should The primary aim of the Service Desk is to restore the be internal or outsourced to a third party) as part of its ‘normal service’ to the users as quickly as possible. In this overall ITSM strategy (see Service Strategy publication) – context ‘restoration of service’ is meant in the widest and then subsequent planning must be done to prepare possible sense. While this could involve fixing a technical for and then implement the appropriate Service Desk fault, it could equally involve fulfilling a service request or function (either when implementing a new function, or answering a query – anything that is needed to allow the more likely these days when making necessary users to return to working satisfactorily. amendments to an existing function – see Service Design Specific responsibilities will include: and Service Transition publications). ■ Logging all relevant incident/service request details, 6.2.1 Justification and role of the Service allocating categorization and prioritization codes Desk ■ Providing first-line investigation and diagnosis ■ Resolving those incidents/service requests they Very little justification is needed today for a Service Desk, as many organizations have become convinced that this is are able by far the best approach for dealing with first-line IT ■ Escalating incidents/service requests that they cannot support issues. One only needs ask the question ‘What is resolve within agreed timescales the alternative?’ to make a compelling case for the Service ■ Keeping users informed of progress Desk concept. Where further justification is needed, the ■ Closing all resolved incidents, requests and other calls following benefits should be considered: ■ Conducting customer/user satisfaction call- ■ Improved customer service, perception and satisfaction backs/surveys as agreed ■ Communication with users – keeping them informed ■ Increased accessibility through a single point of contact, communication and information of incident progress, notifying them of impending changes or agreed outages, etc. ■ Better-quality and faster turnaround of customer or ■ Updating the CMS under the direction and approval of user requests Configuration Management if so agreed. ■ Improved teamwork and communication ■ Enhanced focus and a proactive approach to service provision Organizing for Service Operation | 111 Note: these activities are explained and set in context with ■ Specialized groups of users the fuller Incident Management and Request Fulfilment ■ The existence of customized or specialized services process in sections 4.2 and 4.3 respectively. that require specialist knowledge ■ VIP/criticality status of users. 6.2.3 Service Desk organizational structure There are many ways of structuring Service Desks and 188.8.131.52 Centralized Service Desk locating them – and the correct solution will vary for It is possible to reduce the number of Service Desks by different organizations. The primary options are detailed merging them into a single location (or into a smaller below, but in reality an organization may need to number of locations) by drawing the staff into one or implement a structure that combines a number of these more centralized Service Desk structures. This can be more options in order to fully meet the business needs: efficient and cost-effective, allowing fewer overall staff to deal with a higher volume of calls, and can also lead to 184.108.40.206 Local Service Desk higher skill levels through great familiarization through This is where a desk is co-located within or physically more frequent occurrence of events. It might still be close to the user community it serves. This often aids necessary to maintain some form of ‘local presence’ to communication and gives a clearly visible presence, which handle physical support requirements, but such staff can some users like, but can often be inefficient and expensive be controlled and deployed from the central desk. to resource as staff are tied up waiting to deal with incidents when the volume and arrival rate of calls may 220.127.116.11 Virtual Service Desk not justify this. Through the use of technology, particularly the Internet, There may, however, be some valid reasons for and the use of corporate support tools, it is possible to maintaining a local desk, even where call volumes alone give the impression of a single, centralized Service Desk do not justify this. Reasons might include: when in fact the personnel may be spread or located in any number or type of geographical or structural locations. ■ Language and cultural or political differences This brings in the option of ‘home working’, secondary ■ Different time zones support group, off-shoring or outsourcing – or any User User User User Service Desk Technical Application IT Operations 3rd Party Request Management Management Management Support Fulfilment Figure 6.2 Local Service Desk 112 | Organizing for Service Operation Customer Site 1 Customer Site 2 Customer Site 3 Service Desk Second Line Support Technical Application IT Operations 3rd Party Request Management Management Management Support Fulfilment Figure 6.3 Centralized Service Desk Virtual Service Desk San Francisco Service Desk Paris Rio de Service Desk Janeiro Service Desk Virtual Service Desk Sydney Service Desk Beijing Service Desk Service Knowledge Management System London Service Desk Figure 6.4 Virtual Service Desk Organizing for Service Operation | 113 combination necessary to meet user demand. It is ■ A quiet environment with adequate acoustic control important to note, however, that safeguards are needed in so that one telephone conversation is not disrupted all of these circumstances to ensure consistency and by another uniformity in service quality and cultural terms. ■ Pleasant surroundings and comfortable furniture so as to lighten the mood (the Service Desk can be a very 18.104.22.168 Follow the Sun stressful place to work, so every little helps!) Some global or international organizations may wish to ■ A separate rest-room and refreshment area nearby so combine two or more of their geographically dispersed that staff can take short breaks as appropriate when Service Desks to provide a 24-hour follow-the-sun service. necessary without being away for too long. For example, a Service Desk in Asia-Pacific may handle calls during its standard office hours and at the end of this Anecdote period it may hand over responsibility for any open One company found that there was a ‘them and us’ incidents to a European-based desk. That desk will handle culture existing between the Service Desk and the these calls alongside its own incidents during its standard other support teams. The third-line teams often day and then hand over to a USA-based desk – which believed themselves to be better than the Service finally hands back responsibility to the Asia-Pacific desk to Desk. Hiding the Service Desk away in an isolated complete the cycle. room helped to reinforce this culture. The company found that creating an open-plan office with the This can give 24-hour coverage at relatively low cost, as Service Desk in the middle encouraged closer no desk has to work more than a single shift. However, working and helped to break down these barriers. the same safeguards of common processes, tools, shared database of information and culture must be addressed for this approach to proceed – and well-controlled escalation 22.214.171.124 Building a single point of contact and handover processes are needed. Regardless of the combination of options chosen to fulfil an organization’s overall Service Desk structure, individual 126.96.36.199 Specialized Service Desk groups users should be in no doubt about who to contact if they For some organizations it might be beneficial to create need assistance. A single telephone number (or a single ‘specialist groups’ within the overall Service Desk structure, number for each group if separate desks are chosen) so that incidents relating to a particular IT service can be should be provided and well publicized – as well as a routed directly (normally via telephony selection or a web- single e-mail address and a single web Service Desk based interface) to the specialist group. This can allow contact page. faster resolution of these incidents, through greater familiarity and specialist training. Ideas that can be successfully used to help publicize the Service Desk telephone number and e-mail address, and The selection would be made using a script along the making it available close to hand when users are likely to lines of ‘If your call is about the X Service, please press 1 need them, are: now, otherwise please hold for a Service Desk analyst’. ■ Including the Service Desk telephone number on Care is needed not to over complicate the selection, so hardware CI labels, attached to the components the specialist groups should only be considered for a very user is likely to be calling about small number of key services where these exist, and ■ Printing Service Desk contact details on telephones where call rates about that service justify a separate ■ For PCs and laptops, using a customized background specialist group. or desktop with the Service Desk contact details, together with information read from the system that 188.8.131.52 Environment will be needed when calling (such as IP address, The environment where the Service Desk is to be located OS build number, etc.) in one corner should be carefully chosen. Where possible, the following ■ Printing the Service Desk number on ‘freebies’ (pens, facilities should be provided: pencils, mugs, mouse-mats, etc.) ■ A location where the entire function can be positioned ■ Prominently placing these details on Service Desk with sufficient natural light and overall space – to Internet/intranet sites allow adequate desk and storage-space, and room to move around if necessary 114 | Organizing for Service Operation ■ Including them on any calling cards or satisfaction ● Number of customers and users speaking a survey cards left with users when a desk visit has different language been necessary ● Skill level ■ Repeating the details on all correspondence sent to ■ Incident and Service Request types (and types of RFC the users (together with call reference numbers) if appropriate): ■ Placing the details on notice boards or physical ● Duration of time required for call types (e.g. simple locations that users are likely to regularly visit queries, specialist application queries, hardware, (entrances, canteens, refreshment areas, etc.). etc.) ● Local or external expertise required 6.2.4 Service Desk staffing ● The volume and types of incidents and Service The issues involved in, and criteria for, establishing the Requests appropriate staffing model and levels are discussed in this ■ The period of support cover required, based on: section. Details about typical Service Desk roles and ● Hours covered responsibilities can be found in paragraph 6.6.1 below. ● Out-of-hours support requirements They include the Service Desk Manager, Supervisor, ● Time zones to be covered Analysts and, in some organizations, these roles are ● Locations to be supported (particularly if Service complemented by business users (‘Super Users’) who provide first-line support. Desk staff also conduct desk-side support) ● Travel time between locations 184.108.40.206 Staffing levels ● Workload pattern of requests (e.g. daily, month end, etc.) An organization must ensure that the correct number of staff are available at any given time to match the demand ● The service level targets in place (response levels being placed upon the desk by the business. Call rates can etc.) be very volatile and often in the same day the arrival rate ■ The type of response required: may go from very high to very low and back again. An ● Telephone organization planning a new desk should attempt to ● E-mail/fax/voicemail/video predict the call arrival rate and profile – and to staff ● Physical attendance accordingly. Statistical analysis of call arrival rates under ● Online access/control current support arrangements must be undertaken and ■ The level of training required then closely monitored and adjusted as necessary. ■ The support technologies available (e.g. phone Many organizations will find that call rates peak during the systems, remote support tools, etc.) start of the office day and then fall off quickly, perhaps ■ The existing skill levels of staff with another burst in the early part of the afternoon – this ■ The processes and procedures in use. obviously varies depending upon the organization’s business but is an often occurring pattern for many All these items should be carefully considered before organizations. In such circumstances it may be possible to making any decision on staffing levels. This should also be utilize part-time staff, home-workers, second-line support reflected in the levels of documentation required. staff or third parties to cover the peaks. Remember that the better the service, the more the business will use it. The following factors should be considered when deciding staffing levels: A number of tools are available to help determine the appropriate number of staff for the Service Desk. These ■ Customer service expectations workload modelling tools are dependent on detailed ‘local ■ Business requirements, such as budget, call response knowledge’ of the organization such as call volumes and times, etc. patterns, service and user profiles, etc. ■ Size, relative age, design and complexity of the IT Infrastructure and Service Catalogue – for example, the 220.127.116.11 Skill levels number and type of incidents, the extent of An organization must decide on the level and range of customised versus standard off-the-shelf software skills it requires of its Service Desk staff – and then ensure deployed, etc. that these skills are available at the appropriate times. ■ The number of customers and users to support, and associated factors such as: Organizing for Service Operation | 115 A range of skill options are possible, starting from a ‘call- the service, the more likely specialist knowledge will be logging’ service only – where staff need only very basic required on the first call. technical skills – right through to a ‘technical’ Service Desk Note that first-line resolution rates can be reduced by where the organization’s most technically skilled staff are effective Problem Management, which will reduce a used. In the case of the former, there will be a high number of the simpler, repetitive incidents. In such cases, handling but low resolution rate, while in the latter case although the resolution rates appear to be going down, this will be reversed. the overall service quality will have improved by the The decision on the required skills level will often be complete removal of many incidents. While this is good, driven by target resolution times (agreed with the business if Service Desk staff are paid incentives or bonuses for and captured in service level targets), the complexity of first-call resolution, it could prove disastrous for morale the systems supported and ‘what the business is prepared and process effectiveness unless the bonus threshold to pay’. is reviewed. There is a strong correlation between response and Improvements in resolution times/rates should not be left resolution targets and costs – generally speaking, the to chance, but should instead be part of an ongoing shorter the target times, the higher the cost because more Service Improvement Plan (see the Continual Service resources are required. Improvement publication for fuller details). While there may be instances when business dependency Once the required skill levels have been identified, there is or criticality make a highly technically skilled desk an an ongoing task to ensure that the Service Desk is imperative, the optimum and most cost-effective approach operated in such a way that the necessary staff obtain and is generally to have a ‘call-logging’ first line of support via maintain the necessary skills – and that staff with the the Service Desk, with quick and effective escalations to correct balance of skills are on duty at appropriate times more skilled second-line and third-line resolution groups so that consistency is maintained. where skilled staff can be concentrated and more This will involve an ongoing training and awareness effectively utilised (see Incident Management, section 4.2, programme which should cover: for more details and guidance on end-to-end support structures). However, this basic starting point can be ■ Interpersonal skills: such as telephony skills, improved over time by providing the first-line staff with an communication skills, active listening and customer- effective knowledge-base, diagnostic scripts and care training. integrated support tools (including a CMS), as well as ■ Business awareness: specific knowledge of the ongoing training and awareness, so that first-line organization’s business areas, drivers, structure, resolution rates can gradually be increased. priorities, etc. ■ Service awareness of all the organization’s key IT This can also be achieved by locating second-level staff on the Service Desk, effectively creating a two-tier structure. services for which support is being provided This has advantages of making second-level staff available ■ Technical awareness (and deeper technical training to to help deal with peak call periods and to train more the appropriate level, depending upon the resolution junior personnel, and it will often increase the first-call rate sought) resolution rate. However, second-line staff often have ■ Depending on level of support provided, some duties outside of the Service Desk – resulting in rosters diagnosis skills (e.g. Kepner and Tregoe) having to be managed or second-line staff positions being ■ Support tools and techniques duplicated. In addition, having to deal with routine calls ■ Awareness training and tutorials in new systems and may be demotivating for more experienced staff. A further technologies, prior to their introduction potential drawback is that the Service Desk becomes really ■ Processes and procedures (most particularly Incident, good at resolving calls, whereas Change and Configuration Management – but an second-line staff should be focused on removing the overview of all ITSM processes and procedures) root cause instead. ■ Typing skills to ensure quick and accurate entry of Another factor to consider when deciding on the skills incident or Service Request details. requirements for Service Desk staff is the level of For such a programme to be effective, skill requirements customization or specialization of the supported services. and levels should be evaluated periodically and training Standardized services require less specific knowledge to records maintained. provide quality customer support. The more specialized 116 | Organizing for Service Operation Careful formulation of staffing rotations or schedules staff. This often leads to innovation in Service Desk should be maintained so that a consistent balance of staff operation (such as specialized services) which in turn drive experience and appropriate skill levels are present during operational efficiencies at all tier levels of support. It helps all critical operational periods. It is not sufficient to have to build skills that can be used in their current role as well only the right number of staff on duty – the correct blend as it jump-starts the training for a new role. While it is of skills should also be available. important to develop their core competencies in their current role, having a clear career path and recognising 18.104.22.168 Training future requirement and development needs is also It is vital that all Service Desk staff are adequately trained important. before they are called upon to staff the Service Desk. A formal induction programme should be undertaken by all 22.214.171.124 Staff retention new staff, the exact content of which will vary depending It is very important that all IT Managers recognize the upon the existing skill levels and experience of the new importance of the Service Desk and the staff who work on recruit, but is likely to include many of the required skills it, and give this special attention. Any significant loss of as described above. staff can be disruptive and lead to inconsistency of service – so efforts should be made to make the Service Desk an Where possible, a business awareness programme, attractive place to work. including short periods of secondment into key business areas, should be provided for new staff who do not Ways in which this can be done include proper already have this level of business awareness. recognition of the role with reward packages recognizing this, team-building exercises, staff rotation onto other When starting on the Service Desk, new staff should activities (projects, second-line support, etc.). initially ‘shadow’ experienced staff – sit with them and listen in on calls – before starting to take calls themselves The Service Desk can often be used as a stepping stone with a mentor listening in and able to intervene and into other more technical or supervisory/managerial roles. provide support where necessary. The mentor should If this is done, care is needed to ensure that proper initially review each call with the trainee after it concludes succession planning takes place so that the desk does not to learn any lessons. The frequency of such reviews should lose all of its key expertise in any area at one time. Also, be gradually reduced as experience and confidence grows good documentation and cross-training can mitigate this but the mentor should still be available to provide risk. ongoing support even when the trainee has reached the stage of going solo. 126.96.36.199 Super Users Mentors may need to be trained on how to mentor. Many organizations find it useful to appoint or designate a Service Desk experience and technical skills are not the number of ‘Super Users’ throughout the user community, only requirements for mentoring. Effective knowledge- to act as liaison points with IT in general and the Service transfer skills and the ability to teach without being Desk in particular. condescending or threatening are equally important. Super Users can be given some additional training and A programme will be necessary to keep Service Desk staff’s awareness and used as a conduit for communications flow knowledge up to date – and to make them aware of new in both directions. They can be asked to filter requests and developments, services and technologies. The timing of issues raised by the user community (in some cases even such events is critical so as not to impact upon the normal going as far as to have incidents or requests raised by the duties. Many Service Desks find that it is best to organize Super User) – this can help prevent ‘incident storms’ when short ‘tutorials’ during quiet periods when staff are less a key service or component fails, affecting many users. likely to be needed for call handling. They can also be used to cascade information from the Note: Investment should also be made in the professional Service Desk outwards throughout their local user development of Service Desk staff. Internal mentoring and community, which can be very useful in disseminating shadowing second- and third-level support staff is a good service details to all users very quickly. start, but best-of-breed Service Desks benefit from a It is important to note that Super Users should log all calls formalized programme of staff development. that they deal with, and not just those that they pass on Organizational commitment to professional development to IT. This will mean access to, and training on how to use, helps instil a sense of accomplishment and opportunity to the Incident logging tools. This will help to measure the Organizing for Service Operation | 117 activity of the Super User and also to ensure that their An increase in the number of calls to the Service Desk can position is not abused. In addition, it will ensure that indicate less reliable services over that period of time – valuable history regarding incidents and service quality are but may also indicate increased user confidence in a not lost. Service Desk that is maturing, resulting in a higher likelihood that users will seek assistance rather than try to It may also be possible for Super Users to be involved in: cope alone. For this type of metric to be reliable for ■ Staff training for users in their area reaching either conclusion, further comparison of previous ■ Providing support for minor incidents or simple periods for any Service Desk improvements implemented request fulfilment since the last measurement baseline, or service reliability ■ Involvement with new releases and rollouts. changes, problems, etc. to isolate the true cause for the increase is needed. Super Users do not necessarily provide support for the whole of IT. In many cases a Super User will only provide Further analysis and more detailed metrics are therefore support for a specific application, module or business unit needed and must be examined over a period of time. area. As a business user the Super User often has in-depth These will include the call-handling statistics previously knowledge of how key business processes run and how mentioned under telephony, and additionally: services work in practice. This is very useful knowledge to ■ The first-line resolution rate: the percentage of calls share with the Service Desk, so that it can provide higher- resolved at first line, without the need for escalation to quality services in future. other support groups. This is the figure often quoted It should be noted that a firm commitment is needed from by organizations as the primary measure of the Service potential Super Users, and specifically their management, Desks performance – and used for comparison that they will have the time and interest to perform this purposes with the performance of other desks – but role before selection and training commences. care is needed when making any comparisons. For greater accuracy and more valid comparisons this can A Super User, while a valuable interface to the business be broken down further as follows: and the Service Desk, must be given proper training, ● The percentage of calls resolved during the first accountability and expectation. Super Users can be contact with the Service Desk, i.e. while the user is vulnerable to misuse if their role, responsibilities and still on the telephone to report the call the process governing these are not clearly communicated to the users. It is imperative that a Super User is not seen ● The percentage of calls resolved by the Service as a replacement for, or a means to circumvent, the Desk staff themselves without having to seek Service Desk. deeper support from other groups. Note: some desks will choose to co-locate or embed more 6.2.5 Service Desk metrics technically skilled second-line staff with the Service Desk (see Incident Management for further details). Metrics should be established so that performance of the In such cases it is important when making Service Desk can be evaluated at regular intervals. This is comparisons to also separate out (i) the percentage important to assess the health, maturity, efficiency, resolved by the Service Desk staff alone; and effectiveness and any opportunities to improve Service (ii) the percentage resolved by the first-line Service Desk operations. Desk staff and second-line support staff combined. Metrics for Service Desk performance must be realistic and ■ Average time to resolve an incident (when resolved at carefully chosen. It is common to select those metrics that first line) are easily available and that may seem to be a possible ■ Average time to escalate an incident (where first-line indication of performance; however, this can be resolution is not possible) misleading. For example, the total number of calls ■ Average Service Desk cost of handling an incident. received by the Service Desk is not in itself an indication Two metrics should be considered here: of either good or bad performance and may in fact be ● Total cost of the Service Desk divided by the caused by events completely outside the control of the number of calls. This will provide an average figure Service Desk – for example a particularly busy period for which is useful as an index and for planning the organization, or the release of a new version of a purposes but does not accurately represent the major corporate system. relative costs of different types of calls 118 | Organizing for Service Operation ● By calculating the percentage of call duration time courteous and professional, whether they instilled on the desk overall and working out a cost per confidence in the user. minute (total costs for the period divided by total This type of measure is best obtained from the users call duration minutes’) this can be used to themselves. This can be done as part of a wider calculate the cost for individual calls and give a customer/user satisfaction survey covering all of IT or can more accurate figure. be specifically targeted at Service Desk issues alone. By evaluating the types of incidents with call duration, a more refined picture of cost per call by types arises One effective way of achieving the latter is through a call- and gives an indication of which incident types tend back telephone survey, where an independent Service to cost more to resolve and possible targets for Desk Operator or Supervisor rings back a small percentage improvements. of users shortly after their incident has been resolved, to ■ Percentage of customer or user updates conducted ask the specific questions needed. within target times, as defined in SLA targets Care should be taken to keep the number of questions to ■ Average time to review and close a resolved call a minimum (five to six at the most) so that the users will ■ The number of calls broken down by time of day and have the time to cooperate. Also survey questions should day of week, combined with the average call-time be designed so that the user or customer knows what area metric, is critical in determining the number of staff or subject questions are about and which incident or required. service they are referring to. The Service Desk must act on low satisfaction levels and any feedback received. Further general details on metrics and how they should be used to drive forward service quality is included in the To allow adequate comparisons, the same percentage of Continual Service Improvement publication. calls should be selected in each period and they should be rigorously carried out despite any other time pressures. 188.8.131.52 Customer/user satisfaction surveys Surveys are a complex and specialized area, requiring a As well as tracking the ‘hard’ measures of the Service good understanding of statistics and survey techniques. Desk’s performance (via the metrics described above), it is This publication will not attempt to provide an overview also important to assess ‘soft’ measures – such as how of all of these, but a summary of some of the more widely well the customers and users feel their calls have been used techniques and tools is listed in Table 6.1. answered, whether they feel the Service Desk operator was Table 6.1 Survey techniques and tools Technique/Tool Advantages Disadvantages After-call survey ■ High response rate since the caller ■ People may feel pressured into taking the Callers are asked to remain on the is already on the phone survey, resulting in a negative service phone after the call and then asked ■ Caller is surveyed immediately after experience to rate the service they were the call so their experience is ■ The surveyor is seen as part of the Service provided recent Desk being surveyed, which may discourage open answers Outbound telephone survey ■ Higher response rate since the caller ■ This method could be seen as intrusive, if Customers and users who have is interviewed directly the call disrupts the user or customer from previously used the Service Desk are ■ Specific categories of user or their work contacted some time after their customer can be targeted for ■ The survey is conducted some time after experience with the Service Desk feedback (e.g. people who the user or customer used the Service Desk, requested a specific service, or so their perception may have changed people experienced a disruption to a particular service) (continued overleaf) Organizing for Service Operation | 119 Table 6.1 Survey techniques and tools (continued) Technique/Tool Advantages Disadvantages Personal interviews ■ The interviewer is able to observe ■ Interviews are time-consuming for both the Customers and users are interviewed non-verbal signals as well as interviewer and the respondent personally by the person doing the listening to what the user or ■ Users and customers could turn the survey. This is especially effective for customer is saying interviews into complaint sessions customers or users who use the ■ Users and customers feel a greater Service Desk extensively or who have degree of personal attention and a had a very negative experience sense that their answers are being taken seriously Group interviews ■ A larger number of users and ■ People may not express themselves freely in Customers and users are interviewed in customers can be interviewed front of their peers or managers small groups. This is good for ■ Questions are more generic and ■ People’s opinions can easily be changed by gathering general impressions and for therefore more consistent between others in the group during the interview determining whether there is a need to interviews change certain aspects of the Service Desk, e.g. service hours or location Postal/e-mail surveys ■ Specific or all customers or users ■ Postal surveys are labour intensive to Survey questionnaires are mailed to a can be targeted process target set of customers and users. ■ Postal surveys can be anonymous, ■ The percentage of people responding to They are asked to return their allowing people to express postal surveys tends to be small responses by e/mail themselves more freely ■ Misinterpretation of a question could affect ■ E-mail surveys are not anonymous, the result but can be created using automated forms that make it convenient and easy for the user to reply and increase the likelihood it will be completed Online surveys ■ The potential audience of these The percentage of respondents cannot be surveys is fairly large predicted Questionnaires are posted on a website and users and customers encouraged ■ Respondents can complete the via e-mail or links from a popular site questionnaire in their own time to participate in the survey ■ The links on popular websites are good reminders without being intrusive 6.2.6 Outsourcing the Service Desk and must therefore determine what service the outsourcer The decision to outsource is a strategic issue for senior provides, not the other way round. managers – and is addressed in detail in the Service If the outsourcing route is chosen, there are some Strategy and Service Design publications. Many of the safeguards that are needed to ensure that the outsourced guidelines in this section are not unique to the Service Service Desk works effectively and efficiently with the Desk and can be applied to any function, support area or organization’s other IT teams and departments and that service being outsourced (or out-tasked). end-to-end Service Management control is maintained Regardless of the reasons for, or the extent of, the (this is particularly important for organizations seeking outsourcing contract, it is vital that the organization ISO/IEC 20000 certification as overall management control retains responsibility for the activities and services has to be demonstrated). Some of these safeguards are set provided by the Service Desk. The organization is out below. ultimately responsible for the outcomes of the decision 120 | Organizing for Service Operation 184.108.40.206 Common tools and processes statements may indicate that a potential supplier uses the The Service Desk does not have responsibility for all the ITIL Framework in its delivery of services to customers, or processes and procedures that it initiates. For example, a that they have achieved standards certification for their Service Request is received by the Service Desk but the internal practices, but it is equally important to have the request is fulfilled by the internal IT Operational team. enabling technology in place and being used that demonstrates a service provider’s capability to manage If the Service Desk is outsourced, care must be taken that services and interface to internal practices harmoniously. the tools are consistent with those still being used in the There is no standard of compliance that ensures this and customer organization. Outsourcing is often seen as an so procurement efforts should include specific queries to opportunity to replace outdated or inadequate tools, only satisfy this requirement. More information on outsource to find that there are severe integration problems between provider acquisition can be found in the Service Design the new tool and the legacy tools and processes. publication. For this reason it is important to ensure that these issues are properly researched and the customer’s requirements 220.127.116.11 SLA targets are adequately scoped and specified before the The SLA targets for overall incident-handling and outsourcing contract. Service Desk tools must not only resolution times need to be agreed with the customers support the outsourced Service Desk, but they must and between all teams and departments – and OLA/UC support the customer organization’s processes and targets need to be coordinated and agreed with individual business requirements as well. support groups so that they underpin and support the Ideally the outsourced desk should use the same tools and SLA targets. processes (or, as a minimum, interfacing tools and Examples of these can be seen in the section on metrics processes) to allow smooth process flow between the above (see section 6.2.5). Service Desk and second- and third-line support groups. In addition, the outsourced Service Desk should have 18.104.22.168 Good communications access to: The lines of communication between the outsourced Service Desk and the other support groups need to work ■ All incident records and information very effectively. This can be assisted by some or all of the ■ Problem Records and information following steps: ■ Known Error Data ■ Close physical co-location ■ Change Schedule ■ Regular liaison/review meetings ■ Sources of internal knowledge (especially technical or application experts) ■ Cross-training tutorials between the teams and departments ■ SKMS ■ ‘Partnership’ arrangements when staff from both ■ CMS organizations are used jointly to staff the desk ■ Alerts from monitoring tools. ■ Communication Plans and performance targets are It is often a challenge integrating processes and tools in a documented in a consistent manner in OLAs and UCs. less mature organization with those in a more mature In cases where the Service Desk is located off-shore, not all organization. A common but incorrect assumption is that of these measures will be possible. However, the need for the maturity of the one organization will somehow result training and communication of the Service Desk staff is in higher maturity in the other. Active involvement to still critical, even more so in cases where there are ensure alignment of processes and tools is essential to a language and cultural differences. smooth transition and ongoing management of services between the internal and external organizations. In fact, if This will be covered in more detail in ITIL complementary this is not directly addressed, it could result in the failure publications, but, as a rule, outsourcing companies who of the contract. offer off-shore Service Desk solutions should take the following into account: It is also often incorrectly assumed that the proof of Service Management quality and maturity in an external ■ Training programmes focused on cultural outsource partner can be guaranteed by stating understanding of the customer market requirements in the procurement process for ‘ITIL ■ Language skills – especially the understanding of conformance’ and / or ‘ISO/IEC 20000 certification’. These idiomatic use of the language in the customer market. Organizing for Service Operation | 121 This is not so that the Service Desk staff sound like ■ It provides the actual resources to support the ITSM natives of the customer’s country (that type of Lifecycle. In this role Technical Management ensures insincerity is very quickly detected by customers), but that resources are effectively trained and deployed to to facilitate better understanding of the customer and design, build, transition, operate and improve the the better to appreciate their priorities technology required to deliver and support IT services. ■ Regular visits by representatives of the customer By performing these two roles, Technical Management is organization to provide training and appropriate able to ensure that the organization has access to the feedback directly to the Service Desk management right type and level of human resources to manage and staff technology and, thus, to meet business objectives. ■ Training in the use of the customer organizations tools Defining the requirements for these roles starts in Service and methods of work. This is especially effective if Strategy and is expanded in Service Design, validated in similar training materials are presented by the same Service Transition and refined in Continual Service instructors as those used by the customer Improvement (see other ITIL publications in this series). organization. Part of this role is also to ensure a balance between the 22.214.171.124 Ownership of data skill level, utilization and the cost of these resources. For example, hiring a top-level resource at the higher end of Clear ownership of the data collected by the outsourced the salary scale and then only using that skill for 10% of Service Desk must be established. Ownership of all data the time is not effective. A better Technical Management relative to users, customers, affected CIs, services, strategy would be to identify the times that the skill is incidents, Service Requests, changes, etc. must remain needed and then hire a contractor for only those tasks. with the organization that is outsourcing the activity – but both organizations will require access to it. Another strategy in larger organizations is to leverage specialist staff out of ‘central’ pools so that specialists can Data that is related specifically to performance of be well utilized and provide an economy of scale to the employees of the outsourcing company will remain the organization and minimize the need to hire in contractors. property of that company, which is often legally prevented Specialized skills should be identified among resources in from sharing the data with the customer organization. This the IT organization, then leveraged for specific needs as may also be true of other data that is used purely for the they arise, analogous to a special tactical unit, whose internal management of the Service Desk, such as head members also perform regular duties but who are count, optimization activities, Service Desk cost assigned to tasks needing their specialized skills. This type information, etc. of resource utilization is particularly useful both for project All reporting requirements and issues around ownership of teams and problem resolution. data must be specified in the underpinning contract with An additional, but very important role played by Technical the company providing the outsourcing service. Management is to provide guidance to IT Operations about how best to carry out the ongoing operational 6.3 TECHNICAL MANAGEMENT management of technology. This role is partly carried out during the Service Design process, but it is also a part of Technical Management refers to the groups, departments everyday communication with IT Operations Management or teams that provide technical expertise and overall as they seek to achieve stability and optimum management of the IT Infrastructure. performance. 6.3.1 Technical Management role The objectives, activities and structures that enable Technical Management plays a dual role: Technical Management to perform these roles effectively are discussed below. ■ It is the custodian of technical knowledge and expertise related to managing the IT Infrastructure. 6.3.2 Technical Management objectives In this role, Technical Management ensures that the The objectives of Technical Management are to help plan, knowledge required to design, test, manage and implement and maintain a stable technical infrastructure improve IT services is identified, developed and to support the organization’s business processes through: refined. ■ Well designed and highly resilient, cost-effective technical topology 122 | Organizing for Service Operation ■ The use of adequate technical skills to maintain the technology architectures during the Service Strategy technical infrastructure in optimum condition and Design phases. ■ Swift use of technical skills to speedily diagnose and ■ Research and development of solutions that can help resolve any technical failures that do occur. expand the Service Portfolio or which can be used to simplify or automate IT Operations, reduce costs or 6.3.3 Generic Technical Management increase levels of IT service. activities ■ Involvement in the design and building of new Technical Management is involved in two types of activity: services. Technical Management will contribute to the design of the Technical Architecture and Performance ■ Activities that are generic to the Technical standards for IT services. In addition, it will also be Management function as a whole are discussed in this responsible for specifying the operational activities section as they enable Technical Management as a required to manage the IT Infrastructure on an function to execute its role. ongoing basis. ■ A set of discrete activities and processes, which are ■ Involvement in projects, not only during Service performed by all three functions of Technical, Design and Service Transition, but also for Continual Application and IT Operations Management, are Service Improvement or operational projects, such as covered in Chapter 5. Operating System upgrades, server consolidation Generic Technical Management activities are highlighted projects or physical moves. as follows: ■ Availability and Capacity Management are dependent on Technical Management for engineering IT services ■ Identifying the knowledge and expertise required to to meet the levels of service required by the business. manage and operate the IT Infrastructure and to This means that modelling and workload forecasting deliver IT services. This process starts during the are often done with Technical Management resources. Service Strategy phase, is expanded in detail in Service ■ Assistance in assessing risk, identifying critical service Design and is executed in Service Operation. Ongoing assessment and updating of these skills is done during and system dependencies and defining and Continual Service Improvement. implementing countermeasures. ■ Designing and performing tests for the functionality, ■ Documentation of the skills that exist in the organization, as well as those skills that need to be performance and manageability of IT services. developed. This will include the development of ■ Managing vendors. Many Technical Management Skills Inventories and the performance of Training departments or groups are the only ones who know Needs Analyses. exactly what is required of a vendor and how to ■ Initiating training programmes to develop and refine measure and manage them. For this reason, many the skills in the appropriate technical resources and organizations rely on Technical Management maintaining training records for all technical resources. departments to manage contracts with vendors of specific CIs. If this is the case it is important to ensure ■ Design and delivery of training for users, the Service that these relationships are managed as part of the Desk and other groups. Although training SLM process. requirements must be defined in Service Design, they ■ Definition and management of Event Management are executed in Service Operation. Where Technical Management does not deliver training, it is responsible standards and tools. Technical Management will also for identifying organizations that can provide it. monitor and respond to many categories of events. ■ Technical Management departments or groups are ■ Recruiting or contracting resources with skills that cannot be developed internally, or where there are integral to the performance of Incident Management. insufficient people to perform the required Technical They receive incidents through Functional Escalation Management activities. and provide second- and higher-level support. They are also involved in maintaining categories and ■ Procuring skills for specific activities where the defining the escalation procedures that are executed required skills are not available internally or in in Incident Management. the open market, or where it is more cost-efficient ■ Technical Management as a function provides the to do so. resources that execute the Problem Management ■ Definition of standards used in the design of new process. It is its technical expertise and knowledge architectures and participation in the definition of that is used to diagnose and resolve problems. It is Organizing for Service Operation | 123 also its relationship with the vendors that is used to Infrastructure. In all but the smallest organizations, where escalate and follow up with vendor support teams. a single combined team or department may suffice, ■ Technical Management resources will be involved in separate teams or departments will be needed for each defining coding systems that are used in Incident and type of infrastructure being used. Problem Management (e.g. Incident Categories). IT Operations Management consists of a number of ■ Technical Management resources are used to support technological areas. Each of these requires a specific set of Problem Management in validating and maintaining skills to manage and operate it. Some skill sets are related the KEDB. and can be performed by generalists, whereas others are ■ Change Management relies on the technical specific to a component, system or platform. knowledge and expertise to evaluate changes, and The primary criterion of Technical Management many changes will be built by Technical Management. organizational structure is that of specialization or division ■ Releases are frequently deployed using Technical of labour. The principle is that people are grouped Management resources. according to their technical skill sets, and that these skill ■ Technical Management will provide information for, sets are determined by the technology that needs to and operationally maintain, the Configuration be managed. Management system and its data. This will be done in cooperation with Application Management to ensure Sections 6.6 and 6.7 cover the organizational aspects of that the correct CI attributes and relationships are Technical Management in detail, but this list provides created from the deployment of services and the some examples of typical Technical Management teams ongoing maintenance over the life of CIs. or departments: ■ Technical Management is involved in the Continual ■ Mainframe team or department – if one or more Service Improvement processes, particularly in mainframe types are still being used by the identifying opportunities for improvement and then in organization helping to evaluate alternative solutions. ■ Server team or department – often split again by ■ As a custodian of technical knowledge and expertise, technology types (e.g. Unix server, Wintel server) Technical Management ensures that all system and ■ Storage team or department, responsible for the operating documentation is up to date and properly management of all data storage devices and media utilized. This includes ensuring that all management, ■ Network Support team or department, looking after administration and user manuals are up to date and the organization’s internal WANs/LANs and managing complete and that technical staff are familiar with any external network suppliers their contents. ■ Desktop team or department, responsible for all ■ Updating and maintaining data used for reporting on installed desktop equipment technical and service capabilities, e.g. Capacity and ■ Database team or department, responsible for the Performance Management, Availability Management, creation, maintenance and support of the Problem Management, etc. organization’s databases ■ Assisting IT Financial Management to identify the cost ■ Middleware team or department, responsible for the of technology and IT human resources used to integration, testing and maintenance of all middleware manage IT services. in use in the organization ■ Involvement in defining the operational activities ■ Directory Services team or department, responsible for performed as part of IT Operations Management. Many maintaining access and rights to service elements in Technical Management departments, groups or teams the infrastructure also perform the operational activities as part of an ■ Internet or Web team or department, responsible for organization’s IT Operations Management function. managing the availability and security of access to servers and content by external customers, users and 6.3.4 Technical Management organization partners Technical Management is not normally provided by a ■ Messaging team or department, responsible for e-mail single department or group. One or more Technical services Support teams or departments will be needed to provide ■ IP-based Telephony team or department (e.g. VoIP). technical management and support for the IT 124 | Organizing for Service Operation 6.3.5 Technical Design and Technical ● Installation and configuration of components under Maintenance and Support their control. ■ Process metrics. Technical Management teams Technical Management consists of specialist technical architects and designers (who are primarily involved execute many Service Management process activities. during Service Design) and specialist maintenance Their ability to do so will be measured as part of the and support staff (who are primarily involved during process metrics where appropriate (see section on Service Operation). each process for more details). Examples include: ● Response time to events and event completion In this publication, they are viewed as being part of the rates same function, but many organizations see them as two ● Incident resolution times for second- and third-line separate teams or even departments. The problem with support this approach is that good design needs input from the ● Problem resolution statistics people who are required to manage the solution – and ● Number of escalations and reason for those good operation requires involvement from the people who designed the solution. escalations ● Number of changes implemented and backed out The problems that need to be overcome are similar to ● Number of unauthorized changes detected those faced in managing the Application Lifecycle (see ● Number of releases deployed, total and successful section 6.5 for a more detailed discussion). The solution ● Security issues detected and resolved will include the following elements: ● Actual system utilization against Capacity Plan ■ Support staff should be involved during the design or forecasts (where the team has contributed to the architecture of a solution. Design staff should be development of the plan) involved in setting maintenance objectives and ● Tracking against SIPs resolving support issues. ● Expenditure against budget. ■ A change in how both Design and Support staff are ■ Technology performance. These metrics are based measured. Designers should be held partly on Service Design specifications and technical accountable for design flaws that create operational performance standards set by vendors, and will outages. Support staff should be held partly typically be contained in OLAs or Standard Operation accountable for contribution to the technical Procedures. Actual metrics will vary by technology, but architecture. are likely to include: ● Utilization rates (e.g. memory or processor for 6.3.6 Technical Management metrics server, bandwidth for networks, etc.) Metrics for Technical Management will largely depend on ● Availability (of systems, network, devices, etc.), which technology is being managed, but some generic which is helpful for measuring team or system metrics include: performance, but is not to be confused with ■ Measurement of agreed outputs. These could Service Availability – which requires the ability to include: measure the overall availability of the service and ● Contribution to achievement of services to the may use the availability figures for a number of business. Although many of the Technical individual systems or components Management teams will not be in direct contact ● Performance (e.g. response times, queuing with the business, the technology they manage rates, etc.). impacts the business. Metrics should reflect both ■ Mean Time Between Failures of specified negative (incidents traced to their team) and equipment. This metric is used to ensure that good positive (system performance and availability) purchasing decisions are being made and, when contributions compared with maintenance schedules, whether the ● Transaction rates and availability for critical equipment is being properly maintained business transactions ■ Measurement of maintenance activity, including: ● Service Desk training ● Maintenance performed per schedule ● Recording problem resolutions into the KEDB ● Number of maintenance windows exceeded ● User measures of the quality of outputs as defined ● Maintenance objectives achieved (number and in the SLAs percentage). Organizing for Service Operation | 125 ■ Training and skills development. These metrics Skills Inventories can also be used as part of the Service ensure that staff have the skills and training to Portfolio to assess whether a new service can be delivered manage the technology that is under their control, with existing staff and skill sets, or whether an investment and will also identify areas where training is still needs to be made in new staff or training. Skills required. Inventories can therefore contribute significantly to Capacity Planning. 6.3.7 Technical Management documentation The definition and maintenance of Skills Inventories Technical Management is involved in drafting and requires a good interface with Human Resource processes maintaining several documents as part of other processes and tools in the organization. (e.g. Capacity Planning, Change Management, Problem Management, etc.). These documents are discussed in some detail in the relevant process descriptions. However, 6.4 IT OPERATIONS MANAGEMENT there are some documents that are specific to the In business, the term ‘Operations Management’ is used to Technical Management groups or teams who will provide mean the department, group or team of people document management and control for documents responsible for performing the organization’s day-to-day relating to the technology under their control. Technical operational activities – such as running the production line Management documentation includes the following. in a manufacturing environment or managing the distribution centres and fleet movements within a logistics 126.96.36.199 Technical documentation organization. The sourcing and maintenance of technical Operations Management generally has the following documentation for all CIs is the responsibility of Technical characteristics: Management. These include: ■ There is work to ensure that a device, system or ■ Technical manuals process is actually running or working (as opposed to ■ Management and administration manuals strategy or planning) ■ User manuals for CIs. These will typically exclude ■ This is where plans are turned into actions application user manuals, which are maintained by ■ The focus is on daily or shorter-term activities, Application Management. although it should be noted that these activities will generally be performed and repeated over a 188.8.131.52 Maintenance Schedules relatively long period (as opposed to one-off project These schedules are drawn up and agreed during the type activities) Service Design phase related to Availability and Capacity ■ These activities are executed by specialized technical Management, but they are essentially the property of the staff, who often have to undergo technical training to various Technical Management departments, groups or learn how to perform each activity teams. This is because they have the technical expertise ■ There is a focus on building repeatable, consistent for specific technologies and are most likely to know what actions that – if repeated frequently enough at the is needed to keep them in working order. right level of quality – will ensure the success of For more details on the definition of Maintenance the operation Schedules and Service Maintenance Objectives, refer to the ■ This is where the actual value of the organization is ITIL Service Design publication. delivered and measured ■ There is a dependency on investment in equipment 184.108.40.206 Skills Inventory or human resources or both A Skills Inventory is a system or tool that identifies the ■ The value generated, must exceed the cost of the skills required to deliver and support IT services and also investment and all other organizational overheads the individuals who possess those skills. Skills Inventories (such as management and marketing costs) if the are most effective if they are aligned with processes, business is to succeed. architectures and performance standards. In a similar way, IT Operations Management can be In addition, Skills Inventories should identify the training defined as the function responsible for the ongoing available to cultivate each skill should existing staff leave management and maintenance of an organization’s IT the organization. Infrastructure to ensure delivery of the agreed level of IT services to the business. 126 | Organizing for Service Operation IT Operations can be defined as the set of activities infrastructure and consistency of IT Services is a involved in the day-to-day running of the IT Infrastructure primary concern of IT Operations. Even operational for the purpose of delivering IT services at agreed levels to improvements are aimed at finding simpler and better meet stated business objectives. ways of doing the same thing. ■ At the same time, IT Operations is part of the process 6.4.1 IT Operations Management role of adding value to the different lines of business and The role of Operations Management is to execute the to support the value network (see the ITIL Service ongoing activities and procedures required to manage and Strategy publication). The ability of the business to maintain the IT Infrastructure so as to deliver and support meet its objectives and to remain competitive IT Services at the agreed levels. These have already been depends on the output and reliability of the day-to- described in section 5, but are summarized here for day operation of IT. As such, IT Operations completeness: Management must be able to continually adapt to business requirements and demand. The Business does ■ Operations Control, which oversees the execution not care that IT Operations complied with a standard and monitoring of the operational activities and events procedure or that a server performed optimally. As in the IT Infrastructure. This can be done with the business demand and requirements change, IT assistance of an Operations Bridge or Network Operations Management must be able to keep pace Operations Centre. In addition to executing routine with them, often challenging the status quo. tasks from all technical areas, Operations Control also performs the following specific tasks: IT Operations must achieve a balance between these roles, ● Console Management, which refers to defining which will require the following: central observation and monitoring capability and ■ An understanding of how technology is used to then using those consoles to exercise monitoring provide IT services and control activities ■ An understanding of the relative importance and ● Job Scheduling, or the management of routine impact of those services on the business batch jobs or scripts ■ Procedures and manuals that outline the role of IT ● Backup and Restore on behalf of all Technical Operations in both the management of technology and Application Management teams and and the delivery of IT services departments and often on behalf of users ■ A clearly differentiated set of metrics to report to the ● Print and Output management for the collation business on the achievement of Service objectives; and and distribution of all centralized printing or to report to IT managers on the efficiency and electronic output effectiveness of IT Operations ● Performance of maintenance activities on behalf ■ All IT Operations staff understand exactly how the of Technical or Application Management teams or performance of the technology affects the delivery of departments. IT services ■ Facilities Management, which refers to the ■ A cost strategy aimed at balancing the requirements management of the physical IT environment, typically of different business units with the cost savings a Data Centre or computer rooms and recovery sites available through optimization of existing technology together with all the power and cooling equipment. or investment in new technology Facilities Management also includes the coordination ■ A value, rather than cost, based Return on Investment of large-scale consolidation projects, e.g. Data Centre strategy. consolidation or server consolidation projects. In some cases the management of a data centre is outsourced, 6.4.2 IT Operations Management objectives in which case Facilities Management refers to the The objectives of IT Operations Management include: management of the outsourcing contract. ■ Maintenance of the status quo to achieve stability of As with many IT Service Management processes and the organization’s day-to-day processes and activities functions, IT Operations Management plays a dual role. ■ Regular scrutiny and improvements to achieve ■ IT Operations Management is responsible for executing improved service at reduced costs, while maintaining the activities and performance standards defined stability during Service Design and tested during Service ■ Swift application of operational skills to diagnose and Transition. In this sense IT Operations’ role is primarily resolve any IT operations failures that occur. to maintain the status quo. The stability of the IT Organizing for Service Operation | 127 6.4.3 IT Operations Management ● Expenditure against budget. organization ■ If maintenance activities have been delegated, then Figure 6.1 in the introduction to Chapter 6 illustrated that metrics related to these activities will also be IT Operations Management is seen as a function in its own appropriate: right but that, in many cases, staff from Technical and ● Maintenance performed per schedule Application Management groups form part of this ● Number of maintenance windows exceeded function. ● Maintenance objectives achieved (number and percentage). This means that some Technical and Application ■ Metrics related to Facilities Management are extensive, Management departments or groups will manage and execute their own operational activities. Others will but typically include: delegate these activities to a dedicated IT Operations ● Costs versus budget related to maintenance, department. construction, security, shipping, etc. ● Incidents related to the building, e.g. repairs There is no single method for assigning activities, as it needed to the facility depends on the maturity and stability of the infrastructure ● Reports on access to the facility being managed. For example, Technical and Application ● Number of security events and Incidents and their Management areas that are fairly new and unstable tend to manage their own operations. Groups where the resolution technology or application is stable, mature and well ● Power usage statistics, especially as related to understood tend to have standardized their operations changes in layout and environmental conditioning more and will therefore feel more comfortable delegating strategies these activities. ● Events or incidents related to shipping and distribution. Some options of how to structure IT Operations are discussed in detail in section 6.7 of this publication. 6.4.5 IT Operations Management 6.4.4 IT Operations Management metrics documentation A number of documents are produced and used during IT IT Operations Management is measured in terms of its Operations Management. This list is a summary of some of effective execution of specified activities and procedures, the most important and does not include reports that are as well as its execution of process activities. Examples of produced by IT Operations Management on behalf of these are as follows: other processes or functions. ■ Successful completion of scheduled jobs ■ Number of exceptions to scheduled activities and jobs 220.127.116.11 Standard Operating Procedures ■ Number of data or system restores required The SOPs are a set of documents containing detailed ■ Equipment installation statistics, including number of instructions and activity schedules for every IT Operations items installed by type, successful installations, etc. Management team, department or group. ■ Process metrics. IT Operations Management executes These documents represent the routine work that needs to many Service Management process activities. Their be done for every device, system or procedure. They also ability to do so will be measured as part of the outline the procedures to be followed if an exception is process metrics where appropriate (see section on detected or if a change is required. each process for more details). Examples include: ● Response time to events SOP documents could also be used to define standard levels of performance for devices or procedures. In some ● Incident resolution times for incidents organizations the SOP documents are referred to in the ● Number of security-related incidents OLA. Instead of listing detailed performance measures in ● Number of escalations and reason for those the OLA, a clause is inserted to refer to the performance escalations standards in the SOP and how these will be measured and ● Number of changes implemented and backed out reported. ● Number of unauthorized changes detected ● Number of releases deployed, total and successful ● Tracking against SIPs 128 | Organizing for Service Operation 18.104.22.168 Operations Logs could simply be listed briefly with a reference to the Any activity that is conducted as part of IT Operations section or page in the SOP. should be recorded for a number of reasons, including: Most Shift Schedules take the form of a checklist where ■ They can be used to confirm the successful operators can check off the item as it is completed, completion of specific jobs or activities together with the time of completion. This makes it easy ■ They can be used to confirm that an IT service was to see the progress of activities and also helps to identify any potential issues where jobs are taking too long. delivered as agreed ■ They can be used by Problem Management to Shift Reports are a form of Operations Log, but have the research the root cause of incidents additional functions as follows: ■ They are the basis for reports on the performance of ■ To record major events and actions that occurred the IT Operations Management teams and during the shift departments. ■ To form part of the handover between shift leaders The format of these logs is as varied as the number ■ To report any exceptions to Service Maintenance of systems and Operations Management teams or Objectives departments. Examples of Operations Logs include ■ To identify any uncompleted activity that could result the following: in degraded performance on any service during the ■ Operating System Logs stored on each device next service hours. ■ Application Activity Logs stored in a file on the application server 22.214.171.124 Operations Schedule ■ Event Logs stored on the monitoring tool server The Operations Schedules are similar to Shift Schedules ■ Utilization Logs for key devices but cover all aspects of IT Operations at a high level. This schedule will include an overview of all planned changes, ■ Physical access logs recording who accessed secure maintenance, routine jobs and additional work, together buildings and when with information about upcoming business or vendor ■ Handwritten logs of actions performed by operators. events. The Operations Schedule is used as the basis for This must be in a formal logbook or binder, numbered the Daily Operations Meeting and is the master reference and stored in a secure environment. Checks should for all IT Operations managers to track progress and detect ensure that pages are not removed. exceptions. A policy needs to be established as part of the SOPs to state how long logs need to be kept, how they are 6.5 APPLICATION MANAGEMENT archived and when they can be deleted. These policies will take into account statutory and compliance requirements. Application Management is responsible for managing Policies should also specify the parameters for adequate applications throughout their lifecycle. The Application storage and backup strategies to store and retrieve Management function is performed by any department, log files. group or team involved in managing and supporting operational applications. Application Management also 126.96.36.199 Shift Schedules and Reports plays an important role in the design, testing and Shift Schedules are documents that outline the exact improvement of applications that form part of IT services. activities that need to be carried out during the shift. They As such, it may be involved in development projects, will also list all dependencies and activity sequences. There but is not usually the same as the Applications will probably be more than one Shift Schedule, where Development teams. each team will have a version for its own systems. It is important that all schedules are coordinated before the 6.5.1 Application Management role start of the shift. This is usually done by a person who is Application Management is to applications what Technical specialized in Shift Scheduling, with the help of Management is to the IT Infrastructure. Application scheduling tools. Management plays a role in all applications, whether purchased or developed in-house. One of the key A Shift Schedule could consist of a number of routine decisions that they contribute to is the decision of items that are included in the SOP. In this case the items whether to buy an application or build it (this is discussed in detail in the Service Design publication). Once that Organizing for Service Operation | 129 decision is made, Application Management will play These objectives are achieved through: a dual role: ■ Applications that are well designed, resilient and ■ It is the custodian of technical knowledge and cost-effective expertise related to managing applications. In this role ■ Ensuring that the required functionality is available to Application Management, working together with achieve the required business outcome Technical Management, ensures that the knowledge ■ The organization of adequate technical skills to required to design, test, manage and improve IT maintain operational applications in optimum services is identified, developed and refined. condition ■ It provides the actual resources to support the ITSM ■ Swift use of technical skills to speedily diagnose and Lifecycle. In this role, Application Management ensures resolve any technical failures that do occur. that resources are effectively trained and deployed to design, build, transition, operate and improve the 6.5.3 Application Management principles technology required to deliver and support IT services. By performing these two roles, Application Management is 188.8.131.52 Build or buy? able to ensure that the organization has access to the One of the key decisions in Application Management is right type and level of human resources to manage whether to buy an application that supports the required applications and thus to meet business objectives. This functionality, or whether to build the application starts in Service Strategy and is expanded in Service specifically for the organization’s requirements. These Design, tested in Service Transition and refined in decisions are often made by a Chief Technical Officer Continual Service Improvement (see other ITIL publications (CTO) or Steering Committee, but they are dependent in this series). on information from a number of sources. These are discussed in detail in Service Design, but are Part of this role is to ensure a balance between the skill summarized here from an Application Management level and the cost of these resources. function perspective. In additional to these two high-level roles, Application Application Management will assist in this decision during Management also performs the following two Service Design as follows: specific roles: ■ Application sizing and workload forecasts ■ Providing guidance to IT Operations about how best (see section 4.6.4) to carry out the ongoing operational management of ■ Specification of manageability requirements applications. This role is partly carried out during the ■ Identification of ongoing operational costs Service Design process, but it is also a part of everyday communication with IT Operations ■ Data access requirements for reporting or integration Management as they seek to achieve stability and into other applications optimum performance. ■ Investigating to what extent the required functionality ■ The integration of the Application Management can be met by existing tools – and how much Lifecycle into the ITSM Lifecycle. This is discussed customization will be required to achieve this below. ■ Estimating the cost of customization ■ Identifying what skills will be required to support the The objectives, activities and structures that enable solution (e.g. if an application is purchased, will it Application Management to play these roles effectively are require a new set of employees, or can existing discussed below. employees be trained to support it?) ■ Administration requirements 6.5.2 Application Management objectives ■ Security requirements. The objectives of Application Management are to support the organization’s business processes by helping to If the decision is to build the application, a further identify functional and manageability requirements for decision needs to be made on whether the development application software, and then to assist in the design and will be outsourced or built using employees. This is deployment of those applications and the ongoing detailed in the Service Strategy and Service Design support and improvement of those applications. publications, but there are some important considerations affecting Service Operation, for example: 130 | Organizing for Service Operation ■ How will manageability requirements be specified and This should not replace the SDLC, which is still a valid agreed (e.g. designing application and transaction approach used by developers, especially by third-party monitoring)? These are sometimes forgotten when the software companies. However, it does mean that there operational teams or departments are not represented should be greater alignment between the development in the project view of applications and the ‘live’ management of those ■ What are the Acceptance Criteria for operational applications. performance; how and where will the solution be This is more difficult in large-scale purchased applications, tested and who will perform the tests? such as e-mail, since the developers do not typically ■ Who will own and manage the Definitive Library for interact individually with their application’s users. that application? However, the basic lifecycle still holds true in that the ■ Who will design and maintain the operational application needs requirements, design, customization, management and administration scripts for these operation and deployment. Optimization is achieved applications? through better management, improvements to ■ Who is responsible for environment set-up and customization and upgrades. owning and maintaining the different infrastructure The Application Management Lifecycle is illustrated as components? follows: ■ How will the solution be instrumented so that it is capable of generating the required events? Requirements 184.108.40.206 Operational Models An Operational Model is the specification of the operational environment in which the application will eventually run when it goes live. This will be used during testing and transition phases to simulate and evaluate the Optimize Design live environment. This is a way of ensuring that the application can be sized correctly and the required environmental conditions can be documented and understood by all. The Operational Model should be defined and used in testing during the Service Design and Service Transition phases respectively (see Service Design Operate Build and Service Transition publications). 6.5.4 Application Management Lifecycle The lifecycle followed to develop and manage applications has been referred to by many names, including the Deploy Software Lifecycle (SLC) and Software Development Lifecycle (SDLC). These are generally used by Applications Development teams and their Project Managers to define Figure 6.5 Application Management Lifecycle their involvement in designing, building, testing, ITSM processes and Applications Development processes deploying and supporting applications. Examples of these have to be aligned as part of the overall strategy of approaches are Structured Systems Analysis and Design delivering IT services in support of the business. Methodology (SSADM), Dynamic Systems Development Method (DSDM), Rapid Application Development (RAD), Applications Development and Operations are part of the etc. same overall lifecycle and both should be involved at all stages, although their level of involvement will vary ITIL is primarily interested in the overall management depending on the stage of the lifecycle. of applications as part of IT Services, whether they are developed in-house or purchased from a third party. For this reason, the term Application Management Lifecycle has been used, as it implies a more holistic view. Organizing for Service Operation | 131 220.127.116.11 Requirements Relationship between the Application Management and Service Management Lifecycles This is the phase during which the requirements for a new application are gathered, based on the business needs of The Application Management Lifecycle should not be the organization. This phase is active primarily during the seen as an alternative to the Service Management Service Design phase of the ITSM Lifecycle. Lifecycle. Applications are part of services and have to be managed as such. Nevertheless, applications are a There are six types of requirements for any application, unique blend of technology and functionality and this whether being developed in-house, outsourced or requires a specialized focus at each stage of the purchased: Service Management Lifecycle. ■ Functional requirements are those specifically required Each stage of the Application Management Lifecycle to support a particular business function has its own specific set of objectives, activities, ■ Manageability requirements, looked at from a Service deliverables and dedicated teams. Each stage also has a clear responsibility to ensure that their outputs Management perspective, address the need for a match up to the specific objectives of the Service responsive, available and secure service, and deal with Management Lifecycle. Different aspects of such issues as deployment, operations, system Application Management are covered in detail in each management and security of the ITIL publications, as follows: ■ Usability requirements are those that address the ■ Service Strategy: Defines the overall architecture needs of the end user, and result in features of the of applications and infrastructure. This will include system that facilitate its ease of use defining the criteria for developing in-house, ■ Architectural requirements, especially if this requires a outsourcing development, or purchasing and change to existing architecture standards customizing applications. Service Strategy will also ■ Interface requirements, where there are dependencies assist in defining the Service Portfolio (including between existing applications or tools and the new applications) which also includes information application about the Return on Investment of applications and the services they support. Thus high-level ■ Service Level Requirements, which specify how the requirements are set during this phase. service should perform, the quality of its output and any other qualitative aspects measured by the user or ■ Service Design: Helps to establish requirements customer. for functionality and manageability of applications and works with Development teams to ensure that they meet these objectives. Service Design 18.104.22.168 Design covers most of the Requirements phase and is This is the phase during which requirements are translated involved during the Build phase of the Application into specifications. Design includes the design of the Management Lifecycle. application itself, and the design of the environment, or ■ Service Transition: Application Development and operational model that the application has to run on. Management teams are involved in testing and Architectural considerations are the most important aspect validating what has been built and deploying it of this phase, since they can impact on the structure and operationally. content of both application and operational model. ■ Service Operation: This covers the Operate phase Architectural considerations for the application (design of of the Application Management Lifecycle. These the application architecture) and architectural processes and structures are discussed in detail in considerations for the operation model (design of the this publication. system architecture) are strongly related and need to be aligned. ■ Continual Service Improvement: Covers the Optimize phase of the Application Management In the case of purchased software, most organizations will Lifecycle. Continual Service Improvement not be allowed direct input to the design of the software measures the quality and relevance of applications (which has already been built). However, it is important in operation and provides recommendations on that Application Management is able to provide feedback how to improve applications if there is a clear to the software vendor about the functionality, Return on Investment for doing so. manageability and performance of the software. This will, in turn, be taken up by the software vendor as part of the continual improvement of the software. 132 | Organizing for Service Operation Part of the evaluation process for purchased software Testing also takes place during this phase, although here should include an evaluation of whether the vendor is the emphasis is on ensuring that the deployment process responsive to such feedback. At the same time, they and mechanisms work effectively, e.g. testing whether the should ensure that there is a balance between being application still functions to specification after it has been responsive and changing their software so much that it is downloaded and installed. This is known as Early Life disruptive or that it changes some basic functionality. Support and covers a pre-defined guarantee period that testing, validation and monitoring of a new application or Design for purchased software will also include the design service during that period occurs. Early Life Support is of any customization that is required. Of special covered in detail in the Service Transition publication. importance here is an evaluation of whether future version of the software will support the customization. 22.214.171.124 Operate 126.96.36.199 Build In the Operate phase, the IT services organization operates the application as part of delivering a service required by In the Build phase, both the application and the the business. The performance of the application in operational model are made ready for deployment. relation to the overall service is measured continually Application components are coded or acquired, integrated against the Service Levels and key business drivers. It is and tested. important to distinguish that applications themselves do Please note that Test is not a separate stage in the not equate to a service. It is common in many lifecycle, even though it is a discrete activity, and even organizations to refer to applications as ‘services’; though tests are conducted independently of both the however, applications are but one component of many development and operational activities. Without the Build needed to provide a business service. and Deploy phases, there would be nothing to test and, The Operate phase is not exclusive to applications and is without testing, there would be no control over what is discussed throughout this publication, with a more developed and deployed. detailed list of activities given in section 6.5.5 below. Testing is an integral component of both the Build and Deploy phases as a validation of the activity and output of 188.8.131.52 Optimize those phases – even if it uses different environments and In the Optimize phase, the results of the Service Level staff. Testing in the Build phase focuses on whether the performance measurements are measured, analysed and application meets its functionality and manageability acted upon. Possible improvements are discussed and specifications. Often the distinction is made between a developments initiated if necessary. The two main development and test environment. The test environment strategies in this phase are to maintain and/or improve the allows for testing the combination of application and Service Levels and to lower cost. This could lead to operational model. Testing is covered in the ITIL Service iteration in the lifecycle or to justified retirement of an Transition publication. application. For purchased software, this will involve the actual One important thing to remember about the Application purchase of the application, any required middleware and Management Lifecycle is that, because it is circular, the the related hardware and networking equipment. Any same application can reside in different phases of the customization that is required will need to be done here, lifecycle at the same time. For example, when the next as will the creation of tables, categories, etc. that will be version of an application is being designed, and the used. This is often done as a pilot implementation by the current version is being deployed, the previous version relevant Application Management team or department. might still be in operation in parts of an organization. This obviously requires strong version, configuration and 184.108.40.206 Deploy release control. In this phase, both the operational model and the Particular phases might take longer or seem more application are deployed. The operational model is significant than others, but they are all crucial. Every incorporated in the existing IT environment and the application must go through all of them at least once and, application is installed on top of the operational model, because of the circular nature of the lifecycle, will go using the Release and Deployment Management process through some more than once. described in the ITIL Service Transition publication. This approach also supports iterative development approaches, where software is continually being Organizing for Service Operation | 133 developed in incremental steps. Each step follows the application architectures during the Service Strategy lifecycle and the application is built in increments, using processes. business priorities as a driver. ■ Research and Development of solutions that can help Good communication is the key as an application works its expand the Service Portfolio or which can be used to way through the phases of the lifecycle. It is critical that simplify or automate IT Operations, reduce costs or high-quality information is passed along by those handling increase levels of IT service. the application in one phase of its existence to those ■ Involvement in the design and building of new handling it in the next phase. It is also important that an services. All Application Management teams or organization monitors the quality of the Application departments will contribute to the design of the Management Lifecycle. Changes in the lifecycle, for Technical Architecture and Performance standards for example in the way an organization passes information IT Services. In addition they will also be responsible for between the different phases, will affect its quality. specifying the operational activities required to Understanding the characteristics of every phase in the manage applications on an ongoing basis. Application Management Lifecycle is crucial to improving ■ Involvement in projects, not only during the Service the quality of the whole. Methods and tools used in one Design process, but also for Continual Service phase might have an impact on others, while optimization Improvement or operational projects, such as of one phase might sub-optimize the whole. Operating System upgrades, server consolidation projects or physical moves. 6.5.5 Application Management generic ■ Designing and performing tests for the functionality, activities performance and manageability of IT Services (bearing in mind that testing should be controlled and While most Application Management teams or performed by an independent tester – see Service departments are dedicated to specific applications or sets Transition publication). of applications, there are a number of activities which they have in common. These include: ■ Availability and Capacity Management are dependent on Application Management for contributing to the ■ Identifying the knowledge and expertise required to design of applications to meet the levels of service manage and operate applications in the delivery of IT required by the business. This means that modelling services. This process starts during the Service Strategy and workload forecasting are often done together with phase, is expanded in detail in Service Design and is Technical and Application Management resources. executed in Service Operation. Ongoing assessment ■ Assistance in assessing risk, identifying critical service and updating of these skills are done during Continual and system dependencies and defining and Service Improvement. implementing countermeasures. ■ Initiating training programmes to develop and refine ■ Managing vendors. Many Application Management the skills in the appropriate Application Management departments or groups are the only ones who know resources and maintaining training records for exactly what is required of a vendor and how to these resources. measure and manage them. For this reason, many ■ Recruiting or contracting resources with skills that organizations rely on Application Management to cannot be developed internally, or where there are manage contracts with vendors of specific insufficient people to perform the required Application applications. If this is the case it is important to ensure Management activities. that these relationships are managed as part of the ■ Design and delivery of end-user training. Training may SLM process. be developed and delivered by either the Application ■ Involvement in definition of Event Management Development or Application Management groups, or standards and especially in the instrumentation of by a third party, but Application Management is applications for the generation of meaningful events. responsible for ensuring that training is conducted ■ Application Management as a function provides the as appropriate. resources that execute the Problem Management ■ Insourcing for specific activities where the required process. It is their technical expertise and knowledge skills are not available internally or in the open market, that is used to diagnose and resolve problems. It is or where it is more cost-efficient to do so. also their relationship with the vendors that is used to ■ Definition of standards used in the design of new escalate and follow up with vendor support teams or architectures and participation in the definition of departments. 134 | Organizing for Service Operation ■ Application Management resources will be involved in ■ Third-level support for incidents related to the defining coding systems that are used in Incident and application(s) covered by that team or department Problem Management (e.g. Incident Categories). ■ Involvement in operation testing plans and ■ Application Management resources are used to deployment issues support Problem Management in validating and ■ Application bug tracking and patch management maintaining the KEDB together with the Application (coding fixes for in-house code, transports/patches for Development teams. third-party code) ■ Change Management relies on the technical ■ Involvement in application operability and knowledge and expertise to evaluate changes and supportability issues such as error code design, error many changes will be built by Application messaging, event management hooks Management teams. ■ Application sizing and performance; volume metrics ■ Successful Release Management is dependent on and load testing etc. This is in support of Capacity and involvement from Application Management staff. In Availability Management processes fact they are frequently the drivers of the Release ■ Involvement in developing Release Policies Management process for their applications. ■ Identification of enhancements to existing software, ■ Application Management will define, manage and both from a functionality and manageability maintain attributes and relationships of application CIs perspective. in the CMS. ■ Application Management is involved in the Continual 6.5.6 Application Management organization Service Improvement processes, particularly in Although all Application Management departments, identifying opportunities for improvement and then in groups or teams perform similar activities, each application helping to evaluate alternative solutions. or set of applications has a different set of management ■ Application Management ensures that all system and and operational requirements. Examples of these operating documentation is up to date and properly differences include: utilized. This includes ensuring that all design, management and user manuals are up to date and ■ The purpose of the application. Each application complete and that Application Management staff and was developed to meet a specific set of objectives, users are familiar with their contents. usually business objectives. For effective support and improvement, the group that manages that ■ Collaboration with Technical Management on application needs to have a comprehensive performing Training Needs Analysis and maintaining understanding of the business context and how the Skills Inventories. application is used to meet its objectives. This is often ■ Assisting IT Financial Management to identify the cost achieved by Business Analysts who are close to the of the ongoing management of applications. business and responsible for ensuring that business ■ Involvement in defining the operational activities requirements are effectively translated into application performed as part of IT Operations Management. Many specifications. Business Analysts should recognize that Application Management departments, groups or business requirements must be translated into both teams also perform the operational activities as part of functional and manageability specifications. an organization’s IT Operations Management function. ■ The functionality of the application. Each ■ Input into, and maintenance of, software configuration application is designed to work in a different way and policies. to perform different functions at different times. ■ Together with Software Development teams, the ■ The platform on which the application runs. definition and maintenance of documentation related Although the platform is usually managed by a to applications. These will include user manuals, Technical Management team or department, each of administration and management manuals, as well as them affects the way in which an application needs to any SOPs required to manage operational aspects of be managed and operated. the application. ■ The type or brand of technology used. Even Application Management teams or departments will be applications that have similar functionality operate needed for all key applications. The exact nature of the differently on different databases or platforms. These role will vary depending upon the applications being differences have to be understood in order to manage supported, but generic responsibilities are likely to include: the application effectively. Organizing for Service Operation | 135 Even though the activities to manage these applications ■ Sales force automation are generic, the specific schedule of activities and the way ■ Sales order processing applications they are performed will be different. For this reason, ■ Call centre and marketing applications Application Management teams and departments tend to ■ Business-specific applications (e.g. health care, be organized according to the categories of applications insurance, banking, etc.) that they support. Typical examples of Application ■ IT applications, such as Service Desk, Enterprise System Management organizations include: Management, etc. ■ Financial applications. In larger organizations where a ■ Web portals number of different applications are used for different ■ Online shopping. aspects of Financial Management, there may be several department, groups or teams managing these 220.127.116.11 Organizational roles applications, e.g. Debtors and Creditors, Age Analysis, Traditionally, Application Development and Management General Ledger, etc. teams and departments have been autonomous units. ■ Messaging and collaboration applications Each one manages its own environment in its own way ■ HR applications and each has a separate interface to the business. This is ■ Manufacturing support applications illustrated in Table 6.2. Table 6.2 Organizational roles Application Development Application Management Primary focus Building functionality for their customer. What the Focus on what the functionality is as well as application does is more important to them than how to deliver it. how it is operated Manageability aspects of the application, i.e. how to ensure stability and performance of the application Management mode Most development work is done in projects where Most work is done as part of repeatable, the focus is on delivering specific units of work to ongoing processes. A relatively small number specification, on time and within budget. of people work in projects. This means that it is often difficult for developers This means that it is very difficult for to understand and build for ongoing operations, operational staff to get involved in especially since they are not available for support development projects, as that takes them away of the application once they have moved on to from their ‘real jobs’ the next project Measurement Staff are rewarded for creativity and for completing Staff are rewarded for consistency and for one project so that they can move on to the next preventing unexpected events and project unauthorized functionality (e.g. ‘bells and whistles’ added by developers) Cost Development projects are relatively easy to Ongoing management costs are often mixed in quantify since the resources are known and it is with the costs of other IT services since easy to link their expenses to a specific application resources are often shared across multiple IT or IT Service services and applications Lifecycles Development staff focus on Software Staff involved in ongoing management Development Lifecycles, which highlight the typically only control one or two phases of dependencies for successful operation, but do not these lifecycles – Operation and Improvement assign accountability for these 136 | Organizing for Service Operation Over the last several years, these two worlds are being brought together by recent moves to Object Oriented and Requirements SOA approaches, together with growing pressure from the Business to be more responsive and easy to work with. This means that Application Development will have greater accountability for the successful operation of applications they design, while Application Management Optimize Design will have greater involvement in the development of applications. IT Service Management Strategy, Design, This does not change the fundamental role of each group, Transition and Improvement but it does require a more integrated approach to the SLC. It will also mean that the output of Application Build Operate Development will be more commoditized and that and Test Application Management will be more involved in Development projects. This will require the following changes: ■ A single interface to the business for all stages of the Deploy lifecycle and a common requirements and specification-setting process. ■ A change in how both Development and Management Application Development Application Management staff are measured. Development teams should be held partly accountable for design flaws that create Figure 6.6 Role of teams in the Application operational outages. Management staff should be held Management Lifecycle partly accountable for contribution to the technical architecture and manageability design of applications. 6.5.7 Application Management roles and ■ A single Change Management process for both responsibilities groups, with Change Control in each group being subordinate to the overall authority of Change 18.104.22.168 Applications Managers/Team-leaders Management (see Service Transition publication). An Applications Manager or Team-leader (depending upon ■ A clear mapping of Development and Management the size and/or importance of the team or department activities in the lifecycle, which is illustrated at a high and the application they support, and the organization’s level in Figure 6.5. The exact activities and how they structure and culture) will be needed for each of the interact should be defined in each organization, applications teams or departments. The role will: although some generic guidelines are given in each ■ Take overall responsibility for leadership, control and of the ITIL publications. decision-making for the applications team or ■ Greater focus on integrating functionality and department manageability requirements early in the project. ■ Provide technical knowledge and leadership in the Figure 6.6 shows a common Application Management specific applications support activities covered by the Lifecycle with involvement from both groups. In this team or department diagram it is clear that Application Development will be ■ Ensure necessary technical training, awareness and driving some phases with input from Application experience levels are maintained within the team or Management. In other cases Application Management will department relevant to the applications being be driving the phase with input and support from supported and processes being used Application Development. Both groups are subordinated ■ Involve ongoing communication with users and to the IT Service Strategy of the organization and their customers regarding application performance and efforts are coordinated through Service Transition evolving requirements of the business mechanisms and processes. ■ Report to senior management on all issues relevant to the applications being supported Organizing for Service Operation | 137 ■ Perform line-management for all team or department ● Transaction rates and availability for critical members. business transactions ● Service Desk training 22.214.171.124 Applications Analyst/Architect ● Recording problem resolutions into the KEDB Application Analysts and Architects are responsible for ● User measures of the quality of outputs as defined matching requirements to application specifications. in the SLAs. Specific activities include: ■ Process metrics. Technical Management teams ■ Working with users, sponsors and all other execute many Service Management process activities. stakeholders to determine their evolving needs Their ability to do so will be measured as part of the ■ Working with Technical Management to determine the process metrics where appropriate (see section on highest level of system requirements required to meet each process for more details). Examples include: the business requirements within budget and ● Response time to events and event completion technology constraints rates ■ Performing cost-benefit analyses to determine the ● Incident resolution times for second- and third-line most appropriate means to meet the stated support requirement ● Problem resolution statistics ■ Developing Operational Models that will ensure ● Number of escalations and reason for those optimal use of resources and the appropriate level escalations of performance ● Number of changes implemented and backed out ■ Ensuring that applications are designed to be ● Number of unauthorized changes detected effectively managed given the organization’s ● Number of releases deployed, total and successful, technology architecture, available skills and tools including ensuring adherence to the Release ■ Developing and maintaining standards for application Policies of the organization sizing, performance modelling, etc ● Security issues detected and resolved ■ Generating a set of acceptance test requirements, ● Actual system utilization against Capacity Plan together with the designers, test engineers and the forecasts (where the team has contributed to the user, which determine that all of the high-level development of the plan) requirements have been met, both functional ● Tracking against SIPs and with regard to manageability ● Expenditure against budget. ■ Input into the design of configuration data required ■ Application performance. These metrics are based to manage and track the application effectively. on Service Design specifications and technical An appropriate number of Application Analysts will be performance standards set by vendors and will needed for each of the Application Management teams or typically be contained in OLAs or SOPs. Actual metrics department to perform the generic activities described in will vary by application, but are likely to include: paragraph 6.5.5. ● Response times ● Application availability, which is helpful for The ways in which Application Management groups can be organized, and the options available, are discussed in measuring team or application performance but is some detail in section 6.7 below. not to be confused with Service Availability – which requires the ability to measure the overall 6.5.8 Application Management metrics availability of the service, and may use the availability figures for a number of individual Metrics for Application Management will largely depend systems or components on which applications are being managed, but some ● Integrity of data and reporting. generic metrics include: ■ Measurement of maintenance activity, including: ■ Measurement of agreed outputs. These could ● Maintenance performed per schedule include: ● Number of maintenance windows exceeded ● Ability of users to access the application and its ● Maintenance objectives achieved (number and functionality percentage). ● Reports and files are transmitted to the users 138 | Organizing for Service Operation ■ Application Management teams are likely to work The Application Portfolio forms part of the overall IT closely with Application Development teams on Service Portfolio, which is discussed in detail in the Service projects, and appropriate metrics should be used to Strategy publication. measure this, including: ● Time spent on projects The Application Portfolio and the Service ● Customer and user satisfaction with the output of Catalogue the project The Application Portfolio should not be mistaken for ● Cost of involvement in the project. the Service Catalogue and should not be advertised ■ Training and skills development. These metrics as a list of services to customers or users. Applications ensure that staff have the skills and training to are one of the components used to provide IT services, usually not the service itself. manage the technology that is under their control, and will also identify areas where training is The Application Portfolio should therefore be used as still required. a planning document only by those managers and staff who are involved with the development and 6.5.9 Application Management management of the organization’s IT Strategy, as well as IT staff who are tasked with managing the documentation applications or the platforms on which the A number of documents are produced and used during applications run. Application Management. This list is a summary of some The Service Catalogue should focus on listing the of the most important and does not include reports or services that are available, rather than simply listing documents that are produced by Application Management applications and assuming that users and customers on behalf of other process or functions (e.g. RFC, Known can make the link. Having said that, there are times Error documentation, Release Records, etc.). Note that when the application is synonymous with the service, documents should be controlled as CIs and related to the e.g. word-processing applications are typically known relevant applications or Application Management teams. by their name; an application hosting service will mention the names of the application hosted, etc. 126.96.36.199 Application Portfolio The Application Portfolio is used primarily as part of Service Strategy, but is referenced here for completeness. 188.8.131.52 Application Requirements The Application Portfolio is a list (more accurately a system There are two sets of documents containing requirements or database) of all applications in use within the for applications: organization, together with the following information: ■ Business Requirements outline the Business Case for Key attributes of the application the required application, in other words what the business will do with the application. This will include ■ Customers and users the Return on Investment for the application as well as ■ Business purpose all related improvements to the business. Business ■ Level of business criticality requirements will also include the Service Level ■ Architecture (including the IT Infrastructure Requirements as defined by the service customers and dependencies) users. ■ Developers, support groups, suppliers or vendors ■ Application Requirements documents are based on ■ The investment made in the application to date. In the Business Requirements and specify exactly how this respect the Application Portfolio can be used as the application will meet those requirements. In short, an asset register for applications, Application Requirements documents gather information that will be used to commission new The purpose of the Application Portfolio is to analyse the applications or changes to existing applications, for need for and use of applications in the organization. It can example: be used to link functionality and investment to business ● To design the architecture of the application activity and is therefore an important part of ongoing IT planning and control. Another benefit of the Application (specification of the different components of the Portfolio is that it can be used to identify duplication and system, how they relate to one another and how excessive licensing of applications. they will be managed) Organizing for Service Operation | 139 ● To specify a Request for Proposal (RFP) for a facilitating communication between users, Developers Commercial, Off the Shelf (COTS) application and Application Management staff. ● To initiate the design and building of an ■ Change Cases use scenarios to predict the impact of application in-house. potential changes to utilization, architecture or functionality, and project the impact of specific Requirements documents are normally owned by a project change scenarios. Change Cases are used to clarify leader, either of a development project team, or for a scope and direction with the sponsor. Extra team drawing up specifications for an RFP. Requirements architecture and design work will be needed at this documents are subject to document control for the point to ensure the Change Cases can be met in the project as they form part of the overall scope of the future at reasonable cost. The sponsor must be project. prepared to pay the extra cost. If not, the Change Four different types of Application Requirements need to Cases should be reduced to what the sponsor is be defined (for more detailed information, please refer to prepared to pay for. Change Cases are also used to the ITIL Service Design and Service Transition publications): evaluate the architecture. They influence the ■ Functional Requirements describe the things an development process enabling the design of application is intended to do, and can be expressed as appropriate architectural features to minimize the services, tasks or functions the application is required impact of future changes. to perform. For more information, refer to the ITIL Service Design and ■ Manageability Requirements are used to define Continual Service Improvement publications. what is needed to manage the application or to ensure that it performs the required functions 184.108.40.206 Design documentation consistently and at the right level. Manageability This is not one specific document, but refers to any requirements also identify constraints on the IT system. document produced by Application Development or These requirements serve as a basis for early system Management staff that specifies how an application will be sizing and estimates of cost, and can support the built. As these documents are generally owned and assessment of the viability of the proposed IT system. managed by the Development teams, this publication will Most importantly, they drive design of the operational not cover them in detail. However, to ensure successful models and performance standards used in IT operation, Application Management must ensure that Operations Management. design documentation contains: ■ Usability Requirements are normally specified by the ■ Sizing specifications users of the application and refer to its ease of use. Any special requirements for handicapped users also ■ Workload profiles and utilization forecasts need to be specified here. ■ Technical Architecture ■ Test Requirements specify what is required to ensure ■ Data models that the test environment is representative of the ■ Coding standards operational environment and that the test is valid (i.e. ■ Performance standards that it actually tests what it is supposed to). ■ Software Configuration Management definitions ■ Environment definitions and building considerations (if 220.127.116.11 Use and Change Cases appropriate). Use and Change Cases are managed as part of the Service For COTS applications, these documents take the form of Design and Continual Service Improvement processes, but Application Specifications that are used as input into the are maintained by Application Management. For writing of RFPs. In these cases the documents are owned purchased software, it is common for the team that and managed by Application Management. develops the functional specifications to maintain the Use Case for that application. For more information on Design Documentation, refer to the ITIL Service Design publication. ■ Use Cases document the intended use of the application with real-life scenarios to demonstrate its 18.104.22.168 Manuals boundaries and its full functionality. Use Cases can also be used as modelling and sizing scenarios and for Application Management is responsible for the management of manuals for all applications. Although these are normally developed by the Application 140 | Organizing for Service Operation Development teams or third party suppliers, Application individual or shared between two or more, the importance Management is responsible for ensuring that the manuals is the consistency of accountability and execution, along are relevant to the operational versions of the applications. with the interaction with other roles in the organization. Three types of manuals are generally maintained by Application Management: 6.6.1 Service Desk roles The following roles are needed for the Service Desk. ■ Design manuals contain information about the structure and architecture of the application. These are 22.214.171.124 Service Desk Manager helpful for creating reports or defining event correlation rules. They could also help in diagnosing In larger organizations where the Service Desk is of a problems. significant size, a Service Desk Manager role may be justified with the Service Desk Supervisor(s) reporting to ■ Administration or management manuals describe him or her. In such cases this role may take responsibility the activities required to maintain and operate the for some of the activities listed above and may application at the levels of performance specified in additionally perform the following activities: the Design phase. These manuals will also provide detailed troubleshooting, Known Error and Fault ■ Manage the overall desk activities, including the descriptions, and step-by-step instructions for common supervisors maintenance tasks. ■ Act as a further escalation point for the supervisor(s) ■ User manuals describe the application functionality as ■ Take on a wider customer-services role it is used by an end-user. These manuals contain step- ■ Report to senior managers on any issue that could by-step instructions on how to use the application, as significantly impact the business well as descriptions of what should typically be ■ Attend Change Advisory Board meetings entered into certain fields, or what to do if there is an ■ Take overall responsibility for incident and Service error. Request handling on the Service Desk. This could also be expanded to any other activity taken on by the Manuals and Standard Operating Procedures Service Desk – e.g. monitoring certain classes of event. Manuals should not be seen as a replacement for Note: In all cases, clearly defined job descriptions should SOPs, but as input into the SOPs. be drafted and agreed so that specific responsibilities are SOPs should contain all aspects of applications that known. need to be managed as part of standard operations. If they are not extracted from the manuals, there is a 126.96.36.199 Service Desk Supervisor high likelihood that they will be ignored or performed in a non-standard manner. Application In very small desks it is possible that the senior Service Management should ensure that any such Desk Analyst will also act as the Supervisor – but in larger instructions are extracted from the manuals and desks it is likely that a dedicated Service Desk Supervisor inserted into separate SOP documentation for role will be needed. Where shift hours dictate it, there may Operations. It is also responsible for ensuring that be two or more post-holders who fulfil the role, usually on these instructions are updated with every change or an overlapping basis. The Supervisor’s role is likely to new release of the software. include: ■ Ensuring that staffing and skill levels are maintained throughout operational hours by managing shift 6.6 SERVICE OPERATION ROLES AND staffing schedules, etc. RESPONSIBILITIES ■ Undertaking HR activities as needed The key to effective ITSM is ensuring that there is clear ■ Acting as an escalation point where difficult or accountability and roles defined to carry out the practice controversial calls are received of Service Operation. A role is often tied to a job ■ Production of statistics and management reports description or work group description but does not ■ Representing the Service Desk at meetings necessarily need to be filled by one individual. The size of ■ Arranging staff training and awareness sessions an organization, how it is structured, the existence of ■ Liaising with senior management external partners and other factors will influence how roles ■ Liaising with Change Management are assigned. Whether a particular role is filled by a single Organizing for Service Operation | 141 ■ Performing briefings to Service Desk staff on changes ■ Perform line-management for all team or department or deployments that may affect volumes at the Service members. Desk ■ Assisting analysts in providing first-line support when 188.8.131.52 Technical Analysts/Architects workloads are high, or where additional experience is This term refers to any staff member in Technical required. Management who performs the activities listed in paragraph 6.3.3, excluding the daily operational actions, 184.108.40.206 Service Desk Analysts which are performed by Operators in either Technical or IT The primary Service Desk Analyst role is that of providing Operations Management. Based on the list of generic first-level support through taking calls and handling the activities in paragraph 6.3.3, the role of Technical Analysts resulting incidents or Service Requests using the Incident and Architects includes: Reporting and Request Fulfilment processes, in line with ■ Working with users, sponsors, Application the objectives described earlier. The exact number of staff Management and all other stakeholders to determine required is discussed in paragraph 220.127.116.11. their evolving needs ■ Working with Application Management and other 18.104.22.168 Super Users areas in Technical Management to determine the Super Users are discussed in detail in the section on highest level of system requirements required to meet Service Desk staffing in paragraph 6.2.4. In summary, this the requirements within budget and technology role will consist of business users who act as liaison points constraints with IT in general and the Service Desk in particular. The ■ Defining and maintaining knowledge about how role of the Super User can be summarized as follows: systems are related and ensuring that dependencies ■ To facilitate communication between IT and the are understood and managed accordingly business at an operational level ■ Performing cost-benefit analyses to determine the ■ To reinforce expectations of users regarding what most appropriate means to meet the stated Service Levels have been agreed requirements ■ Staff training for users in their area ■ Developing Operational Models that will ensure ■ Providing support for minor incidents or simple optimal use of resources and the appropriate level request fulfilment of performance ■ Involvement with new releases and rollouts. ■ Ensuring that the infrastructure is configured to be effectively managed given the organization’s 6.6.2 Technical Management roles technology architecture, available skills and tools ■ Ensuring the consistent and reliable performance The following roles are needed in the Technical Management areas of the infrastructure to deliver the required level of service to the business 22.214.171.124 Technical Managers/Team-leaders ■ Defining all tasks required to manage the infrastructure and ensuring that these tasks are A Technical Manager or Team-leader (depending upon the performed appropriately size and/or importance of the team and the organization’s ■ Input into the design of configuration data required structure and culture) may be needed for each of the to manage and track the application effectively. technical teams or departments. The role will: The ways in which Technical Management can be ■ Take overall responsibility for leadership, control and organized, and the options available, are discussed in decision-making for the technical team or department some detail in section 6.7. ■ Provide technical knowledge and leadership in the specific technical areas covered by the team or 126.96.36.199 Technical Operator department ■ Ensure necessary technical training, awareness and This term is used to refer to any staff who performs day- experience levels are maintained within the team or to-day operational tasks in Technical Management. Usually, department these tasks are delegated to a dedicated IT Operations team, and this role is therefore discussed in paragraph ■ Report to senior management on all technical issues 188.8.131.52 on IT Operators. relevant to their area of responsibility 142 | Organizing for Service Operation 6.6.3 IT Operations Management roles leader will be needed on each of the shifts, to perform the The following roles and needed in the IT Operations following activities: Management area: ■ Take overall responsibility for leadership, control and decision-making during the shift period 184.108.40.206 IT Operations Manager ■ Ensure that all operational activities are satisfactorily An IT Operations Manager will be needed to take overall performed within agreed timescales and in accordance responsibility for all of the IT Operations Management with company policies and procedures activities, which include: ■ Liaise with the other shift leader(s) to ensure ■ Operations Control, which oversees the execution handover, continuity and consistency between the and monitoring of the operational activities in the IT shifts Infrastructure. This can be done with the assistance of ■ Act as line-manager for all Operations Analysts on an Operations Bridge or Network Operations Centre. In his/her shift addition to executing routine tasks from all technical ■ Assume overall health and safety, and security areas, Operations Control also performs the following responsibility for the shift (unless specifically specific tasks: designated to other staff members). ● Console Management, which refers to defining central observation and monitoring capability and 220.127.116.11 IT Operations Analysts then using those consoles to exercise monitoring IT Operations Analysts are senior IT Operations staff who and control activities are able to determine the most effective and efficient way ● Job Scheduling, or the management of routine to conduct a series of operations, usually in high-volume, batch jobs or scripts diverse environments. ● Backup and Restore on behalf of all Technical This role is normally performed as part of Technical and Application Management teams or department Management, but large organizations may find that the and often on behalf of users volume and diversity of operational activities requires ● Print and Output management for the collation some more in-depth planning and execution. Examples and distribution of all centralized printing or include Job Scheduling and the definition of a Backup electronic output. strategy and schedule. ■ Facilities Management, which refers to the management of the physical IT environment, typically 18.104.22.168 IT Operators a Data Centre or computer rooms and recovery sites IT Operators are the staff who perform the day-to-day together with all the power and cooling equipment. operational activities that are defined in Technical or Facilities Management also includes the coordination Application Management and, in some cases, IT of large-scale consolidation projects, e.g. data centre Operations Analysts. Typical Operator roles include: consolidation or server consolidation projects. In some cases the management of a Data Centre is outsourced, ■ Performing backups in which case Facilities Management refers to the ■ Console operations, i.e. monitoring the status of management of the outsourcing contract. specific systems, job queues, etc. and providing first- level intervention if appropriate The role of the IT Operations Manager is to: ■ Managing print devices, restocking with paper, ■ Provide overall leadership, control and decision- toner, etc. making and take responsibility for the IT Operations ■ Ensuring that batch jobs, archiving, etc. are performed Management teams and department ■ Running scheduled housekeeping jobs, such as ■ Report to senior management on all IT Operations database maintenance, file clean-up, etc. issues ■ Burning images for distribution and installation on ■ Perform line-management for all IT Operations team or new servers, desktops or laptops department managers/supervisors. ■ Physical installation of standard equipment in the Data Centre. 22.214.171.124 Shift Leaders Many IT Operations areas will work extended hours – on either a two- or three-shift basis. In such cases a shift Organizing for Service Operation | 143 6.6.4 Application Management roles requirements have been met, both functional and with regard to manageability 126.96.36.199 Applications Managers/Team-leaders ■ Input into the design of configuration data required to An Applications Manager or Team-leader should be manage and track the application effectively. considered for each of the applications teams or An appropriate number of Application Analysts will be departments. The role will: needed for each of the Application Management teams or ■ Take overall responsibility for leadership, control and department to perform the activities described elsewhere decision-making for the applications team or in this publication, primarily in paragraph 6.5.5. department The ways in which Application Management groups can ■ Provide technical knowledge and leadership in the be organized, and the options available, are discussed in specific applications support activities covered by the some detail in section 6.7. team or department ■ Ensure necessary technical training, awareness and 6.6.5 Event Management roles experience levels are maintained within the team or It is unusual for an organization to appoint an ‘Event department relevant to the applications being Manager’, as events tend to occur in multiple contexts and supported and processes being used for many different reasons. However, it is important that ■ Involve ongoing communication with users and Event Management procedures are coordinated to prevent customers regarding application performance and duplication of effort and tools. The roles of the Service evolving requirements of the business Operation functions in Event Management are as follows. ■ Report to senior management on all issues relevant to the applications being supported 188.8.131.52 The role of the Service Desk ■ Perform line-management for all team or department The Service Desk is not typically involved in Event members. Management as such, unless an event requires some response that is within the scope of the Service Desk’s 184.108.40.206 Applications Analyst/Architect defined activity, for example notifying a user that a report Application Analysts and Architects are responsible for is ready. Generally, though, this type of activity is matching requirements to application specifications. performed by the Operations Bridge, unless the Service Specific activities include: Desk and Operations Bridge have been combined. ■ Working with users, sponsors and all other The investigation and resolution of events that have been stakeholders to determine their evolving needs identified as being Incidents will initially be undertaken by ■ Working with Technical Management to determine the the Service Desk and then escalated to the appropriate highest level of system requirements required to meet Service Operation team(s) the requirements within budget and technology The Service Desk is also responsible for communicating constraints information about this type of incident to the relevant ■ Performing cost-benefit analyses to determine the Technical or Application Management team and, where most appropriate means to meet the stated appropriate, the user. requirement ■ Developing Operational Models that will ensure 220.127.116.11 The role of Technical and Application optimal use of resources and the appropriate level Management of performance Technical and Application Management plays several ■ Ensuring that applications are designed to be important roles as follows: effectively managed given the organization’s technology architecture, available skills and tools ■ During Service Design, they will participate in the ■ Developing and maintaining standards for application instrumentation of the service, classify events, update sizing, performance modelling, etc. correlation engines and ensure that any auto ■ Generating a set of acceptance test requirements, responses are defined together with the designers, test engineers and the ■ During Service Transition they will test the service to user, which determine that all of the high-level ensure that events are properly generated and that the defined responses are appropriate 144 | Organizing for Service Operation ■ During Service Operation these teams will typically ■ Developing and maintaining the Incident Management perform Event Management for the systems under process and procedures. their control. It is unusual for teams to have a In many organizations the role of Incident Manager is dedicated person to manage Event Management, but assigned to the Service Desk Supervisor – though in larger each manager or team leader will ensure that the organizations with high volumes a separate role may be appropriate procedures are defined and executed necessary. In either case it is important that the Incident according to the process and policy requirements Manager is given the authority to manage incidents ■ Technical and Application Management will also be effectively through first, second and third line. involved in dealing with incidents and problems related to events 18.104.22.168 First line ■ If Event Management activities are delegated to the This is covered in detail under the Service Desk (section Service Desk or IT Operations Management, Technical 6.1) and will not be repeated here. and Application Management must ensure that the staff are adequately trained and that they have access 22.214.171.124 Second line to the appropriate tools to enable them to perform these tasks. Many organizations will choose to have a second-line support group, made up of staff with greater (though still 126.96.36.199 The role of IT Operations Management general) technical skills than the Service Desk – and with additional time to devote to incident diagnosis and Where IT Operations is separated from Technical or resolution without interference from telephone Application Management, it is common for Event interruptions. Monitoring and first-line response to be delegated to IT Operations Management. Operators for each area will be Such a group can handle many of the less complicated tasked with monitoring events, responding as required, or incidents, leaving more specialist (third-line) support ensuring that Incidents are created as appropriate. The groups to concentrate on dealing with more deep-rooted instructions for how to do so must be included in the incidents and/or new developments etc. SOPs for those teams. Where a second-line group is used, there are often Event Monitoring is commonly delegated to the advantages of locating this group close to the Service Operations Bridge where it exists. The Operations Bridge Desk to aid with good communications and to ease can initiate and coordinate, or even perform, the movement of staff between the groups, which may be responses required by the service, or provide first-level helpful for training/awareness and during busy periods support for those events which generate an incident. or staff shortages. A second-line support manager (or supervisor if just a small group) will normally head 6.6.6 Incident Management roles this group. The following roles are needed for the Incident It is conceivable that this group may be outsourced – and Management process. this is more likely and practical if the Service Desk itself has been outsourced. 188.8.131.52 Incident Manager An Incident Manager has the responsibility for: 184.108.40.206 Third line ■ Driving the efficiency and effectiveness of the Incident Third-line support will be provided by a number of Management process internal technical groups and/or third-party suppliers/maintainers. The list will vary from organization ■ Producing management information to organization but is likely to include: ■ Managing the work of incident support staff (first- and second-line) ■ Network Support ■ Monitoring the effectiveness of Incident Management ■ Voice Support (if separate) and making recommendations for improvement ■ Server Support ■ Developing and maintaining the Incident Management ■ Desktop Support systems ■ Application Management – likely that there may be ■ Managing Major Incidents separate teams for different applications or application types – some of which may be external Organizing for Service Operation | 145 supplier/maintainers. In many cases the same team ■ Liaison with suppliers, contractors, etc. to ensure that will be responsible for Application Developments as third parties fulfil their contractual obligations, well as support – and it is therefore important that especially with regard to resolving problems and resources are prioritized so that support is given providing problem-related information and data adequate prominence ■ Arranging, running, documenting and all follow-up ■ Database Support activities relating to Major Problem Reviews. ■ Hardware Maintenance Engineers ■ Environmental Equipment Maintainers/Suppliers. 220.127.116.11 Problem-Solving Groups The actual solving of problems is likely to be undertaken Note: Depending upon where an organization decides to by one or more technical support groups and/or suppliers source its support services, any of the above groups could or support contractors – under the coordination of the be internal or external groups. Problem Manager. 6.6.7 Request Fulfilment roles Where an individual problem is serious enough to warrant Initial handling of Service Requests will be undertaken by it, a dedicated problem management team should be the Service Desk and Incident Management staff. formulated to work together in overcoming that particular problem. The Problem Manager has a role to play in Eventual fulfilment of the request will be undertaken by making sure that the correct number and level of the appropriate Service Operation team(s) or departments resources is available in the team and for escalation and and/or by external suppliers, as appropriate. Often, communication up the management chain of all Facilities Management, Procurement and other business organizations concerned. areas aid in the fulfilment of the Service Request. In most cases there will be no need for additional roles or posts to 6.6.9 Access Management roles be created. Since Access Management is an execution of Security and In exceptional cases where a very high number of Service Availability Management, these two areas will be Requests are handled, or where the requests are of critical responsible for defining the appropriate roles. It is unusual importance to the organization, it may be appropriate to for an organization to appoint an ‘Access Manager’, have one or more of the Incident Management team although it is important that there is a single Access dedicated to handling and managing Service Requests. Management process and a single set of policies related to managing rights and access. This process and the related 6.6.8 Problem Management roles policies are likely to be defined and maintained by The following roles are needed for the Problem Information Security Management and executed by the Management process. various Service Operation functions. Their activities can be summarized as follows. 18.104.22.168 Problem Manager There should be a designated person (or, in larger 22.214.171.124 The role of the Service Desk organizations, a team) responsible for Problem The Service Desk is typically used as a means to request Management. Smaller organizations may not be able to access to a service. This is normally done using a Service justify a full-time resource for this role, and it can be Request. The Service Desk will validate the request by combined with other roles in such cases, but it is essential checking that the request has been approved at the that it not just left to technical resources to perform. There appropriate level of authority, that the user is a legitimate needs to be a single point of coordination and an owner employee, contractor or customer and that they qualify for of the Problem Management process. This role will access. coordinate all Problem Management activities and will Once it has performed these checks (usually by accessing have specific responsibility for: the relevant databases and Service Level Management ■ Liaison with all problem resolution groups to ensure documents) it will pass the request to the appropriate swift resolution of problems within SLA targets team to provide access. It is quite common for the Service ■ Ownership and protection of the KEDB Desk to be delegated responsibility for providing access ■ Gatekeeper for the inclusion of all Known Errors and for simple services during the call. management of search algorithms The Service Desk will also be responsible for ■ Formal closure of all Problem Records communicating with the user to ensure that they know 146 | Organizing for Service Operation when access has been granted and to ensure that they The Operations Bridge, if it exists, can be used to monitor receive any other required support. events related to Access Management and can even provide first-line support and coordination in the The Service Desk is also well situated to detect and report resolution of those events where appropriate. incidents related to access. For example, users attempting to access services without authority; or users reporting incidents that indicate that a system or service has been 6.7 SERVICE OPERATION ORGANIZATION used inappropriately, i.e. by a former employee who STRUCTURES used an old username to gain access and make unauthorized changes. Some general information has already been provided about organizational considerations for each function (see paragraphs 6.2.3, 6.3.4 and 6.5.6.). This section considers 126.96.36.199 The role of Technical and Application some specific organizational structures for all functions. Management There are a number of ways of organizing Service Technical and Application Management play several Operation functions, and each organization will have to important roles as follows: make it own decisions, based upon its scale, geography, ■ During Service Design, they will ensure that culture and business environment. Some options are mechanisms are created to simplify and control Access discussed in the rest of this section. Management on each service that is designed. They will also specify ways in which abuse of rights can be 6.7.1 Organization by technical detected and stopped specialization ■ During Service Transition they will test the service to In this type of organization, departments are created ensure that access can be granted, controlled and according to technology and the skills and activities prevented as designed needed to manage that technology. IT Operations will ■ During Service Operation these teams will typically follow the structure of the Technical and Application perform Access Management for the systems under Management departments. The implication of this is that their control. It is unusual for teams to have a IT Operations is geared toward the operational agendas of dedicated person to manage Access Management, but the Technical and Application Management departments. each manager or team leader will ensure that the This structure can work well, provided that these appropriate procedures are defined and executed groups are fully represented in the Service Design, according to the process and policy requirements Testing and Improvement processes, which will ensure ■ Technical and Application Management will also be that their agendas are aligned with the requirements involved in dealing with Incidents and Problems of the business. related to Access Management ■ If Access Management activities are delegated to the This structure also assumes that all Technical and Service Desk or IT Operations Management, Technical Application Management departments have clearly and Application Management must ensure that the distinguished between their Management activity and staff are adequately trained and that they have access operations activity. It also requires that they have to the appropriate tools to enable them to perform standardized these operational activities so that they can these tasks. be effectively managed by the IT Operations Manager without undue interference from the Technical and Application Management teams or departments. 188.8.131.52 The role of IT Operations Management Where IT Operations is separated from Technical or An example of an IT Operations organization structure Application Management, it is common for operational based on technical expertise is given in Figure 6.7 Access Management tasks to be delegated to IT The advantages of this type of organizational structure Operations Management. Operators for each area will be include: tasked with providing or revoking access to key systems or resources. The circumstances under which they may do so, ■ It is easier to set internal performance objectives since and the instructions for how to do so, must be included in all staff in a single department have a similar set of the SOPs for those teams. tasks on a similar technology Organizing for Service Operation | 147 ■ Individual devices, systems or platforms can be The disadvantages of this type of organizational structure managed more effectively since people with the include the following: appropriate skills are dedicated to manage these and ■ When people are divided into separate departments measured according to their performance the priorities of their own group tend to override the ■ Managing training programmes is easier since skill sets priorities of other departments. An example of this is are clearly defined and separated into specific groups. when departments refuse to accept ownership of an incident, each one blaming the other while the business continues to be disrupted. IT Operations Manager IT Operations Infrastructure Application Facilities Control Operations Operations Management Mainframe Financial Apps Operations Operations Server HR Apps Operations Operations Storage Business Apps Operations Operations Network Operations Desktop Operations Database Operations Directory Service Operations Middleware Operations Internet/Web Operations Figure 6.7 IT Operations organized according to technical specialization (sample) 148 | Organizing for Service Operation ■ Knowledge about the infrastructure and relationships ■ Maintenance (this implies that one team will between components is difficult to collect and coordinate and perform all maintenance across fragmented. Individual groups tend to collect and all technologies) maintain only the data that is required to support ■ Contract Management or Third Party Management their own function, and do not give access to it ■ Monitoring and Control very easily. ■ Operations Bridge ■ Each technology managed by a group is seen as a ■ Network Operations Centre separate entity. This becomes a problem on systems ■ Operations Strategy and Planning (which, as part of that consist of components managed by different the Service Design processes, normally defines the teams, e.g. an application, managed by the standards to be used in IT Operations) – this Application Management team, runs on a server department can set strategy or standards for every managed by the Server Management department, type of Technical and Application Management area. using a network segment managed by the Local Area Networking department. If a change is made by one The Operations Strategy and Planning department is used team or department without consulting the others, to illustrate this type of structure in Figure 6.8. this could be disastrous for the service. The advantages of this type of organizational structure ■ It is more difficult to understand the impact of a include the following: single department’s poor performance on the IT ■ It is easier to manage groups of related activities since Service since there are many different groups contributing to the same service, each with its own set all the people involved in these activities report to the of performance objectives. same manager ■ Measurement of teams or departments is based more ■ It is more difficult to track overall IT Service performance since each group is being measured on on output than on isolated activities. This helps to an individual basis. build higher levels of assurance that a service can be delivered. ■ Coordinating Change Assessments and Schedules is more difficult since many different departments have The disadvantages of this type of organizational structure to provide input for each change. include the following: ■ Work requiring knowledge of multiple technologies is ■ Resources with similar skills may be duplicated across difficult since most resources are only trained for and different functions, which results in higher costs concerned with the management of a single ■ Although measurement is more output-based, it is technology. Projects therefore have to include cross- still focused on the performance of internal activities training, which is time-consuming and expensive. rather than driven by the experience of the customer or end user. 6.7.2 Organization by activity This type of organization structure focuses on the fact that 6.7.3 Organizing to manage processes similar activities have to be performed on all technologies It is not a good idea to structure the whole organization in the organization. This means that people who perform according to processes. Processes are used to overcome similar activities, regardless of the technology, should be the ‘silo effect’ of departments, not to create silos. grouped together, although within each department there However, there are a number of processes that will need a may be teams focusing on a specific technology, dedicated organization structure to support and manage application, etc. it. For example, it will be very difficult for Financial In this type of organization, there is no clear differentiation Management to be successful without a dedicated Finance between the different Technical and Application department – even if that department consists of a small Management areas. Similar activities from many different number of staff. areas can be grouped into a single department. In process-based organizations people are organized into Examples of departments that have been set up to groups or departments that perform or manage a specific perform a specific set of activities across multiple process. This is similar to the activity-based structure, technologies include: except that its departments focus on end-to-end sets of activities rather than on one individual type of activity. Organizing for Service Operation | 149 Organization by Activity IT Strategy and Planning Manager Architecture New Capacity Service and Technology Planning Portfolio Standards Research Applications Mainframe Infrastructure Servers Storage Network Web-based Figure 6.8 A department based on executing a set of activities It should be noted that this type of organization structure Examples of process-based groups or departments include: should only be used if IT Operations Management is ■ Capacity Operations responsible for more than just IT Operations. In some ■ Availability Monitoring and Control organizations, for example, IT Operations is responsible for ■ IT Financial Management defining SLAs and negotiating UCs. ■ Security Administration In addition, processes specifically exist to link the activities ■ Asset and Configuration Management (including of different groups to achieve a specific outcome. Using equipment installation and deployment). processes as the basis to create departments can defeat the purpose of having processes in the first place. Process- The advantages of this type of organizational structure based departments are really only effective when they are include the following: able to coordinate the execution of the process through ■ Processes are easier to define the entire organization. ■ There is less role conflict as job descriptions and This means that process-based departments should only process role descriptions are the same. In other be considered if IT Operations Management is to play the structures a single job description will typically include role of Process Owner for a specific process. activities for several roles 150 | Organizing for Service Operation ■ Metrics of team or department performance and may be structured in this way, while another region uses a process performance are the same, effectively aligning process- or activity-based structure. ‘internal’ and ‘external’ metrics. Figure 6.9 also illustrates that one location could perform The disadvantages of this type of organizational structure centralized operations for all regions if they are similar include the following: enough. In this example, the American Server Operations Department manages all server operations in all locations, ■ A basic principle of processes is that they are a means Brussels manages all database operations and Singapore of linking the activities of various departments and manages all storage operations. groups. By using processes as a basis for organizational design, additional processes need to be The advantages of this type of organizational structure defined to ensure that the departments work together. include the following: ■ Even if a department is responsible for executing a ■ Organization structure can be customized to meet process, there will still be external dependencies. local conditions Groups may not view process activities outside of their ■ IT Operations can be customized to meet differing own process as being important, resulting in processes levels of IT service from region to region. that cannot be fully executed because dependencies cannot be met. The disadvantages of this type of organizational structure ■ While some aspects of a process can be centralized, include the following: there will always be a number of activities that will ■ Reporting lines and authority structures can be have to be performed by other groups. The confusing. For example, does Network Operations relationship between the dedicated team or report into the local Data Centre Manager or to a department and the people performing the centralized Network Operations Manager? decentralized activities is often difficult to define and ■ Operational standards are difficult to impose, resulting manage. in inconsistent and duplicated activities and tools, resulting in reduced economies of scale, which in turn 6.7.4 Organizing IT Operations by increases the overall cost of operations. geography ■ Duplication of roles, activities, tools and facilities IT Operations can be physically distributed and in some across multiple locations could be very costly. cases each location needs to be organized according to its ■ Shared services, such as e-mail, are more difficult to own particular context. deliver as each regional organization operates This structure is typically used in the following differently. circumstances: ■ Communication with customers and inside IT will be more difficult as they are not co-located and it may be ■ Data Centres are geographically distributed difficult for staff in one location to understand the ■ Different regions or countries have different priorities of customers or staff in another location. technologies or provide a different set of services ■ There are different business models or organizational 6.7.5 Hybrid organization structures structures in the different regions, i.e. the business is It is unlikely that IT Operations Management will be decentralized by geography and each Business Unit is structured using only one type of organization structure. fairly autonomous Most organizations use a technical specialization, with ■ Different legislation applies to different countries some additional activity- or process-based departments. or regions (e.g. safety regulations) ■ The type of structure used and the exact combination of Different standards apply to different countries technical specialization, activity-based and process-based or regions departments will depend on a number of organizational ■ Cultural or language differences exist between staff variables. managing IT. An example of this type of structure is given in Figure 6.9. Note that in this example each geographical department is structured internally using Technical Specialization. This could be different in each region. For example one region Organizing for Service Operation | 151 IT Operations Manager American IT European IT African IT Operations – Asia Pacific IT Operations – Miami Operations – Brussels Johannesburg Operations – Singapore Mainframe Mainframe Mainframe Mainframe Operations Operations Operations Operations Server Operations Storage Operations Network Network Network Network Operations Operations Operations Operations Desktop Desktop Desktop Desktop Operations Operations Operations Operations Database Operations Internet/Web Internet/Web Internet/Web Internet/Web Operations Operations Operations Operations Figure 6.9 IT Operations organized according to geography Organizational structure variables ■ The type and level of skills available to the organization The exact criteria chosen and the resulting organizational structure will depend on a number of ■ The size, age and maturity of the organization variables, which may include: ■ The management style of the organization ■ The nature of the business ■ Dependence on IT for business-critical activities, ■ Business requirements and expectations processes and functions ■ The technological and technical architecture ■ The way in which IT participates in the value network (i.e. the way IT interacts with the business ■ The stability of the current IT Infrastructure and and its partners, suppliers and customers) the availability of skills to manage it ■ The relationship between IT and its vendors. ■ The governance of the organization (i.e. the way in which authority is assigned and decisions are For a more complete description of how these factors made – as well as any formal governance influence organizational design, please refer to the framework that is used, such as COBIT or SOX) ‘Organizational Development’ section of the Service Strategy publication. ■ The legislative, political and socio-economic environment of the organization 152 | Organizing for Service Operation IT Operations Manager IT Operations Infrastructure Facilities Application Control Management Management Management Server Mainframe Financial Apps Management Management Management Server Mainframe HR Apps Financial Apps Operations Operations Management Operations Network Storage HR Apps Business Apps Management Management Operations Management Network Storage Business Apps Operations Operations Operations Database Desktop Management Management Database Desktop Operations Operations Internet/Web Management Internet/Web Operations Figure 6.10 Centralized IT Operations, Technical and Application Management structure 184.108.40.206 Combined functions In this structure, IT Operations Management is responsible One last type of organization should be discussed. This for the Technical and Application Management functions, structure incorporates IT Operations, Technical and which in turn are responsible for managing their own Application Management departments into a single operational activities. Each department is able to delegate structure. This is sometimes the case where all groups are some of these activities to the Operations Control co-located in a single data centre. Here, the Data Centre department. Manager takes responsibility for all Technical, Application The advantages of this organization structure are: and IT Operations Management. ■ There is greater consistency and control between the This type of organization structure is illustrated more tactical and more operational Technical in Figure 6.10. Management activities Organizing for Service Operation | 153 ■ It is easier to enforce the performance standards and In Application Management, the central team could technical architectures that are created in Service participate in ongoing design and testing of the Design, since the people who were involved in design application, monitoring and control; perform backups, are managing the activities of the people who are data integrity checks, etc. The local team could provide executing those activities on-site support and education to end users and work with ■ As there is no duplication between location or activity, the local Technical Management team to resolve more this structure is often more cost-effective. complex problems involving local equipment. The disadvantage of this organization structure is: There is one potential issue that needs to be resolved however, and that is who the local team reports to. In ■ The scope of this structure makes it very difficult to some organizations they report to the manager of the manage effectively in large organizations or in centralized team. This has the added advantage of organizations with multiple Data Centres. consistent performance and management across the whole enterprise. 220.127.116.11 Organizing Application and Technical Management In other organizations the local teams report to the most senior IT Manager at that site. This has the added Technical and Application Management organizations tend advantage that IT Services can be customized to meet to be fairly straightforward. As stated in paragraphs 6.3.4 local conditions, but it creates a lot of confusion about and 6.5.6, Technical Management departments are usually who the local teams should take direction from. based on the technology they manage and Application Management departments on the applications and sets of The advantages of this type of organizational structure applications they manage. include the following: However, there are some alternative organization ■ Organization structure can be customized to meet structures and variations, which are discussed in this local conditions section. ■ Technical and Application Management can be customized to meet differing levels of IT service from 18.104.22.168 Geography region to region. In organizations with multiple locations, it is common for The disadvantages of this type of organizational structure the Technical and Application Management departments include the following: to be represented in each physical location. However, this ■ Reporting lines and authority structures can be does not mean that each location will have all the same departments, or that they are all responsible for the same confusing actions. ■ Standards are difficult to impose, resulting in inconsistent and duplicated activities and tools, As support and management tools mature more and more resulting in reduced economies of scale, which in turn IT Infrastructure and application CIs can be managed increases the overall cost of operations remotely. This means that each department will have a ■ Duplication of roles, activities, tools and facilities strong, centralized Technical or Application Management across multiple locations could be very costly. team, with local members to provide specialized, on-site activities or support. 22.214.171.124 Combined Technical and Application For example, in Server Management, the central team will Management structure help to create standards for server configuration, they will Some organizations organize their Technical and monitor and control remote devices, perform backups, Application Management functions according to systems. perform Operating System upgrades, etc. The local teams This means that each department will consist of will provide basic on-site support, hardware maintenance application specialists and IT Infrastructure technical and repair and configuration and installation of new specialists, all geared towards managing the services servers. based on that set of systems. Components that are shared across all these systems, such as the network, will be managed by dedicated Technical Management departments. 154 | Organizing for Service Operation The advantage of this organization structure is: ■ It is easier to produce high-quality output to the end user because all department members are focused on the success of the system as a whole, rather than the performance of an individual technology component or application. The disadvantages of this organization structure are: ■ Duplication of skills and resources across several departments will increase the cost of the organization. For example, each group is likely to have an individual or team dedicated to managing servers – each of which will be doing very similar tasks. ■ Communication between staff who are managing similar technology is reduced. This reduces the amount of learning by experience and increases reliance on collaborative knowledge management tools. ■ When people with similar skills are in the same department, the department will compensate for members with lower skill and competency levels. When there is only one person with Server Management skills on a system-based department, and their competency is minimal, it will affect the performance of the entire department. Technology considerations 7 | 157 7 Technology considerations Each function and process is defined in the relevant and linked to Incident, Problem, Known Error and Change section in Chapters 4 and 6. This chapter brings all Records as appropriate. technology requirements together to define the overall requirement of an integrated set of Service Management 7.1.4 Discovery/Deployment/Licensing technology for Service Operation. technology The same technology, with some possible additions, In order to populate or verify the CMS data and to assist in should be used for the other phases of ITSM – Service Licence Management, discovery or automated audit tools Strategy, Service Design, Service Transition and Continual will be required. Such tools should be capable of being Service Improvement – to give consistency and allow an run from any location on the network and allow effective ITSM Lifecycle to be properly managed. interrogation and recovery of information relating to all components that make up, or are connected to, the IT The main requirements for Service Operation are as set out Infrastructure. in this chapter. Such technology should allow ‘filtering’ so that the data being carried forward can be vetted and only required 7.1 GENERIC REQUIREMENTS data extracted. It is also very helpful if ‘changes only’ since An integrated ITSM technology (or toolset, as some the last audit can be extracted and reported upon. suppliers sell their technology as ‘modules’ whereas some The same technology can often be used to deploy new organizations may choose to integrate products from software to target locations – this is an essential alternative suppliers) is needed that includes the following requirement for all Service Operation teams or core functionality. departments, to allow patches, transports etc. to be distributed to the correct users. 7.1.1 Self-Help Many organizations find it beneficial to offer ‘Self-Help’ An interface to ‘Self Help’ capabilities is desirable to allow capabilities to their users. The technology should therefore approved software downloads to be requested in this way support this capability with some form of web front-end but automatically handled by the deployment software. allowing web pages to be defined offering a menu-driven Tools that allow automatic comparison of software range of Self-Help and Service Requests – with a direct licences’ details held (in the CMS, ideally) and actual interface into the back-end process-handling software. licence numbers deployed – with reporting of any discrepancies – are extremely desirable. 7.1.2 Workflow or process engine A workflow or process control engine is needed to allow 7.1.5 Remote control the pre-definition and control of defined processes such as It is often helpful for the Service Desk Analysts and other an Incident Lifecycle, Request Fulfilment Lifecycle, Problem support groups to be able to take control of the user’s Lifecycle, Change Model, etc. desk-top (under properly controlled security conditions) so This should allow responsibilities, activities, timescales, as to allow them to conduct investigations or correct escalation paths and alerting to be pre-defined and then settings, etc. Facilities to allow this level of remote control automatically managed. will be needed. 7.1.3 Integrated CMS 7.1.6 Diagnostic utilities The tool should have an integrated CMS to allow the It could be extremely useful for the Service Desk and other organization’s IT infrastructure assets, components, support groups if the technology incorporated the services and any ancillary CIs (such as contracts, locations, capability to create and use diagnostic scripts and other licences, suppliers etc. – anything that the IT organization diagnostic utilities (such as, for example, case-based wishes to control) to be held, together with all relevant reasoning tools) to assist with earlier diagnosis of attributes, in a centralised location – and to allow incidents. Ideally, these should be ‘context sensitive’ and relationships between each to be stored and maintained, presentation of the scripts automated so far as possible. 158 | Technology considerations 7.1.7 Reporting More advanced tools integration capabilities are needed to There is no use in storing data unless it can be easily allow greater exploitation of this sort of business and IT retrieved and used to meet the organization’s purposes. integration. The technology should therefore incorporate good reporting capabilities, as well as allow standard interfaces 7.2 EVENT MANAGEMENT which can be used to input data to industry-standard The following features are desirable for any Event reporting packages, dashboards, etc. Ideally, instant, on- Management technology: screen as well as printed reporting can be provided through the use of context-sensitive ‘top ten’ reports. ■ Multi-environmental, open interface to allow monitoring and alerting across heterogeneous services 7.1.8 Dashboards and an organization’s entire IT Infrastructure. Dashboard-type technology is useful to allow ‘see at a ■ Easy to deploy, with minimal set up costs. glance’ visibility of overall IT service performance and ■ ‘Standard’ agents to monitor most common availability levels. Such displays can be included in environments/components/systems. management-level reports to users and customers – but ■ Open interfaces to accept any standard (e.g. SNMP) can also give real-time information for inclusion in IT web event input and generation of multiple alerting. pages to give dynamic reporting, and can be used for ■ Centralized routing of all events to a single location, support and investigation purposes. Capabilities to support programmable to allow different location(s) at various customized views of information to meet specific levels of times. interest can be particularly useful. ■ Support for design/test phases – so that new However, they sometimes represent a technical rather than applications/services can be monitored during service view of the infrastructure and in such cases they design/test phases and results fed back into the may be of less interest to customers and users. design and transition. ■ Programmable assessment and handling of alerts 7.1.9 Integration with Business Service depending upon symptoms and impact. Management ■ The ability to allow an operator to acknowledge an There is a trend within the IT industry to try to bring alert, and if no response is entered within a defined together business-related IT with the processes and timeframe, to escalate the alert. disciplines of IT Service Management – some call this ■ Good reporting functionality to allow feed-back into Business Service Management. To facilitate this, business design and transition phases as well a meaningful applications and tools need to be interfaced with ITSM management information and business user support tools to give the required functionality. This can ‘dashboard’. be illustrated by this example: Such technology should allow a direct interface into the organization’s Incident Management processes (via entry An Eastern European telecoms company was able to into the Incident Log), as well as the capability to escalate interface its telephone cell-net monitoring and billing to support staff, third-party suppliers, engineers etc. via e- system to its Event Management, Incident mail, SMS messaging, etc. Management and Configuration Management processes. In this way it was able to detect any Specialist facilities, or perhaps separate specialist tools, will unusual usage/billing patterns and interpret these be required for website monitoring. Such facilities must be such that it could identify, with a high degree of able to simulate customer traffic onto the website and to certainly, that a telephone had been stolen and was report on availability and performance in relation to the being used to make illicit calls. ‘customer experience’. It was able to raise events for such patterns and automate actions to suspend usage of the mobile phone devices and, in parallel, identify the exact location of the illicit user (using GPRS technology) and raise incidents so that the police had the capability of finding the suspected thief and recovering the device. Technology considerations | 159 7.3 INCIDENT MANAGEMENT 7.4 REQUEST FULFILMENT Integrated ITSM technology is needed so that Service 7.3.1 Integrated ITSM technology Requests can be linked to incidents or events that have Integrated ITSM technology is required that has the initiated them (and been stored in the same CMS, which following functionality: can be interrogated to report against SLAs). Some ■ An integral CMS to allow automated relationships to organizations will be content to use the Incident be made and maintained between incidents, service Management element of such tools and to treat Service requests, problems, Known Errors and all other Requests as a subset and defined category of incidents. configuration items. Where an organization chooses to raise separate Service ■ The CMS that can be used to assist in determining Requests, it will require a tool which allows this capability. priority and aid in investigation and diagnosis. Front-end Self-Help capabilities will be needed to allow ■ A process flow engine to allow processes to be pre- users to submit requests via some form of web-based, defined (including pre-defined incident models, see menu-driven selection process. paragraph 126.96.36.199) and automatically controlled – with In all other respects the facilities needed to manage flexible internal routing to all relevant support groups Service Requests are very similar to those for managing and external e-mail/SMS interfaces. incidents: pre-defined workflow control of Request ■ Automated alerting and escalation capabilities to Models, priority levels, automated escalation, effective prevent an incident being overlooked or delayed. reporting, etc. ■ Open interfacing to Event Management tools, so that any failures can be automatically raised as incidents. ■ A web interface to allow self-help and service requests 7.5 PROBLEM MANAGEMENT to be input via Internet/Intranet screens. 7.5.1 Integrated Service Management ■ An integrated KEDB so that diagnosed and/or resolved incident/problems can be recorded and searched to Technology help in speeding future incident resolution. An integrated ITSM tool is needed that differentiates ■ Easy-to-use reporting facilities to allow incident between incidents and problems – so that separate metrics to be produced and to facilitate incident Problem Records can be raised to deal with the underlying analysis for Problem Management and Availability causes of incidents, but linked to the related incidents. The Management purposes. functionality of Problem Records should be similar to ■ Diagnostic tools (either integrated or interfaces to those needed for Incident Records and also allow for separate products), as already mentioned under multiple incident matching against Problem Records. Service Desk. 7.5.2 Change Management 7.3.2 Workflow and automated escalation Integration with Change Management is very important, The target times should be included in support tools, so that Request, Event, Incident and Problem Records can which should be used to automate the workflow control be related to RFCs that have caused problems. This is to and escalation paths. evaluate the success of the Change Management process – as well as Incident and Known Error Records – and so If for example a second-line support group has not that RFCs can be readily raised to control the activities resolved an incident within a 60-minute agreed target, the needed to overcome problems that have been identified incident must be automatically routed to the appropriate through Root-Cause Analysis or Proactive Trend Analysis. (determined by incident categorization) third-line support group – and any necessary hierarchic escalation should be 7.5.3 Integrated CMS automatically undertaken (e.g. SMS message to the Service It is also important to have an integrated CMS which Desk Manager, Incident Manager and/or IT Services allows Problem Records to be linked to the components Manager and perhaps to the user, if appropriate). The affected and the services impacted – and to any other second-line support group must be informed of the relevant CIs. escalation action as part of the automated process. Configuration Management forms part of a larger SKMS which includes linkages to many of the data repositories used in Service Operations. The process and practices of 160 | Technology considerations Configuration Management and its underlying ■ An automated call distribution (ACD) system to allow a technologies requirements are included in the Service single telephone number (or numbers if a distributed Transition publication. or segmented Service Desk is the preferred option) and group pick-up capabilities. Warning: If options are 7.5.4 Known Error Database offered via the ACD, via keyboard or Interactive Voice An effective KEDB will be as essential requirement, Recognition (IVR) selection, do not use too many which should allow easy storage and retrieval of Known levels of options or offer ambiguous options. Also do Error data. not include any ‘dead ends’ or options which, once chosen, do not allow the caller to go back to previous Good reporting facilities are needed to ease the menus. production of management reports, allowing the data to ■ Computer Telephony Interface (CTI) software to allow be incorporated automatically without the need for re- caller recognition (via the linked ACD) and automated keying of data – and to allow drill-down capabilities for population of the users’ details into the incident Incident and Problem Analysis. record from the CMS. Note: In some cases, components or systems being ■ VoIP – use of this technology can significantly reduce investigated by Problem Management may be provided telephony costs when dealing with remote and by third-party vendors or manufacturers. To address this, international users vendors’ support tools and/or KEDBs may also need ■ Statistical software to allow telephony statistics to be to be used. gathered and easily interrogated/printed for analysis – this should allow the following information to be 7.6 ACCESS MANAGEMENT obtained for any selected period: ● Number of calls received, in total and broken Access Management uses a variety of technologies, mainly: down by any ‘splits’ – where any call-routing has ■ Human Resource Management technology, to validate been chosen and being provided by an IVR the identity of users and to track their status system/keypad response ■ Directory Services Technology (see section 5.8 for a ● Call arrival profiles and answer times description of Directory Services). This technology ● Call abandon rates enables technology managers to assign names to ● Call handling rates by individual Service Desk resources on a network and then provide access to call handlers those resources based on the profile of the user. ● Average call durations Directory Services tools also enable Access ■ Hands-free headsets, with dual-user access capabilities Management to create roles and groups and to link (on at least some of the headsets) for use during these to both users and resources training of new staff, etc. ■ Access Management features in Applications, Middleware, Operating Systems and Network 7.7.2 Support tools Operating Systems There are a range of free-standing Service Desk support ■ Change Management systems tools available in the marketplace – and some ■ Request Fulfilment technology (see section 7.4). organizations may choose to produce their own simple incident logging/management systems. If an organization 7.7 SERVICE DESK seriously intends to implement ITSM then a fully integrated ITSM toolset will be required that has a CMS at Adequate tools and technology support should be the centre and provides integrated support for all the ITIL- provided to enable Service Desk staff to perform their defined processes. roles as efficiently and effectively as possible. This will include the following. Specific elements of such a tool that will be particularly beneficial for the Service Desk include the following. 7.7.1 Telephony Because a high percentage of incidents are likely to be 188.8.131.52 Known Error Database raised by telephone calls from users, the Service Desk An integrated KEDB should be used to store details of should be provided with good, modern telephony previous incidents/problems and their resolutions – so that services. This should include: any recurrences can be more quickly diagnosed and fixed. Technology considerations | 161 To facilitate this, functionality is needed to categorize and ■ Downloads of additional software packages – tools are quickly retrieve previous Known Errors, using pattern available to check a pre-defined software policy and to matching and key word searching against symptoms. allow the download of additional software packages, if Management of the KEDB is the responsibility of Problem covered by the policy. This can include automated Management, but the Service Desk will use to help speed software licence checks and financial approvals as well incident handling. as CMS updating. ■ Advanced notice of any planned downtime or services 184.108.40.206 Diagnostic scripts outages or degradations. Multi-level diagnostic scripts should be developed, stored The self-help solution should include the capability for and managed to allow Service Desk staff to pinpoint the users to log incidents themselves, which can be used cause of failures. Specialist support groups and suppliers during periods that the Service Desk is closed (if not should be asked to provide details of the likely failures and operating 24/7) and attended to by Service Desk staff at the key questions to be asked to identify exactly what has the start of the next shift. gone wrong – and for details of the resolution actions to be taken. Some care has to be exercised to ensure that the Self-Help activities selected for inclusion are not too advanced for These details should then be included in context-sensitive the average user, and that safeguards are included to scripts that should appear on-screen, dependent upon the prevent a ‘little knowledge being a dangerous thing’! It multi-level categorization of the incident, and should be may be possible to offer slightly more advanced Self-Help driven by the user’s answers to diagnostic questions. facilities to ‘Super Users’ who have had extra training. It is also necessary to be very careful about assumptions made 220.127.116.11 Self-Help web Interface when staffing a Service Desk about the amount of use that It is often cost effective and expedient to provide some users will make of Self-Help facilities. form of automated ‘Self-Help’ functionality, so users can Note: As already covered in the list above, it is possible to seek and obtain assistance which will enable them to combine some simpler Request Fulfilment activities as part resolve their own difficulties. Ideally this should be via a of an overall Self-Help system – which can also be of 24/7 web interface that is driven by menu selection and significant benefit in reducing calls to the Service Desk might include, as appropriate: (see paragraph 7.1.1 for further details). ■ Frequently asked questions (FAQs) and solutions. ■ ‘How to do’ search capabilities – to guide users 18.104.22.168 Remote control through a context-sensitive list of tasks or activities. As already stated, but repeated here for completeness, it is ■ A bulletin-type service containing details of often helpful for the Service Desk Analysts to be able to outstanding service issues/problems together with take control of the user’s desktop so as to allow them to anticipated restoration times. conduct investigations or correct settings, etc. Facilities to ■ Password change capabilities – using secure password allow this level of remote control will be needed. protection software to check identities, perform authorization and change passwords without the need 7.7.3 IT Service Continuity Planning for for Service Desk intervention. ITSM support tools ■ Software fix downloads (patches, service packs, bug Organizations are likely to become quickly dependent fixes etc. where it is determined that the user has the upon their ITSM tools and will find it difficult to work wrong version or a fix is needed) – tools are available without them. A full Business Impact Analysis and to automate the checking process, to compare the Risk Analysis should be performed and plans then actual desktop image with the agreed ‘standard’ builds developed to ensure appropriate IT Service Continuity and to allow upgrades to be offered and accepted and resilience levels. where necessary. ■ Software repairs – where it is detected that a corruption may have occurred, to allow software fixes, removal and/or re-installation. ■ Software removal requests – automatically completed with any licence being returned to the pool. Implementing Service Operation 8 | 165 8 Implementing Service Operation It should be noted that Service Operation is a phase in a ■ Changes of management or personnel (ranging from lifecycle and not an entity in its own right. By the time a loss or transfer of individuals right through to major service, process, organization structure or technology is take-overs or acquisitions) operating, it has already been implemented. However, ■ Change of service levels or in service provision – there are a number of processes and functions described outsourcing, in-sourcing, partnerships, etc. in this publication, and it is therefore important to address the implementation considerations which should have 8.1.2 Change assessment been addressed by the time they come into operation. Service Operation staff must be involved in the assessment A number of these have been covered in the relevant of all changes to ensure that operational issues are fully section – for example guidance is given about taken into account. This involvement should commence as organization structures and roles in Chapter 6. This will soon as possible (see paragraph 4.6.1) not just at the later not be repeated here. Rather, this section will focus on stages of change – i.e. CAB and ECAB membership – by some generic implementation guidance for Service which time many fundamental decisions will have been Operation as a whole. made and influence is likely to be very limited. The Change Manager should inform all affected parties of the change being assessed so input can be prepared and 8.1 MANAGING CHANGE IN SERVICE available prior to CAB meetings. OPERATION However, it is important that Service Operation staff are Service Operation should strive to achieve stability – but involved at these latter stages as they may be involved in not stagnation! There are many valid and advantageous the actual implementation and they will wish to ensure reasons why ‘change is a good thing’ – but Service that careful scheduling takes place to avoid potential Operation staff must ensure that any changes are contentions or particularly sensitive periods. absorbed without adverse impact upon the stability of the IT services being offered. 8.1.3 Measurement of successful change The ultimate measure of success in respect of changes 8.1.1 Change triggers made to Service Operation is that customers and users do There are many things that may trigger a change in the not experience any variation or outage of service. So far as Service Operation environment. These include: possible, the effects of changes should be invisible, apart ■ New or upgraded hardware or network components from any enhanced functionality, quality or financial ■ New or upgraded applications software savings resulting from the change. ■ New or upgraded system software (operating systems, utilities, middleware etc. including patches and 8.2 SERVICE OPERATION AND PROJECT bug fixes MANAGEMENT ■ Legislative, conformance or governance changes Because Service Operation is generally viewed as ‘business ■ Obsolescence – some components may become as usual’ and often focused on executing defined obsolete and require replacement or cease to be procedures in a standard way, there is a tendency not to supported by the supplier/maintainer use Project Management processes when they would in ■ Business imperative – you have to be flexible to work fact be appropriate. For example, major infrastructure in ITSM, particularly during Service Operation, and upgrades, or the deployment of new or changed there will be many occasions when the business needs procedures, are significant tasks where formal Project IT changes to meet dynamic business requirements Management can be used to improve control and manage ■ Enhancements to processes, procedures and/or costs/resources. underpinning tools to improve IT delivery or reduce financial costs Using Project Management to manage these types of activity would have the following benefits: 166 | Implementing Service Operation ■ The project benefits are clearly stated and agreed Transition to ensure that when new services reach the live ■ There is more visibility of what is being done and how environment they are fit for purpose, from a Service it is being managed, which makes it easier for other IT Operation perspective, and are ‘supportable’ in the future. groups and the business to quantify the contributions In this context, ‘supportable’ means: made by operational teams ■ This in turn makes it easier to obtain funding for ■ Capable of being supported from a technical and projects that have traditionally been difficult to cost operational viewpoint from within existing, or pre- justify agreed additional resources and skills levels ■ Greater consistency and improved quality ■ Without adverse impact on other existing technical or operational working practices, processes or schedules ■ Achievement of objectives results in higher credibility ■ Without any unexpected operational costs or ongoing for operational groups. or escalating support expenditure ■ Without any unexpected contractual or legal 8.3 ASSESSING AND MANAGING RISK IN complications SERVICE OPERATION ■ No complex support paths between multiple support There will be a number of occasions where it is imperative departments of third-party organizations. that risk assessment to Service Operation is quickly Note: Change is not just about technology. It also requires undertaken and acted upon. training, awareness, cultural change, motivational issues The most obvious area is in assessing the risk of potential and a lot more. Further details regarding wider changes or Known Errors (already covered elsewhere) but management of change are covered in the Service in addition Service Operation staff may need to be Transition publication. involved in assessing the risk and impact of: ■ Failures, or potential failures – either reported by 8.5 PLANNING AND IMPLEMENTING Event Management or Incident/Problem Management, SERVICE MANAGEMENT TECHNOLOGIES or warnings raised by manufacturers, suppliers or There are a number of factors that organizations need to contractors plan for in readiness for, and during deployment and ■ New projects that will ultimately result in delivery into implementation of, ITSM support tools. These include the the live environment following. ■ Environmental risk (encompassing IT Service Continuity-type risks to the physical environment and 8.5.1 Licences locale as well as political, commercial or industrial- The overall cost of ITSM tools, particularly the integrated relations related risks) tool that will form the heart of the required toolset, is ■ Suppliers, particularly where new suppliers are usually determined by the number and type of user involved or where key service components are under licences that the organization needs. the control of third parties ■ Security risks – both theoretical or actual arising from Such tools are often sold in modular format, so the exact security related incidents or events functionality of each module needs to be well understood and some initial sizing must be conducted to determine ■ New customers/services to be supported. how many – and what type – of users will need access to each module. 8.4 OPERATIONAL STAFF IN SERVICE Licences are often available in the following types (the DESIGN AND TRANSITION exact terminology may vary depending upon the software All IT groups will be involved during Service Design and supplier). Service transition to ensure that new components or service are designed, tested and implemented to provide 22.214.171.124 Dedicated licences the correct levels of functionality, usability, availability, For use by those staff that requires frequent and capacity, etc. prolonged use of the module (e.g. Service Desk staff Additionally, Service Operation staff must be involved would need a dedicated licence to use an Incident during the early stages of Service Design and Service Management module). Implementing Service Operation | 167 126.96.36.199 Shared licences An alternative to this is where the use of a tool is offered For staff who make fairly regular use of the module, but as part of a specific consultancy assignment (e.g. a with significant intervals in between, so can usually specialist Capacity Management consultancy, say, who manage with a shared licence (e.g. third-line support staff may offer a regular but relatively infrequent Capacity may need regular access to an Incident Management Planning consultancy package and provide use of the module – but only at times when they are actively tools for the duration of the assignment). In such cases the updating an incident record). The ratio of required licences licence fees are likely to be included as part of, or as an to users needs to be estimated, so the correct number of addendum to, the consultancy fee. licences can be purchased – this will depend upon the A further variation is where software is licensed and number of potential users, the length of periods of use charged on an agent/activity basis. An example of this is and the expected frequency between usages to give an interrogation/monitoring and/or simulation software (e.g. estimated concurrency level. agent software that can simulate pre-defined customer The cost of a shared licence is usually more expensive paths through an organization’s website, to assess and than that of dedicated licences – but the overall cost is report upon performance and availability). Such software is less as users are sharing and fewer licences are therefore typically charged on the basis of the number of agents, needed in total. their location and/or the amount of activity generated. In all cases, full investigations of the licensing structure 188.8.131.52 Web licences must be investigated and well understood during the Usually allowing some form of ‘light interface’ via web procurement investigations and well before tools are access to the tool’s capabilities, this is usually suitable for deployed – so that the ultimate costs do not come as any staff requiring remote access, only occasional access, or sort of surprise. usage of just a small subset of the functionality (e.g. engineering staff wishing to log details of actions taken on 8.5.2 Deployment incidents or users just wanting to log an incident directly). Many ITSM tools, particularly Discovery and Event Web licences usually cost a lot less than other licences Monitoring tools, will require some client/agent software (may even be free with other licences!) and the ratio deploying to all target locations before they can be used. of use is also often lower – so overall costs are This will need careful planning and execution – and reduced further. should be handled through formal Release and Note that some staff may require access to multiple Deployment Management (see Service Transition licences (e.g. support staff may require a dedicated or publication). shared licence when in the office during the day, but may Even where network deployment is possible, this needs require a web licence when providing out of hours careful scheduling and testing – and records must be support from home). Keep in mind that licences may be maintained throughout the rollout so that support staff required for customers/users/suppliers using the same tool have knowledge of who has been upgraded and who has to input, view or update records or reports. not. Some form of interim Change Management may be Note: Some licence agreements (of any of the types necessary and the CMS should be updated as the rollout mentioned above) may restrict the usage of the software progresses. to an individual device or CPU! It is often necessary for a reboot of the devices for the client software to be recognized – and this needs to be 184.108.40.206 Service on demand arranged in advance, otherwise long delays can occur if There has been a trend within the IT industry for suppliers staff do not generally switch off their desktops overnight. to offer IT applications ‘on demand’, where access is given There may be particular problems deploying to laptops to the application for a period of demand and then and other portable equipment and special arrangements severed when it is no longer needed – and charged on may be necessary for staff to log on and receive the the basis of the time spent using the application. This type new software. of offering may be offered by some ITSM tool suppliers – which could be attractive to smaller organizations or if the 8.5.3 Capacity checks tools in question are very specialised and used relatively Some Capacity Management may be necessary in advance infrequently. to ensure that all of the target locations have sufficient 168 | Implementing Service Operation storage and processing capacity to host and run the new for an additional period when the tools go live and into software – any that cannot will need upgrading or the future, as needed. replacing, and lead times for these actions need to be included in the plans. 8.5.5 Type of introduction The capacity of the network should also be checked to A decision is needed on what type of introduction is establish whether it can handle the transmission of needed – whether to go for a ‘Big Bang’ introduction or management information, the transmission of log files and some sort of phased approach. As most organizations will the distribution of clients’ and also possibly software and not start from a ‘green field’ situation, and will have live configuration files. services to keep running during the introduction, a phased approach is more likely to be necessary. 8.5.4 Timing of technology deployment In many cases a new tool will be replacing an older, Care is needed to ensure that tools are deployed at the probably less sophisticated, tool and the switchover appropriate time in relation to the organization’s level of between the two is another factor to be planned. ITSM sophistication and knowledge. If tools are deployed This will often involve deciding what data needs to be too soon, they may be seen as an immediate panacea and carried forward from the old tool to the new one – and any necessary action to change processes, working this may require significant reformatting to achieve the practices or attitudes may be hindered or overlooked. required results. Ideally this transfer should be done A tool alone is usually not enough to make things work electronically – but in some cases a small amount of re- better. There is an old adage: ‘A fool with a tool is keying of live data may be inevitable and should be still a fool!’ factored into the plans. The organization must first examine the processes that the Caution: older tools generally relied on more manual entry tool is seeking to address and also ensure that staff are and maintenance of data so if electronic data migration is ‘bought in’ to the new processes and way of working and being used, an audit should be performed to verify data have a adopted a ‘service culture’. quality. However, tools can and often do make things a reality for Where data transfer is complicated or time consuming to many people – they are tangible and technical staff can achieve, an alternative might be to allow a period of immediately see how the new processes can work and parallel running – with the old tool being available for an how they may improve their way of working. initial period alongside the new one, so that historical data can be referenced if needed. In such cases it will be Some processes just cannot be done without adequate prudent to make the old tool ‘read-only’ so that no tooling, so there is a careful balance to be made to ensure mistakes can be made by logging new data in the old tools are introduced when they are needed – but not tool. before! Complete details on the Release and Deployment Similarly, care is needed to ensure that training in any Management process can be found in the Service tools is provided at the correct point – not too early or Transition publication. knowledge will diminish or be lost, but early enough so that staff can be formally trained and fully familiarize themselves with the operation of the tools well in advance of live deployment. Additional training should be planned Challenges, Critical Success Factors and risks 9 | 171 9 Challenges, Critical Success Factors and risks 9.1 CHALLENGES 9.1.2 Justifying funding There are a number of challenges faced within Service It is often difficult to justify expenditure in the area of Operation that need to be overcome. These include those Service Operation, as money spent in this sphere is often set out in this section. regarded as ‘infrastructure costs’ – with nothing new to show for the investment. 9.1.1 Lack of engagement with The Service Strategy publication discusses how to ensure a development and project staff Return on Investment and eliminate the perception of Traditionally, there has been a separation between Service investment as a purely Infrastructure ‘overhead’. Good Operation staff and those staff involved in developing new guidance is offered on how to justify investment. applications or running projects that will eventually deliver In reality, many investments in ITSM, particularly in the new functionality into the operational environment. Service Operation areas, can save money and show a This separation was originally deliberate and driven by positive Return on Investment – as well as resulting the desire to prevent collusion and avoid potential improvement in service quality. Some examples of security risks (in some organizations it is still a potential areas of savings include: legislative requirement). However, instead of using ■ Reduced software licence costs through the better this separation of duties to create positive contributions, management of licences and deployed copies in many organizations it is a source of rivalry and ■ Reduced support costs due to fewer incidents and political manoeuvring. problems and reduced resolution times All too often, ITSM is seen as something that has been ■ Reduced headcount through workforce rationalization, initiated in the operational areas and is nothing to do with supporting roles and accountability structures development or projects. ■ Less ‘lost business’ due to poor IT service quality This view is very damaging as the appropriate time to be ■ Better utilization of existing infrastructure equipment thinking of Service Operation issues is at the outset of new and deferral of further expenditure due to better developments or projects – when there is still time to capacity management include these factors in the planning stages. ■ Better-aligned processes, leading to less duplication of activities and better usage of existing resources. The Service Design and Service Transition publications describe the steps needed to ensure that IT Operations 9.1.3 Challenges for Service Operation issues are considered from the outset of new developments and projects. Managers The following is a list of some of the challenges that Anecdotes Managers in Service Operation should expect to face. There is no easy solution to these challenges, mainly One organization uses an ‘Operation Transition-In because they are by-products of the organization culture Policy’ to ensure that services being deployed have had the appropriate level of input from the and the decisions made during the process of deciding operational teams. This is basically a policy that the organizational structure. The purpose of including the clearly shows under what circumstances an list is to ensure that Service Operation Managers are application is ‘ready’ to transition into Operations. conscious of them and can create a plan to deal with This helped with communication to development and them. project teams and also provided a clear set of The differences between Design activities and Operational guidelines on how to work with the operational teams. activities will continue to present challenges. This is for a Another organization uses Operations Use Cases to number of reasons, including the following: get development teams to include requirements that should be fulfilled by the application to be run in production under the control of Operations personnel. 172 | Challenges, Critical Success Factors and risks ■ Service Design may tend to focus on an individual ■ Service Transition that is not used effectively to service at a time, whereas Service Operation tends to manage the transition between the Design and focus on delivering and supporting all services at the Operation phases. For example, some organizations same time. Operation Managers should work closely may only use Change Management to schedule the with Service Design and Service Transition to provide deployment of changes that have already been made the Operation perspective to ensure that design – rather than testing to see whether the change will and transition outcomes support the overall successfully make the transition between Design and operational needs. Operation. It is imperative that the practices of Service ■ Service Design will often be conducted in projects, Transition are followed and organization policies to while Service Operation focuses on ongoing, prevent poorly managed Change practices are in repeatable management processes and activities. The place. Operation, Change and Transition Managers result of this is that operational staff are often not must have the authority to deny any changes into the available to participate in Service Design project operational environment, without exception, that are activities, which in turn results in IT services that are not thoroughly tested. difficult to operate, or which do not include adequate These challenges can only be dealt with if Service manageability design elements. In addition, once Operation staff are involved in Service Design and project staff have finished the design of one IT Service Transition, and this will require that they are formally they could move onto the next project and not be tasked and measured to do this. Roles identified in the available to support difficulties in the operational Service Design processes should be included in Technical environment. Overcoming this challenge requires and IT Application Management staff job descriptions and Service Operation to plan for its staff to be actively their time allocated on a project-by-project basis. involved in design projects, to resource the transition activities and participate in Early Life Support of Another set of challenges relates to measurement. Each services introduced in the operational environment. alternative structure will introduce different combinations ■ The two stages in the lifecycle have different metrics, of items that are easy or difficult to measure. For example which encourages Service Design to complete the measuring the performance of a device or team could be project on time, to specification and in budget. In relatively easy, but determining whether that performance many cases it is difficult to forecast what the service is good or bad for the overall IT Service is another matter will look like and how much it will cost after it has altogether. A good Service Level Management process will been deployed and operated for some time. When it help to resolve this, but this means that Service Operation does not run as expected, IT Operations Management teams must be an integral part of that process (see is held responsible. While this challenge will always be Continual Service Improvement publication). a reality in Service Management, this can be addressed A third set of challenges relates to the use of Virtual by active involvement in the Service Transition stage Teams. Traditional, hierarchical management structures are of the lifecycle. The objective of Service Transition is to becoming inadequate because of the complexity and ensure that designed services will operate as expected diversity of most organizations. A management paradigm and the Operations Manager can provide the (Matrix Management) has emerged where employees knowledge needed to Service Transition to assess, and report to different sources for different tasks. This has remedy, issues before they become issues in the resulted in a complex web of accountability and an operational environment. increased risk of activities falling through the cracks. On the other hand, it also enables the organization to make skills and knowledge available where they are most needed to support the business. Knowledge Management and the mapping of authority structures will become increasingly important as organizations expand and diversify. This is discussed in the ITIL Service Strategy publication. One of the most significant challenges faced by Service Operation Managers is the balancing of many internal and external relationships. Most IT organizations today are complex and as services become more commoditized Challenges, Critical Success Factors and risks | 173 there is an increased use of value networks, partnerships should go out of their way to make their support known, and shared services models. While a significant advantage not just by their words but also by their actions and to dynamically evolving business needs, this increases the adherence to the organization’s agreed processes complexity of managing services cohesively, efficiently and and procedures. providing the invisible seam between the customer and Middle Managers should also give their full support to the intricate web of how services are actually delivered. A hiring staff to support the process, instead of accepting Service Operation Manager should invest in relationship the need for formalized Service Operation and then simply management knowledge and skills to help deal with the increasing the workload of existing staff to get it done. complexity of this challenge. 9.2.2 Business support 9.2 CRITICAL SUCCESS FACTORS It is important that the Business Units also support Service Operation. This level of support can be far better achieved 9.2.1 Management support if the Service Operation staff involve the business in all of Senior and Middle Management support is needed for all their activities and are open in their reporting of both ITSM activities and processes, particularly in Service successes and failures – and their efforts to improve. Operation. It is equally important that the Business Units understand, Senior Management support is critical for obtaining and accept and carry out the role they play in Service maintaining adequate funding and resourcing. Rather than Operation. Good service requires good customers! seeing Service Operation as a ‘black hole’ for investment, Adhering to the policies, processes and procedures, such Senior Management should quantify and champion the as using the Service Desk for logging all requests, is a benefits of good Service Operation. They should also be direct responsibility of the customer to support and fully informed of the dire results that can occur because of promote within the business. poor Service Operation. Regular communications with the business to understand Senior Management must provide visible support during their concerns and aspirations and to give feedback on the launch of new Service Operation initiatives (such as efforts to meet their needs are essential in building the through appearances at seminars, signatories to memos correct relationships and ensuring ongoing support. and announcements, etc.) and their ongoing support must Also the business should agree to the costs for be equally well demonstrated. Entirely the wrong implementing Service Operation and understand the messaging can be given if a senior manager fails to turn return on the investment, unless this has already been up to an important project meeting or launch seminar. agreed as part of the Design process. Even worse are senior managers who support the initiative verbally, but abuse their authority to encourage 9.2.3 Champions circumvention of the Service Operation practice. ITSM projects and the resulting ongoing practice Senior Managers should also empower the Middle (performed by Service Operation staff) are often more Managers who will be directly responsible for Service successful if one or more ‘champions’ are forthcoming Operation. Supporting the initiative publicly, but then who can lead others through their enthusiasm and overriding budget requirements or necessary changes, will commitment for ITSM. harm both the implementation and ongoing Service Operation initiative. In some cases these champions may be senior managers who are leading from the top. But champions can also be Middle Managers must also provide the necessary support successful if they come from other tiers of the – and in particular this should be demonstrated by their organization. One or two junior staff can still have a actions. If a Middle Manager is seen to be circumventing significant beneficial influence on a successful conclusion. or overriding an agreed procedure (e.g. implementing a change that has not been through the Change Champions are often created or heavily influenced Management process) then this gives the clear message through formal Service Management training, particularly that others can do the same – and that the procedure is at more advanced levels where the potential benefits to worthless and can be ignored by all. Middle Managers an organization, and to the individuals who make a career path in Service Management, can be fully explored. 174 | Challenges, Critical Success Factors and risks It should be noted that champions emerge over time. organization – and all must be instilled with a ‘Service They cannot be created or appointed. Often it is users or Management culture’. customers who provide the most help in creating good It is possible to have the finest Service Operation practice Service Management processes as they are acutely aware and tools in the world – but Service Management will not of needed improvements from a business perspective. It is be successful unless the people are also attuned to the important to recognize that these are usually highly overall Service Management objectives. Buy-in and motivated staff who often voluntarily take on the greatest support of all staff are therefore very important – and the workloads. If their input is to be most effective they must role of training and awareness, and even formal be given time to work as the champion. qualifications that benefit the individual, should not be underestimated. 9.2.4 Staffing and retention Having the appropriate number of staff with the Training required for successful Service Management appropriate skills is critical to the success of Service includes: Operation. Some challenges that need to be ■ Training IT staff on the processes that have been overcome include the following: implemented. This will include generic training so that ■ Projects for new services are usually quite good about they understand the concepts fully, as well as training specifying required new skills, but often underestimate specially targeted at the organization’s own processes the number of staff required and how to retain the ■ Training on ‘soft’ or ‘people’ skills, especially for those new skills. See paragraph 9.2.1 for some ideas on how staff in customer-facing positions to facilitate better communication about requirements. ■ Training about understanding the business, and the ■ Scarcity of resources who have a good understanding importance of achieving a service culture of Service Management. Having good technical people ■ Where tools have been implemented, training on how is necessary, but there needs to be a number of key to use and manage those tools people who are able to move between technology ■ Also, customers and users need appropriate training issues and service issues. on how to work with IT – access services, request ■ Since these resources are fairly scarce it is quite changes, submit requests, use tools, etc. common to train them, only to have them resign and join another company for a better salary. Clear career 9.2.6 Suitable tools paths and good incentives should be part of every Many Service Operation processes and activities cannot be Service Management initiative. performed effectively without adequate support tools (as ■ Attempting to assign too much, too soon, to existing outlined in Chapter 7). Senior management must ensure staff. Achieving efficient Service Operation will take that funding for such tools is included in ongoing budgets time, but if approached correctly it will be achieved. and support their procurement, deployment and ongoing Unfortunately, some managers try to expedite the maintenance. savings by assigning the interim work of implementing the new processes and tools to existing, 9.2.7 Validity of testing very busy, staff. Invariably either the project fails, or The quality of IT services that can be provided in Service service suffers and sometimes valuable staff will leave. Operation is dependent upon the quality of systems and Successful Service Management projects often require components delivered into the operational environment. a short-term investment in either temporary staff or contractors, and this should not be underestimated. The quality level will be significantly enhanced if adequate and complete testing of new components and releases is 9.2.5 Service Management training carried out in good time. Documentation should also be tested for completeness and quality. Adequate training and awareness can have much wider overall benefits. As well as creating champions of a few, it This requires a comprehensive and realistic testing can be used to win the ‘hearts and minds’ of many. environment to be in place for all systems/components – Service Operation staff must all be aware of the which mirrors the operational environment in terms of consequences of their actions, both good and bad, on the volume as well as characteristics. There should be Challenges, Critical Success Factors and risks | 175 independent testers wherever possible. Funding for such ■ Loss of key personnel: Sometimes the loss of one or testing environments is essential if high-quality services two key personnel can have a severe impact: to try to are to be achieved.