29 September 2015 blogs Steven Dwyer 8 min read
We’ve been writing a series of posts under the broad theme of “Jump-starting ITSM in your organization.” Throughout the series, we’re providing advice on how you can start applying the principles and practices of ITSM using tools your organization is already using and the benefits of said application.
In part one of the series, “ITSM and System Center Operations Manager” we presented an introduction to ITSM, discussed the different ITSM frameworks that are available, and discussed how Microsoft’s System Center Operations Manager provides support for some of the pillars and objectives of ITSM.
In part two, “Building Your Service Catalog and Service Maps” we discussed the importance of cataloging and mapping out services – including their dependencies -, and talked about how you can use the distributed application feature in System Center Operations Manager to relate services to the IT components that underpin them.
In this part we’ll be looking at incident, problem and change management. We’ll talk about the differences between problems and incidents, and how a properly updated configuration management database (or CMDB for short) can quantify the business impact of changes to your systems.
Each of the various ITSM frameworks has its own definition of an incident, but the common thread among them is that an incident is an unplanned interruption or a reduction in quality of a service. Note that an incident is an event (the service is not available) rather than a cause (a disk drive failure). If you have service level agreements (SLAs) in place for your services, it is very likely that an incident is going to impact your SLA and may even cause a breach.
An IT organization that follows an ITSM framework should have an incident management process in place. While it is understood that incidents should be discovered and remedied as quickly as possible, IT organizations should also track and analyze incidents as part of a continual improvement process. By understanding which incidents occur most frequently and which cost the business the most, resources – including IT, development, and vendor compliance – can be deployed to have the maximum impact.
An incident will typically be caused by one or more problems. Again, various ITSM frameworks have different definitions of problems, but a very nice one comes from Rob England in The ITSM Review where he states, “a problem is in fact the cause of zero or more incidents.” We can certainly understand that a problem can be the cause of one or more incidents – perhaps a piece of hardware failed which generated a service outage. However, England’s definition also allows for the possibility of latent incidents. For example, if a printer ceases to operate outside of business hours it is a problem but not yet an incident since no one is around to attempt to use the printer. Further, the code your developers write may have undiscovered bugs that haven’t yet been triggered and thus comprise a problem that no one is aware of that may or may not lead to an incident in the future.
We should further differentiate root cause from problem. All problems will have a root cause, but the root cause will not always be known. And IT can certainly resolve an incident without discovering the root cause of the problem behind it; more often than not the advice to “reboot your PC” will cause a problem to disappear without anyone being the wiser as to its cause.
An IT organization following an ITSM framework should have a problem management process in place. This process will include the discovery of root causes of problems as well as mitigation of those causes. As with incident management, problems should be tracked and analyzed so that commonalities can be discovered. Perhaps a certain brand or model of disk drive has a higher failure rate than others, or a particular IaaS or PaaS vendor is discovered to have difficulty meeting their SLAs on a consistent basis.
Problem management can be either reactive or proactive. Reactive management occurs when problems have already caused incidents and steps must be taken to resolve the current incident and prevent future incidents. Proactive management includes solving problems before they are noticed by service users (i.e. before they cause incidents), as well as activities such as auditing code to find bugs.
As we alluded to before, incident and problem management do not need to take place at the same time. For critical services, it is most important that the incident be resolved as quickly as possible. It is certainly possible to resolve an incident without resolving the underlying problem. A service can be switched over to a back-up site, work-arounds can be deployed, or automated processes can be replaced with manual intervention. Once the incident is resolved, the IT organization can concentrate on the task of resolving the underlying problems.
Change management is the process of making changes to the IT infrastructure in a standardized and systematic manner. Changes can include replacing or upgrading the capacity of hardware, upgrading to a new version or rolling back to an old version of software, or switching to new vendors of IaaS and PaaS solution. Changes can both be a response to problems and incidents as well as causes of them.
Many organizations will have a change advisory board that is required to sign off on all changes before they take place. This board will carefully examine the impact of any changes so as to prevent incidents stemming from them. The board may either veto a proposed change or require mitigation measures to be put in place before the change such as standing up a fail-over site.
The most valuable tool that a change advisory board can have at its fingertips is an up to date configuration management database (or CMDB for short). The CMDB will inventory and relate together all IT assets (known as configuration items or CIs). CIs can include physical hardware, software, SaaS subscriptions, or even people such as business service owners. With reference to the CMDB, the change advisory board can determine what services will be impacted by any particular set of changes.
For those organizations that are using System Center Operations Manager, you already have the beginnings of a CMDB. Operations Manager is able to examine your IT environment and discover a wide variety of IT assets. With the use of distributed applications – whether created natively or with tools such as Savision’s Live Maps – you can model your services and relate them to the IT assets they are composed of.
You can synchronize configuration items and the relations between them from Operations Manager into both Microsoft’s System Center Service Manager and ServiceNow using Savision’s Live Maps. Both of these products offer CMDB functionality as well as incident and problem management.
By synchronizing from Operations Manager and using Live Maps‘ ability to dynamically associate components to services, you can avoid the tedious work involved in manually keeping your CMDB update to date. Rather, as your IT infrastructure evolves, Operations Manager will update its inventory, Live Maps will update your service dependencies, and Live Maps will then synchronize the new relationships into Service Manager and ServiceNow.
Once you have all the mappings between components and services in your CMDB, incident, problem, and change management becomes much easier. Root cause analysis of problems becomes faster and your change advisory board can immediately see the impact of any changes on your services. And, you can also practice proactive problem management by determining where incidents could occur in the future based on problems that have arisen today. (We provided a sobering example of an organization that allowed a preventable incident to occur due to improper problem prioritization in our blog post “Savision: a must-have for SCOM users”.
It can be a challenge to implement an ITSM framework in any IT organization. But, by leveraging investments that you’ve already made in tools and practices, you can simplify the process and make it easier to gain buy-in from critical stakeholders. We’ve provided some advice on how to overcome objections to deploying the ITIL ITSM framework in our blog post “Overcoming Obstacles in Adopting ITIL”, and our technical sales staff are happy to work with you to develop a proof of concept and gain senior management approval.
About: Steven Dwyer
Steven was the Vice President of Research and Development at Savision where he was responsible for delivering excellence in System Center solutions. He has led development teams at Sitebrand and Shout Research, and has developed System Center Operations Manager management packs for applications like Apache, Tomcat, and Oracle WebLogic.
Steven has a Master’s degree in Theoretical Physics from the University of Waterloo.