tisdag 17 november 2015

Add value, stop protecting people from the truth

Today there is set of ITIL processes that are almost treated as commodity.  I’m talking about the Operation lifecycle processes. How they are used varies but they are often there one way or the other. Pretty much every service management tool vendor covers them and has been for years. Still the outcome of these processes in reality does not cut it in many situations and there is a struggle to keep up.


I’m going to go out on a limb here and say: I bet you it’s not due to people and it’s not due to tools. 


So what’s left? What to blame? Considering my bet, our last options are to blame processes or information. I blame both. The good thing here is that they are pretty easy to fix. So let’s fix them, or? What should we fix? What is it that is broken or lacking?


Let’s look at an example. When an incident occurs, the nature of incident management is to minimize the adverse impact on business and fix the incident as soon as possible. To accomplish this, there are more processes within operations than just incident management that interact to accomplish this. "Shift left" is a term and one way to illustrate this interaction between these processes. Event management, problem management and knowledge management (knowledge not part of operation lifecycle but still) are all processes that contribute to "shift left". “Shift left” is basically to shorten the incident timespan by applying an incident resolution or workaround as early on as possible. This is done by structuring knowledge in a way so it is available and applicable in the incident process. The information that is needed to accomplish this is usually stored and structured with the help of event-, problem- and knowledge- management but is created from knowledge throughout an organization.


This is healthy, effective and efficient. There is still nothing to fix! We have good people involved, a good tool to support our way of working, the "shift left" information is flowing and we have processes that describes each individual process and their interactions. 


With this in place, operations reporting consist of decreasing average resolution times, increasing first line fix rates, high utilization of workarounds, etc. Development reporting consists of shorter time to market, decreased average release cycles, decreased backlog etc. This reporting is usually done in silos due to lack of collaboration and information transparency. 


Each area is focusing on issues and performance that are close to their respective daily work, the things they see and do. So still, what is there to fix? Both operations and development are reporting positive progress, increasing productivity and decreasing lead times. The nature of this setup is that both departments will, over time, change the content of their reporting to address more and more specific subjects within their own area of expertise. This is of course to further improve but it is important to remember that operations reporting is not only for operations.


This is where the processes and information availability might start to wear down. There is a risk in "shift left" that can be very hard to detect when all things seem to improve. In this vast flow of information (shift left) and the very effective use of it (early resolutions) we become heroes. We tend to forget that there was still an unwanted interruption at the end of the day. It’s unwanted by all parties but we start treating it as a fact instead of a fault. 


If we treat an incident as a fact we tend to focus on damage control and we might even be very good at it for a very long time. Everyone becomes heroes. Heroes of Operations apply quick solutions and workarounds to incident. Heroes of Development provide knowledge to do so. 


If we instead treat an incident as a fault we have the possibility to focus on prevention, but only if we stop hiding the truth from whom it might concern. If an incident is the result of a fault, and the incident can be mitigated by a solution early on, there is still the question of why the incident happened in the first place. 


That incidents will occur is a fact. But why an incident occurred is still a subject that needs attention. Handling incidents effectively is not enough. The risk of "shift left", if not managed wisely, is that we build information walls between the truth (actual business impact) and the root cause of the fault (usually human error). These information walls are unintentionally born when support focus only on the closest thing at hand "minimize the adverse impact on business and fix the incident as soon as possible" when the guiding star should be that an incident never should happen at all, or at least not repeatedly. 


A common support setup is 1:st, 2:nd and 3:d level support within incident management. Each level of support is eager to use available knowledge to minimize incident impact, and as soon as that is achieved, the incident is closed and statistics are reported. If not achieved, the incident will be escalated to the next level of support where the procedure is repeated. If we are good at "Shift left", the incidents will be solved early on and the truth might not be visible at the source of the fault. The weakness here is that if the "why" is not addressed, the source will be blind of the truth. If the source is crappy developer code or bad infrastructure design, this will continue to generate faults. 


The effective handling of incidents is an operation concern. Why an incident occurred is a design concern and should be address as such. This painful truth must be available at the source of the fault in a valuable and actionable way. This does not mean that all incidents should be subject for problem management, oh no. But the incident reporting and statistics are equally important for both operations and development. How the incident statistics should be used differ though. Operations use it to improve handling; Development should use it to improve quality and prevention of incidents. A working feedback loop is the key to enable people to be aware of this truth.


Done right, there is nothing wrong with 1:st, 2:nd and 3:d level support within incident management. There is nothing wrong with "shift left" either. The missing piece here is to assure that we accomplish information transparency of the truth all the way to the source. And of course, at the source it needs to be treated as valuable knowledge to improve.


We need processes that are designed to provide this valuable feedback and we need the information (truth) made available in a transparent way all the way at the source. Basically, stop protecting people from the truth, or you could say stop hiding it.