The move to a service-focused approach to IT service delivery brings with it greater focus on the application layer. When you’re faced establishing monitoring a mission-critical enterprise line-of-business (LOB) application, you may feel overwhelmed and maybe not sure where to begin. That’s a situation I face together with customers, often with reputations and big dollars at stake. This is something you won’t find a blueprint for any product manual.
In this installment of Private Cloud in the Real World (PCITRW), I’m going to share with you a lightweight (quick-and-dirty) version of a 5-step strategy I developed years ago for establishing an effective LOB application monitoring game plan when you’re short on time and resources. So let’s get started.
Step 1. Identify the Service Chain
If you are not familiar with the application at, you may have to identify one subject matter expert (SME) to help you document the service chain (see step 2), the chain of systems, devices, and software components that play a part in a complete transaction. However, I don’t like to bring all the SME’s in at this point, because it can slow things down. Knowledgeable people all have opinions. and some will conflict with others, too much talking doesn’t serve you well early in the process. I like to pick up with the SME’s one-by-one later in the process when I’m ready for them so I can better control the pace and the flow of the information gathering.
NOTE: I like to put this information to a simple Visio diagram identify service components and potentially high-level steps in the transaction, although we will likely have the details yet .
Step 2. Identify the subject matter experts (SMEs) for each component in the service chain
To establish an in-depth monitoring strategy, you’re going to need to know how the application works at a pretty fine level of detail. In the largest applications, it could take 5-10 SMEs to tell the whole story (yes, really). Find out early in the game who knows what about the application, as you can’t build an effective monitoring strategy without deep knowledge of the application – and you can’t be an expert on every app out there.
IMPORTANT: Put the name of the SME by each component in the service chain so you know who to talk to when you’re ready.
Step 3. Map the dependencies between application components
Now it’s time to begin talking to the SMEs. By understanding the which components in the service chain depend on which other components help us to establish the workflow for troubleshooting application issues. Just as we were all taught to start at layer 1 (the physical layer) when troubleshooting network problems, understanding dependencies between components in the service chain will help you to establish a “bottom-to-top” troubleshooting strategy for identifying root cause. Perhaps the easiest way to identify dependencies between applications is to simply map the steps and communication between components in the course of a single transaction.
IMPORTANT: Write this down. This information will come in handy when you want to design the service and health models for a management pack down the road.
Step 4. Identify the fastest, easiest method to identify application component health
While you’re talking to the SMEs, there is another bit of very useful info you can gather. In the long run you want to establish a comprehensive monitoring strategy, but in the short term, the goal is often to minimize the mean-time-to-recovery (MTTR). If you can stop the bleeding, you’ll have some breathing room to develop the solution for your comprehensive long term plan. So, when you are talking to your SMEs about each component in the service chain, ask them how they determine if the component is healthy or unhealthy. I’ve frequently been pleasantly surprised to learn that the litmus test to determine the health of a component within a complex distributed application was as simple as checking for files in a folder or retrieving the date and time from a web service.
Ultimately you want to build a comprehensive monitoring strategy, but identifying these “quick hits” can allow you to create a simplified first version to assist in quick identification a root cause, lessening the impact of the problem in the short term.
Step 5. Determine what monitoring is covered by OpsMgr off-the-shelf…and what is not
Operations Manager 2012 delivers a lot of off-the-shelf monitoring functionality, so it’s important to take a moment to figure out what’s available off-the-shelf and what you have to build to bridge the gaps. This type of gap analysis is important, as you’ll want to begin lobbying for funds to build out your long term strategy early in the game. If you can identify what you don’t have today in terms of monitoring logic, you can talk to vendors about what it will take to buy or build the monitoring logic (monitoring packs in OpsMgr terms) you need to deliver the desired comprehensive end result.
I hope you found this installment useful and please don’t hesitate to let me know if elements within this conversation you’d like to discuss at greater depth.
On a loosely related note, over the past few months, I have written many articles on System Center 2012 and MS private cloud solutions. Links to a few of these resources are shown below.