Problem Management

My last workshop of the day was called "Improving the 'Best' of Problem Management," by Christopher Jones from MeadWestvaco. This was the second year he gave this talk--it was apparently very popular last year.

The two goals of problem management are to prevent incidents and to minimize unpreventable incidents. For them problem management means

* find the "true" root cause--for them, when you can no longer ask "why?"
* publish known errors
* select and implement solutions to problems (in conjunction with change management)

All of their "major incidents" (major production-down issues) automatically get problem tickets created for them. Other problem tickets only get "approved" if there is a certain business value to exploring them.

For each problem, they assemble an ad-hoc problem team of subject matter experts. Everyone assigned action items must go to an 8:30 AM weekly meeting, or send a delegate, or their items get escalated to their managers.

They also require their outsourcer to perform problem management.

For major incidents, he recommended gathering as much data as possible at the time (e.g. copying logs), so you have more data for problem management. Another attendee recommended that, for recurring major incidents, making an arrangement with the business for you to get more time to diagnose the issue depending on when the incident recurs.

They have 13 "major incident managers"--leaders who can make the call about when to reboot a server. These people are trained in problem management so they have to make a decision about how much data is practical to gather before working around the incident.

Individual site contributors are solely responsible for the content of this web site.