Troubleshoot something you don't know!

One day, three years ago, I found my self supporting users that working on a solution/system for call center department called Avaya Elite, and somehow, I suppose not only to operationally support the system but also to make web application & desktop application that integrates with the earlier mentioned solution.

The idea was simple, we have a solution that controls the Telephone Transactions on the company, part of its tools is a desktop application linked with an application server that's controlling the Inbound call center and the Outbound call Center. for the client desktop application, we need to link it with an application that shows information about the caller (in case of inbound) or information about the one who will be called (in case of outbound) from the company's ERP.

The above was an overview, as I will not go farther in details on this, but I will describe the situation that I faced:
There was no documentation provided from the solution supplier on how to integrate the Applications that I will build and Avaya Elite. because they were providing the solution as it is without any customizations or third parties integrations.  and this wasn't all! no documentation for the database structure also (needed to push data of customers must be called in certain criteria + needed to customize reports rather than the one already came by default with the solution.) and the client application was stopping or making phantom calls without an understandable logic! However, I came out with some simple approach that guided me to solve the problems we faced faster:

1. Understand the Vocabulary and terminologies of the system you need to troubleshoot:

For my example, I had to understand what "Program" means to Avaya System, and how is it related to "Skill" and how both used to generate a calling queue.
 However, by understanding, I don't mean to have full knowledge of the terminology and deep understanding, But a high level of how it is different on this system or solution, and what it impacts on other solutions/system components.

2. Log files and viewer: 

Knowing the location of the log files, or if the solution has a log viewer, helps to have initial investigating on the issue. Most of the time, log files will report the system exceptions and warning messages as it is, full of technical terminologies, and even if it contains a friendly reading human message, mostly it will describe it on a solution related word. That's why the first point is important. 

3. Look at the Database:

If you have access to the database, this could be great to a deep understanding of what the business logic behind the solution looks like, and how the data flows, and in which point the transactions stuck or aborted.

4. If you are in a Windows environment, Check the solution services, or processes in other platforms: 

Knowing what is the solution running services and what is handling exactly, helps a lot, for example, if the log files indicate that something has been stoped, or some servers are not reached, you will search for the services and processes that related to that issue.
As you know, most of the servers (DB servers, Web Servers, ...) are versual instances of servers that installed at the same server and controlled with services to start and stop.
Moreover, as part of the first point at this one, you should know the solution servers that the system interacts with and what it does, from our example: we have XML server and Media server, which is both not common on other solution (for me!), and was the root cause for most of the issues I faced, most of the log records from their exception messages, those servers were interacting directly with the database and have a list of services to check in case they are unreachable or stoped! see? all points make sense now.

5. Don't ping only the IP, ping the server name also:

In one of the cases, I faced an issue that, users' client applications can't reach the application server! I checked the server and its services, ping it from the user and it was visible, tried another group of users that are using the same application server and they were working fine! By coincidence, I pinged the server name and it didn't respond,  figuring out the reason was the infrastructure team changed the IP ranges for the company, leading to some DNS problem, what I was missing was the client application talks to the application server by its name not by it's IP, and that was the root cause of the issue.

I came with those five steps three years ago, use them to this day. for most of those in the operation field it may be like "things that cames with gut instinct", but I believe for most new developers it would be a good start to understand other solutions and legacy systems that they had to deal with.




Comments

Popular posts from this blog

Adapting a New Technology: A General Guide to keep Your Systems up-to-date

Practices for Clean & Secure Code Repositories

Adding Multiple DB Contexts in your DotNet Project