Exchange 2013: When your Load Balancer marks your Client Access Server as offline.. and all services are up and running

Just recently, I was asked to help with an issue in an Exchange 2013 Organization. The problem was that the KEMP Load Balancer, that balances requests for multiple Exchange protocols, between several Exchange 2013 servers, was marking one of the CAS servers as being offline.

The server was not being marked as offline for all services, it was just for one, Outlook Web Access (OWA).

Before I go into server details, let me give you a visual representation of what the above means. In my case we had a KEMP Load Balancer, but this can apply to any load balancer.

mon2

What you can see above is the KEMP admin portal. On the left hand side if you go to Virtual Services and click “View/Modify Services”, you’ll see a table (that I conveniently don’t show, just so I don’t have to cover all information in it anyway). In that table you will have, per server, one or more “Real Servers”, and when that real server health cannot be verified, it will show in red.

What that means from an operational perspective, is that no request from that protocol (https/smtp/sip etc) will be sent to that server, meaning you can be in a situation of single point of failure and you won’t know it.

Now lets have a look at how those real servers health is checked.

mon3

If you click to modify the virtual service, in the “Real Servers” section you will see the check parameters. Basically what KEMP does is to get an https request to https://server/owa/healthcheck.htm

If that request fails, you should see a page not found and the server is marked as being offline. The below is the result you should expect from a healthy server.

mon4

In the URL above, you should see the server name. Below the 200 OK you should see the server FQDN.

Now that I explained, very high level, how the process works, lets drill down to the problem that I had.

Before detailing the problem that I had, I just want to point out that there are many reasons for a server to be marked as offline by a load balancer, and that you should check the obvious ones first, such as if all services are running, if Exchange is coming back as healthy when using tools such as the Get-ServerHealth or Exchange cmdlet, System Center Operations Manager (SCOM) if you have it, etc.

In my case there were no obvious reasons, plus I found it strange that only the https protocol was having issues, so this is what I did:

1 – I browsed to the https website that KEMP uses and noticed that I was getting page not found, which explains KEMP marking the server as not available

2- I then used another very handy Exchange PowerShell cmdlet, the Get-ServerComponent, to check which components were in an Active or Inactive state

See below an example of a good output for the server components, with all of the relevant ones being active.

mon5

But what I was getting was not what you see above. I was seeing the “OwaProxy” component as inactive, and that was a problem.

3- So what I did next was to look at the Exchange 2013 Health Mailboxes, searching for inconsistencies

And I found them…

mon1

When I ran a Get-Mailbox -Monitoring, to list all Health mailboxes, I found that some were corrupt, due to being associated with missing databases or missing mailbox servers.

This Exchange Organization has been downsized for the past year and during that cleanup those mailboxes were forgotten. If you have the same issue, don’t worry. Although it’s important to keep those mailboxes in a good state, you can easily recreate them, and that leads me to my next step.

4- Recreate the health mailboxes

So instead of me doing a lot of copy past into my blog post, check this Exchange 2013/2016 Monitor mailboxes article, from the product group blog, that not only has a lot of information about those mailboxes, but also has a perfect step by step at the bottom, on how exactly you recreate them.

And that’s what I did, recreate them, re-tested the URL https://server/owa/healthcheck.htm, which now worked perfectly and my load balancer was no longer marking my CAS server as unavailable in my https virtual service.

Hopefully this article can help you resolve your issue!

 

Advertisements