Wednesday, March 30, 2011

ESXi 4.1 hosts in Standby Mode causes faults in Cisco UCS Manager for B230 M1 blades

For those who have seen my previous post about testing standby mode, I noticed something odd happening in the UCS B Series infrastructure environment I used to generate the screenshots and I thought might be worthwhile to write a post about it.  What happened was that right after I successfully turned on DPM for the cluster and servers were put into Standby Mode, UCS Manager began indicating that there were faults for some of the servers and as you can probably already guess, these servers with faults are the ones that are currently in Standby Mode.

Navigating to the Faults tab for the servers in UCS Manager would show the following:

Affected object: sys/chassis-1/blade-1/fabric-A/path-1/vc-1450

Description: fc VIF 1 / 1 A-1450 down, reason: None

ID: 285353

Cause: link-down

Code: F0283

Original severity: major

image

image

image

I’m not sure if this is by design by once I took the B230 M1 blades out of standby the errors went away.  The firmware this infrastructure was running on was version 1.4(1j).

Testing standby mode before enabling VMware DPM (Distributed Power Management)

I was recently asked by an ex-colleague of mine about a warning messaging he received when trying to enable DPM where he was presented with the following warning message:

One or more hosts in the cluster have never exited from standby state, or failed in the last attempt.

It is recommended that you manually test the enter and exit standby commands of these hosts before enabling power management.

Click Cancel and go to the Host Options page to see which hosts need to be tested (recommended).

Click OK to enable power management.

image

Those who are not familiar with VMware vSphere might not know what to do and this was the case for my ex-colleague.  What you’re supposed to do as per the vSphere Resource Management Guide:

http://www.vmware.com/pdf/vsphere4/r40/vsp_40_resource_mgmt.pdf

image

… you should go into the settings of the cluster’s properties, navigate to VMware DRS –> Power Management –> Host Options:

image

From the window above, you’ll be able to see which hosts have last exited standby mode and for the hosts that have the status of never, proceed with right clicking them in the Hosts and Clusters view and select Enter Standby Mode:

image

A host that have successfully entered the standby state will look like the following:

image

From here on, right click on the host again and select Power On:

image

As the host begins to power up, you should see the task status in the recent task list:

image

image

image

Once the server boot up completes:

image

… the host should now have reverted back to the previous state:

image

Navigating back to the host options, you should now see a date and time in the Last Time Exited Standby field:

image

Once you’ve completed this for all the hosts in the cluster that you want to enable for DPM, you should be able to turn on DPM without receiving the warning message shown earlier in this post:

image

Problems moving an ESX/ESXi host into EVC cluster? Try putting them in maintenance mode first

I’ve been asked recently about issues people encounter when trying to enable Enhanced VMotion Compatibility (EVC) processor support and putting ESX/ESXi hosts into those clusters.  While I’m not going to go into the details of EVC, what’s important to understand because you are setting the cluster to the lowest common denominator of the CPUs across the hosts in the cluster:

image

… any hosts that are to be moved into the cluster whether they host virtual machines or not will require you to put them into maintenance mode.  Failure to put the host into maintenance mode will throw the following error when you try to move it into an EVC enabled cluster:

The operation is not allowed in the current state.

image

What you need to do is put the host into maintenance mode first:

image

image

image

Then move the hosts into the new EVC enabled cluster:

image

How long do you perform memory testing with vSphere 4.x ESX/ESXi hosts?

This post will not be the usual type of posts I write where I state a problem and then a solution but rather a question I have about the best practice for memory testing prior to deploying an ESX or ESXi host to production.  As per the Performance Best Practices for VMware vSphere 4.0 guide: http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.0.pdf, VMware recommends that we:

Test system memory for 72 hours, checking for hardware errors.

image

I’ve recently been a part of a Cisco UCS deployment for multiple datacenters and had to run memtest on some B230 M1 blades that had 128GB of memory and noticed that even after 40 hours, memtest has only ran 2 passes:

image

What’s also interesting is that even with a blade that only had 32GB, 40 hours was only 3 passes on the memory:

image

I’ve asked my practice lead about this and he tells me that some of these guidelines may be out of date so adjust them accordingly.  This leads me to wonder what other professionals are doing out there so if you’re one of them, feel free to comment.

Event ID: 20209, 20000, 20590, 20154 warnings logged in event viewer on a BES 5.x server

Problem

For those who have come across my previous post on an Exchange issue, after getting mail flow going and the Exchange server’s event logs cleared of warnings and errors, the next issue I had to tackle on was the BES server.  Even though the storage administrator restored the BES server that was a month old and rejoined it to the domain, none of the users with Blackberries were able to send or receive mail messages.  Reviewing the application logs in the event viewer, I can see there are there are event IDs: 20209, 20000, 20590 warnings consistently logged:

image

The warnings show the following messages:

Event ID: 20154

User someFirstName someLastName not started.

image

Event ID: 20000

[DEVICE_SRP:somePIN:0x003B4B90] Receive_UNKNOWN, VERSION=2, CMD=241

image

Event ID: 20000

[DEVICE_SESSION:somePIN:0x00B391F0] Timer Event. Exceeded service authentication timeout. No authenticated services. Releasing session.

image

Event ID: 20590

{someUser} BBR Authentication failed! Error=1

image

Event ID: 20000

[DEVICE_SESSION:somePIN:0x00AFA528] Service authentication verify failed. Error: 1 Service: S55070939

image

Event ID: 20209

{SomeFirstName someLastName} DecryptDecompress() failed, Tag=7976793, Error=604

image

The error above repeats for a few users until an information event gets logged indicating:

Event ID: 50079

8 user(s) failed to initialize

image

Solution

So how did I fix this?  The solution was actually quite simple although there was some manual labour involved.  What needed to be done was to log into the BAS (BlackBerry Administration Service) and choose Resend service books to device for the user:

image

image

Once that’s done, take the user’s Blackberry and regenerate the encryption key.  With the new Torch Blackberry OS available, the menus may differ so I’ll include pictures of the screen for the 2 OS I had to work with:

Blackberry OS 5:

Options –> Security Options –> Information –> Desktop –> Regenerate Encryption Key.

image

image

 image

image

image 

Once you select the Regenerate Encryption Key option, you navigate to the enterprise activation screen and you will see the synchronization process begin:

image

Blackberry Torch OS:

Options –> Security –> Security Status Information –> Desktop –> Regenerate Encryption Key.

image

image

image

image

image

image

Hope this helps anyone that may run into a similar problem.

Event ID: 9176 error logged on Exchange 2007 server

We had a client last week who suffered a SAN failure which required them to restore SAN snapshots that were a month old.  I won’t go into the details of the situation but I was brought in to bring their directory services up and determine what course of action we could take.  To make a long story short, they had most of their infrastructure virtualized with only 1 physical domain controller named DC2.  The virtualized environment which included DC1 and their Exchange server was restored but since DC1 was a snapshot, the event logs were littered with errors about USN rollback.  From here on, I had the choice of either fixing the USN rollback to get DC1 operational OR use DC2 which had a more recent directory services database up and fix the rest of the infrastructure servers.  I personally did not want to lose all the changes they made to their Active Directory and therefore opted to use DC2. 

Problem

After I completed seizing the FSMO roles from DC1 and rejoined the Exchange server to the domain, I began to notice that the Exchange server continuously logged the following error:

NSPI Proxy can contact Global Catalog DC1.domain.local but it does not support the NSPI service. After a Domain Controller is promoted to a GLobal Catalog, the Global Catalog must be rebooted to support MAPI Clients. Reboot DC1.domain.local as soon as possible.

image

Solution

What threw me off was that DC1 was no longer on the network and I had completely removed the domain controller’s metadata with NTDSUtil and all records in DNS.  After reviewing the event logs a few more times, I began remembering that I had come across an article a few years ago (don’t ask me how I remember) that had a fix for an Outlook Anywhere issue.  The article required registry changes on the Exchange server which referenced an NSPI Target Server.  From there on, I did a search on Google and finally found that article here:

http://messagexchange.blogspot.com/2008/12/outlook-anywhere-failing-rpc-end-points.html

Not surprisingly, I reviewed the registry for the keys and found references to DC1 hardcoded into the registry.  What I ended up doing was updating the registry keys to the appropriate DC, DC2, and the event errors went away.

image

image

image 

Hope this helps anyone that may come across this problem.