Hello,
Not sure exactly where to post this but having a weird issue with our Cisco UCS Chassis w/ B200 M5 blades running ESXi 6.5 hosts for our Horizon environment.
We have two data centers running an identical configuration and cluster size to support our Horizon desktops. On 10/30 we lost power to a Cisco UCS Chassis at one of our data centers and 8 of our blades went down with it. There were no indications of spikes in resources, panel/circuit issues, etc. The weird part is that the following day we lost a chassis at the other data center (with 8 M5 blades).
We have been told by Cisco this was due to our ESXi hosts over-consuming resources and triggered a power supply failure. VMware analyzed logs and determined that it was indeed a power supply issue but was not something caused by the hosts.
I know it's a shot in the dark but I'm curious if any others have run into this issue or have any advice for troubleshooting as it is now a finger pointing game between vendors. Not sure if it may be a bug within the UCS Chassis or even perhaps the ESXi hosts.
Appreciate any insight you may be able to give.
Thanks!
Datacenter 1
VCenter 6.5.0.22000 Build Number 9451637
ESXI 6.5.0 Update 2 (Build 9298722)
Horizon 7.4
Cisco UCS 5108 AC2 Chassis 3.2(3h)B
Power Supply: 4ea - 2500W PSU DV
IO Module: 2ea – 2304
Cisco UCS B200 M5 2 Socket Blade Server 3.2(3f)
Processor: 2ea - Intel 8168 2.7 GHz 205W 24C/33.00MB Cache/DDR4 2666MHz
Memory: 12ea- 64GB DDR4-2666-MHz RDIMM/PC4-23100/quad rank/x4/1.2v
32GB DDR4-2666-MHz RDIMM/PC4-23100/dual rank/x4/1.2v
Datacenter 2
VCenter 6.5.0.22000 Build Number 9451637
ESXI 6.5.0 Update 2 (Build 10390116)
Horizon 7.4
Cisco UCS 5108 AC2 Chassis 3.2(1b)
Power Supply: 4ea - 2500W PSU DV
IO Module: 2ea – 2304
Cisco UCS B200 M5 2 Socket Blade Server 3.2(3c)
Processor: 2ea - Intel 8168 2.7 GHz 205W 24C/33.00MB Cache/DDR4 2666MHz
Memory: 12ea- 64GB DDR4-2666-MHz RDIMM/PC4-23100/quad rank/x4/1.2v
32GB DDR4-2666-MHz RDIMM/PC4-23100/dual rank/x4/1.2v