Over the past couple of months we have had problems with our ESX server popping with a PSOD. Every time we would look at the dump file all we could find is that the system lost heartbeat and went PSOD on us.
I came across this VMware KB article that talked about the vmksummary file. It showed that just before each PSOD the swap memory was exhausted. I started tracking this file and watching for high swap usage on our hosts. Finally, I came across a server with high swap usage and found the culprit.
VMLogix LabManager 3.7.0 and below uses a host agent on the ESX server. The agent is written in python and normally works flawlessly except for this little problem. Trying to ID the problem on the server I looked at the memory usage.
# ps -eo comm,rss --sort rss
...
vpxa 38724
webAccess 66076
vmware-hostd 68092
python2.2 310101
python2.2 311201
This showed me the top memory users to by python. I then took a look at ps
# ps -ef | grep python
....
user 2701 1 0 Sep07 ? 00:00:00 python2.2 esx_service.pyc
user 2869 2701 0 Sep07 ? 00:00:00 python2.2 esx_service.pyc
I located the esx_service.pyc inside the install directory for the VMLogix agent, so I tried to restart the agent via the web gui interface. The service stopped and restarted by these two guys still hung around. I stopped the VMLogix guest agent then killed the remaining processes. Once I did this my swap memory was freed. I then restarted the VMLogix agent.
Version: VMLogix 3.7.0
Version: VMware ESX 3.5 Update 4
This will not be a problem in VMLogix 3.8.1 + using vCenter mode