VMFS Recovery on a Broken Array

[Originally posted August 2009]

Yesterday, one of my colleagues swapped in a replacement disk we received from IBM for an x3650.  It was part of a RAID 5 array for one of our development ESX boxes.  Unfortunately, ESXi 3.5 didn’t seem to like that too much, and the whole volume was dropped.  After some effort, rebuilding of the array and other jiggery-pokery, we had no luck in getting ESX to see the disks.  Development team now out for half a day.

Today, we called one of our partners with a swag of VMWare certified engineers, and they couldn’t offer much help other than try a few warm and cold restarts, they suspected the array was at fault.  We had the option of re-adding the array to ESX to see if it would work but everything pointed to wiping of the LUN, which was not what we wanted.  We needed that data!  At this point I started to look for some options:

1. Use an Ubuntu Live CD (Jaunty) to try and see the volume – no joy, as I found out VMFS is a proprietary file system.

2. Try and download Ubuntu Karmic Koala Alpha 3 which looked to have some VMFS tools available – started the download then continued to look for options
3. Came across what appears to be some good work by one individual who has created a live CD toolkit just for this purpose.  Unfortunately I had to wait for administrative approval before I could download, so I kept this as a backup.
4. While downloading Karmic, I came across Open VMFS and a comment in the VMWare forum that provided some hope.

So I proceeded to try option four on an old bit of gear with ESXi 4 installed.  After some fiddling I had booted Ubunutu Jaunty, tried to load the Open VMFS driver but it did not recognize the VMFS partitions.  Gparted provided confirmation on the ‘unkown’ volumes pertaining to the size of the data store.  Not to be deterred, I tried on the same kit with ESXi 3.5 and I could read my vmfs store!

With a 1TB USB disk connected, the VMs got extracted with one exception – a virtual disk for one of the servers was giving I/O errors and I couldn’t get it off.  Had to use the filecopy parameter for large files (>4GB) and successfully retrieved a 70GB virtual disk.

Our Dev team was now out for 1.5 days, with a public holiday tomorrow it will mean everything should be fully operational for the next business day.