Monitoring your VM’s using screenshots

(This is part 1 of 3, in an articles series on how we at Unity started using “screenshots” and machine learning in our monitoring system)

Part 1: Getting screenshots from VMware machines (This Post)
Part 2: Using image analysis to detect machines in a broken state (Not published yet)
Part 3: Using Clarifai “Machine learning” to further narrow down our results (Not Published yet)

In my new job as a “DevOps” engineer (I know I know, there is no such thing as a DevOps engineer), I am working with Unity’s build farm, we are building the Unity engine, and a lots of artifacts 1000’s of times a day, so our build farm is relatively big, running multiple different OS’s like Windows, Mac and Linux.

We are continuously trying to improve our monitoring of the platform, both in order to detect failed machines, but also trying to gather information as to what have gone wrong, so we can use this information to prevent the same issues arising in the future, by giving feedback to the relecant teams.

We had a period, where we had some storage related issues, this caused Mac and Linux machines, in particular, to crash and hang, we had no trouble detecting the machines went offline, but since we weren’t able to connect to these machines, we could not tell what “state” they were in. So in order to document, what had happened, we had to look at the console of the given machine.
So I started thinking if there was an automated way that I could test for this, and it hit me, that I had read somewhere, that it I possible in Vmware to take a “screenshot” of a running machine.

So I wrote a PowerShell script that would take a screenshot of all our running VM’s in our build farm, then at least I had some documentation of what the “state” of the machines were.

In the above example am creating a “session” to reuse for the calls against Vcenter, so we will not see 100’s of connections in Vcenter.

In the next part of the article, I will cover some PowerShell functions I wrote to wrap ImageMagick to make some initial comparisons of each screenshot, sorting them in known good, known bad and What the h**l is going on here 😉