My research group has been trying to address this problem by using user interactions with a website to create this "ground truth". One assumption is that you have to look where you click. But there are more subtle user interactions that could be helpful.
Depends on which aspect you mean. E.g. a proper infrared tracker with high framerate, something like >120Hz, can be used to detect microsaccdes. I don't know much about webcam tech, but I doubt it can reach those framerates and at the same time low noise needed to distinguish those microsaccades. Then again, the research of the meaning of those and whether they're predictive or not isn't conclusive afaik so I wouldn't know any use for VR anyway.
Video and demos and source code: http://webgazer.cs.brown.edu/