I’m fine with the random part. What I don’t get is why 2 works just as well as four, or square root of n. It seems like 3 should do much, much better and it doesn’t.
It’s one of those things I put in the “unreasonably effective” category.
It's because going from 1 to 2 changes the expected worst case load from an asymptotic log to an asymptotic log log, and further increases just change a constant.
In code the semantic difference is pretty small between “select one at random” and “select two at random and perform a trivial comparison” - roughly about the same difference as to “select three at random and perform two trivial comparisons”. That is, they are all just specific instances of the “best-of-x” algorithm: “select x at random and perform x-1 comparisons”. Natural to wonder why going from “best-of-1” to “best-of-2” makes such a big difference, but going from “best-of-2” to “best-of-3” doesn’t.
In complexity analysis however it is the presence or absence of “comparisons” that makes all the difference. “Best-of-1” does not have comparisons, while “best-of-2”, “best-of-3”, etc., do have comparisons. There’s a weaker “selections” class, and a more powerful “selections+comparisons” class. Doing more comparisons might move you around within the internal rankings of the “selections+comparisons” class but the differences within the class are small compared to the differences between the classes.
An alternative, less rigorous intuition: behind door number 1 is a Lamborghini, behind door number 2 is a Toyota, and behind door number 3 is cancer. Upgrading to “best of 2” ensures you will never get cancer, while upgrading again to “best of 3” merely gets you a sweeter ride.
It’s one of those things I put in the “unreasonably effective” category.