My background including pharmaceutical study statistics (obliquely), I just did an interesting thought experiment. Theorising an equal distribution of black/red cards in a (maybe) equally-distributed sample-hat, we draw out two reds (frexample) as our first two cards. What is the likelihood that this was an aberration rather than representative[1]?
Interestingly, with a large potential sample, then comparing to an assumption of 50:50 pure chance, it's 0.5 * ~0.5 (the latter being slightly less than 0.5, but quite close), NAMND to 25%. A sample of
four cards (the baseline assumption being 2 of each) provides a chance of 0.5*0.3˙, or 16.6˙%. Compare this to the counter-extreme (not possible with four cards, but you can have "all
but two are non-red" in 5+ sets) of being
increasingly less possible the result is by chance, as the cohort size increases, and of course to instinctive conclusion ("all are red") and the smaller population version is
less wrong than the larger population one.
Of course, you need to colour these resulting proportions by the actual
chance one might have plucked two reds out of a set of {entirely red | half-n-half | those-are-the-only-reds} situation, in order to get this situation in the first place. (Respectively 100%, something like "!(n/2)*!(n-2) / !(n)*!(n/2 - 2)" for n cards with n/2 reds, and "2/(n
2-n)" for n cards with
2 reds... if I've not messed up those two mental rearrangements at all...)
Of course, those represent situations in which we
know what we should be looking at and are just being picky as to the observations. Much apart from any point where I've accidentally dropped a clanger, I'm just saying as how you could
try (albeit fail, against anyone knowledgeable enough to pick apart ones hypothesis) to prove
anything... Lies, damn lies, etc.
But the closer the selected cohort is to the size of the population it is supposed to represents, the better (though not in a linear fashion!). I'm sure that's pretty much a truth, and magnitudes of difference probably equates to sufficient gain/loss of assumable accuracy. There's probably some Poisson stuff to be done in there to decide whether doubling your sub-selection is worth any given amount of extra confidence or not, though. It's a long time since I've dabbled with statistical methodology so I'd have to go from first principles again (better than I did above) in order to get be confident about my exact confidences!
[1] This estimation being more important in 'proving' clinical studies. i.e. if you can show that non-correlation is unlikely (beyond a standard limit), then actual correlation can be accepted as a supportable hypothesis.