Two Wrongs

Inventing Fisher’s Exact Test

Inventing Fisher’s Exact Test

In a previous article, we needed a way to measure how well readers could tell apart the logistic distribution from the normal distribution, to have something to sort the scoreboard by. In coming up with such a metric, I accidentally re-invented Fisher’s exact test. Here’s how it happened.

Distilling the problem to its essence

The situation is that we have some pictures of random distributions, which we can label 1 to 18.

1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18

The reader guesses which of these are logistic (L) or normal (N). Presumably, they will guess nine logistic and nine normal, since I revealed that half of the pictures would be from each.

1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18
L  N  L  N  N  N  N  L  L   L   L   N   N   N   L   L   L   N

The order in which these are displayed really does not matter at all. We can rearrange them so that the reader’s guesses for the logistic regression come first, for example.

1  3  8  9  10  11  15  16  17  |  2  4  5  6  7  12  13  14  18
L  L  L  L   L   L   L   L   L  |  N  N  N  N  N   N   N   N   N

At this point, we reveal what the true answers hidden behind the numbers are.

N  L  L  L   N   N   L   L   N  |  L  N  L  N  L   L   N   N   N
L  L  L  L   L   L   L   L   L  |  N  N  N  N  N   N   N   N   N

If we assume the reader has no idea which picture is from which distribution – i.e. we assume the null hypothesis – then the top row will contain 18 letters, 9 of which are L, but the locations of the L’s will be completely random. This is analogous to having a bag of 18 letters of which 9 are L, and drawing (without replacement) one at a time. It is also analogous to looking back at a sequence of coin flips where we know nine of them turned up heads.

Coin flips as a random walk

A series of coin flips can be represented as a random walk, were we go forward each toss, and also up on tosses where we get heads. In this case, we know that in 18 coin tosses we will get 9 heads. That corresponds to knowing that the random walk will end up on the coordinate (18, 9). It might look like this:

                ___o
               /
             _/
           _/
         _/
        /
     __/
    /
   /
o_/

Or this:

                   o
                  /
                 /
                /
             __/
          __/
         /
     ___/
   _/
o_/

Or this:

             ______o
            /
          _/
         /
        /
       /
    __/
   /
  /
o/

All of these paths contain 18 steps forward, 9 of which are also a step upward. The number of possible paths starting from (0,0) and ending up on (18, 9) are given by the binomial coefficient \(\binom{18}{9}\). Believe it or not, there are roughly 50,000 distinct ways to go between these two points. These correspond to all possible locations of the L’s in the sequence.

Passing through points

But we also know something else!

Since we ordered the sequence by the reader’s guesses, and the reader guessed five L’s correct, we know that we have seen five heads in the first half of the sequence of coin flips, or that the first half of the random walk contains five steps that also go upward.

This means the path that ends up on (18, 9) must also have passed through (9, 5). The first example path did that, here indicated by the tiny dot:

                 ___o
                /
              _/
            _/
         _./
        /
     __/
    /
   /
o_/

The other two did not pass through that point, instead going narrowly by either side of it.

                   o
                  /
                 /
                /
          .  __/
          __/
         /
     ___/
   _/
o_/

The second one going under, and the third one passing over.

             ______o
            /
          _/
         /
        / .
       /
    __/
   /
  /
o/

We are starting to get a sense for how the shape of the path is an indicator of the reader’s guessing performance. This is what the path would look like for a reader that got all their guesses correct:

          _________o
         /
        /
       /
      /   .
     /
    /
   /
  /
o/

It shoots straight up for the first nine steps, and then levels off. This is because the first half of the path contains an upward step for each correctly guessed letter L, and then the second half of the path contains an upward step for whatever letters L were not correctly guessed.

If the path passes over the (9, 5) point, that corresponds to someone getting more than 5 correct. If it passes through the (9, 5) point, it means getting exactly 5 correct, and if it passes under the (9, 5) point, that’s fewer than five correct.

How many paths are there that pass through the (9, 5) point and then go on to the (18, 9) point? The number of paths that do that is the number of paths that first go to (9, 5), multiplied by the number of paths that go from (9, 5) to (18, 9). In other words,

\[\binom{9}{5} \binom{18-9}{9-5}.\]

This is about 16,000. We can compare that to the total number of paths that end up at (18, 9), which was 50,000, and conclude that about

\[\frac{16\,000}{50\,000} \approx 30\; \%\]

of the paths that eventually end up at (18, 9) first go through (9, 5).

This is almost the answer we are looking for! Assuming the reader has no idea which distribution is which, there will be around a 30 % chance they get five correct purely by chance. The visual intuition for it is that 30 % of the random walks that eventually end up on (18, 9) will – by pure chance – pass through (9, 5).

Fisher’s exact test

In the first version of the previous article, I actually stopped there. I used that number as the quality metric of the guess, because I forgot a critical component: to become a proper hypothesis test, we need to check not just for the probability of getting exactly five correct, but getting five correct or better. In other words, what is the chance of this – or a more extreme – result, assuming the null hypothesis?

If we do that1 In practise, that means computing the number for 5 correct, as we did. Then doing the same for 6 correct, then 7, 8, and finally 9. The p-value of getting five or better under the null hypothesis will be the sum of all these probabilities., we have actually performed what’s known as Fisher’s exact test which is a real test really used for real when faced with these sorts of problems.

Inventing through intuition

The crazy thing for me is that I never learned to perform Fisher’s exact test. I didn’t even know it existed2 I’m sure I have read the name of it at some point, but I’ve grouped it together with all these other unintuitive, black-box procedures that lie around in the frequentist’s magical toolbox of statistics – things I don’t bother learning because I’d never understand them anyway. until a few hours ago, when I accidentally stumbled over the Wikipedia page for it and realised that it sounded an awful lot like what I had just done for the distribution high score table.

Despite my lack of frequentist sophistication, when faced with a problem that called for Fisher’s exact test, I managed to invent it3 Barring the “this good or better” mistake. from an intuition around random walks alone!

This is not to extol my brilliance, but rather a long letter of gratitude to the technology that let me build this intuition: spaced repetition. I have used spaced repetition seriously for technical reading for a year and a half now, and while I’m not always as disciplined at practising as I would like to be, I always catch up eventually. I have always read a lot, but the amount I retain is through the roof. If you do not practise spaced repetition, I recommend learning how to do it well and picking it up for yourself.