Two Wrongs

(don't make a right)

How Laziness Works

by ~kqr

So yesterday I saw someone ask about how many times the variable m gets evaluated in the following Haskell snippet:

maxOccurs :: [Int] -> (Int, Int)
maxOccurs a =
    (m, n)
  where
    m = maximum a
    n = length (filter (==m) a)

I knew just based on common sense that m only gets evaluated once (when it is first needed, no sooner, no later) but I realised I didn't know the exact mechanics behind that. I grabbed pencil and paper, I printed 88 pages of computer science writing, some various intermediary stages of code and sat down at my desk to get to the bottom of how it works.

A night without sleep later I emerged with some additional knowledge, though probably not nearly as much as I'd like. What I'm about to describe is completely useless to learn how to write Haskell, but if you're like me and like poking at things under the hood, by all means join in.

At least a superficial familiarity with the Haskell syntax is probably good to be able to follow along. The examples of functional code I write will use a Haskell-like syntax. Some knowledge of a low-level language (like assembly, Ada, C or Java) is useful for the last part on C--.

Intermediary Languages

First off, you should know that when you compile Haskell code with GHC, it does not get translated directly into machine code. In fact, there are as many as five intermediary languages involved. They are, in the following order.

  1. Haskell is the language your source code is written in. Haskell is a very high-level functional language, yada yada, if you read this you probably know all that already.

  2. The first translation GHC does is from Haskell to Core. You can view Core sort of like a minimal subset of Haskell. It lacks many of the convenient shortcuts Haskell has, but you'd still vaguely recognise it as Haskell.

    Any Haskell programmer worth their salt should, in my opinion, at least have a passing familiarity with Core and be able to read it in a pinch. Reading the Core GHC generates can be a great help to try to figure out which optimisations get triggered and where you have space leaks due to laziness.

  3. The Core code eventually gets translated into STG. STG is both a language and a mathematical model for how functional code can get executed by an imperative machine. Having a good model for imperative execution is really important, because after all, pretty much every computer you are using on a daily basis is built on an imperative architecture.

    You can think of STG as an extremely low-level Haskell, but don't be fooled – despite its appearance it is still most decidedly a functional language!

  4. At this point however, we leave the realm of functional languages. The STG code gets translated to C-- (pronounced "C minus minus", a pun on C++). C-- is a low-level imperative language, similar to C, but designed for compilers. C-- is not meant for human consumption.

  5. As the final step, the C-- code gets translated to assembly/machine code. This machine code is collected in object files, which are linked together to create a binary, which is the final executable file you can run and see your program in action.

(By the way, GHC applies optimisations at every step along the way here, because some optimisations are easier in some stages and harder in others.)

Even though I've just said Core is like the most important step in this pipeline to be familiar with, I'm not going to talk about it in this article. The reason is that it's fairly uninteresting in relation to the question. It's close enough to Haskell to not tell you anything about how things actually get evaluated in this case.

My target here is the C-- code, but to be able to understand that, we need to start at the STG level, so that's where we'll jump in.

Intermediary Code, GHC pls

I'm going to paste loose snippets of pseudo-language syntax in this article. The real code in these intermediary languages is very loaded with information that's irrelevant for answering the question, so I'm going to cut irrelevancies with extreme prejudice.

Again, please note that the examples I post look nothing like the real code generated by GHC. I've simplified a lot to make the relevant parts clear. If you wish to dig deeper, there are other texts that go through the basics better than I ever could.

If you want to follow along with real code, here's how you get GHC to print it out. Start by creating the very small Haskell source file

module MaxOccurs where

maxOccurs :: [Int] -> (Int, Int)
maxOccurs a =
    (m, n)
  where
    m = maximum a
    n = length (filter (==m) a)

and then compile it with

  1. ghc -fforce-recomp -O0 MaxOccurs.hs -ddump-simpl to get the Core code,
  2. ghc -fforce-recomp -O0 MaxOccurs.hs -ddump-stg to get the STG code,
  3. ghc -fforce-recomp -O0 MaxOccurs.hs -ddump-cmm to get the C-- code, and
  4. ghc -fforce-recomp -O0 MaxOccurs.hs -ddump-asm to get the assembly code,

-O0 ("dash-oh-zero") specifies that I don't want GHC to perform any significant optimisations. We could look at the code with optimisations on, but that would be less educational because even small changes in the code can result in completely different optimisations firing.

STG

Anyway, where were we? Ah, right, the STG code. I've taken the output GHC gave me, invented some syntax to make it look more like Haskell code, and cleaned up a bunch of things we don't need for understanding. What remains is this:

maxOccurs :: [Int] -> (Int, Int) = \r a_s18l ->
    let
        m_s18m   ::        Int   = \u         -> maximum a_s18l
        sat_s18o :: Int -> Bool  = \r ds_s18n -> (==) ds_s18n m_s18m
        sat_s18p ::        [Int] = \u         -> filter sat_s18o a_s18l
        sat_s18q ::        Int   = \u         -> length sat_s18p
    in
        (,) m_s18m sat_s18q

We see that maxOccurs is still a function [Int] -> (Int, Int), and it takes one argument, a_s18l. Eventually, that function is going to return the tuple (m_s18m, sat_s18q). The first part of that tuple, m_s18m, has to correspond to the m variable in the code in the question.

The second part of the tuple corresponds to the n in the question. This means that when the Haskell code says

n = length (filter (==m) a)

the STG says just

sat_s18q = length sat_s18p

In other words, the whole filter parenthesis has been stuffed away in the sat_s18p variable. This is an illustration of one of the interesting properties of STG: it doesn't have parentheses. In STG, you can't put an argument to a function in parentheses, you have to put it in a variable and then use that variable where you would put the parenthesised expression.

So if we used Haskell syntax, the three declarations related to the n variable would be

sat_s18o = \x -> x == m
sat_s18p = filter sat_s18o a
sat_s18q = length sat_s18p

We can sort of see this in the STG code, but the STG code also has some extra function weirdness. That's another property of STG: thunks are explicitly encoded.

Thunks?

As you know, Haskell is lazy. That means that every value is stored as a piece of unexecuted code. When that value is needed, the code is executed.

To begin with, I'll wave my hands and say that "thunk" is the name for one of those unexecuted pieces of code that will generate a value. We'll get a more firm understanding of what a thunk is shortly.

As an example of this thunk encoding, let's use the binding that corresponds to the m variable in the Haskell source. In the STG code, it looks like

m_s18m :: Int = \u -> maximum a_s18l

This looks almost like an anonymous function, and that's almost correct!

In fact, at this level, there is no practical difference between a thunk and a function. A thunk is simply a function that takes no arguments. Or, if you're bent that way, a function is simply a thunk that takes arguments.

It looks like the above thunk takes an argument named u, but that's not the case. In STG code, there are three kinds of thunks: \u-thunks, \r-thunks (both of which exist in the code for this module) and \s-thunks. So m_s18m is simply a thunk of the \u variety, that happens to take 0 arguments.

The \u is actually important for the original question. \u stands for updateable. That a thunk is updateable means that once it has been evaluated the first time (in this case, once maximum a_s18l has produced an Int), the thunk will be updated to return the Int directly instead of calculate it again.

I'll repeat that.

Somewhere in memory there exists a function called m_s18m that will, when accessed, compute maximum a_s18l. The first time that function returns, it will replace itself by the value it computed. The next time someone calls that function, they just see the value that got computed the first time.

So after the thunk has been evaluated once, the code to produce the value (i.e. the maximum a_s18l line) will be gone. Poof. It's been replaced by the result of that computation.

This is the central idea for how laziness works.

(By the way, the name "updateable" is really a misnomer. All thunks are capable of "updating" themselves. That we only perform this updating on some thunks is a low-hanging performance optimisation. For some thunks, performing the updating operation would simply result in replacing the thunk with the value it already has. We don't mark these as updateable to avoid the expensive no-op.)

Now what?

There are two ways to go from here. There is a formal set of rules for how STG code should be executed. That set is very small, so you could look at your STG code, follow the rules diligently, and you'd be able to prove to yourself that I'm not lying about the updateable thunks.

Or... you could look at the C-- code that GHC generates from the STG code.

I don't know about you, but that sounds a lot more fun to me.

C--

The C-- code unfortunately references standard library functions quite a bit, which means we can't easily see everything that goes on in the program. In a way, that might also be fortunate, because Haskell is a very high level language, and if you want to see everything that goes into running the program, you'll be staring at a lot of code.

What I'll do is try to share the most relevant bits of the code that we do get to see, though. GHC is great at dumping a lot of C--, but there are three things that interest me the most. We'll start with five lines in the MaxOccurs.maxOccurs_entry() procedure.

_s18m::P64 = R2;
Hp = Hp + 80;
I64[Hp - 72] = m_s18n_info;
P64[Hp - 56] = _s18m::P64;
_c18w::P64 = Hp - 72;

Line by line:

  1. _s18m::P64 = R2 means "set _s18m (of type "64-bit pointer") to refer to the thunk pointed to by special register #2". The contents of that register is a reference to the thunk that contains a, the argument to maxOccurs.

  2. Make space on the heap.

  3. Store the address of the code for the thunk that was called m_s18m in the STG code.

  4. Next to that code address, store the address to the thunk we talked about in step 1. Basically, the code in step 3 needs to have access to the a variable, so we store a reference to its thunk next to the reference to its code.

    In step 3 and 4, we have actually packaged together the code and data required to produce a value. This is what a thunk essentially is. It's a pointer to some code followed by pointers to the data that code uses. (Function arguments are passed separately.)

  5. Set _c18w to refer to the thunk pointed to by Hp - 72. This is the thunk we just set up in steps 2–4.

After this, we can see that _c18w is used in two places:

I64[Hp - 48] = sat_s18r_info;
P64[Hp - 32] = _s18m::P64;
P64[Hp - 24] = _c18w::P64;     // here

//...

I64[Hp - 16] = (,)_con_info;
P64[Hp - 8] = _c18w::P64;      // and here
P64[Hp - 0] = _c18D::P64;

These two occurrences correspond to the two places in your Haskell code where m is used! You might have picked up on the pattern already, but in case you haven't: this is the set-up of two new thunks.

The top three lines are how the thunk that represents n is constructed, and the bottom three lines creates the thunk that represents the return value of (m, n). (You can spot the latter quite easily, because the thunk points to code referenced by (,)_con which presumably stands for "(,) constructor".)

Something that might surprise if you're not used to lazy evaluation is the way your maxOccurs function ends. It ends not with a bang, but with a whimper:

call (P64[Sp])(R1) args: 8, res: 0, upd: 8;

This means, "return the tuple thunk". That's it. Nothing has been computed. All the function does is set up a bunch of thunks and then return the outermost of them. We've just painstakingly set up an elaborate contraption, and we have no guarantee that it will ever be sprung. If the caller doesn't do something with the tuple, none of the code we set up will get executed.

However, if the calling code uses the tuple such that it forces either side of it, eventually, the code for the m thunk will be called. That is, this code, except the real code has more error handling and meta data:

m_s18n_entry() { //  [R1]
    _s18n::P64 = R1;
    I64[Sp - 16] = stg_upd_frame_info;
    P64[Sp - 8] = _s18n::P64;
    _s18m::P64 = P64[_s18n::P64 + 16];
    R2 = Data.Foldable.$fFoldable[]_closure;
    I64[Sp - 40] = stg_ap_pp_info;
    P64[Sp - 32] = GHC.Classes.$fOrdInt_closure;
    P64[Sp - 24] = _s18m::P64;
    Sp = Sp - 40;
    call Data.Foldable.maximum_info(R2) args: 48, res: 0, upd: 24;
}

This procedure sets up two further procedure calls, of which one is exciting and the other is the regular maximum call.

One thing should be said about how STG is translated to imperative code: an STG function never really returns back to its caller. The caller decides where the function should return when it's done, by setting up the "return stack".

Here's how the return stack gets set up in our m thunk:

I64[Sp - 16] = stg_upd_frame_info;
P64[Sp - 8] = _s18n::P64;

This means that when maximum has computed an Int, it will not return back to the m thunk – it will "return" into the stg_upd_frame procedure, which is also informed of the address of the current thunk (_s18n). The stg_upd_frame procedure has the responsibility of taking the computed value and putting it into the specified address – effectively overwriting the thunk code that's already there. This ensures the value will never be computed again, but instead fetched directly from where the thunk used to be.

Final Words

Well... that's that. The C-- language that's currently in use in GHC seems very different from the C-- language the report specifies, which means I've been guessing at a bunch of the things I've talked about in this comment. If someone knows of an up-to-date documentation for C--, please send me an email.

I also should say as a disclaimer that I haven't finished reading the STG paper yet, but I'm probably doing that later this week!

If you enjoyed this article, you might like others tagged with