At Cloudflare we’re heavy users of LuaJIT and in the past have sponsored many improvements to its performance.

LuaJIT is a powerful piece of software, maybe the highest performing JIT in the industry. But it’s not always easy to get the most out of it, and sometimes a small change in one part of your code can negatively impact other, already optimized, parts.

One of the first pieces of advice anyone receives when writing Lua code to run quickly using LuaJIT is “avoid the NYIs”: the language or library features that can’t be compiled because they’re NYI (not yet implemented). And that means they run in the interpreter.

CC BY-SA 2.0 image by Dwayne Bent

Another very attractive feature of LuaJIT is the FFI library, which allows Lua code to directly interface with C code and memory structures. The JIT compiler weaves these memory operations in line with the generated machine language, making it much more efficient than using the traditional Lua C API.

Unfortunately, if for any reason the Lua code using the FFI library has to run under the interpreter, it takes a very heavy performance hit. As it happens, under the interpreter the FFI is usually much slower than the Lua C API or the basic operations. For many people, this means either avoiding the FFI or committing to a permanent vigilance to maintain the code from falling back to the interpreter.

Optimizing LuaJIT Code

Before optimizing any code, it’s important to identify which parts are actually important. It’s useless to discuss what’s the fastest way to add a few numbers before sending some data, if the send operation will take a million times longer than that addition. Likewise, there’s no benefit avoiding NYI features in code like initialization routines that might run only a few times, as it’s unlikely that the JIT would even try to optimize them, so they would always run in the interpreter. Which, by the way, is also very fast; even faster than the first version of LuaJIT itself.

But optimizing the core parts of a Lua program, like any deep inner loops, can yield huge improvements in the overall performance. In similar situations, experienced developers using other languages are used to inspecting the assembly language generated by the compiler, to see if there’s some change to the source code that can make the result better.

The command line LuaJIT executable provides a bytecode list when running with the -jbc option, a statistical profiler, activated with the -jp option, a trace list with -jv, and finally a detailed dump of all the JIT operations with -jdump.

The last two provide lots of information very useful for understanding what actually happens with the Lua code while executing, but it can be a lot of work to read the huge lists generated by -jdump. Also, some messages are hard to understand without a fairly complete understanding of how the tracing compiler in LuaJIT actually works.

One very nice feature is that all these JIT options are implemented in Lua. To accomplish this the JIT provides ‘hooks’ that can execute a Lua function at important moments with the relevant information. Sometimes the best way to understand what some -jdump output actually means is to read the code that generated that specific part of the output.

CC BY 2.0 image by Kevan

Introducing Loom

After several rounds there, and being frustrated by the limitations of the sequentially-generated dump, I decided to write a different version of -jdump, one that gathered more information to process and add cross-references to help see how things are related before displaying. The result is loom, which shows roughly the same information as -jdump, but with more resolved references and formatted in HTML with tables, columns, links and colors. It has helped me a lot to understand my own code and the workings of LuaJIT itself.

For example, let’s consider the following code in a file called twoloops.lua:

for i=1,1000 do
    for j=1,1000 do
    end
end

With the -jv option:

$ luajit -jv twoloops.lua
[TRACE   1 twoloops.lua:2 loop]
[TRACE   2 (1/3) twoloops.lua:1 -> 1]

This tells us that there were two traces, the first one contains a loop, and the second one spawns from exit #3 of the other (the “(1/3)” part) and it’s endpoint returns to the start of trace #1.

Ok, let’s get more detail with -jdump:

$ luajit -jdump twoloops.lua
---- TRACE 1 start twoloops.lua:2
0009  FORL     4 => 0009
---- TRACE 1 IR
0001    int SLOAD  #5    CI
0002  + int ADD    0001  +1  
0003 >  int LE     0002  +1000
0004 ------ LOOP ------------
0005  + int ADD    0002  +1  
0006 >  int LE     0005  +1000
0007    int PHI    0002  0005
---- TRACE 1 mcode 47
0bcbffd1  mov dword [0x40db1410], 0x1
0bcbffdc  cvttsd2si ebp, [rdx+0x20]
0bcbffe1  add ebp, +0x01
0bcbffe4  cmp ebp, 0x3e8
0bcbffea  jg 0x0bcb0014    ->1
->LOOP:
0bcbfff0  add ebp, +0x01
0bcbfff3  cmp ebp, 0x3e8
0bcbfff9  jle 0x0bcbfff0    ->LOOP
0bcbfffb  jmp 0x0bcb001c    ->3
---- TRACE 1 stop -> loop

---- TRACE 2 start 1/3 twoloops.lua:1
0010  FORL     0 => 0005
0005  KSHORT   4   1
0006  KSHORT   5 1000
0007  KSHORT   6   1
0008  JFORI    4 => 0010
---- TRACE 2 IR
0001    num SLOAD  #1    I
0002    num ADD    0001  +1  
0003 >  num LE     0002  +1000
---- TRACE 2 mcode 81
0bcbff79  mov dword [0x40db1410], 0x2
0bcbff84  movsd xmm6, [0x41704068]
0bcbff8d  movsd xmm5, [0x41704078]
0bcbff96  movsd xmm7, [rdx]
0bcbff9a  addsd xmm7, xmm6
0bcbff9e  ucomisd xmm5, xmm7
0bcbffa2  jb 0x0bcb0014    ->1
0bcbffa8  movsd [rdx+0x38], xmm6
0bcbffad  movsd [rdx+0x30], xmm6
0bcbffb2  movsd [rdx+0x28], xmm5
0bcbffb7  movsd [rdx+0x20], xmm6
0bcbffbc  movsd [rdx+0x18], xmm7
0bcbffc1  movsd [rdx], xmm7
0bcbffc5  jmp 0x0bcbffd1
---- TRACE 2 stop -> 1

This tells us… well, a lot of things. If you look closely, you’ll see the same two traces, one is a loop, the second starts at 1/3 and returns to trace #1. Each one shows some bytecode instructions, an IR listing, and the final mcode. There are several options to turn on and off each listing, and more info like the registers allocated to some IR instructions, the “snapshot” structures that allow the interpreter to continue when a compiled trace exits, etc.

Now using loom:

There’s the source code, with the corresponding bytecodes, and the same two traces, with IR and mcode listings. The bytecode lines on the traces and on the top listings are linked, hovering on some arguments on the IR listing highlights the source and use of each value, the jumps between traces are correctly labeled (and colored), finally, clicking on the bytecode or IR column headers reveals more information: excerpts from the source code and snapshot formats, respectively.

Writing it was a great learning experience, I had to read the dump script’s Lua sources and went much deeper in the LuaJIT sources than ever before. And then, I was able to use loom not only to analyze and optimize Cloudflare’s Lua code, but also to watch the steps the compiler goes through to make it run fast, and also what happens when it’s not happy.

The code is the code is the code is the code

LuaJIT handles up to four different representation of a program’s code:

First comes the source code, what the developer writes.

The parser analyzes the source code and produces the Bytecode, which is what the interpreter actually executes. It has the same flow of the source code, grouped in functions, with all the calls, iterators, operations, etc. Of course, there’s no nice formatting, comments, the local variable names are replaced by indices, and all constants (other than small numbers) are stored in a separate area.

When the interpreter finds that a given point of the bytecode has been repeated several times, it’s considered a “hot” part of the code, and interprets it once again but this time it records each bytecode it encounters, generating a “code trace” or just “a trace”. At the same time, it generates an “intermediate representation”, or IR, of the code as it’s executed. The IR doesn’t represent the whole of the function or code portion, just the actual options it actually takes.

A trace is finished when it hits a loop or a recursion, returns to a lower level than when started, hits a NYI operation, or simply becomes too long. At this point, it can be either compiled into machine language, or aborted if it has reached some code that can’t be correctly translated. If successful, the bytecode is patched with an entry to the machine code, or “mcode”. If aborted, the initial trace point is “penalized” or even “blacklisted” to avoid wasting time trying to compile it again.

What’s next()?

One of the most visible characteristics of the Lua language is the heavy use of dictionary objects called tables. From the Lua manual:

“Tables are the sole data structuring mechanism in Lua; they can be used to represent ordinary arrays, symbol tables, sets, records, graphs, trees, etc.”

To iterate over all the elements in a table, the idiomatic way is to use the standard library function pairs() like this:

for k, v in pairs(t) do
    -- use the key in ‘k' and the value in ‘v'
end

In the standard Lua manual, pairs() is defined as “Returns three values: the next function, the table t, and nil”, so the previous code is the same as:

for k, v in next, t, nil do
    -- use the key in ‘k' and the value in ‘v'
end

But unfortunately, both the next() and pairs() functions are listed as “not compiled” in the feared NYI list. That means that any such code runs on the interpreter and is not compiled, unless the code inside is complex enough, and has other inner loops (loops that doesn’t use next() or pairs(), of course). Even in that case, the code would have to fall back to the interpreter at each loop end.

This sad news creates a tradeoff: for performance sensitive parts of the code, don’t use the most Lua-like code style. That motivates people to come up with several contortions to be able to use numerical iteration (which is compiled, and very efficient), like replacing any key with a number, storing all the keys in a numbered array, or store both keys and values at even/odd numeric indices.

Getting next() out of the NYI list

So, I finally have a non-NYI next() function! I’d like to say “a fully JITtable next() function”, but it wouldn’t be totally true; as it happens, there’s no way to avoid some annoying trace exits on table iteration.

The purpose of the IR is to provide a representation of the execution path so it can be quickly optimized to generate the final mcode. For that, the IR traces are linear and type-specific; creating some interesting challenges for iteration on a generic container.

Traces are linear

Being linear means that each trace captures a single execution path, it can’t contain conditional code or internal jumps. The only conditional branches are the “guards” that make sure that the code to be executed is the appropriate one. If a condition changes and it must now do something different, the trace must be exited. If it happens several times, it will spawn a side trace and the exit will be patched into a conditional branch. Very nice, but this still means that there can be at most one loop on each trace.

The implementation of next() has to internally skip over empty slots in the table to only return valid key/value pairs. If we try to express this in IR code, this would be the “inner” loop and the original loop would be an “outer” one, which doesn’t have as much optimization opportunities. In particular, it can’t hoist invariable code out of the loop.

The solution is to do that slot skipping in C. Not using the Lua C API, of course, but the inner IR CALL instruction that is compiled into a “fast” call, using CPU registers for arguments as much as possible.

The IR is in Type-specific SSA form

The SSA form (Static Single Assignment) is key for many data flow analysis heuristics that allow quick optimizations like dead code removal, allocation sinking, type narrowing, strength reduction, etc. In LuaJIT’s IR it means every instruction is usable as a value for subsequent instructions and has a declared type, fixed at the moment when the trace recorder emits this particular IR instruction. In addition, every instruction can be a type guard, if the arguments are not of the expected type the trace will be exited.

Lua is dynamically typed, every value is tagged with type information so the bytecode interpreter can apply the correct operations on it. This allows us to have variables and tables that can contain and pass around any kind of object without changing the source code. Of course, this requires the interpreter to be coded very “defensively”, to consider all valid ramifications of every instruction, limiting the possibility of optimizations. The IR traces, on the other hand, are optimized for a single variation of the code, and deal with only the value types that are actually observed while executing.

For example, this simple code creates a 1,000 element array and then copies to another table:

local t,t2 = {},{}
for i=1,1000 do
    t[i] = i
end
for i,v in ipairs(t) do
    t2[i]=v
end

resulting in this IR for the second loop, the one that does the copy:

0023 ------------ LOOP ------------
0024          num CONV   0017  num.int
0025       >  int ABC    0005  0017
0026          p32 AREF   0007  0017
0027          num ASTORE 0026  0022
0028 rbp    + int ADD    0017  +1
0029       >  int ABC    0018  0028
0030          p32 AREF   0020  0028
0031 xmm7  >+ num ALOAD  0030
0032 xmm7     num PHI    0022  0031
0033 rbp      int PHI    0017  0028
0034 rbx      nil RENAME 0017  #3
0035 xmm6     nil RENAME 0022  #2

Here we see the ALOAD in instruction 0031 assures that the value loaded from the table is in effect a number. If it happens to be any other value, the guard fails and the trace is exited.

But if we do an array of strings instead of numbers?

a small change:

local t,t2 = {},{}
for i=1,1000 do
    t[i] = 's'..i
end
for i,v in ipairs(t) do
    t2[i]=v
end

gives us this:

0024 ------------ LOOP ------------
0025          num CONV   0018  num.int
0026       >  int ABC    0005  0018
0027          p32 AREF   0007  0018
0028          str ASTORE 0027  0023
0029 rbp    + int ADD    0018  +1
0030       >  int ABC    0019  0029
0031          p32 AREF   0021  0029
0032 rbx   >+ str ALOAD  0031
0033 rbx      str PHI    0023  0032
0034 rbp      int PHI    0018  0029
0035 r15      nil RENAME 0018  #3
0036 r14      nil RENAME 0023  #2

It’s the same code, but the type that ALOAD is guarding is now a string (and it now uses a different register, I guess a vector register isn’t appropriate for a string pointer).

And if the table has a values of a mix of types?

local t,t2={},{}
for i=1,1000,2 do
    t[i], t[i+1] = i, 's'..i
end
for i,v in ipairs(t)
    do t2[i]=v
end

0031 ------------ LOOP ------------
0032          num CONV   0027  num.int
0033       >  int ABC    0005  0027
0034          p32 AREF   0007  0027
0035          str ASTORE 0034  0030
0036 r15      int ADD    0027  +1
0037       >  int ABC    0019  0036
0038          p32 AREF   0021  0036
0039 xmm7  >  num ALOAD  0038
0040       >  int ABC    0005  0036
0041          p32 AREF   0007  0036
0042          num ASTORE 0041  0039
0043 rbp    + int ADD    0027  +2
0044       >  int ABC    0019  0043
0045          p32 AREF   0021  0043
0046 rbx   >+ str ALOAD  0045
0047 rbx      str PHI    0030  0046
0048 rbp      int PHI    0027  0043

Now there are two ALOADs, (and two ASTOREs), one for ‘num’ and one for ‘str’. In other words, the JIT unrolled the loop and found that that made the types constant. =8-O

Of course, this would happen only on very simple and regular patterns. In general, it’s wiser to avoid unpredictable type mixing; but polymorphic code will be optimized for each type that it’s actually used with.

Back to next()

First let’s see the current implementation of next() as used by the interpreter:

lj_tab.c
/* Advance to the next step in a table traversal. */
int lj_tab_next(lua_State *L, GCtab *t, TValue *key)
{
  uint32_t i = keyindex(L, t, key);  /* Find predecessor key index. */
  for (i++; i < t->asize; i++)  /* First traverse the array keys. */
    if (!tvisnil(arrayslot(t, i))) {
      setintV(key, i);
      copyTV(L, key+1, arrayslot(t, i));
      return 1;
    }
  for (i -= t->asize; i <= t->hmask; i++) {  /* Then traverse the hash keys. */
    Node *n = &noderef(t->node)[i];
    if (!tvisnil(&n->val)) {
      copyTV(L, key, &n->key);
      copyTV(L, key+1, &n->val);
      return 1;
    }
  }
  return 0;  /* End of traversal. */
}

It takes the input key as a TValue pointer and calls keyindex(). This helper function searches for the key in the table and returns an index; if the key is an integer in the range of the array part, the index is the key itself. If not, it performs a hash query and returns the index of the Node, offset by the array size, if successful, or signals an error if not found (it’s an error to give a nonexistent key to next()).

Back at lj_tab_next(), the index is first incremented, and if it’s still within the array, it’s iterated over any hole until a non-nil value is found. If it wasn’t in the array (or there’s no next value there), it performs a similar “skip the nils” on the Node table.

The new lj_record_next() function in lj_record.c, like some other record functions there, first checks not only the input parameters, but also the return values to generate the most appropriate code for this specific iteration, assuming that it will likely be optimal for subsequent iterations. Of course, any such assumption must be backed by the appropriate guard.

For next(), we choose between two different forms, if the return key is in the array part, then it uses lj_tab_nexta(), which takes the input key as an integer and returns the next key, also as an integer, in the rax register. We don’t do the equivalent to the keyindex() function, just check (with a guard) that the key is within the bounds of the array:

lj_tab.c
/* Get the next array index */
MSize LJ_FASTCALL lj_tab_nexta(GCtab *t, MSize k)
{
  for (k++; k < t->asize; k++)
    if (!tvisnil(arrayslot(t, k)))
      break;
  return k;
}

The IR code looks like this:

0014 r13      int FLOAD  0011  tab.asize
0015 rsi   >  int CONV   0012  int.num
0017 rax    + int CALLL  lj_tab_nexta  (0011 0015)
0018       >  int ABC    0014  0017
0019 r12      p32 FLOAD  0011  tab.array
0020          p32 AREF   0019  0017
0021 [8]   >+ num ALOAD  0020

Clearly, the CALL itself (at 0017) is typed as ‘int‘, as natural for an array key; and the ALOAD (0021) is ‘num‘, because that’s what the first few values happened to be.

When we finish with the array part, the bounds check (instruction ABC on 0018) would fail and soon new IR would be generated. This time we use the lj_tab_nexth() function.

lj_tab.c
LJ_FUNCA const Node *LJ_FASTCALL lj_tab_nexth(lua_State *L, GCtab *t, const Node *n)
{
  const Node *nodeend = noderef(t->node)+t->hmask;
  for (n++; n <= nodeend; n++) {
    if (!tvisnil(&n->val)) {
      return n;
    }
  }
  return &G(L)->nilnode;
}

But before doing the “skip the nils”, we need to do a hash query to find the initial Node entry. Fortunately, the HREF IR instruction does that: This is the IR:

0014 rdx      p32 HREF   0011  0012
0016 r12      p32 CALLL  lj_tab_nexth  (0011 0014)
0017 rax   >+ str HKLOAD 0016
0018 [8]   >+ num HLOAD  0016

There’s a funny thing here: HREF is supposed to return a reference to a value in the hash table, and the last argument in lj_tab_nexth() is a Node pointer. Let’s see the Node definition:

lj_obj.h
/* Hash node. */
typedef struct Node {
  TValue val;       /* Value object. Must be first field. */
  TValue key;       /* Key object. */
  MRef next;        /* Hash chain. */
#if !LJ_GC64
  MRef freetop;     /* Top of free elements (stored in t->node[0]). */
#endif
} Node;

Ok… the value is the first field, and it says right there “Must be first field”. Looks like it’s not the first place with some hand-wavy pointer casts.

The return value of lj_tab_next() is a Node pointer, which can likewise be implicitly cast by HLOAD to get the value. To get the key, I added the HKLOAD instruction. Both are guarding for the expected types of the value and key, respectively.

Let’s take it for a spin

So, how does it perform? These tests do a thousand loops over a 10,000 element table, first using next() and then pairs(), with a simple addition in the inner loop. To get pairs() compiled, I just disabled the ISNEXT/ITERN optimization, so it actually uses next(). In the third test the variable in the addition is initialized to 0ULL instead of just 0, triggering the use of FFI.

First test is with all 10,000 elements on sequential integers, making the table a valid sequence, so ipairs() (which is already compiled) can be used just as well:

So, compiled next() is quite a lot faster, but the pairs() optimization in the interpreter is very fast. On the other hand, the smallest smell of FFI completely trashes interpreter performance, while making compiled code slightly tighter. Finally, ipairs() is faster, but a big part of it is because it stops on the first nil, while next() has to skip over every nil at the end of the array, which by default can be up to twice as big as the sequence itself.

Now with 5,000 (sequential) integer keys and 5,000 string keys. Of course, we can’t use ipairs() here:

Roughly the same pattern: the compiled next() performance is very much the same on the three forms (used directly, under pairs() and with FFI code), while the interpreter benefits from the pairs() optimization and almost dies with FFI. In this case, the interpreted pairs() actually surpasses the compiled next() performance, hinting that separately optimizing pairs() is still desirable.

A big factor in the interpreter pairs() is that it doesn’t use next(); instead it directly drives the loop with a hidden variable to iterate in the Node table without having to perform a hash lookup on every step.

Repeating that in a compiled pairs() would be equally beneficial; but has to be done carefully to maintain compatibility with the interpreter. On any trace exit the interpreter would kick in and must be able to seamlessly continue iterating. For that, the rest of the system has to be aware of that hidden variable.

The best part of this is that we have lots of very challenging, yet deeply rewarding, work ahead of us! Come work for us on making LuaJIT faster and more.

Read more here:: CloudFlare

LuaJIT Hacking: Getting next() out of the NYI list