Saturday, February 14, 2015

IonPower now entering low Earth orbit

Happy Valentine's Day. I'm spending mine in front of the computer with a temporary filling on my back top left molar. But I'm happy, because IonPower, the new PowerPC JIT I plan to release with TenFourFox 38, is now to the point where it can compile and execute simple scripts in Baseline mode (i.e., phase 3, where 1 was authoring, 2 was compiling, 3 is basic operations, 4 is pass tests in Baseline, 5 is basic operations in full Ion and 6 is pass tests in Ion). For example, tonight's test was debugging and fixing this fun script:

function ok(){print("ok");}function ok3(){ok(ok(ok()));}var i=0;for(i=0;i<12;i++){ok3();}

That's an awful lot of affirmation right there. Once I'm confident I can scale it up a bit more, then we'll try to get the test suite to pass.

What does IonPower bring to TenFourFox, besides of course full Ion JIT support for the first time? Well, the next main thing it does is to pay back our substantial accrued technical debt. PPCBC, the current Baseline-only implementation, is still at its core using the same code generator and assumptions about macroassembler structure from JaegerMonkey and, ultimately, TenFourFox 10.x; I got PPCBC working by gluing the new instruction generator to the old one and writing just enough of the new macroassembler to enable it to build and generate code.

JaegerMonkey was much friendlier to us because our implementation was always generating (from the browser's view) ABI-compliant code; JavaScript frames existed on a separate interpreter stack, so we could be assured that the C stack was sane. Both major PowerPC ABIs (PowerOpen, which OS X and AIX use, and SysV, which Linux, *BSD, et al. use) have very specific requirements for how stack frames are formatted, but by TenFourFox 17 I had all the edge cases worked out and we knew that the stack would always be in a compliant state at any crossing from generated code back to the browser or OS. Ben Stuhl had done some hard work on branch and call stanza optimization and this mostly "just worked" because we had full control to that level.

Ion (and Baseline, its simpleton sibling) destroyed all that. Now, script frames are built on the C stack instead, and Ion has its own stack frame format which it always expects to be on top, meaning when you enter Ion generated code you can't directly return to ABI-compliant code without some sort of thunk and vice versa. This frame has a descriptor and a return address, which is tricky on PowerPC because we don't store return addresses on the stack like x86, the link register we do have (the register storing where a subroutine call should return to) is not a general purpose register and has specialized means for controlling it -- part of the general class of PowerPC SPRs, or special purpose registers -- and the program counter (the internal register storing the location of the current instruction) is not directly accessible in any fashion. Compare this with MIPS, where the link register is a GPR ($ra), and ARM, where both the LR and PC are. Those CPUs can just sling them to and from the stack at will.

That return address is problematic in another respect. We use constructs called branch stanzas in the PowerPC port, because the PPC branch instructions have a limited signed displacement of (at most, for b) 26 bits, and branches may exceed that with the size of scripts these days; these stanzas are padded with nop instructions so that we have something to patch with long calls (usually lis ori mtctr bctr) later. In JagerMonkey (and PPCBC), if we were making a subroutine call and the call was "short," the bl instruction that performed it was at the top of the branch stanza with some trailing nops; JaegerMonkey didn't care how we managed the return as long as we did. Because we had multiple branch variants for G3/G4 and G5, the LR which bl set could be any number of instructions after the branch within the stanza, but it didn't matter because the code would continue regardless. To Ion, though, it matters: it expects that return address on the stack to always correlate to where a translated JSOP (JavaScript interpreter opcode) begins -- no falling between the cracks.

IonPower deals with the problem in two ways. First, we now use a unified stanza for all architectures, so it always looks the same and is predictable, and short branches are now at the end of the stanza so that we always return to the next JSOP. Second, we don't actually use the link register for calling generated code in most cases. In both PowerPC ABIs, because LR is only set when it gets to the callee (the routine being called) the callee is expected to save it in its function prologue, but we don't have a prologue for Baseline code. PPCBC gets around this by using the link register within Baseline code only; for calls out to other generated code it would clobber the link register with a bl .+4 to get the PC, compute an offset and store that as a return address. However, that offset was sometimes wrong (see above), requiring hacks in the Ion core to do secondary verification. Instead, since we know the address of the instruction for the trailing JSOP at link time, we just generate code to push a 32-bit literal and we patch those push instructions when we link to memory to the right location. This executes faster, too, because we don't have all that mucking about with SPR operations. We only use the link register for toggled calls (because we control the code it calls, so we can capture LR there) and for bailout tables, and of course for the thunk code that calls ABI-compliant library routines.

This means that a whole class of bugs, like branching, repatching and stack frame glitches, just disappear in IonPower, whereas with PPCBC there seemed to be no good way to get it to work with the much more sophisticated Ion JIT and its expectations on branching and patching. For that reason, I simply concentrated on optimizing PPCBC as much as possible with high-performance PowerPC-specific parallel type guards and arithmetic routines in Baseline inline caches, and that's what you're using now.

IonPower's other big step forward is improved debugging abilities, and chief amongst them is no compiler warnings -- at all. Yep. No warnings, at least not from our headers, so that subtle bugs that the compiler might warn about can now be detected. We also have a better code generation display that's easier to visually parse, and annoying pitfalls in PowerPC instructions where our typical temporary register r0 does something completely different than other registers now assert at code generation time, making it easier to find these bugs during codegen instead of labouriously going instruction by instruction through the debugger at runtime.

Incidentally, this brought up an interesting discussion -- how does, say, MIPS deal with the return address problem, since it has an LR but no GPR PC? The MIPS Ion backend, for lack of a better word, cheats. MIPS has a single instruction branch delay slot, i.e., the instruction following any branch is (typically) executed before the branch actually takes place. If this sounds screwy, you're right; ARM and PowerPC don't have them, but some other RISC designs like MIPS, SuperH and SPARC still do as a historical holdover (in older RISC implementations, the delay slot was present to allow the processor enough time to fetch the branch target, and it remains for compatibility). With that in mind, and the knowledge that $ra is the MIPS link register, what do you think this does?

addiu $sp,$sp,-8 ; subtract 8 from the current stack pointer,
                 ; reserving two words on the stack
jalr $a0         ; jump to address in register $a0 and set $ra
sw $ra, 0($sp)   ; (DELAY SLOT) store $ra to the top of the stack

What's the value of $ra in the delay slot? This isn't well-documented apparently, but it's actually already set to the correct return address before the branch actually takes place, so MIPS can save the right address to the stack right here. (Thanks to Kyle Isom, who confirmed this on his own MIPS system.) It must be relatively standard between implementations, but I'd never seen a trick like that before. I guess there has to be at least some advantage to still having branch delay slots in this day and age.

One final note: even though necessarily I had to write some of it for purposes of compilation, please note that there will be no asm.js for PowerPC. Almost all the existing asm.js code out there assumes little endian byte ordering, meaning it will either not run or worse run incorrectly on our big endian machines. It's just not worth the headache, so they will run in the usual JIT.

On the stable branch side, 31.5 will be going to build next week. Dan DeVoto commented that TenFourFox does not colour manage untagged images (unless you force it to). I'd argue this is correct behaviour, but I can see where it would be unexpected, and while I don't mind changing the default I really need to do more testing on it. This might be the standard setting for 31.6, however. Other than the usual ESR fixes there are no major changes in 31.5 because I'm spending most of my time (that my Master's degree isn't soaking up) on IonPower -- I'm still hoping we can get it in for 38 because I really do believe it will be a major leap forward for our vintage Power Macs.

No comments:

Post a Comment

Due to an increased frequency of spam, comments are now subject to moderation.