The goal is to enable JIT codegen without sacrificing too much performance and adding too much maintenance burden, and a functional JIT implementation needs a few more components other than that---most notably a facility to monitor and trace function calls for the eventual JIT compilation. Consider the OP to be one of intermediate goals, not the eventual goal.
I don't think we disagree that the long-term goal is to _eventually_ make it faster :) I rather meant to temper the enthusiasm that some could have upon seeing "JIT" and immediately trying to compare with, say, PyPy.
> enable JIT codegen without sacrificing too much performance
This is the part I don't buy. The main point of a JIT is performance, so by definition I don't see it being enabled unless it improves performance across the board.
What I wonder is if the current approach, stated as "copy-and-patch auto-generated code for each opcode", can ever reach that point without being replaced by a completely different design along the way. AFAIK, as is, the main difference between running the interpreter loop composed of normally compiled opcodes and JIT copy-and-patching these opcodes is lack of the opcode dispatch logic running between each op - which is good, but also countered by slightly worse quality of the copied code.
> What I wonder is if the current approach, stated as "copy-and-patch auto-generated code for each opcode", can ever reach that point without being replaced by a completely different design along the way.
Of course this approach produces a worse code than a full compiler by definition---stencils would be too rigid to be further optimized. A stencil conceptually maps to a single opcode, so the only way to break out of this restriction is to add more opcodes. And there are only so many opcodes and stencils you can prefare. But I think you are thinking too much about a possibility to make Python as fast as, say, C for at least some cases. I believe that it won't happen at all, and the current approach clearly points why.
Let's consider a simple CPython opcode named `BINARY_ADD` which has a stack effect of `(a b -- sum)`. Ideally it should eventually be compiled down to a fully specialized machine code something like `add rax, r12`, plus some guards. But the actual implementation (`PyNumber_Add` [1]) is far more complex: it may call at most 3 "slot" calls that may add or concatenate arguments, some of them may call back to a Python code.
So let's assume that we have done type specialization and arguments are known to be integers. That will result in a single slot call to `PyLong_Add` [2], which again is still complex because CPython has two integer representations. Even when they are both "compact", i.e. at most 31/63 bits long, it may still have to switch to another representation when the resulting sum is no longer compact. So a fully specialized machine code would be only possible when both arguments are known to be integers and compact and have one more spare bit to prevent an overflow. That sounds way more restrictive.
An uncomfortable truth is that all these explanations also almost perfectly apply to JavaScript---the slot resolution would be the `[[ToNumber]]` internal function and multiple representations will be something like V8's Smi. Modern JS engines do exploit most of them, but at the expense of extremely large codebase with tons of potential attack surfaces. It is really expensive to maintain, and people don't really realize that no performant JS engine was ever developed by a small group of developers. You have to cut some corners.
In comparison, CPython's approach is essentially inside out. Any JIT implementation will require you to split all those subtasks into small bits that can be either optimized out or baked into a generated machine code. So what if we start with subtasks without thinking about JIT in the first place? This is what a specializing adaptive interpreter [3] did. The current CPython already has two tiers of interpreters, and micro-opcodes can only appear in the second tier. With them we can split larger opcodes into smaller ones, possibly with optimizations, but its performance is limited by the dispatch logic. The copy-and-patch JIT is not as powerful, but it does eliminate the dispatch logic without large design changes and it's a good choice for this purpose.
In the best scenario, it will eventually hit the limit of what's possible with copy-and-patch and a full compiler will be required at that point. But until that point (which may never come as well), this approach allows for a long time of incremental improvements without disruption.
I think there was some misunderstanding, you're arguing different points than ones I made.
> Of course this approach produces a worse code than a full compiler by definition---stencils would be too rigid to be further optimized.
Yeah, but that's not what I meant by "worse code". I just meant that even being aware this is a naive copy-and-patch JIT, my first impression was that the code was slightly worse than I expected. I don't expect the compiler to do any magic on a small code slice; I only claimed that there's "room to improve" in the currently generated code, though I may be totally wrong on whether it's actually possible to achieve by "just convincing clang" and without manually messing with the asm.
> But I think you are thinking too much about a possibility to make Python as fast as, say, C for at least some cases.
I never said this about CPython, quite the opposite.
> I believe that it won't happen at all
(FWIW, if we're talking long-term and about Python in general, it already did happen, PyPy (and modern JS runtimes) are good examples of this being possible in principle. But being able to make a language orders of magnitude faster (with some major asterisks too) doesn't mean I expect the same from the CPython implementation.)
As for your example with integer adding, I totally agree with all you said, and that's exactly what I meant by "there’s only so much one can do without touching the data model".
> In the best scenario, it will eventually hit the limit of what's possible with copy-and-patch and a full compiler will be required at that point. But until that point (which may never come as well), this approach allows for a long time of incremental improvements without disruption.
That's why in my initial message I said I wonder about expected peak improvement. I won't be surprised if it (together with theorized uop optimizations) barely exceeds single-digit percent perf gains, which would of course be still totally worth it. And even it's more, well, even better :) And in the worst case - which I hope won't happen - the point you mentioned is today, and copy-and-patch would never be worth enabling by itself.
> I just meant that even being aware this is a naive copy-and-patch JIT, my first impression was that the code was slightly worse than I expected.
> "there’s only so much one can do without touching the data model"
You probably want to look at the other link in that PR, which demonstrated how well copy-and-patch can do for another dynamic language (Lua): [1]
Of course, whether or not CPython could eventually make it to that point (or even further) is a different story: they are under a way tighter constraint than just developing something for academia. But copy-and-patch can do a lot even for dynamic languages :)
> That's why in my initial message I said I wonder about expected peak improvement. I won't be surprised if it (together with theorized uop optimizations) barely exceeds single-digit percent perf gains, which would of course be still totally worth it. And even it's more, well, even better :) And in the worst case - which I hope won't happen - the point you mentioned is today, and copy-and-patch would never be worth enabling by itself.
Ah, so you meant that even all of them including specializing interpreter and copy-and-patch JIT may not give a reasonable speedup. But I think you have missed the fact that specializing interpreter has already landed on 3.11 and provided 10--60% speedup. So specialization really works, and copy-and-patch JIT should allow finer-grained uops which can have an enormous impact on performance.
On the other hand, it is possible that copy-and-patch JIT itself turns out to be useless even after all the work. In this case there is no other known viable way to enable JIT without disruption, so JIT shouldn't be added to CPython. I should have stressed this point more, but "incremental" improvements are really important---it was a primary reason that CPython didn't even try to implement JIT compilation for decades after all. CPython can give them up, but then there is one less reason to use (C)Python, so CPython never did so. (GIL is the same story by the way, and the current nogil effort is not possible without other performance improvements that outweigh a potential overhead in the single-threaded setting.)
> As for your example with integer adding, I totally agree with all you said, and that's exactly what I meant by "there’s only so much one can do without touching the data model".
If the data model refers to the publicly visible portion of the interface, I don't think so. Even JS runtimes didn't require any change to the public interface, and CPython itself already caches lots of the data model for the sake of performance. I'm not aware of attempts like shape optimizations, but it might be possible to extend the current `__slots__` implementation to allow the adaptive memory layout.
> Ah, so you meant that even all of them including specializing interpreter and copy-and-patch JIT may not give a reasonable speedup. But I think you have missed the fact that specializing interpreter has already landed on 3.11 and provided 10--60% speedup
No, I'm talking compared to the current default production state. Exactly what Brandt said in his talk at around 23:30, and what I observed when building his branch.
Then I'm not sure why that would refute the intermediate goal to "enable JIT codegen without sacrificing too much performance" as stated in my initial comment, since the proposed copy-and-patch JIT compiler can't make the impact by itself.
> The goal is to enable JIT codegen without sacrificing too much performance and adding too much maintenance burden, and a functional JIT implementation needs a few more components other than that---most notably a facility to monitor and trace function calls for the eventual JIT compilation. Consider the OP to be one of intermediate goals, not the eventual goal.
It seems like the copy and patch approach is sort of somewhere inbetween an interpreter and a traditional JIT, and the authors of the original copy and patch paper seem to be trying to use it to replace things like the baseline compiler in the two-tier baseline/optimizing compiler strategy used for things like webassembly.
Because of this, is it really necessary to add tracing and try use a two tier interpeter/copy-and-patch JIT approach for this python JIT? Wouldn't it make more sense to try to get it to be fast enough that the JIT can be used alone?
See my other comment for details, but in short, this strategy uses a single code base for both interpreter and JIT. So any further performance improvement will benefit both without any additional work. The traditional JIT-only approach is costly to maintain in comparison.