The PHP internals community is quite a poisonous place, IMO. Its a community tha...

dkhenry · on Nov 30, 2012

I think that's a function of their inability to let go of their own work. The internals of PHP need a _complete_ overhaul to make PHP competitive on a speed/memory basis with the python and rubys of the world, but that would mean throwing away a lot of work by the current contributors and probably re-architecting things like zvals and internal opcode representation. That's something I don't think they are willing to do, and I don't know if there is enough community around the zend core to really get it done even if they wanted to.

pbiggar · on Nov 30, 2012

I wouldn't know if that's the reason, but totally agree that the internals need a total overhaul. The entire codebase was a complete mess - hacks built on hacks built on hacks. Each variable takes what - 96 bytes I think on 64bit machines? The opcodes are terrible. The interpreter dispatch is switch based, and funnily enough it doesn't matter because the rest of the engine is so slow (function dispatch in particular) that interpreter dispatch doesn't even matter.

Disclaimer: I dont think I've looked at PHP source in 3 years, it might have changed significantly but I suspect it has not. I'd love to be corrected with some detail!

haberman · on Nov 30, 2012

What's wrong with switch based dispatch? Lua's dispatch is switch based, and it is the fastest mainstream bytecode-based VM out there last I checked.

pbiggar · on Nov 30, 2012

Lua's dispatch wasn't chosen because it was the fastest. If you read their Lua5 paper, they discuss how they prioritized being portable ANSI C over speed. Indirect threading is the fastest simple technique, but you can do even better. See section 3.3 of http://www.cs.tcd.ie/publications/tech-reports/reports.07/TC... for a discussion.

haberman · on Nov 30, 2012

If you're giving up portability, you might as well go all the way and create a JIT. Techniques like what the paper calls "inline-threaded dispatch" seem of limited usefulness, since they give neither the speed of a JIT nor the portability of a bytecode VM.

My main point was that switch()-based dispatch isn't that bad. I'm surprised that this recent literature distinguishes between switch dispatch and token-threaded dispatch, since as Mike Pall notes, "Tail-merging and CSE will happily join all these common tails of each instruction and generate a single dispatch point," so the goto-spaghetti of token-threaded dispatch is likely not even worth it. (http://article.gmane.org/gmane.comp.lang.lua.general/75426)

> Indirect threading is the fastest simple technique

I'm confused; the paper itself says: "However, because of a level of indirection, indirect-threaded dispatch is not be as efficient as direct-threaded dispatch."

Even direct-threaded dispatch still results in an indirect branch for every VM instruction, it just tries to save a table lookup over token-threading.

Ultimately the most common and practical dispatch techniques seem to boil down to either a single indirect branch or replicated indirect branches.

sb · on Nov 30, 2012

You're not really "giving up" portability, only with ANSI C. For all architectures where GCC is available, its labels-as-values support will happily compile and hence give you a fully portable instruction dispatch.

Regarding the threaded code: the literature seems to be horribly inconsistent in this regard. Token threaded code might actually refer to indirect-threaded code depending on what the author has read. I was actually quite surprised to read what "indirect threaded code" originally meant when I read Debaere and van Campenhout's 1990 book "Interpretation and Instruction Path co-processing".

Regarding the effect of CSE & tail-merging: I have only seen GCC to do this once, by disabling basic block reordering (-fno-reorder-blocks, due to a so-called "software trace cache"), otherwise this has never happened to me. Since I have nothing but the highest respect for Mike Pall, I guess that he might have experienced this for Lua, which has--to the best of my knowledge--less than 40 instructions. I will try to verify this on the weekend.

haberman · on Nov 30, 2012

> You're not really "giving up" portability

I should have been more clear; that comment was in reference to some of the techniques the paper discusses which do give up portability, namely "inline-threaded dispatch" and "context-threaded dispatch."

My overall points are: switch() isn't that bad, and most practical dispatch techniques ultimately boil down to either a single indirect branch or replicated indirect branches.

> Regarding the effect of CSE & tail-merging: I have only seen GCC to do this once

I've never seen it happen with my own eyes, but it appears the Python guys experienced this also at one point and had to tweak the flags -fno-crossjumping and -fno-gcse. http://bugs.python.org/issue4753

> I will try to verify this on the weekend.

Do you have a blog? I'd love to hear what you discover.

pbiggar · on Nov 30, 2012

> Do you have a blog? I'd love to hear what you discover.

I second this. I'd be very interested to know who you (sb) are in real life.

pbiggar · on Nov 30, 2012

I'm not sure where you're going with this, but you seem to have an emotional attachment to switch statements, which is otherwise unsupported by reality :)

> If you're giving up portability, you might as well go all the way and create a JIT.

Well no. A JIT is a massive massive amount of work, an interpreter is not. Don't throw the baby out with the bathwater. Besides, you can make a damn portable token-threaded dispatch for supported platforms (which means anything using GCC, LLVM or Visual Studio, which means basically anything that exists today), falling back to switch statements for unsupported platforms.

> Techniques like what the paper calls "inline-threaded dispatch" seem of limited usefulness, since they give neither the speed of a JIT nor the portability of a bytecode VM.

I forget the numbers, but lets say its 20%. It gives a 20% speed boost, for a few dozen lines of code. Hardly "limited usefulnes".

> My main point was that switch()-based dispatch isn't that bad.

Its portable and easy to implement. It is not the fastest dispatch mechanism. It will generally be the bottleneck in an otherwise-efficient interpreter. It does not compare well to most other dispatch techniques, and is only useful if you demand that absolutely only ANSI C will do.

> I'm surprised that this recent literature distinguishes between switch dispatch and token-threaded dispatch, since as Mike Pall notes, "Tail-merging and CSE will happily join all these common tails of each instruction and generate a single dispatch point," so the goto-spaghetti of token-threaded dispatch is likely not even worth it. (http://article.gmane.org/gmane.comp.lang.lua.general/75426)

Firstly, the work on interpreters was done in the 80s and 90s. Really awesome optimizing compilers post-date that considerably.

Secondly, Pall does not say those those dispatching techniques are equivalent. What he says is that interpreters in C are difficult, and you have to fight the optimizer. If you wrote them in assembly, as Pall did, you would not find that the techniques were equivalent.

> Ultimately the most common and practical dispatch techniques seem to boil down to either a single indirect branch or replicated indirect branches.

Its certainly common to use a single indirect branch. However, it is suboptimal. People like speed.

Finally, note that my original point is the switch-based dispatch is slow, but that better techniques in PHP are not worth it, because the rest of the interpreter is so much slower.

haberman · on Nov 30, 2012

Wow, you read a way more argumentative tone into my comment than I intended. It's all good, I'm trying to expand my understanding. :) Though I'm not sure I agree with your characterization of my points as "nonsense."

> [Inline-threaded dispatch] gives a 20% speed boost, for a few dozen lines of code. Hardly "limited usefulness".

Unless I'm missing something, I think you would be very hard-pressed to generate basic blocks of native code in only a few dozen lines of platform-dependent code. I'm not seeing how you could do it without implementing your own code generator (which would take far more than a dozen lines). Is there some clever way of leveraging the C compiler's output that I'm missing?

Also, once you are generating basic blocks of native code, I would consider it a JIT at that point, even if it is not as sophisticated as a JIT that's doing register allocation and optimization. Writing this sort of JIT that uses a static register allocation and a fixed instruction sequence for each bytecode op doesn't seem that much more work than writing an interpreter in assembly language (especially if you use Mike Pall's excellent framework DynASM: http://luajit.org/dynasm.html).

> Its certainly common to use a single indirect branch. However, it is suboptimal.

I don't believe that a single indirect branch is a priori slower; while replicated dispatch makes better use of the branch predictor, it also results in less optimal i-cache usage (the LuaJIT code notes this trade-off, but observes that replicated dispatch is 20-30% faster in practice: http://repo.or.cz/w/luajit-2.0.git/blob/2ad9834df6fe47c7150d...).

I'm not saying switch is better, what I'm saying is that you can't reliably replicate dispatch in C (at least according to Mike's message that I cited earlier), which makes switch() a totally reasonable choice for interpreters written in C.

By the way, I'm not just making stuff up, I have written a JIT of my own: https://github.com/haberman/upb/blob/master/upb/pb/decoder_x...

pbiggar · on Nov 30, 2012

> Wow, you read a way more argumentative tone into my comment than I intended. It's all good, I'm trying to expand my understanding. :) Though I'm not sure I agree with your characterization of my points as "nonsense."

Indeed, that was much too harsh and I retracted it. I thought I might have retracted it before anyone noticed, but I guess not. My bad, sorry about that.

> Unless I'm missing something, I think you would be very hard-pressed to generate basic blocks of native code in only a few dozen lines of platform-dependent code.

Whoops, I thought we were still talking about token-threading. I believe people implement token threading by fiddling with the compiler output. John Aycock had a paper about doing this for Perl or Python as I recall (also, since you're into JITs, you might enjoy his "a brief history of Just-in-time").

> Also, once you are generating basic blocks of native code, I would consider it a JIT at that point,

Yes, people often consider this to be a JIT. However, its much much easier to write than a "real" JIT. The complexity of writing a method-jit or trace-jit is very very high. This could probably be done in an afternoon by a great hacker.

Don't underestimate the work of writing the assembly language interpreter like Mike Pall did. Most incredibly hardcore compiler guys with whom I have discussed it were simply blown away by it. It might be one of the most hardcore feats in VM implementation history. He manually register allocated across dynamic flow control boundaries for gods sake!!

> reasonable choice for interpreters written in C

But not a fast one.

haberman · on Nov 30, 2012

> also, since you're into JITs, you might enjoy his "a brief history of Just-in-time"

Indeed, I've bookmarked it, thanks! I couldn't find the other paper you mentioned about fiddling with compiler output; would be very interested in seeing that.

LuaJIT2 was the first interpreter and JIT that I ever studied deeply, so while I am very impressed by it, I don't have a great understanding of how it differs from previous work. How is its interpreter substantially different or more impressive than previous interpreters written in assembly language? Don't all assembly language interpreters need a fixed register allocation?

I've been a big fan of Mike's work for a long time and helped him get some Google funding: http://google-opensource.blogspot.com/2010/01/love-for-luaji...

By the way, when I was searching for the John Aycock paper I came across your dissertation which I've also bookmarked. I see that you went to Trinity College Dublin -- I visited Dublin a few years ago and what a beautiful university that is! Looks like there's a lot of interesting JIT-related research coming from there lately.

pbiggar · on Nov 30, 2012

> How is its interpreter substantially different or more impressive than previous interpreters written in assembly language?

Traditionally, there were two implementation strategies: compilers and interpreters. JITs didnt become mainstream until Sun bought Hotspot and put it into the 2nd version of Java.

So you had to choose between compilers and interpreters. Compilers led to optimization, fast object code, but were complex to implement (especially multiplatform). Interpreters were simple to implement, very portable, but were very very slow.

Obviously there is more to both, but until recently, basically this is how people thought about language implementation. Considerations like a REPL, slow compilation speed, ease of prototyping, etc, were a complete side show (perhaps people cared, but you'd rarely see it discussed).

When all of the dynamic languages were implemented, they all used interpreters, written in C. They all used a simple dispatch loop to opcode implementation, and let the C compiler do the heavy lifting. All the research into fast compilers (the David Gregg/Anton Ertl stuff for example), looked at instruction set (registers/stack) and dispatch type. So when making interpreters fast, there were only 4 strategies:

- make a compiler,

- use better dispatch,

- rewrite using a register interpreter set,

- make a JIT.

Making a JIT is lunacy of course, because JITs are ridiculously hard, they're not portable. So that Pall was making a 1-man JIT (LuaJIT1) was incredible.

But that he made an interpreter that was as fast as a JIT was even more insane. In Trinity, all of us language/JIT/scripting language people were in one room, and when we heard about this we were just amazed. Nobody had even thought about the stuff - it was all brand new and novel in a field that barely had anything novel in decades! Until that point, basically all interpreters were one big while loop.

> How is its interpreter substantially different or more impressive than previous interpreters written in assembly language?

I wouldnt know, since I've not heard of any mainstream interpreters in assembly. I can only imagine that they were exactly the same as C interpreters: essentially a while loop with a switch statement, just written in assembly.

I find it amusing that you started at LuaJIT2. I would liken it to studying modern war, then wondering "why didnt they just use drone strikes at the Somme" :) Looking back from LuaJIT2, interpreters must seem really really primitive.

monopede · on Nov 30, 2012

I don't think assemblers written in assembly were that bad. LuaJIT 2 uses direct threading (not new at all), register-based bytecode (relatively new), and manually optimised register assignment (perhaps new). AFAICT, the key innovations are that he did not use Lua 5.1's register-based bytecode format, but simplified it even further so it can be decoded efficiently on x86. The second key component is that he pre-decodes the following instruction in order to take advantage of out-of-order execution. This technique also required fixing some register's roles.

Don't get me wrong, I think LuaJIT2's interpreter is great, but interpreters before LuaJIT2 weren't complete crap, either. Many emulators, for example, have very good interpreters written in assembly (some aim to be cycle-accurate).

pbiggar · on Nov 30, 2012

I was trying to describe how it looked from an academic standpoint. Direct threading and register bytecode was well known (register stuff is actually very old, but the jury was out until about 2003), but everything else Pall did was basically new to programming language researchers and implementers.

pbiggar · on Nov 30, 2012

Aycock paper is available here: http://pages.cpsc.ucalgary.ca/~aycock/

title: "Converting Python Virtual Machine Code to C."

sb · on Nov 30, 2012

I don't know about the exact definition of "mainstream bytecode-based VM", but just for the record: both the OCaml interpreter and the GForth interpreter usually score pretty well, too.

Probably the de facto fastest interpreter is the one of the Sun HotSpot Java virtual machines. There is a paper by Robert Griesemer detailing some information, but AFAIR it is a hand-coded optimized assembly interpreter that does not too bad. LuaJIT's interpreter is doing quite well, too. (Mike Pall said in the LTU thread about trace-based compilation that the LuaJIT2 interpreter is mostly on par with the LuaJIT1 compiler, or at least not much slower.)

EDIT: replaced "LuaJIT1 interpreter" with more accurate compiler as pointed out by haberman.

haberman · on Nov 30, 2012

Ah, sorry I should have been more specific and said "for dynamic languages."

The LuaJIT2 interpreter is on par with the LuaJIT1 JIT compiler. I don't believe LuaJIT1 had a separate interpreter -- it was just a JIT compiler that used the main Lua interpreter.

sb · on Nov 30, 2012

Ok, well even for dynamic languages threaded code is likely to give you a noticeable speedup. The only negative example I'm aware of was adding threaded code (+superinstructions) to the Tcl interpreter, which resulted in slowdowns sometimes.

Even though the potential is not going to be as big as for Java, Forth, and OCaml interpreters (where people frequently report 2x speedups), for example Python gains between 20 and 45%. But somebody already replied to a similar inquiry and said that ANSI C compatibility is more important than the increase in performance. (Python uses conditional compilation to achieve both.)

ZeroGravitas · on Nov 30, 2012

Genuine question: Couldn't we all just move to HipHop if the PHP internals are so bad?

maratd · on Nov 30, 2012

We're getting there, that's why this is so exciting!

notJim · on Nov 30, 2012

Uhh, to the extent that it makes sense to compare speeds of programming languages, PHP is faster and uses less memory than both ruby and python in many or most cases:

Python: http://shootout.alioth.debian.org/u64q/benchmark.php?test=al...

Ruby: http://shootout.alioth.debian.org/u64q/benchmark.php?test=al...

byroot · on Nov 30, 2012

Did you looked at the source ?

For example for the pidigits bench:

The PHP version use the GM extension http://shootout.alioth.debian.org/u64q/program.php?test=pidi...

And Ruby do it in pure Ruby http://shootout.alioth.debian.org/u64q/program.php?test=pidi...

So basically it's a C vs Ruby benchmark. Many peoples complained on the forum, take a look.

The Debian shootout is interesting, but don't take it as an absolute truth.

pbiggar · on Nov 30, 2012

The Debian shootout has made some very questionable decisions, and the author has defended them by saying that you cant rely on the results. Which is definitely true.

igouy · on Nov 30, 2012

>> the author has defended them by saying that <<

Please show where I said that, or admit that you are putting words into my mouth.

pbiggar · on Nov 30, 2012

Do you believe that the shootout results reliably show that one language is faster than another?

igouy · on Nov 30, 2012

You claimed I said "that you cant rely on the results".

Please show where I said that, or admit that you are putting words into my mouth.

pbiggar · on Nov 30, 2012

This is a discussion forum, not a court of law. I'm not about to trawl through 5 or 6 years of reddit and hacker news comments to find an exact quote.

If you believe the shootout results can reliably show that one language is faster than another, say so. Otherwise, de facto you are saying you cant rely on the results.

igouy · on Nov 30, 2012

People here are smart enough to understand that you want to claim "you cant rely on the results" and to make your claim seem stronger you put your words in my mouth.

pbiggar · on Dec 1, 2012

Who's putting words in who's mouth now?

igouy · on Dec 1, 2012

You told us "you cant rely on the results" and gave emphatic support to that claim -- "Which is definitely true."

http://news.ycombinator.com/item?id=4851984

Your words -- not mine.

pbiggar · on Dec 1, 2012

You're going a long way to avoid saying that you can rely on the results. If you felt we could, this discussion would be over a long time ago.

I'm stopping this dance here. Feel free to make a claim about the reliability of the results, and whether you can use them to say that one language is faster than another. Then we can dance some more, but otherwise I'm out.

igouy · on Dec 2, 2012

You can rely on the results, the results are what they claim to be on the home page -- "the time taken for this task, by this program, when compiled with this compiler, with these options, on this machine, with these workloads."

> Feel free to make a claim...

The only point of "this dance" was to help people understand that you are putting words into my mouth.

pbiggar · on Dec 2, 2012

Exactly. You've said it yourself - you cannot rely on the results to tell if a language is faster than another. That was the original context of my claim, and you are admitting it yourself.

igouy · on Dec 3, 2012

I said - "You can rely on the results."

Please say what you mean by "a language is faster than another".

Please say what in The Java™ Language Specification tells us that "language is faster than another."

http://docs.oracle.com/javase/specs/jls/se7/html/index.html

pbiggar · on Dec 3, 2012

Exactly. You can't use the results to say that one language is faster than another. It doesn't even make sense.

(That's the context of the conversation, see http://news.ycombinator.com/item?id=4851781)

igouy · on Dec 4, 2012

>> You can't use the results to say that... <<

That's a different claim.

>> It doesn't even make sense <<

That depends on what `notJim` meant by "to compare speeds of programming languages".

`notJim` might have thought that everyone would understand he was talking about particular language implementations and particular tasks and particular programs.

igouy · on Nov 30, 2012

The benchmarks game is not "The Debian ..." anything -- just one of 960 hosted projects.

The pi-digits page also shows a PHP program which does not use the GMP extension.

Why do you think no one has contributed a Ruby program that uses GMP? Is that too hard to do using Ruby?

The benchmarks game is the absolute truth about the reported measurements.