Does Int32Array look horrible only to me? What if I want an array of my own custom types - say vec2f. Do I have to use 2 Float32Array's? Javascript is a horrible language where you have the tradeoff between performance and expressibility. Seriously, this whole web development thing needs to be fixed.
I am not sure though where you tradeoffs between expressibility and performance. Typed arrays came to our world only recently, when needs of WebGL gave them birth. So they ended up being bolted on top of an existing language.
"You can use one Float32Array: x component gets even indices, y gets odd ones."
That is exactly my point of choosing the performance vs expressibility. But then how do I write algorithms in a generic fashion for them? In C++ I have templates and template functions, so I can write algorithm that will work for any type of the array. And I'm not even talking about STL which decouples algorithms from containers, by making the latter provide a hierarchy of iterators. Other languages like Haskell also provide compile-time polymorphism. And here we are in 2012 - there is an ad-hoc solution that only works in few browsers, and a proper one is only down the road.
A few things to add to this discussion about typed arrays:
1. The list of support statuses does not seem to reflect that Float64Array was missing from Safari until recently, and I think might still be missing from mobile Safari.
2. Typed array performance of Uint32Arrays can differ from the other integer array types: unsigned int values do not fit in 31-bit signed integers, which is what most JS engines optimize ints for. This is most significant in V8 which does not use NaNboxing, so when it sees a big integer it makes it a boxed double, leading to bad performance, see for example http://code.google.com/p/v8/issues/detail?id=2097 Looks like the benchmarks done here did not check those values, but they happen a lot in practice with things like compiled C code.
3. Aside from points 1 and 2 overall typed array support and performance are predictable and good across browsers (sans IE, but hopefully with IE10). Aside from random access there is also the .set() method which lets you copy large amounts of typed data efficiently as well (which was not benchmarked here, but should basically be a memcpy so likely consistent across browsers).
> This is most significant in V8 which does not use NaNboxing, so when it sees a big integer it makes it a boxed double
NaN-tagging has nothing to do with the issue you are linking to e.g. in V8 does not allocate a boxed double for every floating point value it reads from Float64Array.
Let me expand my own comments [vegorov@chromium.org] from the issue, so that the problem V8 has becomes more understandable.
The reason for the Issue 2097 is somewhat "patchy" support V8 has for uint32 values in the optimized code. uint32 value in JavaScript has somewhat a dual nature: it can be treated bitwise as an signed int32 as long as it participates in truncating operations[1]; but once it escapes into unsafe place where its "unsigned-ness" can be observed it might have to become a full double if it does not fit into positive part of int32 range.
V8's optimizing pipeline does not have a proper analysis to determine where uint32 value can efficiently be represented as int32 without affecting semantics. It also does not want to pessimistically represent all values read from Uint32Array as doubles so it just chooses to assume non-truncating int32 representation for them and as a result code deoptimizes when it sees a value that does not fit [another bit that is missing here is type-feedback on keyed load IC that would tell hydrogen that he is too optimistic in its assumption].
V8 actually has the same problem with x >>> 0.
[1] - examples of truncating operations are: bitwise ops, stores into integer typed arrays and arithmetic operations when they do not overflow beyond 53bits and their results are used only in truncating operations;
Interesting, thanks. I am not sure I fully follow, though: Why is v8 so much slower on this than other engines that also do not have special code to handle uint32s? We are talking slowdown factors of 10x-15x, so it isn't a minor optimization here or there - what is the cause?
Optimized code can't deal with uint32 that is out of range so it deoptimizes and effectively you end up running in an entirely unoptimized code which is obviously much slower than optimized one... and of course this penalizes V8 even more because unboxed doubles and int32s are possible for V8 only in optimized code.
It is true that if V8 were using NaN-tagging it would suffer less from running entirely unoptimized code. But my point is: it doesn't have to use NaN-tagging to run this code efficiently, it just needs to ensure it doesn't deoptimize for nothing.
So, out of the 10-15x slowdown, how much is due to deoptimizing and how much is due to not NaNboxing, if it's possible to estimate that? The distinction should be important in the case of a very large codebase whose performance is not focused in a few small loops, so presumably most of the time you will be running unoptimized code (and then if you NaNbox or not gets important).
I am not sure I entirely understand. If you are running in a cold code then performance does not matter and you can tolerate quickly allocating a small amount of boxes which will be as quickly reclaimed by scavenger once you are done with them. If you are running in an hot code --- then it should be optimized in a way that minimizes the number of boxes produced.
In other words: ideally application should be running unoptimized code if and only if it is either cold or cannot be improved by optimization; all other cases are bugs.
I can't split 10-15x between deoptimization and boxing because for V8 cost of "erroneous" deoptimization includes the cost of boxing as you can't have unboxed numbers in unoptimized code.
As I said earlier it is true that non-optimized code heavily manipulating doubles could become faster if V8 used NaN-tagging (or another technique that would allow it to maintain unboxed doubles on unoptimized frames). But speed of unoptimized code should not matter (see above).
Another thing to keep in mind is that for NaN-tagging on ia32 you pay with memory overhead: every object slot that can contain primitive number becomes twice as large on ia32. This is not nice if you don't have a lot of number floating around.
Overall, let me reiterate it, I am not arguing against NaN-tagging. I am just clarifying that the Issue 2097 is caused by the wrong decision in the hydrogen pipeline not by the fact that V8 does not use NaN-tagging.
I see now what you are saying about that issue, NaNboxing makes it worse but at core it is a deoptimization issue. Which is good, I hope this is fixed soon (so emscripten-compiled code runs more consistently across browsers).
> I am not sure I entirely understand. If you are running in a cold code then performance does not matter and you can tolerate quickly allocating a small amount of boxes which will be as quickly reclaimed by scavenger once you are done with them. If you are running in an hot code --- then it should be optimized in a way that minimizes the number of boxes produced.
Let's say that performance matters in the application, but it is huge in code size and all the code matters, not a few small parts. Would you call all the code hot, and would v8 optimize the entire application? (i.e., how is 'hot' defined in v8?)
Hot currently is defined as "function called more than X times" and "function contains a loop that took a backedge more than Y times". So it is defined per-function basis.
I can hardly speculate how V8 will behave on some abstract application. That is really highly dependent on how code looks like. But ultimately V8 will try to optimize everything that falls under criteria outlined above.
I ask because performance of small benchmarks has been quite good, except for the uint32 issue mentioned above, often around 3x slower than native code - but on very large codebases it is often much slower, and I do not know why.
Unfortunately (as you probably know yourself) I don't have a magical answer that would speed everything up. As for V8 there is quite a number of limits that you might be hitting with generated code (e.g. number of locals, size of the function etc) and there can be some bugs or non-done-yet thingies affecting performance. Profiling and looking at the generated code (and filing bugs) is the only suggesting I can give here.
Couple of things to be aware of from a V8 perspective:
- to make "creation" test more fair for normal arrays they should be preallocated with new Array(arraySize), if arraySize does not exceed 90000. This will ensure that you are not wasting time reallocating backing store as it grows.
- It has been pointed to the test author a year ago that having a single test_SMTH function and calling it with different array types causes it to become polymorphic --- which affects the generated code. V8 became much-much better in handling polymorphism of this sort, but if you'll create a test_SMTH_ARRAYTYPE for each combination of test and array type you'll see results not distorted by the polymorphism.
I am not very familiar with internals of other engines. AFAIK both SpiderMonkey and Safari have polymorphic inline caches for a.foo property access sites. I can't say if the leverage PICs for a[i] access sites.
But from what I see from a quick glance over JavaScriptCore sources they do not seem to handle any kind of polymorphism for a[i] kind of sites in their new optimizing compiler (aka DFG). Additionally they do seem to handle polymorphism for a.foo sites in DFG only if foo always has the same offset in all structures this site have seen. I might be wrong about it though, it was just a quick flight over the source, without even checking it out to the disk.
A quick look with the JIT inspector addon (https://addons.mozilla.org/en-US/firefox/addon/jit-inspector...) seems to tell me that a lot of performance is gated on the array[i] index access, which you already mentioned. Indeed the polymorphic lookups are hurting us very badly and I think we don't inline cache anything and always take stub call. (Screenshot: http://i.imgur.com/UKL0t.png) All the red sections are the very hot stubcalls.
We only IC properties of regular objects (but also with different shapes and offsets) and indexes into some optimized array kind. But _not_ typed arrays. We used to do this, but Type Inference is usually very good with "normal" usages of typed arrays.
Yes, other engines do not seem to handle polymorphism well. I just amended the code to use separate test_something_arraytype for each array type. And, it turns out that the performance of other browsers have improved significantly and they look closer to Chrome now. Check out the updated charts in the blog post.
I suggest you add a note to the text hinting that you are not using original tests anymore. It would be even cooler if you publish both results of original (polymorphic) and your (non-polymorphic) runs.
I find results for "random read" particularly interesting. I need to take a look at the generated code. It might be that there is something in the way V8 compiles integer modulo operation that makes it slower than SpiderMonkey.
[though for this test to be comparable across browsers you should really fix function init_randlist not to use Math.random but use a pseudo-random generator with a _fixed_ seed. Uncontrolled sources of randomness should be discouraged in microbenchmarks and they lead to flaky unreproducible benchmarks. Though I don't think that it's the reason here].
I have added an update note to the post. Also, I have added a table comparison between polymorphic & non-polymorphic runs.
I will fix the code to use pseudo-random function to make the test results more meaningful.Also, I will try to send a pull request to the original test author.
This test should also add strings where the least significant byte of each character is used a byte (i.e. '\x00' = 0 up through '\xff' = 0xff). For many operations, these end up being faster than arrays of numbers.
Jdataview handles this already. Except for Internet Explorer of course, which can't handle arrays with null characters in it. No matter what you do, IE won't let you see beyond the null char. No other browser has this issue.
I once pointed this out, but got a "holy edge case, Batman!" [1], and got voted to -1...