I should clarify slightly. There are multiple definitions of "data race" in play. In the sense of the C/C++ standard, where a data race is essentially used to describe behaviors that are undefined, relaxed atomics are free of data races. However, from the perspective of a data-race-free model where data races are things that observably violate sequential consistency, relaxed atomics are absolutely data races.
In fact, the raison d'être for relaxed atomics is to permit applications to create "benign data races" [in the second sense] that aren't undefined behavior. As it turns out, though, actually specifying what semantics such a "benign data race" has is complicated, especially when you get into the realm of avoiding things like out-of-thin-air behavior (or, to reference a paper that crossed my desk a month ago, undefined behavior executed only if out-of-thin-air happens).
> There are linker flags that allow some deduplication but those have their own drawbacks
As long as you use --icf=safe I don't see any drawback, and most of the time it results in almost identical reductions to --icf=all since not many real programs compare addresses of functions.
I, along with everyone in the embedded space, have been using separate function sections forever for --gc-sections and I would be very surprised if they really cause any bloat and duplication at runtime. Do you mean bloat for intermediate files?
It may be limited to intermediate files, I assumed that the downside would be bigger since it is not a default and the description mentioned that some things may not be merged as well.
You cannot overload operators of integral types in C++, if you mean you can overload `int operator+(int, int)` to be saturating. You can create `saturating_int` kind of types easily though.
Despite its C underpinnings, C++ does provide the tooling for Type Driven Development like ML languages, I think it is a hard sell on the performance driven culture to have little types for those purposes, which is why it isn't used as much as it should.
Can you actually connect HDMI input to a thunderbolt only display? I have the LG 5K thunderbolt-only display and cannot use it with a desktop PC due to this issue.
Isn't this supposed to be a strength of framework? Their machines are pretty modular. Users should get to pick and choose AMD vs Intel, keyboard layouts, wifi chipsets without framework having to design different machines for each configuration.
Keyboards and wifi are one thing (both already modular) but the CPU is integrated into the mainboard and an AMD-based board would presumably need different AMD chipsets and so on beyond simply the CPU itself. That's "modular" in the sense that the chassis is designed for swappable mainboards, but it's much more involved to support.
More or less, yes. The different keyboard layouts is not that difficult to do. A different wifi chipset is trivial; customers can source their own from wherever they want. CPUs are a bit different, as Intel and AMD will require completely different mainboard designs. It's doable, certainly, but requires quite a bit of engineering effort, especially if you want to be able to place components and connectors (etc.) in the same places on the different boards, which is necessary for something like the Framework laptop, where they'd want to be able to allow you to put either mainboard in the same chassis.
Speaking of chassis, that makes offering different screen aspect ratios really hard, as you'll usually have a different sized/shaped chassis for a different aspect ratio. That might mean a different mainboard layout, different keyboard, different touchpad, and different battery, at least. That would vastly complicate Framework's offering, something I'm sure they're in no position to do as such a young company.
You can have your choice of 3 Intel CPUs, and you can bring your own NVME SSD, RAM, and wifi card. Of course the motherboard chipset determines which of those will work.
You can't change the CPU brand. Moving from Intel to AMD (or from the current Intel CPUs to a newer generation of Intel CPUs) would require an entirely different motherboard.
This ends up with a laptop where nothing really works well because everyone has a completely different setup and issues come up with particular configs. The reason the macbook works so well right now is every single part was very carefully supported and designed to work together.
I used to be all in on the FOSS hardware train but at some point you just want to get some work done rather than debugging wifi drivers.
You can only get so modular until you run into the fact that a hybrid Intel/AMD/etc motherboard chipset does not exist. Creating one would cost more money than they have raised.
I've been using `expected`, i.e. value-or-error type, for a while in C++ and it works just fine, but the article shows it has some noticeable overhead for the `fib` workload for instance. Not sure if the Rust implementation has a different design to make it perform better though.
> Not sure if the Rust implementation has a different design to make it perform better though.
Prolly not, I expect the issue comes from the increase in branches since a value-based error reporting has to branch on every function return. Even if the branch is predictible, it’s not free.
And fib() would be a worst-case scenario as it does very little per-call, the constant per-call overhead would be rather major.
It's also worth noting that Rust does also have stack-unwinding error propagation, in the form of `panic`/`catch_unwind`, which can be used as a less-ergonomic optimization in situations like this. Result types like this also don't color the function, since you can just explicitly panic, which would be inlined at the call site and show similar performance to C++ exceptions.
This is very much non-idiomatic. Panics are not intended as “application” error reporting, but rather as “programming” error.
The intended use-case of catch_unwind is to protect the wider program e.g. avoid breaking a threadpool worker or a scheduler on panic, or transmit the information cross threads or to collation tools like sentry.
Using up scarce branch-prediction slots is a good way to make your program unoptimizable. Time wasted because you ran out will not show up anywhere localized on your profile. (Likewise blowing any other cache.)
Using up BTB slots is an interesting problem but in practice doesn't seem to be a big issue. If it was, ISAs would use things like hinted branches but instead they've been taking them away. Code size is more important but hot/cold splitting can help there.
A problem with using exceptions instead is they defeat the return address prediction by unwinding the stack.
That hinted branches are not useful tells us nothing about the importance of branch predictor footprint. When hinting branches gives you a bigger L1 cache footprint, it has a high cost. Compilers nowadays use code motion to implement branch hinting, which does not burn L1 cache. (Maybe code motion is what you mean by "hot/cold splitting"?)
Anyway the hint we really need, no ISA has: "do not predict this branch". (We approximate that with constructs that generate a "cmov" instruction, which anyway is not going away.)
How does using exceptions defeat return address prediction? You are explicitly not returning, so any prediction would be wrong anyway. In the common case, you do return, and the predictor works fine.
> When hinting branches gives you a bigger L1 cache footprint, it has a high cost.
It was the same size on PPC, and on x86 using recommended branch directions (but not prefixes).
> Compilers nowadays use code motion to implement branch hinting, which does not burn L1 cache. (Maybe code motion is what you mean by "hot/cold splitting"?)
Hot/cold splitting is not just sinking unlikely basic blocks, it's when you move them to the end of the program entirely.
That doesn't hint branches anymore, though; Intel hasn't recommended any particular branch layout since 2006.
> How does using exceptions defeat return address prediction? You are explicitly not returning, so any prediction would be wrong anyway.
Anything that never returns is a mispredict there; most things return. What it does instead (read the DWARF tables, find the catch block, indirect jump) is harder to predict too since it has a lot of dependent memory reads.
It suffices, for cache footprint, for cold code to be on a different cache line, maybe 64 bytes away. For virtual memory footprint, being on another page suffices, ~4k away. Nothing benefits from being at the "end of the program".
Machines do still charge an extra cycle for branches taken vs. not, so it matters whether you expect to take it.
Negligibly few things never return; most of those abort. Performance of those absolutely does not matter.
Why should anyone care about predicting the catch block a throw will land in after all the right destructor calls have finished? We have already established that throwing costs multiple L3 cache misses, if not actual page faults.
> Nothing benefits from being at the "end of the program".
It's not about the benefit, that's just the easiest way to implement it - put it in a different TEXT section and let the linker move it.
Although, there is a popular desktop ARM CPU with 16KB pages.
> Machines do still charge an extra cycle for branches taken vs. not, so it matters whether you expect to take it.
Current generation CPUs can issue one taken branch or 2 not-taken branches in ~1 cycle (although strangely Zen2 couldn't), but yes it is better to be not taken iff not mispredicted. (https://www.agner.org/optimize/instruction_tables.pdf)
> Negligibly few things never return; most of those abort. Performance of those absolutely does not matter.
Throwing an exception isn't a return, nor longjmp/green threads/whatever. Sometimes they're called abnormal or non-local returns, but according to your C++ compiler your throwing function can be `noreturn`.
Error path performance is important since there are situations like network I/O where errors aren't at all unexpected. If you're writing a program you can just special case your hotter error paths, but if you're designing the language/OS/CPU under it then you have to make harder decisions.
> Why should anyone care about predicting the catch block a throw will land in after all the right destructor calls have finished? We have already established that throwing costs multiple L3 cache misses, if not actual page faults.
More prediction is always better. The earlier you can issue a cache miss the earlier you get it back.
For instance, that popular desktop ARM CPU can issue 600+ instructions at once (according to Anandtech). That's a lot of unnecessary stalls if you mispredict.
And so its vendor has their own language, presumably compatible with it, which doesn't support exceptions.
Specifically, a prediction is not better when it makes no difference. Then, it is worse, because it consumes a resource that would better be applied in a different place where it could make a difference.
Exceptions are the perfect example of a case where any prediction expenditure would be wasted; except predicting that no exception will be thrown. It is always better to predict no exception is thrown, because only the non-throwing case can benefit.
Running ahead and pre-computing results that would be thrown away if an exception is thrown is a pure win: With no exception, you are far ahead; with an exception, there is no useful work to be sped up, it is all just overhead anyway.
This is similar to a busy-wait: you are better off to have leaving the wait predicted, because that reduces your response latency, even though history predicts that you will continue waiting. This is why there is now a special busy-wait instruction that does not consume a branch prediction slot. (It also consumes no power, because it just sleeps until the cache line being watched shows an update.)