Hacker Newsnew | past | comments | ask | show | jobs | submit | DRAGONERO's commentslogin

I’d expect most vendors do, at least in their closed source drivers. You could also check in the mesa project if this is implemented but it’s definitely possible to do


Shader compilers tend to be very latency sensitive, so “it takes too long to run” would be a valid reason why it is not done if it is not done.


Shader compilers mostly use LLVM even though runtime is a constraint, if the pattern is common enough it’s definitely easy to match (it’s just two intrinsics after all) meaning you can do it for cheap in instcombine which you’re going to be running anyway


For some reason, I feel like this is harder to implement than you expect. The way to find out would be to get a bunch of examples of people doing this “optimizations in shader code, look at the IR generated compared to the optimal version and figure out a set of rules to detect the bad versions and transform it into a good versions. Keep in mind that in the example, the addition operators could be replaced with logical OR operators, so there are definitely multiple variations that need to be detected and corrected.


I've checked and on "certain vendors" the mix + step is actually (slightly) better: same temp usage, lower instructions/cycles.


“Additionally, participants can easily contribute photos and videos to a dedicated Shared Album within each invite to help preserve memories and relive the event.”

This sounds like a great feature. Post event photo sharing is always a bit of a mess.


Yes, I've had tons of invites to share on google photos and it has always been a cluster in some way.


Massive if it truly works cross-platform.

Has anyone tried from android yet?


Seriously, this is a killer feature.


This will be quite nice for Apple Funerals.


Might also be worth reporting directly through Cloudflare https://radar.cloudflare.com/domains/feedback/notepad.plus


Cloudflare typically doesn't do anything unless you get lawyers involved in my experience.


...unless the website happens to become a pet peeve of the higher ups at Cloudflare


Another way to explore floating point representation can also be to compare rounding modes, as this can impact a lot of what is generated for floating point operations. Most systems are round to nearest even, but round to zero, round towards positive and round towards negative also exist. Bonus points if you also check results for x87 floats.


For Mandelbrot, presumably you can do exact rational arithmetic (all it needs is multiplications and add, and you presumably start with a floating-point coordinate, which is eminently rational). It will be very slow with many iterations, of course, since the fractions will rapidly spin out of control.

Edit: I see the supplementary material addresses this: “One could attempt to avoid the problem by using arbitrary-precision floating point representation, or rational representation (using two integers, for numerator and denominator, i.e., keep everything as fractions), but that quickly leads to so many significant digits to multiply in every iteration that calculations become enormously slow. For many points in a Mandelbrot Set Image, the number of significant digits in a trajectory tends to double with each iteration, leading to O(n^2) time complexity for evaluating one point with n iterations. Iterations must then be kept very low (e.g., 20-30 on a modern PC!), which means that many higher-iteration points cannot be evaluated. So accumulated round-off error is hard to avoid.” 20-30 sounds crazy low even for O(n²), though.

Edit 2: This part seems wrong: “Cycle detection would work if floating point quantities could be compared exactly, but they cannot.” They certainly can. If you detect a cycle, you will never escape _given your working precision_. Floats can cycle just as well as fixed-point numbers can. But I suppose true cycles (those that would be cycles even in rationals) are extremely unlikely.


“One could” indeed:

https://www.fractint.org/

From wikipedia:

> The name is a portmanteau of fractal and integer, since the first versions of Fractint used only integer arithmetic (also known as fixed-point arithmetic), for faster rendering on computers without math coprocessors. Since then, floating-point arithmetic and arbitrary-precision arithmetic modes have been added.


I spent a lot of time with fractint in the late 1990s. Great memories!


Fractint is fixed-point, not exact rationals.


Hm, what about intervals of rationals, and then approximating the intervals with a slightly wider interval in order to make the denominators smaller? I guess the intervals for the real and imaginary components might grow too quickly?

if 0 < a_L < a < a_R , 0 < b_L < b < b_R , then (a + b i)^2 = (a^2 - b^2) + 2 a b i , and a_L^2 - b_R^2 < a^2 - b^2 < a_R^2 - b_L^2 and 2 a_L b_L < 2 a b < 2 a_R b_R ,

uh,

(a_R^2 - b_L^2) - (a_L^2 - b_R^2) = (a_R^2 - a_L^2) + (b_R^2 - b_L^2) = 2 (a_R - a_L) ((a_L + a_R)/2) + 2 (b_R - b_L) ((b_L + b_R)/2)

And, a_R b_R - a_L b_L = a_R b_R - a_R b_L + a_R b_L - a_L b_L = a_R (b_R - b_L) + (a_R - a_L) b_L

So, looks like the size of the intervals are like, (size of the actual coordinate) * (size of interval in previous step), times like, 2?


"20-30 sounds crazy low even for O(n²), though."

I mean, 30 means 2^30 sig digits. That means any operation on those are a bit expensive in pure memory throughput. Squaring that, let's use an FFT then it's just O(n log n) for the next step, with n being number of digits here. Or 30 * 2^30 ops, let's call it 2^35 ops. We're at 32 MFlops for a single dot. Let's only do a 1280*768 image, that's 32 GFlops to compute that image.

That's a perfectly nice rate, let's pretend it's half a second because all memory access was just perfectly sequential. (It's not, it's more).

Each additional step more than doubles the time spent. Which means even at 6 more iterations, we're staring at half a minute compute time. Nobody waits half a minute for anything. (Unless it's a web page downloading JS)

Now add the fact that I lied, memory access can't be fully sequential here. But even if it was, we're at least spilling into L3 cache, and that means some 40 cycles penalty every time we do. We'll do that a lot. So multiply time with 10 or so (I'm still hopelessly optimistic we have some efficiencies.), and suddenly even 30 iterations cost you 5 seconds.

And so, yes, it's really expensive even for modern PCs.

If you have a GPU, you can shuffle it off to the GPU, they're pretty good at FFTs - but even that's going to buy you, if we're really really optimistic, another 20 or so iterations before it bogs down. Doubling digits every cycle will make any hardware melt quickly.


Ah, of course, not only does it require O(n²) work to multiply, but n grows to n² for each iteration (n bits -> 2n bits). So that's why you get O(2^n) _total_ time.


afaik all ieee floating point systems allow configuring rounding mode, for example with fesetround or equivalent https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/fe...


I had never realized the necessity of nearest even.

I thought that 0.5 ought to be rounded to 1, mirroring 0.0 rounded to 0 !

But I can now see how this only works for reals, while anything non-continuous (even rationals ?) could end up with compounding errors...


I've heard it described as bank-rounding


I think you mean “best shoe”


You could wear the best most advanced shoes, and I bet you still couldn’t beat Usain Bolt wearing $10 sneakers.


Two years post-retirement, wearing sweatpants and sneakers, he (albeit unofficially) tied the fastest 40-yard dash time at the NFL Combine: https://bleacherreport.com/articles/2818947-watch-usain-bolt...


That's probably a good amount of battery life that can be claimed back once this is fixed


The fix will be ready for the next macOS Apple Event so that they can claim battery performance improvements


Meanwhile Apple is the company most actively working on better battery performance, less memory utilization (e.g. compressed memory) and other such changes...


> compressed memory

doesn't solve much and it's been available in all other OS for a long time.

many reported that compressed memory can negatively impact performances in certain scenarios

For example

https://github.com/microsoft/Windows-Dev-Performance/issues/...


Don't know how compressed memory works for in Windows, but on macOS its role is equivalent to swapping. No memory is compressed unless pages are evicted due to not being used. If memory compression wasn't present then it'd been swapped to disk.


Sounds exactly like zswap. Are there differences that make it better?


Not that I know of. What I don't know and which might make a difference is when evicting to disk, zswap decompresses and swaps to disk uncompressed† as if swap wasn't there, while macOS might (or might not) disk-swap it compressed (so, a bit by bit copy), minimising CPU, IO, and size (and wear for SSDs) at the cost of decompressing when paging it in again.

† Unless the underlying block device backing the swap device or file does transparent compression, in which case it gets decompressed by zswap then compressed by whatever (e.g LVM compression).


> No memory is compressed unless pages are evicted due to not being used

It's the same in Windows.

You can't enable memory compression in Windows unless a page file is present (have not used Windows in 10 years, but that's what I gathered from researching the topic)


MacOS had compressed memory for at least a decade


I know. Your point being?

Did I say that only recently they started optimizing things?


CPU usage isn't equivalent to power usage, so this should have no practical effect on battery life.


Fixed ? This is modern UX design.(drawing the whole screen when the cursor blinks)


It's not a UX issue whether to redraw a larger area than where the changes are.


Is the Matrix resemblance on purpose?


Is the benchmark suite available somewhere?


(Author here) See https://github.com/clamchowder/Microbenchmarks/tree/master/G...

It's very much a work in progress, as noted in the article. And some of the stuff that worked reasonably well on my cards, like the instruction rate test when trying to measure throughput across the entire card, went down the drain when run on Arc.


Have you tried reducing the register count in your FP32 FMA test by increasing the iteration count and reducing the number of values computed per loop?

Instead of computing 8 independent values, compute one with 8x more iterations:

    for (int i = 0; i < count * 8; i++) {
        v0 += acc * v0; 
    }
That plus inlining the iteration count so the compiler can unroll the loop might help get closer to SOL.


The problem is loop overhead matters on AMD, because AMD's compiler doesn't unroll the loop. Nvidia's does, so it doesn't matter for them.


unroll with #pragma unroll?


The article lacks a lot of information unfortunately, but it makes it sound like the website (distribution channel) was the only part they are concerned about, which wouldn't be classed as major.

What I'd class as major would be some third party gaining access to NVIDIA's RTL designs and source code for their drivers for current and unreleased GPUs, but this hack doesn't sound remotely close to that. Luckily.


> the website (distribution channel) was the only part they are concerned about, which wouldn't be classed as major

By whom? I'd certainly class it as major if their website could distribute malware instead of the real drivers, as that impacts everyone. Stealing nvidia's proprietary designs impacts only them.

I visited that page a few days ago to setup a new system which is, at the same time, supposed to be very secure (the proprietary drivers being one of the weak points indeed, but can't quite get around that if the GPU is to be fully functional). If this was compromised then I can start over and have a bunch of passwords and private keys to rotate.


Maybe should consider doing the rotation already... Better safe than sorry in such cases.


> What I'd class as major would be some third party gaining access to NVIDIA's RTL designs and source code for their drivers

Ransomware operators are not that clever, they go for low hanging fruit. I mean, yeah, by all means, do recon on a system you just pwned and try to do a supply chain attack, but it's outside the range of these operators. They only have a hammer, and everything just looks like a nail.


Even if they get the RTL I'm not sure how useful those would be. While Russia does have semiconductor fabs, apparently their smallest node is around 65nm, completely useless for the large designs current NVidia GPUs use. At best they could have them made at a fab in mainland China, but even there the smallest node is only 14nm.


A thief would be using the RTL to make a clone of NVIDIA graphics card, they’d be using the IP cores as modules in their own designs. With some minor adjustment it shouldn’t be too difficult to get at least most of the RTL working in a different mode (maybe lower clock speed)


That's not how VLSI chip design works. You can't just take the RTL designed for 5 - 8 nm, zoom it up to 65nm and expect it to still work.

When you design a CPU or GPU, the RTL, like the core pipelines, schedulers, and various buses, are designed from the start on a certain manufacturing process where they're expected to work correctly at specific frequencies that are fast enough to feed the pipelines at the right timings, in order to get the top expected performance. Failure to meet the fabrication process expectations means the RTL design will perform much worse than expected in practice.

That's why many of Intel's past designs sucked so bad in the performance and efficiency category as their 10nm manufacturing process fell behind, so they had to scale their newer designs back on the aging 14+++++ process, which caused those CPUs to flop big time.


>maybe lower clock speed

That is an understatement. 65nm is ten times larger than what NVidia is currently using. That means the area would be 100 times larger and any signal distances 10 times larger. And keep in mind that NVidia GPU designs already take up quite a bit of area on modern nodes.

So you'd likely have to cut it down to a 100th of modules which would run at 10th speed.


Is signal propagation actually close to being a limiting factor in clock speeds for most designs? I thought it's pretty much always thermals.


That is not the point; the signal propagation times in the VLSI blocks are engineered to work properly at the specific physical size. If the structures are scaled to a larger node size, the timing variances increase. If you want to do this, you can either 1) reengineer all the VLSI blocks to meet timing requirements at the larger node size (maybe impossible) or 2) slow the clock speed to loosen the requirements.


Isn't it exactly the point? If, provided sufficient cooling, you could double the clock speed without running into clock skew or other timing issues, then timing issues shouldn't be a problem if you want to scale things up physically by 50% without touching the clocks. I don't think you'd have to lower clocks by 90% to increase size of most designs tenfold. Or rather, that the reason you'd have to if you did wouldn't be due to signal propagation time.


There are EUV machines in China. Not sure why people keep perpetuating the myth that there aren't.


Does Apple do this with iMessage? I don't think they can, even.


They could. Correct me if I'm wrong but users don't see which public keys have been used to encrypt the message's symmetric key. Theoretically Apple could easily and invisibly include themselves as a recipient.


You're exactly right and I'm not sure why this is being downvoted. Apple can add additional keys to iMessage messages and thus view them in transit - they say this themselves in their own security white paper[0].

[0]: https://www.apple.com/business/docs/iOS_Security_Guide.pdf


> Apple can add additional keys to iMessage messages and thus view them in transit - they say this themselves in their own security white paper

I just read the section on iMessage (from around page 49) and I can’t see where this is written. Can you point to the part where they say this?


Page 51:

> The private keys for both key pairs are saved in the device’s Keychain and the public keys are sent to Apple’s directory service (IDS), where they are associated with the user’s phone number or email address, along with the device’s APNs address.


I'm no security engineer but wouldn't that require Apple to have access to the private key, whereas the whitepaper says they only have access to the public key?


They can't mess up with the private key (at least this is what they say, and we can't verify that as their software is closed source). But they're free to manipulate the public key which is used during the encryption phase.

For Apple as a company, not having access to iMessages is the safest thing to do, and I believe them when they say they can't access them in the current setup and are not willing to change that. It's because this would change their status from hardware/software vendor to telecommunications provider, with all related problems and costs - and they don't need any of these, so the best option is just to shield themselves from any user-to-user communication.


I don't think so but I'm not a security expert either so I might have this wrong.

If you send a group message, Apple provides your messaging client with all of the recipients' public keys that are used to encrypt the symmetric key that actually protects the message. They could slip their key into that list and I don't think you would be able to easily tell if they did that.

If you send a message to a single person, then that's just a group of one.

The interesting question to me is if Apple can be compelled to write code to do this if they haven't already done so (and I don't think they have). I wouldn't think they could be forced, but like Microsoft did with Skype, they might do it anyway.


Apple can associate their own public key with the users phone number, then they will be able to read messages sent to that user.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: