More

hasheddan · on Feb 14, 2025

Yes, thank you for calling out. It should read "more important". This has been corrected.

hasheddan · on Feb 14, 2025

Author here -- thanks for engaging in the discussion! You won't find any pushback from us on using Zephyr -- we are contributors, the firmware example in the post is using it (or Nordic's NCS distribution of it), and we offer free Zephyr training [0] every month :)

[0]: https://training.golioth.io/

hasheddan · on Feb 14, 2025

Author of this post here -- thanks for sharing your experience! One thing I'll agree with immediately is that if you can afford to power down hardware that is almost always going to be your best option (see a previous post on this topic [0]). I believe the NAT post also calls this out, though I believe I could have gone further to disambiguate "sleeping" and "turning off":

> This doesn’t solve the issue of cloud to device traffic being dropped after NAT timeout (check back for another post on that topic), but for many low power use cases, being able to sleep for an extended period of time is more important than being able to immediately push data to devices.

(edit: there was originally an unfortunate typo here where the paragraph read "less important" rather than "more important")

Depending on the device and the server, powering down the modem does not necessarily mean that a session has to be started from scratch when it is powered on again. In fact, this is one of the benefits of the DTLS Connection ID strategy. A cellular device, for example, could wake up the next time in a completely different location, connect to a new base station, be assigned a fresh IP address, and continue communication with the server without having to perform a full handshake.

In reality, there is a spectrum of low power options with modems. We have written about many of them, including a post [1] that followed this one and describes using extended discontinuous reception (eDRX) [2] with DTLS Connection IDs and analyzing power consumption.

[0]: https://blog.golioth.io/power-optimization-recommendations/ [1]: https://blog.golioth.io/turn-off-subsystems-remotely-to-redu... [2]: https://www.everythingrf.com/community/what-is-edrx

hasheddan · on Dec 25, 2024

Author here. I’ve got a few more posts on VPR coming in the next couple of weeks. If you have any requests for deep dives into specific aspects of the architecture, feel free to drop them here!

hn3er1q · on Dec 26, 2024

Thank you so much for asking, I have oh so many requests...

Personally, I'm mostly interested in the ARM vs RISCV compare and contrast.

- I'd be very interested in comparing static memory and ram memory requirements for programs that are as similar as you can make them at the c-level using whatever toolchain Nordic wants you to use.

- Since you're looking to do deep dives I think looking into differences in the interrupt architecture and any implications on stack memory requirements and/or latency would be interesting, especially as VPR is a "peripheral processor"

- It would be interesting to get cycle counts for similar programs between ARM and RISCV. This might not be very comparable though as it seems the ARM architectures are more complex thus we expect a lower CPI from them. Anyway I think CPI numbers would be interesting.

I could go on but I don't want to be greedy. :)

brucehoult · on Dec 27, 2024

The Raspberry Pi Pico 2 of course also uses the Cortex M33, along with a self-developed (in his spare time!) RISC-V core that has very similar performance, other than not having an FPU.

It's pretty easy to compare the same C code on both CPUs on a Pico 2, where you have equal RAM, equal peripherals etc.

rwmj · on Dec 26, 2024

Why did they go with the 64 bit Arm core instead of an RV64 core? (Or an alternative question: why go with the 32 bit RISC-V core instead of an Arm M0?)

Does having mixed architectures cause any issues, for example in developer tools or build systems? (I guess not, since already having 32 vs 64 bit cores means you have effectively a "mixed architecture" even if they were both Arm or RISC-V)

What's the RISC-V core derived from (eg. Rocket Chip? Pico?) or is it their own design?

crest · on Dec 26, 2024

They haven't gone with a 64Bit ARM core. The ARMv8*M* isn't 64bit unlike ARMv8R and ARMv8A (the nomenclature can get confusing). The differences between ARMv7M (especially with the optional DSP and FPU extension) and ARMv8M mainline are fairly minor unless you go bigger with an M55 or M85 which (optionally) adds the Helium SIMD extension. At he low end ARMv8M baseline adds a few quality of life features over ARMv6M (e.g. the ability to reasonably efficiently load large immediate values without resorting to constant pool). Also the MPU got cleaned up to make it a little less annoying to configure.

pm215 · on Dec 26, 2024

ARMv8A and ARMv8R can both be pure 32 bit as well, incidentally -- e.g. Cortex-A32 and Cortex-R52. v8A added 64 bit, but it didn't take away 32 bit. It's not until v9A that 32 bit at the OS level was removed, and even there it's still allowed to implement 32 bit support for userspace.

rwmj · on Dec 26, 2024

Thanks for the clarification. Confusing terminology!

audunw · on Dec 27, 2024

Not 64 bit, it’s 32-bit cortex m33.

M33 has, among other things, TrustZone. So there’s some feature, along with developer familiarity and tools that make ARM desirable for an application processor.

Mixed architecture doesn’t really cause any significant problems.

The design is fully custom in-house design.

als0 · on Dec 26, 2024

> Why did they go with the 64 bit Arm core

ARM Cortex-M33 is a 32-bit core, not 64-bit.

crest · on Dec 26, 2024

Will open-source developers unable or unwilling to sign an NDA get access to a toolchain to run their own code on the RISC-V co-processors? Is the bus matrix documented somewhere? Does the fast co-processor have access to DMA engines and interrupts?

janice1999 · on Dec 26, 2024

FYI Nordic said on their YouTube channel that the RISC-V toolchain that already ships with Zephyr's SDK will support the cores. See around 00:56:32.520 [1]

[1] https://www.youtube.com/watch?v=ef87Gym_D5c

hasheddan · on Dec 26, 2024

Indeed. It is used in this post to compile the Zephyr Hello World example for the PPR.

SV_BubbleTime · on Dec 26, 2024

All of this wreaks of complexity crisis to me. That you need to know much and do do so much work - just in order to do the work you want to do.

Explain why I’m wrong, please.

fidotron · on Dec 26, 2024

You are wrong.

When more general purpose hardware (i.e. CPU cores) are added to chips like this it is to replace the need for single purpose devices. True nightmarish complexity comes from enormous numbers of highly specific single purpose devices which all have their own particular oddities.

There was a chip a while back which took this to a crazy extreme but threw out the whole universe in the process https://www.greenarraychips.com/

awjlogan · on Dec 26, 2024

Not wrong, especially for microcontrollers where micro/nanosecond determinism may be important - software running on general purpose cores is not suitable for that. They can also be orders of magnitude more energy efficient than running a full core just to twiddle some pins.

I’ve got a project that uses 4 hardware serial modules, timers, ADC, event system etc all dedicated function. Sure, they have their quirks but once you’ve learnt them you can reuse a lot of the drivers across multiple products, especially for a given vendor.

Of course there is some cost, but it’s finding the balance for your product that is important.

fidotron · on Dec 26, 2024

> They can also be orders of magnitude more energy efficient than running a full core just to twiddle some pins.

This used to be true, but as fabrication shrinks first you move to quasi FSMs (like the PIO blocks) and eventually mini processors since those are smaller than the dedicated units of the previous generation. When you get the design a bit wrong you end up with the esp32 where the lack of general computation in peripherals radically bumps memory requirements and so the power usage.

This trend also occurs in GPUs where functionality eventually gets merged into more uniform blocks to make room for newly conceived specialist units that have become viable.

awjlogan · on Dec 26, 2024

No, still true - you’re never going to beat the determinism, size, and power of a few flops and some logic to drive a common interface directly compared to a full core with architectural state and memory. E.g., just to enter an interrupt is 10-15 odd cycles, a memory access or two to set a pin, and then 10-15 cycles again to restore and exit.

Additionally, micros have to be much robust electrically than a cutting edge (or even 14 nm) CPU/GPU and available for extended (decade) timespans so the economics driving the shrink are different.

Small, fast cores have eaten the lunch of e.g. large dedicated DSP blocks for sure but those are niche cases where the volume is low so eventually the hardware cost and cost to develop on weird architectures costs more than running a general purpose core.

fidotron · on Dec 26, 2024

> No, still true - you’re never going to beat the determinism, size, and power of a few flops and some logic to drive a common interface directly compared to a full core with architectural state and memory.

But you must know what you intend to do when designing the MCU, and history shows (and some of the questioning here also shows) that this isn’t the case. As you point out expected lifespans are long, so what is a designer to do?

The ESP32 case is interesting because it comes so close, to the point I believe the RMT peripheral probably partly inspired the PIO, thanks to how widely it has been used for other things and how it breaks.

The key weakness of the RMT is it expects the conversion of the data structures to be used to control it to be prepared in memory already, almost certainly by the CPU. This means that to alter the data being sent out requires the main app processor, the DMA and the peripheral to be involved, and we are hammering the memory bus while doing this.

A similar thing occurs with almost any non trivial SPI usage where a lot of people end up building “big” (relatively) buffers in memory in advance.

Both of those situations are very common and bad. Assuming the tiny cores can have their own program memory they will be no less deterministic than any other sort of peripheral while radically freeing up the central part of the system.

One of the main things I have learned over the years is people wildly overstate the cost of computation and understate the cost of moving data around. If you can reduce the data a lot at the cost of a bit more computation that is a big win.

awjlogan · on Dec 26, 2024

> But you must know what you intend to do when designing the MCU, and history shows (and some of the questioning here also shows) that this isn’t the case. As you point out expected lifespans are long, so what is a designer to do?

Designers do know that UARTs, SPIs, I2C, timers etc will be around essentially forever. Anything new has to be so much faster/better, the competition being the status quo and its long tail, that you would lay down a dedicated block anyway.

I think we'll disagree, but I'm not convinced by many of the cases given here (usually DVI on an RP2040...) as you would just buy a slightly higher spec and better optimised system that has the interface already built in. Personal opinion: great fun to play with and definitely good to have a couple to handle niche interfaces (e.g. OneWire), but not for majority of use cases.

> A similar thing occurs with almost any non trivial SPI usage where a lot of people end up building “big” (relatively) buffers in memory in advance.

This is neither here nor there for a "PIO" or a fixed function - there has be state and data somewhere. I would rather allocate just what is needed for e.g. a UART (on my weapon of choice, that amounts to a heady 40 bits local to the peripheral written once to configure it, overloaded with SPI and I2C functionality) and not trouble the memory bus other than for data (well said on data movement, burns a lot and it's harder to capture).

> Assuming the tiny cores can have their own program memory they will be no less deterministic than any other sort of peripheral while radically freeing up the central part of the system.

Agreed, only if it's dedicated to a single function of course otherwise you have access contention. And, of course, we already have radically freed up the central part of the system :P

Regardless, enjoyed the conversation, thank you!

fidotron · on Dec 26, 2024

> Regardless, enjoyed the conversation, thank you!

Likewise, very much so!

kragen · on Dec 26, 2024

If you have a programmable state machine that's waiting for a pin transition, it can easily do the thing it's waiting to do in the clock cycle after that transition. It doesn't have to enter an interrupt handler. That's how the GA144 and the RP2350 do their I/O. Padauk chips have a second hardware thread and deterministically context switch every cycle, so the response latency is still less than 10–15 cycles, like 1–2. I think old ARM FIQ state also effectively works this way, switching register banks on interrupt so no time is needed to save registers on interrupt entry, and I think the original Z80 (RIP this year) also has this feature. Some RISC-V cores (CH32V003?) also have it.

An alternate register bank for the main CPU is bigger than a PWM timer peripheral or an SPI peripheral, sure, but you can program it to do things you didn't think of before tapeout.

rcxdude · on Dec 27, 2024

Making I/O properly programmable actually reduces complexity overall, because you can put more of the customizability on the other side if the interface, making things much simpler overall. I2C, for example, is a terrible interface in many ways, but one of the biggest is it creates a very complex and low-latency interface between the hardware and the software, and it's often easier to bitbang than use dedicated peripherals, especially the buggier ones. Running it on a small dedicated core means you can deal with it much more sensibly than trying to wrangle a hard peripheral which can't make enough assumptions about your use-case to give a good interface.

AlotOfReading · on Dec 26, 2024

The article goes into more detail than it strictly needs to because the purpose is educational. However, a lot of what it's presenting is simplified interfaces and relevant details rather than the true complexity of the whole.

Modern hardware is just fundamentally complex, especially if you want to make full use of the particular features of each platform.

hasheddan · on Dec 4, 2024

awesome! keep up the great work!

hasheddan · on Nov 20, 2024

There is a page[0] in the Bluesky docs on this, though it effectively boils down to “fetch each user repository”.

[0] https://docs.bsky.app/docs/advanced-guides/backfill

hasheddan · on Oct 21, 2024

Good suggestion. I think that my pre-race fueling strategy was better than previous races, but certainly lots of room for further improvement!

canucker2016 · on Oct 31, 2024

I think your insecurity about your race plan is from lack of confidence in your ability to run the full 26.2 miles at a given sub-3 hour pace.

Looking at your Strava workouts, your long runs were at a much slower than marathon pace, and very few of those runs were over 16 miles.

If you want to run the full 26.2 mile marathon at, say, 6:45/mile pace, then you should be able to run 22 or 20 miles, averaging 6:45/mile.

You can start at a shorter distance, say 12-16mi at a slower pace, and work up to 20mi+ at or close to race pace.

You can play around with planning a race plan with these race simulation workouts, i.e. all miles at race pace or half distance at race pace+5secs, last half at race pace-5secs, etc.

Also check into marathon training plans for ideas of how to progress and schedule workouts until race day.

hasheddan · on Oct 21, 2024

I appreciate the kind words! Best of luck on your leap of faith -- it takes courage to attempt it in the first place!

hasheddan · on Aug 8, 2024

Luke (author of Hazard3) provided some context regarding including the Hazard3 cores alongside the M33's:

> I can't compare the sizes of the two cores. The final die size would likely have been exactly the same with the Hazard3 removed, as std cell logic is compressible, and there is some rounding on the die dimensions due to constraints on the pad ring design. I can say that we taped out at a very high std cell utilisation and we might have saved a few grey hairs during final layout and STA by deleting the RISC-V cores.

https://x.com/wren6991/status/1821582405188350417

hasheddan · on March 14, 2024

I actually recently interviewed one of the folks (Philip Freidin) who worked on the 29k.

Clip of Philip describing AMD 29k: https://youtu.be/I5cYxLg7Vfc

Full Episode: https://microarch.club/episodes/1/

Previous HN Post: https://news.ycombinator.com/item?id=39452960

chasil · on March 14, 2024

I'll grab that one too, thanks!