I represent Sweden in the ISO WG14, and I voted for the inclusion of Embed in to C23. Its a good feature. But its not a necessary feature and I think JeanHeyd is wrong in his criticism of the pace of wg14 work. I have found everyone in wg14 to be very hardworking and serious about their work.
Cs main strengthen is its portability and simplicity. Therefore we should be very conservative, and not add anything quickly. There are plenty of languages to choose form if you want a "modern" language with lots of conveniences. If you want a truly portable language there is really only C. And when I say truly, I mean for platforms without file systems, or operating systems or where bytes aren't 8 bits, that doesn't use ASCI or Unicode, where NULL isn't on address 0 and so on.
We are the stewards of this, and the work we put in, while large, is tiny compared to the impact we have. Any change we makes, needs to be addressed by every compiler maintainer. There are millions of lines of code that depend on every part of the standard. A 1% performance loss is millions of tons of CO2 released, and billions in added hardware and energy costs.
In this privileged position, we have to be very mindful of the concerns of our users, and take the time too look at every corner case in detail before adding any new features. If we add something, then people will depend on its behavior, no matter how bad, and we therefor will have great difficulty in fixing it in the future without breaking our users work, so we have to get it right the first time.
> for platforms without file systems, or operating systems or where bytes aren't 8 bits, that doesn't use ASCI or Unicode, where NULL isn't on address 0 and so on.
This seems totally misconceived to me as a basis for standardizing a language in 2022. You are optimizing for the few at the expense of the many.
I get that these strange architectures need a language. Why does it have to be C or C++? They can use a nonstandardized variant of C, but why hobble the language that is 99% used on normal hardware with misfeatures that are justified by trule obscure platforms.
It doesn't have to be C, but as of today there is no other option. No one is coming up with new languages with these kinds of features so C it is. People should, but language designers today are more interested in memory safety and clever syntax, than portability.
I would like to caution you against thinking that these weird platforms are old machines from the 60s that only run in museums. For instance many DSPs have 32bit bytes (smallest memory unit that can be individually addressed), so if you have a pair of new fancy noise canceling headphones, then its not unlikely you are wearing a platform like that on your head everyday.
Unusual platforms like DSPs usually have specific (usually proprietary) toolchains. Why can't those platforms implement extensions to support 32-bit bytes? Why must everyone else support them? In practice ~no C code is portable to machines with 32-bit bytes. That's okay! You don't choose a DSP to run general purpose code. You choose it to run DSP code, usually written for a specific purpose, often in assembly.
"Weird" platforms often do have their own tool-chains but they do have the ability to leverage LLVM, MISRA, and an array of common tools and analyses that exists for C. One of the reason we got new platforms like RISC-V is that today its possible to use existing OSS software to build a platform with a working OS and Development environment, that common basic libraries can be built for is that all this software is written in C and can be targeted towards a new platform.
Ok here’s a concrete one, ARM is experimenting with a brand new CHERI implementation that throws out a lot of “obvious” things that are supposed to be true for pointers and integers. The only way this has been able to work is that the standard is flexible enough to let C(++) work here. Rust is getting breaking changes to support the platform.
Unsurprisingly Rust can target Morello already but it comes at a performance penalty. Whereas C has myriad integer types for special purposes, Rust only has a pair - a signed and unsigned integer of the same size as pointers. Since on CHERI a pointer is a different size from an address, Rust pays the price of using a 128-bit integer (to store pointers) as its array index.
The potential change would make these types address-sized, so you can't fit a whole pointer in them on Morello where the pointer isn't just an address. On other platforms this change makes no practical difference.
The point is that new exploration of the design space only works when there’s a familiar environment to build on. The old days of each architecture being its own hermetic environment are gone.
Because C already does this, and has from the beginning. C was designed to be portable in an era where there were significant differences in fundamental CPU design decisions between platforms. C is widely used to write software for all kinds of weird platforms. Changing that would be far more work than just making a new language.
They also tend to be non-standard for a variety of reasons anyway! C bends backwards to support odd architectures but is often also insufficient (or at least the vendors cannot justify the effort to achieve full compliance and their customers don't care significantly anyway).
Perhaps Carbon is the first in a series of new low level languages that free us from the impossible tensions of C/C++ having to be all things to all (low level) programmers.
I would love a new language for implementing high level languages. I've worked on several of these projects and we use mostly unstandardized dialects of C++ and it's really not fit for purpose.
> AFAICT, Zig is halfway step between C and C++ for desktop and mobile developers.
What does this even mean? Zig very much has embedded use as a target as well, with freestanding use being a first-class citizen. The majority of the standard library works without an OS (and if you wish, you can provide your own OS-like interfaces quite easily, and use the OS-dependent parts in a freestanding environment). I've written a UEFI bootloader in Zig, and right now I'm using it on an RP2040 as well, cross compiling from an M1 Mac without the need to install any additional cross compilers.
I'd argue that it might be even better for embedded than C/C++ eventually, as unlike them, allocation is even more tightly controlled, with a convention that only functions (or datastructures) that take an allocator as an argument will allocate memory on the heap. Future versions may even restrict recursion in certain contexts to guarantee max stack depths.
Man, it hasn’t even fucking reached 1.0 yet. The fact that the default install can target multiple architectures (including C dependencies) is nothing short of impressive.
Does GCC target any of those in its standard build? Oh. Wait. It doesn’t. It builds for only one architecture at a time and if you want anything else you’ll need to recompile the compiler.
LLVM is even intended to be optional in future, and the Zig team is well on their way to making that a reality with the work on stage2.
What backend would you suggest for embedded support?
Zig is 6 years old, and years-old issues exist in their task tracker requesting support for some of these targets.
You _cannot_ simply recompile LLVM to support the targets I listed without using unsupported, risky patches from community sources; if any such patches exist. The multiple architecture support in one binary is a pointless feature when installing multiple GCC builds only takes a tiny, insignificant fraction of my development machine's disk space. Never in my life have I thought "Wow, I'd really speed up development if GCC was one enormous binary holding all possible targets."
Zig's best path forward is to support targeting C as a first tier target; which they seem to be interested in doing.
We can! Many of us still use c89.(c99 has problems, like variable length arrays).
The reality however is that you cant escape never versions entirely. Not all code you interact with was written in the subset you want, so when your favorite OS or library starts using header files with newer features you need to run that version of the language too.
Another less appreciated detail, is that a lot of WG14 work is not about adding new features but clarifying how existing features are meant to work. When the text is clarified this gets back-ported to all previous versions of C in major compilers. An example of this is "provenance". This is a concept that implicitly been standard since the first ISO standard, but only now is becoming formalized. This means that if you want to adhere to the C89 standard, you will find a lot of clarifications about how things should work in the C23 standard.
They are one of the few things added that require the target platform, rather than the compiler, to do something different. (And the other big things like that, like atomics, are similarly optional.)
If it were to focus on stability, it would probably be LLVM IR. That said, there's plenty of C++ being written for these applications. And Ada.
> so if you have a pair of new fancy noise canceling headphones, then its not unlikely you are wearing a platform like that on your head everyday.
Chip shortage aside, the likelihood of these devices using obscure hardware like discrete DSPs is going down as cheaper low power architectures are becoming commoditized.
And pay the price, there are tons of places that are stuck with some ossified “LLVM 3.2” toolchain or similar because they build their stuff on an unstable IR.
Yea and again, I said "if". We live on the else branch of that if statement.
My point is that a lot of this sentiment is treating C as the portable backend of a compiler because there is no portable front end. It holds us back in a lot of ways from iterating on systems languages in ways that are interesting and valuable.
Ideally there would be an IR with stable textual and binary format that can be compiled into the machine code for various ISAs and support the extensions necessary by silicon manufacturers for the exotic bits. I've used exotic tool chains for weird ISAs where the exotic bits are sugar added to a GCC front end, and it always feels wrong and limiting.
> This seems totally misconceived to me as a basis for standardizing a language in 2022. You are optimizing for the few at the expense of the many.
Sure, but it's the same line of reasoning that made C relevant in the first place, and keeps it relevant today - some library your dad wrote for a PDP-whatever is still usable today on your laptop running Windows 10.
Because it's antiquated, it's also extremely easy to support, and to port to new and/or exotic platforms.
The library my dad wrote (lol) for the PDP-11 is probably full of undefined behaviour and won't work now that optimizers are using any gap in the standard to miscompile code.
> using any gap in the standard to miscompile code
For code to be miscompiled, there has to be a definition of what correctly compiling it would mean, and if there were, it would not be undefined behavior.
We are taking about code written before the standard so every bit of UB in the standard is in play here.
Eg the fact that overflowing a signed int can cause the compiler to go amuck would certainly be a surprise to the person who wrote code for the PDP-11.
-Wall -Werror is mostly designed to catch dangerous but totally well-defined idioms, not UB. It doesn't warn on every signed arithmetic operation or unchecked array access, for example.
See "useful" — it may not be not quite as strong as you're thinking. It may be possible to write a minimal C program without UB, but I'm thinking of larger programs, more than a few hundred lines. Common UB includes: array access out of bounds, dereferencing a null pointer, use after free, use of uninitialized variables. -Wall -Werror can catch some instances of some types of UB, and runtime libraries like UBSan can catch more. But they're not exhaustive.
When I first learned C, it was K&R, pre-ANSI with old style function parameters. It is trivial to convert to ANSI C. The truth is C has barely changed in decades.
You should take a look at plan9port - a bunch of userspace tools from Plan 9, carefully ported to Linux with few changes. Maybe that's not PDP-whatever, but it is Sun-whatever. Either way, it's code that was written decades ago for a dead architecture.
C is pretty much the only language in common use for programming microcontrollers. Microntrollers seldomly have filesystems. To break the language on systems without filesystems or terminals means to break the software of pretty much every electronics manufacturer out there.
Of course, I can't see a reason why #embed wouldn't be useful for microcontrollers. In fact, I imagine it's a key target market for a feature like that, resource managers are complex and tools like bin2c have always felt like a terrible back.
I was solely replying to the commenter who said that all reasonable modern systems have filesystems so I put one in for the embedded software developers.
Yeah, if the number of platforms using an obscure corner of the spec were exactly zero, we wouldn't be talking about whether the niche size is worth the hassle.
My point was that this niche doesn't cover the entirety of the embedded / IoT space.
This feature is precisely very useful when you don't have a filesystem to read data from at runtime. Embedding it to the binary to flash it is so much simpler with this.
I'll flip that around if you want to serve on a language standards commit there are a lot of other languages to choose from. Why be on the C standards committee with the express purpose of blocking progress?
I would say that one should be pretty cautious when baking in assumptions snouty such a fleeting thing as hardware into such a lasting thing as a language.
C itself carries a lot of assumptions about computer architecture from the PDP-9 / PDP-11 era, and this does hold current hardware back a bit: see how well the cool nonstandard and fast Cell CPU fared.
A language standard should assume as little about the hardware as possible, while also, ideally, allowing to describe properties of the hardware somehow. C tries hard, but the problem is not easy at all.
All memory is uniform, for instance. There is one scalar data processing unit that finishes a previous operation and then issues the next: no way to naturally describe SIMD, for instance. No way to speak about asynchronous things that happen on a Cell CPU all the time, as much as I can judge. (I never programmed it, but I remember that people who did said they had to use assembly extensively.)
OTOH you can write stuff like `*src++ = *dst++`, and it would neatly compile into something like `movb (R1)+, (R2)+`, a single opcode on a PDP-11.
It’s worse—-almost all of them already use a nonstandard variant of C. The committee is bending over backwards to accommodate them, but they literally _do not care what the standard says_, so this doesn’t even benefit them. Most will keep using a busted C89 toolchain with a haphazard mix of extensions no matter what the standard does.
This reasoning has always rung mostly hollow for compiler features (#embed, typeof) rather than true language features (VLAs, closures).
Modern toolchains must exist for marginal systems. It's understandable to want to write code for a machine from 1975, or a bespoke MCU, on a modern Thinkpad. It is not necessary to support a modern compiler running on the machine from 1975 / bespoke MCU. You might as well argue against readable diagnostic messages because some system out there might not be able to print them!
I could also see this, though perhaps it's a step too far for C, applying to Unicode encoding of source files.
The 1970s mainframe this program will run on has no idea that Unicode exists. Fine. But, the compiler I'm using, which must have been written in the future after this was standardised, definitely does know that Unicode exists. So let's just agree that the program's source code is always UTF-8 and have done with it.
Jason Turner has a talk where the big reveal is, the reason the slides were all retro-looking was that they were rendered in real time on a Commodore 64. The program to do that was written in modern C++ and obviously can't be compiled on a Commodore 64 but it doesn't need to be, the C64 just needs to run the program.
This seems a step too far for me. Compatibility with existing source files which may not be trivial to migrate does also matter. (Well, except for `auto`, C23 was right to fuck with that.) At the very least you'll need flags that mean "do whatever you did before".
Sure, I don't seriously expect C to embrace that, even though I think it'd be worth the effort I'm sure plenty of their users don't.
For auto I think the argument is that if you poke around in real software the storage specifier was basically never used because it's redundant. That's the rationale WG21 had to abolish its earlier meaning in C++ before adding type deducing auto.
As I read it, N2368 (which I think is what they took?) gives C something more similar to the type inference found in many languages today (which gives you a diagnostic if it can't infer a unique type from available information) whereas C++ got deduction which will choose a type when ambiguous, increasing the chance that a maintenance programmer misunderstands the type of the auto variable.
However it got inference from return, which I think is a misfeature (although I think I can see why they took it, to make generics nicer). With inference from return, to figure out what foo(bar)'s type is, I need to read the implementation of foo because I have to find out what the return statements look like. It's more common today to decide we should know from the function's signature.
This is somewhat mitigated by the fact that N2368 says auto won't work in extern context, so we can't just blithely say "This object file totally has a function which returns something and you should figure out what type that is" because that's clearly nonsense. You will have the source code with the return statements in it.
Ah, great. I don't write very much C any more, but the auto described in N3007 (well, the skim of N3007 I just did) feels very much like what I'd want from this feature in C and perhaps more importantly, what I'd assume auto does if I see it in a snippet of somebody else's code I'm trying to understand.
Or how to make everyone life worse for the few weirdos that don't use a LSP/IDE. The type of auto on the return of a function can and is automatically shown e.g. https://resources.jetbrains.com/help/img/rider/2022.1/inlay_...
With this kind of non-consideration of the developer comfort, it is clear C is an obscolete language.
No, I'm mainly talking about targeting. My point is not so much about embed, but rather that, almost anything you assume you think you know about how computers work isn't necessarily true, because C targets such a wide group of platforms. Almost always when some one raises a question along the line of "No platform has ever done that right?", some one knows of a platform that has done that, and it turns out has very good reasons for doing that.
For this reason, everything is much more complicated then you first think. For me joining the WG14 has been an amazing opportunity to learn the depths of the language. C is not big but it is incredibly deep. The answer to "Why does C not just do X?" is almost always far more complicated and thought through than the one thinks.
Everyone in the wg14 who has been around for a while, knows this, and therefore assumes that even the simplest addition will cause problems, even if they cant come up with a reason why.
Yeah, but then I have to side with the author - how could a compile time only feature which doesn't even introduce new language semantics possibly be affected by the multitude of build targets?
Unless "it's more complicated than you think" is the catchall answer to any and all proposals for new language features. In which case, how to make progress at all?
Also, I find the point about the language being "truly portable" a bit ironic, considering the whole rationale of #embed was that the use case of "embed large chunks of binary data in the executable" was completely non-portable and required adding significant complexity to the build scripts if you were targeting multiple platforms.
It's easy to make a language portable on paper if you simply declare the non-portable parts to not be your responsibility.
> Everyone in the wg14 who has been around for a while, knows this, and therefore assumes that even the simplest addition will cause problems, even if they cant come up with a reason why.
Look at embed as an example. Look how complex it is, dealing with empty files, different ways of opening files, files without lengths, null termination... the list goes on. This is typical of a proposal for C, it starts out simple "why cant i just embed a file in to my code?" and then it gets complicated because the world is complicated.
I worry a lot about people loading in text files and forgetting to add null termination to embeds. I would not be surprised if in a few years that provides a big headline on Hacker news, about how that shot someone in the foot and how C isn't to be trusted. The details matter.
null termination is not to be added to embed. embed adds a const sized buffer of unsigned bytes, not strings. files are not strings, files do contain \0.
and I still don't get why embed is so much better than xxd included buffers. it's more convenient sure, but 10x faster?
#embed is faster because it skips the parsing step. xxd generates source code tokens, which the parser must then decode back into bytes; embed just reads the bytes directly.
> I worry a lot about people loading in text files and forgetting to add null termination to embeds. I would not be surprised if in a few years that provides a big headline on Hacker news, about how that shot someone in the foot and how C isn't to be trusted. The details matter.
The compiler should insert the null terminator if it's not in the embedded file.
This is another issue here. If loads of compilers start doing this then programs start relying on it an then it becomes a de-facto undocumented feature. That means if you move compilers/platforms you get new issues. A lot of what the C standard does is mopping up these kinds of issues.
Then require compilers implement it in the standard. I think it's really backwards to ignore the tool chain and its ability to prevent bugs from entering software.
It's stuff like this that leaves us writing C to rely on implementation defined behavior. Under specification that leaves easy holes to fill will be filled by the compiler and we will rely on them. Just like type punning.
This is the problem. Things get complicated fast. If we mandate null termination, then its impossible to have multiple embeds in a row to concatenate files, or we need some how to have rules for when to add null termination and not. These rules in turn are not going to be read by all users, so some people will just assume that embed always adds null terminate in when it doesn't and then we are back to square one. The more we add the more corner cases there are.
I don't see how that would prevent sequential embeds unless you have defined the semantics of embed to be a textual include, which if that's the case then the mistake are the semantics and not the constraints on compilers or embedded files.
edit: why is it desirable to concatenate files with #embed? Does that not seem out of scope if not contrived?
This is literally the premise of this comment chain. Can you see how the standard can get dragged around by various competing needs? People want to embed strings, with null termination, some just want to embed binary data. Neither is particularly “wrong”. And in this thread everyone is presenting “obvious” solutions that are one or the other.
No, at least in this case I think the feature chooses the correct path on this. If you want to embed a string that’s null terminated then just save the string into a file as a null terminated string or copy it from the array to a proper string type. It looks like you may even be able to use this with a struct which might let you add a null termination after the array.
Though yes I agree lots of features get bogged down trying to handle everything. But in this case not adding it creates even more complexity. You can store strings natively in C. You can’t include binary blobs in C in a platform independent way so you have hacks that explode compile times for everyone using that software.
The ability to add vendor specific attributes also allows for those use cases to evolve naturally while still solving the core problem of embedding binary data.
I, FWIW, agree with the implementation. I'm just saying that this thread has someone saying "the compiler should add null termination" in it so the choice to make is not obvious.
I don't think adding a null terminator is useful for binary files which are not null-terminated strings, and may even have embedded 0 bytes in the middle.
It doesn't have to, you just add a zero byte at the end of the embedded byte sequence in the object file. It's up to the programmer to make the choice how to interpret that.
That said for binary embeds you almost always need the length embedded as well, which has been the case for every tool I've used to embed files in object code. You usually get something like
> It doesn't have to, you just add a zero byte at the end of the embedded byte sequence in the object file. It's up to the programmer to make the choice how to interpret that.
Like, are you proposing that the ideal semantics for `#embed "foo"` if foo contained 0x10 0x20 0x30 0x40 to be to expand to `16, 32, 48, 64, 0`? That seems more annoying than the opposite, given that:
> That said for binary embeds you almost always need the length embedded as well, which has been the case for every tool I've used to embed files in object code. You usually get something like
TFA demonstrated it by relying on sizeof() of arrays doing the right thing:
static_assert((sizeof(sound_signature) / sizeof(*sound_signature)) >= 4,
"There should be at least 4 elements in this array.");
You'd need to change this to subtract one every time, which sounds more annoying when embedding binary resources than adding the zero to strings would be.
I was on X3J11, the ANSI committee that created the original C standard and my experience was similar. It was a great opportunity to learn C at depth and get an understanding of many of the subtle details. We rejected a great many suggestions because our mandate was to standardize existing practice, address some problem areas, and not get too creative. (We occasionally did get too creative. The less said about noalias the better.)
Maybe you can answer a question I have: what companies are still supporting C compilers for sign-magnitude and 1s-complement machines today? I've been programming for almost 40 years now, and I have never come across any machine that is sign-magnitude or 1s-complement (I have encountered real analog computers---a decent sized one too---about 9' (3m) long, 6' (2m) high, and about 3' (1m) deep, requiring hundreds of patch cables to program).
"""Codify existing practice to address evident deficiencies. Only those concepts that have some prior art should be accepted. (Prior art may come from implementations of languages other than C.) Unless some proposed new feature addresses an evident deficiency that is actually felt by more than a few C programmers, no new inventions should be entertained."""
well, basic string support would be fine, wouldn't it? the C standard still having no proper string library for decades didn't harm its popularity, but still.
you cannot find non-normalized substrings (strings are Unicode nowadays), utf-8 is unsupported. coreutils and almost all tools don't have proper string (=Unicode) support.
Isn't there literally a single GPU for which it is true?
Asking because everytime this surfaces, someone inevitably asks for an example, and the only example I've seen over the years was of one specific (Nvidia?) GPU that uses NULL of 0xFFFFFFFA (or something similar).
That is, do you know how common it is for NULL to not be 0?
There’s a lot of platforms where you might want to do this. If you’re programming baremetal the “address 0” might be a physical address that you expect stuff to exist at, so it might be relevant to use the bit pattern 0xffffffff instead. If you’re targeting a blockchain or WASM VM you may not also not have memory protection to work with, just a linear array of memory. And some machines don’t even have bit patterns for pointers, like say a Lisp machine.
For an older CPU example, x86 (in 16 bit mode) maps the interrupt table at the physical address 0. So to tell the CPU what handler to use for the 0th interrupt, you have to do:
*(u16)0 = segment // dereference of null!
*(u16)2 = offset
Granted, most 16 bit OSs were written in assembly, not C, but if you were to write one in C, you’d have this problem.
IIRC, the M68k (which was a popular C target with official Linux support) did the same thing.
For a more recent example, AVR (popularized by Arduino) maps it’s registers into RAM starting at address 0. So, if you wanted to write to r0, you could write to NULL instead. Although one would be using assembly for this, not C.
People who call C simple have some weird definition of simple. How many C programs contain UB or are pure UB? Probably over 95%+. Language's not simple at all.
A straight razor is simple and that's why it's the easiest to cut yourself with. An electric razor is much safer precisely because much engineering went into its creation.
Its also worth remembering that a lot of higher level languages have runtimes / VMs are implemented in C. Web applications rely heavily on databases, java script VM, network-stacks, system calls and operating system features, all of which are impemented in C.
If you are a software developer and want to do something about climate change, consider becomming a compiler engineer. If you manage to get a couple of tenths of a percent performance increase in one of the big compilers during your career, you will have materially impacted global warming. Compiler engineers are the unsung heroes of software engineering.
> If you manage to get a couple of tenths of a percent performance increase in one of the big compilers during your career, you will have materially impacted global warming.
I've heard this kind of claim a number of times and I think it's more complicated than the crude statistical measurement makes it sound. Personally, I think that most programs are not run frequently enough to matter from an emissions perspective. For programs that are, like ML training programs, users will just train more data if the algorithms are faster so most energy efficiencies will get wiped out by the increased usage.
Even if that theory is wrong, what if there is a language that is 10% better than C for 95% of common C use cases? Wouldn't it be better for compiler engineers to focus on developing that language than micro-optimizing C?
No JavaScript VM is implemented in C. They are all written in a language that's a bit like C++ but has no exceptions and relies on lots of compiler behaviour that is not defined by the C++ standard.
Hm? I can think of two pure C JS engines off the top of my head: Duktape and Elk. I believe Samsung or another vendor also has their own; they’re all somewhat common in the embedded space.
Fair. I'm not familiar with the tiny JS VMs. But really the main point stands: It's not possible to build a decent GC without violating strict aliasing so C and C++ as standardized are not suitable for this task.
This is incredibly pedantic, even for Hacker News. I suspect you do not wish to be responded to like this in general, so I struggle to see why it is appropriate here.
The point is that if you are programming in a C dialect or a C++ dialect you will be told this again and again.
Ask a question on Stackoverflow about code that requires nonstandard flags or uses UB. You will be told your program is not really in C/C++ and that your question makes no sense. You will be lectured on nasal demons for the nth time.
File a bug against a compiler asking the optimizers to back off using UB to subvert the intentions of the programmer and you will be told there's no way to even know the intentions of the programmer given that the standards don't apply.
So we can all move to a nomenclature where C is used loosely to indicate a family of languages, once of which happens to be standardized. I'd be happy to do that.
But let's not bait and switch, using standards pedantry to dismiss feature requests and bug reports, but then turning around and saying C/C++ is suitable for implementing runtimes of other languages when nothing can really be achieved in that space without going beyond the language spec.
And really in 2022 runtimes, JITs, GCs should be the primary use of C/C++. Many other uses (systems software, compilers, desktop apps, phone apps) are not suited for C/C++ due to security, stability, ease of development and newer languages that make more sense for these domains.
Well…the answer is that the response you get will depend. If you are familiar with the standard and how it is implemented in compilers, you will get a sense for which UBs are implicitly blessed by compilers and you can file bugs against. "I double freed a pointer and expected something reasonable to happen" is never going to get you a serious response. But if you ask something like "I tagged this pointer's bit pattern and untagged it and the pointer I got back isn't valid" you will generally be heard out. FWIW I would recommend picking a different language to write your language runtime in 2022, using it to support your new language is often a good way to invite memory unsafety bugs in your code.
It’s a valuable distinction! This is a thread about the C standard. It’s misleading to talk about projects that use a specific C compiler on a specific platform but plainly need implementation details past what’s in the standard, or even straight up violating them.
Well, yes and no. I do actually agree with your claim that Linux uses a nonstandard C because it builds with special flags and without them it would not work. But for code that just happens to have unintentional bugs that result in UB–I definitely consider that C. Code that has a gentleman's agreement with the compiler on certain UBs is straddling the line but I would still usually call it C, unless someone tried to claim that because GCC accepted something it must be part of the standard. My point really isn't that you're wrong or that this isn't useful in the right context–that's why I called it pedantic–but the difference between "we build our code with special flags and this doesn't work with any other code that is ostensibly in this language because of that" and "this works on major compilers but is not standards compliant" is kind large and I don't think you really needed to go where you did. Sort of like if you made a typo in a comment about the English language I don't go "you must be using a language other than English, because clearly this is not valid" I say you have a typo, but if you're one of the people who makes everything lowercase and doesn't use punctuation I might reasonable say "it's mostly English but I'm not sure I can really call it that without qualification".
How would such a platform without file systems handle #include?
Reading further, I don't think this was ever addressed when someone else brought it up. I cannot for the life of me imagine a system where #include works but #embed doesn't. Again, it's fine if some systems have non-standard subsets of the C standard....why hobble the actual standard for code which can be compiled on systems where you have a filesystem (that will handle #include by the way) for the systems without filesystems?
> How would such a platform without file systems handle #include?
I don't think it would, you'd cross-compile for it on a platform with a file system. I think the parent poster's point was that C is the only option for some ultra low resources platforms and that a conservative approach should be taken to add new features in general. I don't think they were saying that specifically that not having a filesystem is problematic for this particular inclusion.
include is with regards to the source platform, not the target platform
la, you (generally) need a filesystem to compile, but you don't need a filesystem to run what you compiled
If I may gripe about C for a bit though. I do truly appreciate C's portability. It's possible to target a very diverse set of architectures and operating systems with a single source. Still, I do wish it would actually embrace each architecture, rather than try to mediate between them. A lot of my gripes with C are due to undefined behaviour which is left as such because of platform differences. I've never seen my program become faster if I remove `-fwrapv -fno-strict-aliasing`, but it has resulted in bugs due to compiler optimisations. I really wish by default "undefined behaviour" would become "platform-specific behaviour", with an officially blessed way to tell the compiler it can perform farther optimisations based on data guarantees.
C occupies a very pleasant niche where it lets you write software for the actual hardware, rather than for a VM, while still being high level enough to allow for expressiveness in algorithms and program organisation. I just wish by default every syntactically valid program would also be a well-defined program, because the alternative we have now makes it really hard to reason about and prove program correctness (i.e. that it does what you think it does).
I'm curious what you think of UB from a standard perspective --- were things left undefined and not just implementation-defined because there was simply so much diversity in existing and possibly future implementations that specifying any requirements would be unnecessarily constraining? I can hardly believe that it was done to encourage compiler writers to do crazy nonsensical things without regard for behaving "in a documented manner characteristic of the environment" which seems like the original intent, yet that's what seems to have actually happened.
>I'm curious what you think of UB from a standard perspective
I think a lot about that! I'm a member of the UB study group and the lead author of a Technical Report we hope to release on UB.
In short, "Undefined behavior" is poorly named. It should have been called "Things compilers can assume the program wont do". With what we call "assumed absence of UB" compilers can and do do a lot of clever things.
Until we get the official TR out, you may find I made a video on the subject interesting:
> And when I say truly, I mean for platforms without file systems, or operating systems or where bytes aren't 8 bits, that doesn't use ASCI or Unicode, where NULL isn't on address 0 and so on.
Genuine question: why do we want these platforms to live, rather than to be forced to die? They sound awful.
I understand retrocomputing, legacy mainframes, etc; but 99% of that work is done in non-portable assembler and/or some flavor of BASIC; not in C.
May of these platforms are micro controllers, DSPs or other programmable hardware, that are in every device now a days, so its not retro, its very much current technology.
Once again — I can understand wanting to program this hardware, but who's programming it in C, rather than writing directly to the metal in order to squeeze every cycle out of these?
Enjoy the naysayers if you like! I'm glad someone spent the time and effort to push past them. Bit too late for me - I have moved on to Rust which had support for this from version 1.0.0.
> There's also the standard *nix/BSD utility "xxd".
> Seems like the niche is filled. Or, at least, if you want to claim that
>...do NOT completely fill this evolutionary niche
> This ultimately would encourage a weird sort of resource management philosophy that I think might be damaging in the long run.
> Speaking from experience, it is a tremendously bad idea to bake any resource into a binary.
> I'll point out that this is a non-issue for Qt
applications that can simply use Qt's resources for this sort of business.
(Though credit to Matthew Woehlke, he did point out a solution which is basically identical to #embed)
> I find this useless specially in embedded environments since there should be some processing of the binary data anyway, either before building the application
In fairness there was a decent amount of support. But given the insane amount of negativity around an obviously useful feature I gave up.
I wonder if there was a similar response to the proposal to include `string::starts_with()`...
> > Speaking from experience, it is a tremendously bad idea to bake any resource into a binary.
What a pompous douche whoever wrote that was.
> > This ultimately would encourage a weird sort of resource management philosophy that I think might be damaging in the long run.
So, this might be a valid point, although not enough to reject the feature for. It true that it's a feature that could potentially see over-use and ab-use. But then, so did templates :-P
> told me this form was non-ideal and it was worth voting against (and that they’d want the pure, beautiful C++ version only[1])
I heard about #embed, but I didn't hear about std::embed before. After looking at the proposal, to me it does look a lot better than #embed, because reading binary data and converting it to text, only to then convert it to binary again seems needlessly complex and wasteful. I also don't like that it extends the preprocessor, when IMHO the preprocessor should at worst be left as is, and at best be slowly deprecated in favour of features which compose well with C proper.
Going beyond the gut reaction and moving on to hard data, as you can expect from this design, std::embed of course is faster during compilation than #embed for bigger files (comparable for moderately-sized files, and a bit slower for tiny files).
I'm not a huge fan of C++, but the fact that C++ removed trigraphs in C++17 and that it's generally adding features replacing the preprocessor scores a point with me.
Compilers follow the "as if" principle, they don't have to literally follow the formal rules given by the standard. They could implement #embed by doing as you say, pretty printing out numbers and then parsing them back in again. But that would be an extremely roundabout way to do it, so I doubt anyone will actually do it that way. Unless you're running the compiler in some kind of debugging mode like GCC's -E.
I don’t think the implication is that the C compiler must encode the binary file as a comma-separated integer list and then re-parse it, only act as if it did so.
How would that work? It would need to depend on the grammar of surrounding C code. This directive isn't limited to variable initialisers. You can use it anywhere. So e.g. you can use it inside structure declaration, or between "int main()" and "{". etc. etc. Those will generate errors in subsequent phases, but during preprocessing the compiler doesn't know about it. Then there is also just that:
int main () {
return
#embed "file.bin"
;
}
There are plenty of cases, where it will all behave differently. And if you're going to pretend even more that the preprocessor understands C syntax, then why not just give this job to compiler proper, which actually understands it?
Preprocessing produces a series of tokens, so you would implement it as a new type of token. If you're using something like `-E` you would just pretty-print it as a comma-delimited list of integers. If you're moving on to translation phase 7, you'd have some sort of rules in your parser about where those kinds of tokens can appear . Just like you can't have a return token in an initializer, you wouldn't be allowed to have an embed token outside of one (or whatever the rules are). And you can directly instantiate some kind of node that contains the binary data.
People don't dislike it because they are unaware how helpful it can be. They dislike it because they are aware how hacky, fragile and error-prone it is. They want something more robust than text substitution.
People that don't like it generally have used macros that are more sophisticated than just blindly copy pasting text into your source files and have became aware of how absurd that is.
which obliterate tooling such as IDEs. Of course, this is a contrived example, but the preprocessor is just one big footgun, which offers no benefits over other ways of solving the problems you mentioned, such as constexpr and perhaps additional, currently unimplemented solutions.
A tool being very useful doesn’t mean that it is a very good tool.
There are better tools for the functionality the C preprocessor attempts to provide. Other languages have module inclusion systems and very powerful macros that don’t have the enormous footguns of the C preprocessor.
Édit: to be clear, I think #embed is a fine idea; I’d use it and it would make my sourcebase cleaner in some places.
My carpenter has a lot of tools that can be dangerous if misused. Of course better tools can be devised, but useful things have been done with them (and he still has all his fingers)
Yes, but we’re in a thread about ways to improve the language, not about how to make the best with what’s there. This type of argument holds back improvement.
This serves the same use as Rust's `include_bytes!` macro, right? Presumably most people just use this feature as a way to avoid having to stuff binary data into a massive array literal, but in our case it's essential because we're actually using it to stuff binaries from earlier in our build step into a binary built later in the build step. Not something you often need, but very handy when you do.
This has different affordances than std::include_bytes! but I agree that if you were writing Rust and had this problem you'd reach for std::include_bytes! and probably not instead think "We should have an equivalent of #embed".
include_bytes! gives you a &'static [u8; N] which for non-Rust programmers means we're making a fixed size array (the size of your file) full of unsigned 8-bit integers (ie bytes) which lives for the life of the program, and we get an immutable reference to it. Rust's arrays know how big they are (so we can ask, now or later) but cannot grow.
#embed gets you a bunch of integers. The as-if rule means your compiler is likely to notice if what you're actually doing is putting those integers into an array of unsigned 8-bit integers and just stick all the file bytes in the array, short cutting what you wrote, but you could reasonably do other things, especially with smaller files.
As the article quotes, in C the lack of standardisation makes this tricky when you want to support more than one compiler, or even when you want to support just one compiler (cf email about the hacks to make it work on GCC with PIE).
> Even among people who control all the cards, they are in many respects fundamentally incapable of imagining a better world or seizing on that opportunity to try and create one, let alone doing so in a timely fashion.
That does sound soul-crushing. Congrats on this achievement!
This is simply wrong. We (the ISO wg14) don't hold the cards, compilers are free to implement what ever they want, users are free to use what ever tools or languages they want.
We exist only as long as we are trusted to be good stewards, and only go forward with the consensus of the wider community.
It's amazing that you and the ISO team are good stewards of the C standard. Thank you for being part of that.
And it can also be true that it was "hell" and "hardly worth it" for the OP to get a new feature added to the language. I believe it was a miserable experience that has him questioning how he spends his time.
Both can be true. Thank you for your efforts. And thank the OP for his efforts too.
> > Even among people who control all the cards, they are in many respects fundamentally incapable of imagining a better world or seizing on that opportunity to try and create one, let alone doing so in a timely fashion.
> This is simply wrong. We (the ISO wg14) don't hold the cards, compilers are free to implement what ever they want, users are free to use what ever tools or languages they want.
This is an incredibly oblivious realization of JeanHeyd's point.
I think in our reality the prerequisite for holding all the cards is the lack of competence in knowing how to improve the world. We've gotten where we are now through sheer force of will of those that are empty handed.
The reasonable man adapts himself to the world: the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.
This reminds me, I'd argue that the explosion of JS frameworks can be mainly blamed on one thing: the lack of an <include src="somemodule.html"> tag. If you have that you basically have 80% of vue.js already natively supported. No clue why this was never added in any fashion. Change my mind.
HTML imports were part of the original concept of Web Components, and I think they were supported in Chrome. If you look up examples of things built with Polymer 1.x, it was used extensively.
It was actually pretty neat, because you could have an HTML file with a template, style, and script section.
Safari rejected the proposal, so it had to get dropped.
But ESM makes it a bit redundant anyway. The end-goal is to allow you to import any kind of asset, not just JS. There have been demos and examples of tools supporting this going back over half a decade at this point.
Not the parent comment, but my personal use case is for rendering a selectable list. The server side would render a static list with fragment links (ex. `#item-10`) and include elements with corresponding IDs, and a `:target` css rule to unhide the element. This would hopefully be paired with lazy loading the include elements.
edit:
My goal is to avoid reloading the page for each selection and rendering all items eagerly. JS frameworks are the only ones that really allow this behavior.
Honestly I'm usually very wary of additions to C, as one of its greatest strengths (to me) is how rather straightforward it is as a language in terms of conceptual simplicity. There just aren't that many big concepts to understand in the language. (On the other hand there's _many_ footguns but that's another issue.)
That said, to me this seems like a great addition to the language. It's very single-purpose in its usage (so it doesn't seem to add much conceptual complexity to the language) and it replaces something genuinely painful (arcane linker hacks). I'm very much looking forward to using this as I often make single-executable programs in C. The only thing that's unfortunate is I'm sure it'll take decades before proprietary embedded toolchains add support for this.
The first commandment of C is: 'writing a naive C compiler should be "reasonable" for a small team or even one individual'. That's getting harder and harder, longer and longer.
I did move from C being "the best compromise" to "the less worse compromise".
I wish we had a "C-like" language, which would kind of be a high-level assembler which: has no integer promotion or implicit casts, has compile-time/runtime casts (without the horrible c++ syntax), has sized primitive types (u64/s64,f32/f64,etc) at its core, has sized literals (42b,12w,123dw,2qw,etc), has no typedef/generic/volatile/restrict/etc well that sort of horrible things, has compile-time and runtime "const"s, and I am forgetting a lot.
From the main issues: the kernel gcc C dialect (roughly speaking, each linux release uses more gcc extensions). Aggressive optimizations can break some code (while programing some hardware for instance).
Maybe I should write assembly, expect RISC-V to be a success, and forget about all of this.
I wish we had something like typed Lua without Lua’s weird quirks (e.g. indexing by 1), designed with performance enhancement and and safety in mind, and with the features you mention.
But like Lua, the base compiler is really small and simple and can be embedded. And it’s “pseudo-interpreted”: ultimately it’s an ahead-of-time language to support things like function declarations after references and proper type checking, but compiling unoptimized is practically instant and you can load new sources at runtime, start a REPL, and do everything else you can with an interpreted language. Now having a simple compiler with all these features may be impossible, so worse-case there is just a simple interpreter, a separate type-checker, and a separate performance-optimized JIT compiler (like Lua and LuaJIT).
Also like Lua and high-level assembly, debugging unoptimized is also really simple and direct. By default, there aren’t optimizations which elide variables, move instructions around, and otherwise clobber the data so the debugger loses information, not even tail-call optimization. Execution is so simple someone will create a reliable record-replay, time-travel debugger which is fast enough you could run it in production, and we can have true in-depth debugging.
Now that i’ve wrote all that I realize this is basically ML. But oCaml still has weird quirks (the object system), SML too honestly, and I doubt their compilers are small and simple enough to be embedded. So maybe a modern ML dialect with a few new features and none of the more confusing things which are in standard ML.
Checkout Nim! It does much of what you describe and its great. The core language is fairly small (not quite lua simple but probably ML comparable). It compiles fast enough that a Nim repl like `inim` is useable to check features and for basic maths, though it requires a C compiler, but TCC [4] works perfectly. Essentially Nim + tcc is pretty close to your description, IMHO. Though I'm not sure TCC supports non-x86 targets.
I've never used it but Nim does support some hot reloading as well [3]. It also has a real VM if you want to run user scripts and has a nice library for it [1]. Its not quite Lua flexible but for a generally compiled language its impressive.
Recently I made a wrapper to embed access to the Nim compilers macros at runtime [2]. It took 3-4 hours probably and still compiles in 10s of seconds despite building in a fair bit of the compiler! It was useful for making a code generator for a serializer format. Though I'm not sure its small enough to live on even beefy m4/m7 microcontrollers. Though I'm tempted to try.
GCC or Clang with all warnings turned on will give you almost what you want. -Wconversion -Wdouble-promotion and 100s of others. A good way to learn about warning flags (apart from reading the docs) is Clang -Weverything, which will give you many, many warnings.
I agree (with a lot of caveats), but a key value of C is that we do not break peoples code and that means that we cant easily remove things. If we do, we create a lot of problems. This makes it very difficult to keep the language as easy to implement as we would like. As a member of the WG14, I intend to propose that we do make this our prime priority going forward.
> I wish we had a "C-like" language, which would kind of be a high-level assembler which: has no integer promotion or implicit casts, has compile-time/runtime casts (without the horrible c++ syntax), has sized primitive types (u64/s64,f32/f64,etc) at its core, has sized literals (42b,12w,123dw,2qw,etc), has no typedef/generic/volatile/restrict/etc well that sort of horrible things, has compile-time and runtime "const"s, and I am forgetting a lot.
Unsafe Rust code I think fits this model better than C does: it relies on sized primitive types, it has support for both wrapping and non-wrapping arithmetic rather than C's quite frankly odd rules here, it has no automatic implicit casts, it has no strict aliasing rules.
The first commandment of C is: 'writing a naive C compiler should be "reasonable" for a small team or even one individual'. That's getting harder and harder, longer and longer.
100% agreed. I've always viewed C as a "bootstrappable" language, in which it is relatively straightforward to write a working compiler (in a lower level language, likely Asm) which can then be used to bring up the rest of an environment. The preprocessor is actually a little more difficult in some respects to get completely correct, and arguably #embed belongs there, so it's debatable whether this feature is actually adding complexity to the core language.
Your wish for a "C-like" language sounds very much like B.
There is so much more to remove: 1 loop statement is enough, loop {}, enum should go away with the likes of typeof, etc.
I wonder if all that makes writing a naive "B+" compiler easier (time/complexity/size) than a plain C compiler. I stay humble since I know removing does not mean easier and faster all the time, the real complexity may be hidden somewhere else.
The question here is why this did not already exist as an extension in some compilers. Getting something standardized that exists already in compilers and is used is far easier.
The people on the C standards committee are gold bricks that say things like we won't add anything to the standard that isn't implemented in two or more compilers.
The compiler writers program in C++. And say things like if you want that feature that's a good reason to use C++ and stop using C. But if the standards committee adds it of course we will.
Also embed is exactly the feature that end uses would find very useful and compiler writers would not care about at all.
I’m really amazed at how divisive this one is, and the number of comments here questioning what seems to me a really useful and well-thought-out feature, something I’d have loved to have used many many times over the years.
I guess the heated arguments here help me understand how it could have taken so long to get this standardised, though, so that’s something!
Congratulations and thank you to the OP for doing this, and thanks also for this really interesting (if depressing) view of the process.
This is a really, really good feature and I am so glad it is finally getting standardized. C23 is shaping up to be a very good revision to the C standard. I’m hoping the proposal to allow redeclaration of identical structs gets in as well as you would finally be able to write code using common types without having to coordinate which would allow interoperability between independently written libraries.
Congratulations to the author. Things like this are why I hope Carbon exists. Evolving c++ seems like a dumpster fire, despite whatever compelling arguments about comparability you are going to drop on me.
The issue is that a lot of people just think about languages in a wrong way, which is the whole reason for pointless things like C++ expansions, Carbon, Rust, and stuff like this.
One of the fundamental ideas that people run with in language creation/expansion is "programmer is stupid and/or make mistakes" -> "lets add language features that intercept and control his stupidity/mistakes".
And there is a very valid reason for this - it allows programmers of lesser skill and knowledge base to pick up codebases and develop safe software, which has economic advantages in being able to higher less experienced devs to write software at lower salary points and spend next to no time fixing segfault issues due to complex memory management. The whole reason Java got so popular over C++ was because of its GC - both C++ and Java supported fairly strong typing with classes, but C++ still had a lot of semantics around memory management that had to be taken care of, whereas with Java you just simply don't do anything.
However, people are applying this idea towards lower level languages, because they want the high performance of a compiled language with a whole bunch of features that make writing code as mistake free as possible.
And my challenge to that is this - why not spend the time making just smarter compilers/tooling?
Think about a hypothetical case where Rust gets all the features added to it that people want, and is widely used as the main language over all others. Looking at all the code bases, there will be a lot of common use patterns, a lot of the safety code duplicated over and over in predictable patterns, e.t.c. And you will see these common things added to Rust. Just like with Java a lot of the predictable use patterns got abstracted into widely used libraries like Lombok, Spring, e.t.c, where you don't have to worry about correctness in lieu of using a library. And you essentially will start to move towards more and more stuff being handled for you automagically, which is all part of the compiler/toolchain
In the same way, #embed can be solved by smart compiler. Have a static string that opens a file, and read contents into a buffer that doesn't change? Auto include that file in a binary if you want to target performance rather than executable size. No need for special instruction, just be smart about how you handle an open call, and leave the fine tuning of this to specific compiler options.
And from an economic perspective of ease of use from above, you would have a language like Python which is super easy to pick up and program in, except instead of the interpreter, you would have a compiler that will spit out binaries. Python is already widely adopted primarily of how easy it is to set up and use. Now imagine if you had the option to run a super smart compiler that highlights any potential issues that come with dynamic typing because it understands what you are trying to do, fixes any that it can, and once everything is addressed, it spits out an optimized memory safe executable. With Rust, you code, compile, see you made a mistake somewhere with a reference, fix it, repeat. With this, you would code, compile, fix the mistake somewhere that the compiler warns you about, repeat. No difference.
Focusing on the toolchain also lets you think about integrating features from languages like Coq with provability, where you can focus not only on correctness processing/memory wise, but also "is the output actually correct". I.e, any piece of code for all given input can be specified to have guaranteed bounded output set, which you can integrate into IDE tools to provide you real time feedback on this for you to design the code in a way that avoids things like URL parsing mistakes, which all the languages safety features of Rust won't catch.
As for C, you leave it a version that has a stable, robust ABI, and then anything that you need to support will be delegated to custom tools. That way, in the future where compute will likely be full of specialized ML chips, instead of worrying about writing the frontend to support every feature, you quickly get a notional tool chain made and are able to run existing C code.
I would expect that to produce an error, though. If I had a regular file that was not infinite in size, and I specified the wrong length for the array, I would find it more useful to have the compiler inform me as to the discrepancy rather than truncate my file.
The preprocessor needs to run before the compiler, though, and isn't complex enough to understand the context of the code that it's in. That would be a substantially complex thing to implement.
This will indeed require delaying population of the array to the compilation stage. However it's worth the convenience and the succinctness of the syntax, and it's not that substantially complex to implement.
Interesting. I look forward to this. What I've been doing now to embed a source.png file is something like this, where I generate source code from a file's data:
What about creating object files from raw binary files and then linking against them? That's what I (and of course many others) do for linking textures and shaders into the program. It's a bit ugly though that with this approach you can't generate custom symbol names, at least with the GNU linker.
This #embed feature might be a nice alternative for small files. Well for large files you usually don't even want to store them inside the binary, so the compilation overhead might be miniscule, since the files are, by intention, small.
When I read the introduction of the article - about allowing us to cram anything we want into the binary - I was hoping to see a standard way to disable optimizations (When the compiler deletes your code and you don't even notice).
You reminded me of Bethesda Softworks games, which always seem to have 1GB+ executables for some reason. I hope it isn't all code. Maybe they embed the most important assets that will always need to be loaded.
My guess is that the files are not truly embedded as it would require to load the entire file in memory before running the application, which seems wasteful.
More likely, the actual executable is only a small part of the file which accesses the rest of the file as an archive, like a self-extracting zip. There may also be some DRM trickery going on.
Off the top of my head, I think there's some niche use in embedding shaders so that they don't need to be stored as strings (no IDE support) or read at runtime (slower performance).
There are a lot of use cases for baking binary data directly into the program, especially in embedded applications. For instance, if you are writing a bootloader for a device that has some kind of display you might want to include a splash screen, or a font to be able to show error messages before a filesystem or an external storage medium is initialized. Similarly, on a microcontroller with no external storage at all you need to embed all your assets into the binary; the current way to do that is to either use whatever non-standard tools the manufacturer's proprietary toolchain provides, or to use xxd to (inefficiently) generate a huge C source file from the contents of the binary file. Both require custom build steps and neither is ideal.
You can get some IDE support with a simple preprocessor macro[1].
It's a crutch, but at least you don't need to stuff the shader into multiple "strings" or have string continuations (\) at the end of every line. Plus you get some syntax highlighting from the embedding language. I.e. the shader is highlighted as C code, which for the most part seems to be close enough.
nullptr since we have type detection now, and NULL mustn't be a pointer. auto, because otherwise everybody would create their own hacky auto using the new typeof.
That’s the same that `xxd` does, for instance, and it works with C89/C99/C11.
The advantage is that since it uses the same syntax as C23, it is easier to switch to using the compiler: just remove the Cedro pragma `#pragma Cedro 1.0 #embed` and it will compile as C23.
Yes, but it's linker-specific and non-portable. It can also come with some annoying limitations, like having to separately provide the data size of each symbol. In some cases this might be introspectable, but again comes at the expense of portability.
ELF-based variants of the IAR toolchain, for example, provide a means of directly embedding a file as an ELF symbol, but without the size information being directly accessible.
GNU ld and LLVM lld do not provide any embedding functionality at all (as far as I can see). You would have to generate a custom object file with some generated C or ASM encoding the binary content.
MSVC link.exe doesn't support this either, but there is the "resource compiler" to embed binary bits and link them in so they can be retrieved at runtime.
Having a universal and portable mechanism which works everywhere will be a great benefit. I'll be using it for compiled or text shaders, compiled or text lua scripts, small graphics, fonts and all sorts.
This article[1] shows how you can use GCC toolchain along with objcopy to create an object file from a binary blob, link it, and use the data within in your own code.
The article address this directly. If you're only targeting one platform then this is reasonably easy (albeit still not as easy as #embed), but if you need to be portable then it becomes a nightmare of multiple proprietary methods.
Sure, but to add binary data to any executable on any platform is more involved.
As an example, see [1]. That will turn any file into a C file with a C array, and I use it to embed a math library ([2]) into the executable so that the executable does not have to depend on an external file.
5 years ago I wrote a small python script [1] to help me solve "the same problem".
It reads files in a folder and generates an header file containing the files' data and filenames.
Is very simple and was to helping me on a job. It has limitations, don't be too hard on me :)
This is a cool feature and I'll likely be using it in the years to come. However, the posix standard command xxd and its -i option can achieve this capability portably today.
It will be useful to achieve it directly in the preprocessor however. I wonder how quickly can it be added to cpp?
> It's also only suitable for tiny files: compile time and RAM requirements will blow up once you go beyond a couple of megabytes.
Do you know what makes it so? Is there a technical argument why the compiler could do better, except maybe for xxd not being specifically optimized for this use case?
It's allowed to do whatever it wants so long as the results are as-if it did what the standard says. So even though the standard says this is making a big list of integers like your xxd command, the compiler won't do that, because (as a C compiler) it knows perfectly well it would just parse those integers into bytes again, just like the ones it got out of the binary file. It knows the integers would all be valid (it made them) and fit in a byte (duh) and so it can skip the entire back-and-forth.
If I understand correctly what you're saying it's not a problem with xxd itself, but with the format of the data, which in xxd's case is a pair made out of an array of bytes and its length, while the compiler is free to include the binary verbatim and just provide access to it as it was an array/length pair. Am I right about it?
The article spends a fair bit of time discussing the build speed and memory use problems with that approach. Like, the benchmark results [0] linked to from this post literally have xxd as one of the rows. It's not a viable option for embedding megabytes of data.
Scary, it's as if the preprocessor has become type-aware. I guess I better don't imagine the result of the preprocessing to look similar to and following the same rules as something I would have written by hand.
This might make manual inspection of the preprocessed file a bit painful.
Its not really a pre-processor stage. Probably better to think of it more like a pointer cast to some binary blob. Though it'd be interesting to see what `gcc -E` would produce.
One particular scenario that people have highlighted is developing for an embedded system that doesn't have any storage except flash memory, and no filesystem. In this kind of system, embedding static resources in the executable is the only reasonable option you have.
> “Touch grass”, some people liked to tell me. “Go outside”, they said (like I would in the middle of Yet Another Gotdang COVID-19 Spike). My dude, I’ve literally gotten company snail mail, and it wasn’t a legal notice or some shenanigans like that! Holy cow, real paper in a real envelope, shipped through German Speed Mail!! This letter alone probably increases my Boomer Cred™ by at least 50; who needs Outside anymore after something like this?
Touch grass indeed. Sure, #embed is a nice feature, but this self-indulgent writing style I can’t stand.
That’s not the case, if you take a bit of care. Look at STB for example (https://github.com/nothings/stb) -- I’ve successfully used STB functions on a bunch of different platforms. In both C and C++, even!
C89 is where C should've stayed at. If you need to convert a file to a buffer and stick that somewhere in your translation unit, use a build system. Don't fuck with C.
> "Did you read the snail mail letter from someone who does just that?"
I did. The author struggled embedding files into their executables with makefiles. We don't know anything else beyond that. So what?
People also struggle with memory management in C, an arguably much more difficult and widespread problem. Should we introduce a garbage collector into the C spec? How about we just pull in libsodium into the C standard library because people struggle with getting cryptography right?
OP mentions #embed was a multi-year long uphill battle, with a lot of convincing needed at every turn. That in itself is enough proof that people aren't in clear agreement over there being a single "right" solution. Hence, leave this task to bespoke build systems and be done with it. Let different build systems offer different solutions. Allow for different syntaxes, etc. Leave the core language lean.
Haven't the people making the standards other things to do, like, integrating useful features instead of duplicating incbin.h [0] years after that feature worked?
> The directive is well-specified, currently, in all cases to generate a comma-delimited list of integers.
While a noble act, this is nearly as inefficient as using a code generator tool to convert binary data into intermediate C source. Other routes to embed binary data don't force the compiler to churn through text bloat.
It would be much better if a new keyword were introduced that could let the backend fill in the data at link time.
You should read or re-read the article and references. There are multiple benchmarks showing this not to be the case. Actually half the article is a (well deserved) rant about how wrong compiler devs were in thinking that parsing intermediate C sources could ever match the new directive. Compiler internal representation of an array of integers also doesn't require a big pile of integer ast's.
According to the benchmarking data this extension is even 2x faster than using the linker `objcopy` to insert a binary at link time as you suggest.
The article definitely isn’t a glowing praise of C/C++. In fact, including this simple, useful feature that rust has had for a decade now has taken an immense amount of effort and received so much pushback from various parties, in part due to the strangled mess of various compiler limitations and in part because of design-by-committee stupidity.
> Surprisingly, despite this journey starting with C++ and WG21, the C Committee is the one that managed to get there first
Later it mentions presenting their first formal attempt at this to Belfast 2019, that's a C++ meeting, it's too late for this to go into C++ 20 at that point, but it easily could have been in C++ 23 (it is not).
Officially, Rust's R-cog logo is the symbol of Rust. It is a registered trademark of the Foundation.
But it's a bit boring. Unofficially, Rust has a mascot, in the form of a crab named "Ferris". The crab mascot appears in lots of places, and the Unicode crab emoji U+1F980 is often used by Rust programmers to indicate Rust in text. Unlike the trademarked logo, you can have a bit of fun with such an unofficial symbol, for example Jon Gjengset's "Rust for Rustaceans" book cover has a stylised crab wearing glasses with a laptop apparently addressing a large number of other crabs.
> vendor extensions ... were now a legal part of the syntax. If your implementation does not support a thing, it can issue a diagnostic for the parameters it does not understand. This was great stuff.
I can’t be the only one who thinks magic comment is already an ugly escape hatch, adding a mini DSL to it that can mean anything to anyone just makes it ten times worse. It’s neither beautiful nor great.
> do extensions on #embed to support different file modes, potentially reading from the network (with a timeout), and other shenanigans.
To be completely honest, I find the fact that this was raised by the committee to be really obtuse and unnecessary. The same "complaint" could be raised about #include as well.
If you want to include data from a continuous stream from a device node, then you could just as easily have the data piped into a temporary file of defined size and then #embed that. No need to have the compiler cater for a problem of your own making.
As for the custom data types. It's a byte array. Why not leave any structure you wish to impose on the byte array up to the user. They can cast it to whatever they like. Not sure why that's anything to do with the #embed functionality.
Both these things seem to be massive overthinking on the part of the committee members. I'm glad I'm not participating, and I really do thank the author for their efforts there. We've needed this for decades, and I'm glad it's got in even if those ridiculous extensions were the compromise needed to get it there.
I’ve known C for close to two decades, thank you. I’m using the not at all well defined term “magic comment” to loosely refer to everything that’s not strictly speaking code but has special meaning, which include pre-processor directives.
> I’ve known C for close to two decades, thank you. I’m using the not at all well defined term “magic comment”
Please forgive those of us who've been using C since the 80's, or earlier, from assuming you don't know C when you invent your own terminology for preprocessor directives.
This is not a preprocessor directive though, from reading the post I don’t think cpp is expanding the #embed into an array initializer, otherwise there’s no performance benefit at all.
The deliberate wording expands #embed to C's initializer-list
The major performance benefit is from the as-if rule. The compiler is entitled to do whatever it wants so long as the result is as-if it worked the way the standard describes.
So a decent compiler is going to notice that you're #embed-ing this in a byte array, and conclude it should just shovel the whole file into the array here. If it actually made the bytes into integers, and then parsed them, they would of course still fit in exactly one byte, because they're bytes, so it needn't worry about actually doing that which is expensive.
Does it work if you try to #embed a small file as parameters to a printf() ? Yeah, probably, go try it on Godbolt (this is an option there) but for the small file where that's viable we don't care about the performance benefit. It's just a nice trick.
The problem with the preprocessor often is that it is a language inside another language and the preprocessor is designed to be almost completely agnostic of the language it is embedded in. So there might be subtle ways to use the preprocessor so that implementing as-if becomes very unintuitive. I don't have a good intuition about this case, if this is 100% designed in a way that it can never provoke such subtle side-effects. Basically what might end up happening is that the preprocessor has to learn some part of the C language to decide if such an as-if transformation is possible and then branch to either do it or don't.
this isn't adding a new feature. This is replacing a feature that everyone already implemented independently on every other project - some with xxd, some with special embedders such as rcc or windres or whatever, some through CMake directly (like this: https://gist.github.com/sivachandran/3a0de157dccef822a230 for instance) - in a standard and more performant way. Instead of being paid per-project this cost will now be paid per-compiler implementation which is unambiguously good as there are <<< compilers than C projects.
You replied to my quote about pulling network resources and “other shenanigans”, which certainly isn’t what “everyone already implemented independently”. Plus that’s a potential vendor extension, i.e., some may implement it independently, some may not, implementations will likely differ.
> You replied to my quote about pulling network resources and “other shenanigans”, which certainly isn’t what “everyone already implemented independently”.
?? pulling network resources works, today, with what people are using. there's zero difference between "cat foo.txt | xxd -i" and "curl https://my.api/foo.json | xxd -i"
Personally I feel the C committee should've disbanded after the first standard (the C++ one, after the 2003 technical corrigendum). I didn't mind C99 much, but it looks like C(++)reeping featuritis is a nasty habit.
These gratuitous standards prompt newbies to use the new features (it's "modern") and puzzled veterans to keep up and reinternalize understanding of new variants of languages they've been using for decades. There's no real improvement, just churn. Possibly it's one of the instruments of ageism. More incompatibility with existing software and existing programmers.
One problem of this new millennium is that the field has developed a tendency to do old things with new languages instead of doing new things with old languages.
Reinventing another polygonal wheel approximation while constantly tweaking the theory used serves to segregate the experienced (who may have trouble accepting such tweaks or their necessity - they already know about the existence of wheels anyway) from the newbies (who have no previous intuitions and no taste for legitimate and spurious novelty).
Newbies are cheap, and new ideas are hard. Let's do some mental rent seeking.
This is a cool feature, but the author doesn't do himself any favors with his style of writing that greatly overestimates the importance of his own feature in the great scheme of things. Remarks like "an extension that should’ve existed 40-50 years ago" make me think, if we should've really bothered all compiler vendors with implementing this 40-50 years ago. After all, you can already a) directly put your binary data in the source file like shown after the preprocessor step and b) read a file at runtime. I'm not saying this isn't useful, but it's a rather niché performance improvement than a core language feature.
Cs main strengthen is its portability and simplicity. Therefore we should be very conservative, and not add anything quickly. There are plenty of languages to choose form if you want a "modern" language with lots of conveniences. If you want a truly portable language there is really only C. And when I say truly, I mean for platforms without file systems, or operating systems or where bytes aren't 8 bits, that doesn't use ASCI or Unicode, where NULL isn't on address 0 and so on.
We are the stewards of this, and the work we put in, while large, is tiny compared to the impact we have. Any change we makes, needs to be addressed by every compiler maintainer. There are millions of lines of code that depend on every part of the standard. A 1% performance loss is millions of tons of CO2 released, and billions in added hardware and energy costs.
In this privileged position, we have to be very mindful of the concerns of our users, and take the time too look at every corner case in detail before adding any new features. If we add something, then people will depend on its behavior, no matter how bad, and we therefor will have great difficulty in fixing it in the future without breaking our users work, so we have to get it right the first time.