Hacker Newsnew | past | comments | ask | show | jobs | submit | aesthesia's commentslogin

Calling the AISLE experiment a "benchmark" is generous. They tested three code snippets on each model.

It maps (-inf, inf) to (0, inf) in about as nice a way as you could expect (addition turns into multiplication). When you want to constrain a value to be positive, parameterizing it with exp is usually a good option.

And importantly it's got nice properties like being differentiable and monotonic, unlike eg. taking |x|.

The "model card" concept actually comes from a pre-LLM Google paper (https://arxiv.org/abs/1810.03993), where the example cards did fit on a single page. The concept quickly became a standard component of AI governance frameworks, and Hugging Face adopted it as a reasonable standard format for a model README. As LLMs emerged and became more capable at broader ranges of tasks, model cards expanded to the sizes we see today.

That makes sense. I recall a “battle card“ (“concise, easy-to-scan document that helps [sales] reps handle competitive conversations, respond to objections, and highlight key differentiators” per HubSpot) as about a half sheet document, which is congruent.

There's some validity to these criticisms, but it would be a lot more credible to cite someone whose job isn't "loudly promote any claim that sounds negative for AI, regardless of how well-founded it is."

It would be twice that, since nVidia always lists "with sparsity" FLOPS as the headline number. But I bet they got a bunch of research credits to do this.

There's a similar but unreleased project here: https://github.com/DGoettlich/history-llms

I've been waiting for them to publish the 4B model for a while so I'm glad to have something similar to play with. I think I trust the Ranke-4B process a bit more, but that's partly because there aren't a lot of details in this report. And actually releasing a model counts for a whole lot.

One thing that I think will be a challenge for these models is achieving any sort of definite temporal setting. Unless the conversation establishes a clear timeframe, the model may end up picking a more or less arbitrary context, or worse, averaging over many different time periods. I think this problem is mostly handled by post-training in modern LLMs (plus the fact that most of their training data comes from a much narrower time range), but that is probably harder to accomplish while trying to avoid bias in the SFT and RL process.


I wonder if it would be possible to do something simple like prepending sentinel tokens with the year. Or, since they're training a model from scratch anyways, tweak the architecture to condition on a temporal embedding. That opens the door to cool stuff like: Generate a response from 2050.

But of course the monarch was a queen for the majority of the 19th century. While there's definitely post-1930 information that made it into the training data, I suspect the reason this happened is that the model is not very sure what year it actually is, and based on various subtle cues can generate text that seems to be situated in a wide range of time periods.

The same This American Life episode raised serious doubts about Dr. Steel's claims, which is mentioned in the article you link:

> When reporters tried to corroborate Dr. Steel’s claims, however, holes started appearing, according to the This American Life episode. Chief among them: There actually was a real Dr. Robert Ho Man Kwok, and his biographical details seemed to match those provided in the letter, like his professional title, the name of his research institute, and the date of his move to the US.

> While both Dr. Steel and Dr. Ho Man Kwok had died by the time the digging began in earnest, their surviving family members were able to shed some light on the situation. Dr. Ho Man Kwok’s children and former colleagues were adamant that Dr. Ho Man Kwok had in fact written the letter. Meanwhile, Dr. Steel’s daughter said her father was a lifelong prankster who loved pulling one over on people. With this testimony in mind, the reporters came to the conclusion that Dr. Ho Man Kwok was most likely the true author and Dr. Steel had taken credit for years as an elaborate practical joke.


Oh damn I just linked the first article I could find without fully reading it, my bad. That's crazy haha wow

I'm not totally convinced by this:

> It might appear that this is an argument against scale, and the Bitter Lesson. That is not the case. I see this as a move that lets scale do its work on the right object. As with chess, where encoding the game rules into training produces a leap that no amount of inference-time search can today match, the move here is to encode the programming language itself into the training, and apply scale on a structure that actually reflects what we’re trying to produce.

One way to think of the bitter lesson as it applies to generative models is that ~all data carries some information about the structure of reality, and architectures that let you train on more data are better because they learn better underlying world models. Knowledge transfers: LLMs are good at writing code partly because they've seen a lot of code, but also because they understand (at least to some extent) the relationship between that code and the rest of the world. Constraining a model's output structure also constrains the data that is available to train it. So the big question is whether you can actually meaningfully scale training with these kinds of strictly structured outputs.


At the same time treating everything as tokens and next word prediction will never produce any real understanding like what humans do when they learn how to program. The bitter lesson is an admission that we still have no clue what is at the core of human learning and reasoning so we have to brute force it with tons of data generated by humans. I also don't know if expert systems and ML techniques like feature extraction are really any worse in practice or if we just didn't have enough engineering resources or a proper way to organize and scale their development. They seemed to work quite well in a lot of cases with more predictable results and several orders of magnitude less compute. And LLMs still suffer the long-tail problem despite their insane amounts of data.

If we're at the end of the data and most new data is now produced by LLMs with little human oversight, where do we go? Seems like figuring out ways to mix LLMS with more structured models that can reliably handle important classes of problems is the next logical step. In a way that is what programming languages and frameworks/libraries are doing, but they've massively disincentivized work on those by claiming that LLMS will do everything.

The chess example is a good one, it's effectively solved so why shouldn't an LLM have a submodule that it can use to play chess and save some energy.


Author here - thanks for engaging.

> One way to think of the bitter lesson as it applies to generative models is that ~all data carries some information about the structure of reality

Completely agree. It might have not come across, but what I'm pointing out in the post is that the data as it is currently encoded in the models is needlessly lossy. Tokens do not reveal all the information we have at our disposal. In natural language, that's fine, because it's quite loose in structure.

But if our domain is heavily structured (like modern programming languages are), why reveal only snippets of linearised syntax of that structure to the model? Why not reveal the full structure we have at our disposal?

> and architectures that let you train on more data are better because they learn better underlying world models.

By this argument, wouldn't we conclude that training on chess using the game structure wouldn't work either, since that'd be a model that uses less data?

Less data is the point, isn't it?


I notice the experiments are all run with Gaussian token embeddings and weight matrices, which is a very different scenario than you would get in a real model. It shouldn't be much more difficult to try this with an actual model and data and get a much better sense of how well it compresses.


I completely agree.Right now this is all on a synthetic setup to isolate the behavior and understand the reconstruction vs memory tradeoff. Real models will definitely behave differently.

I’ve started trying this out with actual models, but currently running things CPU-bound, so it’s pretty slow. Would ideally want to try this properly on GPU, but that gets expensive quickly

So yeah, still very much a research prototype — but validating this on real models/data is definitely the next step.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: