More

jaen · 2026-03-01T14:37:09 1772375829

Huh, interesting... why have both React components and Mustache-style templates in the same framework? They perform the same function?

What's the use case for mixing them?

LukeB42 · 2026-03-01T14:49:08 1772376548

React components might eventually be removed in favor of making the templating system as fast and as elegant as possible but for the time being they provide flexibility.

You can read https://lukeb42.github.io/vertex-interop.html for more info.

jaen · 2026-03-01T11:51:30 1772365890

That's just because true statements are more likely to occur in their training corpus.

red75prime · 2026-03-01T21:16:27 1772399787

The overwhelming majority of true statements isn't in the training corpus due to a combinatorial explosion. What it means that they are more likely to occur there?

amelius · 2026-03-01T12:42:45 1772368965

The training set is far too small for that to explain it.

Try to explain why one shotting works.

jaen · 2026-03-01T14:58:18 1772377098

Uh, to explain what? You probably read something into what I said while I was being very literal.

If you train an LLM on mostly false statements, it will generate both known and novel falsehoods. Same for truth.

An LLM has no intrinsic concept of true or false, everything is a function of the training set. It just generates statements similar to what it has seen and higher-dimensional analogies of those .

red75prime · 2026-03-02T04:38:31 1772426311

Reasoning allows to produce statements that are more likely to be true based on statements that are known to be true. You'd need to structure your "falsehood training data" in a specific way to allow an LLM to generalize as well as with the regular data (instead of memorizing noise). And then you'll get a reasoning model which remembers false premises.

You generate your text based on a "stochastic parrot" hypothesis with no post-validation it seems.

jaen · 2026-02-28T08:24:24 1772267064

The paper does make this distinction under the "Concurrent Versions" property.

Allowing concurrent versions though opens you up to either really insidious runtime bugs or impossible-to-solve static type errors.

This happens eg. when you receive a package.SomeType@v1, and then try to call some other package with it that expects a package.SomeType@v2. At that point you get undefined runtime behavior (JavaScript), or a static type error that can only be solved by allowing you to import two versions of the same package at the same time (and this gets real hairy real fast).

Also, global state (if there is any) will be duplicated for the same package, which generally also leads to very hard-to-discover bugs and undefined behavior.

Onavo · 2026-02-28T08:38:05 1772267885

Good points. Practically speaking though global state is rarely an issue unless it's the underlying framework (hence peer deps).

Modern languages are mostly lexically scoped and using primarily global variables for state aside from Singletons has fallen out of favor outside of embedded unless it's a one off script.

jaen · 2026-02-27T10:48:50 1772189330

There isn't any attempt to falsify the "clean room" claim in the article - a rational approach would be to not provide any documents about the Z80 and the Spectrum, and just ask it to one-shot an emulator and compare the outputs...

If the one-shot output resembles anything working (and I am betting it will), then obviously this isn't clean room at all.

measurablefunc · 2026-02-27T21:10:33 1772226633

Author just trusts the agent to not use the internet b/c he wrote it so in the instructions should tell you all you need to know. It's great he managed to prompt it w/ the right specification for writing yet another emulator but I don't think he understands how LLMs actually work so most of the commentary on what's going on with the "psychology" of the LLM should be ignored.

the_af · 2026-02-27T21:24:12 1772227452

Even without internet access, probably everything there is to say about Z80/Speccy emulators was already in its training set.

antirez · 2026-02-27T11:04:39 1772190279

You didn't read the full article. The past paragraph talks about this specifically.

tredre3 · 2026-02-27T21:31:05 1772227865

In the last paragraph you handwave that all the Z80 and ZX Spectrum documentations is likely already in the model anyway... Choosing to not provide the documents/websites might then requiring more prompting to finish the emulator, but the knowledge is there. You can't clean room with a large LLM. That's delusion!

nathell · 2026-02-28T20:52:04 1772311924

Counterpoint: in December, a Polish MP [0] has vibe-coded an interpreter [1] of a 1959 Polish programming language, feeding it the available documentation. _That,_ at least, is unlikely to have appeared in the model’s training data.

[0]: https://en.wikipedia.org/wiki/Adrian_Zandberg [1]: https://sako-zam41.netlify.app/

jaen · 2026-03-01T12:01:10 1772366470

Not exactly a counterpoint, since nobody argued that LLMs can not produce "original" code from specs at all - just that this particular exercise was not clean room.

(although for SAKO [1], it's an average 1960 programming language, just with keywords in Polish, so it's certainly almost trivial for an LLM to produce an interpreter, since construction via analogy is the bread and butter of LLMs. Also, such interpreters tend to have an order of magnitude less complexity than emulators.)

[1]: https://en.wikipedia.org/wiki/SAKO_(programming_language)

jaen · 2026-02-28T08:30:24 1772267424

I mean, for an article that's titled "clean room", that would be the first thing to do, not as a "maybe follow up in the future"...

(I do think the article could have stood on its own without mentioning anything about "clean room", which is a very high standard.)

For the handwavy point about the x86 assembler, I am quite sure that the LLM will remember the entirety of the x86 instruction set without any reference, it's more of a problem of having a very well-tuned agentic loop with no context pollution to extract it. (which you won't get by YOLOing Claude, because LLMs aren't that meta-RLed yet to be able to correct their own context/prompt-engineering problems)

Or alternatively, to exploit context pollution, take half of an open-source project and let the LLM fill in the rest (try to imagine the synthetic "prompt" it was given when training on this repo) and see how far it is from the actual version.

jaen · 2026-02-27T10:35:00 1772188500

This is roughly the same problem as syntactical macros in non-Lisp syntax languages.

There needs to be a way to indicate a "hole" (metavariable/unquote) in the syntax tree, and depending on the complexity of the language's grammar, that might be somewhat difficult, eg. in C++ having a hole for a type declaration runs into the common ambiguity between declarations (constructor calls) and expressions (regular calls). This needs to be worked around by eg. having multiple types of holes to disambiguate...

For the article, the idea of using an "any" prefix on identifiers instead of eg. operators such as ` , ... (Lisp macros) is an interesting solution, as it does not require extending the grammar of the language... although it's not applicable in all situations (eg. for matching grammar elements where identifiers are not allowed). For a very regular language like Smalltalk though, it's pretty good.

Grammar-based rewriting has a long line of history and research, so there's a deep well of knowledge to be mined if you feel like hitting up Google Scholar...

For modern implementations, there's eg. ast-grep and semgrep: https://ast-grep.github.io/ https://semgrep.dev/docs/writing-rules/pattern-syntax

jaen · 2026-02-25T12:42:05 1772023325

You used LLMs to generate code to beat ARC-AGI "without using LLMs"... Uhh, okay then.

LLMs generating code to solve ARC-AGI is literally what they do these days, so as far as I see, basically this entire exercise is equivalent to just running "Deep Think" test-time compute type models and committing their output to Github?

What exactly was the novel, un-LLMable human input here?

kofdai · 2026-02-26T05:19:00 1772083140

I understand the skepticism—the line between "AI-generated" and "AI-assisted" has become incredibly blurry. Let me clarify the architectural distinction.

1. The Inference Engine is 100% Deterministic: The "solver" is a standalone Python program (26K lines + NumPy). At runtime, it has zero neural dependencies. It doesn't call an LLM, it doesn't load weights, and it doesn't "hallucinate." It performs a combinatorial search over a formal Domain Specific Language (DSL). You could run this on a legacy machine with no internet connection. This is fundamentally different from o1/o3 or Grok-Thinking, where the model is the solver at test-time.

2. The "Novel Human Input" is the DSL Design: Using an LLM to help write Python boilerplate is trivial. Using an LLM to design a 7-phase symbolic pipeline that solves ARC is currently impossible. My core contributions that an LLM could not "reason" out are:

The Cross DSL: The insight that ~57% of ARC transforms can be modeled by local 5-cell Von Neumann neighborhoods.

Iterative Residual Learning: A gradient-free strategy where the system synthesizes a transform, calculates the residual error on the grid, and iteratively synthesizes "correction" programs.

Pruning & Verification: Implementing a formal verification loop where every candidate solution is checked against the 3-5 training examples before being proposed.

3. Scaling through Logic, not Compute: While the industry spends millions on "Test-time Compute" (GPU-heavy CoT), Verantyx achieves 18.1% (and now 20% in v6) using Symbolic Synthesis on a single CPU. The 208 commits in the repo represent 208 iterations of staring at grid failures and manually expanding the primitive vocabulary to cover topological edge cases that LLMs consistently miss.

If using Copilot to speed up the implementation of a deterministic search algorithm invalidates the algorithm, then we’d have to invalidate most modern OS kernels or compilers written today. The "intelligence" isn't in the typing; it's in the program synthesis architecture that does what pure LLM inference cannot.

I'd encourage you to check the source—it's just pure, brute-force symbolic logic: https://github.com/Ag3497120/verantyx-v6

jaen · 2026-02-26T10:08:20 1772100500

Did you even read my comment with any thought?! This is like an AI-generated response that didn't understand what I was actually saying.

> I'd encourage you to check the source

I couldn't have written my comment without reading the source, obviously!

> o1/o3 or Grok-Thinking, where the model is the solver at test-time.

What? I said that the SotA is the model generating code at test-time, not solving it directly via CoT/ToT etc.

kofdai · 2026-02-27T01:03:52 1772154232

Title: Clarification on my development workflow (re: jaen)

You’re right to be skeptical of the speed, and I realize I was incomplete in describing my process. I should have been more transparent: I am using Claude Code as a "pair-programmer" to implement and review the logic I design.

While the Verantyx engine itself remains a 100% static, symbolic solver at test-time (no LLM calls during inference), the rapid score jumps from 20.1% to 22.4% are indeed accelerated by an AI-assisted workflow.

My role is to identify the geometric pattern in the failed tasks and design the DSL primitive (the "what"). I then use Claude Code to scaffold the implementation, check for regressions across the 1,000 tasks, and refine the code (the "how").

This is why I can commit 30-80 lines of verified geometric logic in minutes rather than hours. The "thinking" and the "logic design" are human-led, but the "implementation" is AI-augmented.

My apologies if my previous comments made it sound like I was manually typing every single one of those 26K lines without help. In 2026, I believe this "Human-Architect / AI-Builder" model is the most effective way to tackle benchmarks like ARC.

I’d love to hear your thoughts on this hybrid approach to symbolic AI development.

kofdai · 2026-02-26T12:06:02 1772107562

You're right—I should have engaged with your actual point more carefully. Let me address it directly.

You said the SotA is models generating code at test-time, and you're correct. Systems like o3 synthesize Python programs per-task, execute them, and check outputs. That's a legitimate program synthesis approach.

Here's where Verantyx differs structurally:

*The DSL is fixed before test-time.* When Verantyx encounters a new task, it doesn't generate arbitrary Python. It searches over a closed vocabulary of ~60 typed primitives (`apply_symmetrize_4fold`, `self_tile_uniform`, `midpoint_cross`, etc.) and composes them. The search space is finite and enumerable. An LLM generating code has access to the full expressiveness of Python—mine doesn't.

*Here's the concrete proof that this isn't prompt-engineering:*

While we've been having this discussion, the solver went from 20.1% to *22.2%* (222/1000 tasks). That's +21 tasks in under 48 hours. Each new task required identifying a specific geometric pattern in the failure set, designing a new primitive function, implementing it, verifying it produces zero regressions on all 1,000 tasks, and committing. The commit log tells this story:

- `v55`: `panel_compact` — compress grid panels along separator lines - `v56`: `invert_recolor` — swap foreground/background with learned color mapping - `v57`: `midpoint_cross` + `symmetrize_4fold` + `1x1_feature_rule` (+5 tasks) - `v58`: `binary_shape` lookup + `odd_one_out` extraction (+2) - `v59`: `self_tile_uniform` + `self_tile_min_color` + `color_count_upscale` (+4)

Each of these is a 30-80 line Python function with explicit geometric semantics. You can read any one of them in `arc/cross_universe_3d.py` and immediately understand what spatial transformation it encodes. An LLM prompt-tuning loop cannot produce this kind of monotonic, regression-free score progression on a combinatorial benchmark—you'd see random fluctuations and regressions, not a clean staircase.

*The uncomfortable reality for "just use an LLM" approaches:*

My remaining ~778 unsolved tasks each require a new primitive that encodes a geometric insight no existing primitive covers. Each one I add solves 1-3 tasks. This is the grind of actual program synthesis research—expanding a formal language one operator at a time. It's closer to compiler design than machine learning.

I'd genuinely welcome a technical critique of the architecture. The code is right there: [cross_universe_3d.py](https://github.com/Ag3497120/verantyx-v6/blob/main/arc/cross...) — 1,200 lines, zero imports from any ML library.

jaen · 2026-02-24T11:08:43 1771931323

Don't bother engaging with grandparent, LLM-generated comment which is a regurgitation of what bigyabai said upthread.

jaen · 2026-02-24T11:01:43 1771930903

With the small graph in the post, finding the solution by searching backwards from "finished" graphs (ie. single-city) using dynamic programming should be simpler than beam search and guaranteed optimal.

jerf · 2026-02-24T15:43:54 1771947834

First thing that came to my mind too.

I think it would also be easier to add some meaningful variation to the resulting graph removals by building up instead of trying to remove and retain properties. The proposed algorithms are perhaps too predictable by the player for the game, depending on how it is played.

salamo · 2026-02-26T07:04:47 1772089487

See, that's why I have to post these things. Someone will inevitably reply with something more clever.

jaen · 2026-02-19T20:50:59 1771534259

Complexity-wise, this version is more complicated (mixing different styles and paradigms) and it's barely less tokens. Lines of code don't matter anyway, cognitive load does.

Even though I barely know Raku (but I do have experience with FP), it took way less time to intuitively grasp what the Raku was doing, vs. both the Python versions. If you're only used to imperative code, then yeah, maybe the Python looks more familiar, though then... how about riding some new bicycles for the mind.

zephen · 2026-02-20T04:43:38 1771562618

> Complexity-wise, this version is more complicated (mixing different styles and paradigms)

Really? In the other Python version the author went out of his way to keep two variables, and shit out intermediate results as you went. The raku version generates a sequence that doesn't even actually get output if you're executing inside a program, but that can be used later as a sequence, if you bind it to something.

I kept my version to the same behavior as that Python version, but that's different than the raku version, and not in a good way.

You should actually ignore the print in the python, since the raku wasn't doing it anyway. So how is "create a sequence, then while it is not as long as you like, append the sum of the last two elements" a terrible mix of styles and paradigms, anyway? Where do you get off writing that?

> Lines of code don't matter anyway, cognitive load does.

I agree, and the raku line of code imposes a fairly large cognitive load.

If you prefer "for" to "while" for whatever reason, here's a similar Python to the raku.

  seq = [0,1]
  seq.extend(sum(seq[-2:]) for _ in range(18))

The differences are that it's a named sequence, and it doesn't go on forever and then take a slice. No asterisks that don't mean multiply, no carets that don't mean bitwise exclusive or.

> If you're only used to imperative code, then yeah, maybe the Python looks more familiar, though then... how about riding some new bicycles for the mind.

It's not (in my case, anyway) actually about imperative vs functional. It's about twisty stupid special symbol meanings.

Raku is perl 6 and it shows. Some people like it and that's fine. Some people don't and that's fine, too. What's not fine is to make up bogus comparisons and bogus implications about the people who don't like it.

jaen · 2026-02-20T09:41:51 1771580511

Reminds me a bit of the fish anecdote told by DFW... they've only swam in water their entire life, so they don't even understand what water is.

Here are the mixed paradigms/styles in these Python snippets:

- Statements vs. expressions

- Eager list comprehensions vs. lazy generator expressions

- Mutable vs. immutable data structures / imperative reference vs. functional semantics

(note that the Raku version only picks _one_ side of those)

> seq.extend(sum(seq[-2:]) for _ in range(18))

I mean, this is the worst Python code yet. To explain what this does to a beginner, or even intermediate programmer.... oooooh boy.

You have the hidden inner iteration loop inside the `.extend` standard library method driving the lazy generator expression with _unspecified_ one-step-at-a-time semantics, which causes `seq[-2:]` to be evaluated at exactly the right time, and then `seq` is extended even _before_ the `.extend` finishes (which is very surprising!), causing the next generator iteration to read a _partially_ updated `seq`...

This is almost all the footguns of standard imperative programming condensed into a single expression. Like ~half of the "programming"-type bugs I see in code reviews are related to tricky temporal (execution order) logic, combined with mutability, that depend on unclearly specified semantics.

> It's about twisty stupid special symbol meanings.

Some people program in APL/J/K/Q just fine, and they prefer their symbols. Calling it "stupid" is showing your prejudice. (I don't and can't write APL but still respect it)

> What's not fine is to make up bogus comparisons and bogus implications about the people who don't like it.

That's a quite irrational take. I didn't make any bogus comparisons. I justified or can justify all my points. I did not imply anything about people who don't like Raku. I don't even use Raku myself...

zephen · 2026-02-20T19:38:56 1771616336

> You have the hidden inner iteration loop inside the `.extend` standard library method driving the lazy generator expression with _unspecified_ one-step-at-a-time semantics

That's why it wasn't the first thing I wrote.

> To explain what this does to a beginner, or even intermediate programmer.... oooooh boy.

As if the raku were better in that respect, lol.

> Some people program in APL/J/K/Q just fine, and they prefer their symbols.

APL originally had a lot of its own symbols with very little reuse, and clear rules. Learning the symbols was one thing, but the usage rules were minimal and simple. I'm not a major fan of too many different symbols, but I really hate reuse in any context where how things will be parsed is unclear. In the raku example, what if the elements were to be multiplied?

> Calling it "stupid" is showing your prejudice. (I don't and can't write APL but still respect it) > Reminds me a bit of the fish anecdote told by DFW...

Yeah, for some reason, it's not OK for me to insult a language, but it's OK for you to insult a person.

But you apparently missed that the "twisty" part was about the multiple meanings. Because both those symbols are used in Python (the * in multiple contexts even) but the rules on parsing them are very simple.

perl and its successor raku are not about simple parsing. You are right to worry about the semantics of execution, but that starts with the semantics of how the language is parsed.

In any case, sure, if you want to be anal about paradigm purity, take my first example, and (1) ignore the print statement because the raku version wasn't doing that anyway, although the OP's python version was, and (2) change the accumulation.

  seq = [0,1]
  while len(seq) < 20:
    seq = seq + [seq[-2] + seq[-1]]

But that won't get you very far in a shop that cares about pythonicity and coding standards.

And...

You can claim all you want that the original was "pure" but that's literally because it did nothing. Not only did it have no side effects, but, unless it was assigned or had something else done with it, the result was null and void.

Purity only gets you so far.

jaen · 2026-02-22T08:12:24 1771747944

You're getting more and more irrational.

> it's OK for you to insult a person.

I made an analogy which just means that it's hard to understand what the different styles and paradigms are when those are the things you constantly use.

You're apparently taking that as an insult...

> But you apparently missed that the "twisty" part

I didn't miss anything. You just didn't explain it. "twisty" does not mean "ambiguous" or "hard to parse". Can't miss what you don't write.

lizmat · 2026-02-20T23:30:09 1771630209

> In the raku example, what if the elements were to be multiplied?

$ raku -e 'say (0, 1, 2, * × * ... )[^10]' # for readability (0 1 2 2 4 8 32 256 8192 2097152)

$ raku -e 'say (0, 1, 2, * * ... *)[^10]' # for typeability (0 1 2 2 4 8 32 256 8192 2097152)

zephen · 2026-02-21T01:29:17 1771637357

Yeah, no thanks.

My instincts about raku were always that perl was too fiddly, so why would I want perl 6, and this isn't doing anything to dissuade me from that position.

jaen · 2026-02-18T18:11:38 1771438298

but...

1. Synchronizing on trivial properties (otherwise you couldn't use `public` anyway!) is an anti-pattern, as it's a too fine-grained unit of concurrency and invites race conditions.

2. Can't proxy without rewriting byte code, you mean.

3. Of course you can evolve, it's just a breaking ABI change so it requires a trivial code migration on the side of the callee. If the cost of that migration is too high, something else is wrong.

uniq7 · 2026-02-19T01:51:34 1771465894

> a trivial code migration on the side of the callee

If your library is used by multiple consumers, forcing all them to migrate is not trivial, no matter how simple the change is.

If your income comes from these customers, it is not a good idea to put every one of them in the situation of having to choose between updating their code or stoping being your customer.