Typo : Webiste URL
The site is unusable on mobile even with desktop mode
In desktop mode the instructions to use the arrows appeared only after I zoomed out, they would gain from being part of the top banner
Would be nice to have the final listing :
1st place, second place, joined third
I used chatgpt often but switched to Lumo a few days ago. I like Lumo a lot. It almost never ends with a follow up question. If it does it's a sensible/useful one. It readily searches the web if it's not quite sure what the correct answer is. Also it's privacy first. It's based on a Mistral model.
Oh my god. I hate this so much. Gemini’s Voice mode is trained to do this so hard that it can’t even really be prompted away. It completely derails my thought process and made me stop using it altogether.
Part of what makes it so infuriating is that it uses the same patterns so often, the other part is that it's not very good at using them—the revelation that it's Y and not X is typically incredibly banal, not some profound observation.
But it was always going to attempt to do some things it's not good at too often. It's these things in particular because skilled human writers do use similar flourishes quite a lot. So imitating them allows the model to superficially appear like a good writer, which is worse than actually being a good writer, but better than superficially appearing like a bad writer.
A different training process might try to limit the model to only attempt things it can do 100% perfectly, but then there wouldn't be a lot it could do at all.
I tried ChatGPT over the holidays (paid) vs. claude.ai (paid).
After trying some prompts that worked well on Claude in ChatGPT, I understand why people are so annoyed about AI slop. The speech patterns in text output for ChatGPT are both obvious and annoying, and impossible to unsee when people use them in written communication.
Claude isn't without problems ("You're absolutely right"), but I feel that some of the perception there is around the limited set of phrases the coding agent uses regularly, and comes less from the multi-paragraph responses from the chatbot.
We're talking about a codebase, here. How does "lack of curiosity" about LLMS "make a mess"
> "probably know enough" <- that's exactly the point of the question, is the candidate clueless about AI/LLM.
Probably knows enough about what's a good vs bad change. If you're "clueless about AI/LLM" but know a bad change when you see one, how do you "make a mess?"
It's 2026, even a developer who's never touched an LLM before has heard about LLM hallucinations. If you've got programming knowledge, you should know how to make changes (e.g. you're not going to commit 200 files for a tiny change, because you know that doesn't smell right), which should guard against "making a mess."
My point it doesn't seem reasonable to assume symmetry here. That if you don't know both things, you'll make a mess. That also implies everything built before 2022 was a mess, because those developers new programming but not LLMs, which is an unreasonable claim to make.
I was too cute in trying to be terse, but I meant a mess while using AI:
> [Employers], above, are more focused on the opposite side, though: engineers who try AI once, see a mess or hallucinations, and decide it's useless. There is some learning to figure out how to wield it.
It does not matter that he vibe-coded it. It does not matter if any stars/twitter post were bought. He generated hype and that's what big AI company need at the moment. They hire him, they give a cut on that hype. If he's no good (at generating any hype) in the coming months, he'll be gone. It's hype all the way down.
"[a photoshopped picture of a dog with 5 legs]...please count the legs"
Meanwhile you could benchmark for something actually useful. If you're about to say "But that means it won't work for my use case of identifying a person on a live feed" or whatever, then why don't you test that? I really don't understand the kick people get of successfully tricking LLMs on non productive task with no real world application. Just like the "how many r in strawberry?", "uh uh uh it says two urh urh".. ok but so what? What good is a benchmark that is so far from a real use case?
The point of benchmarking that is checking for hallucinations and overfitting. Does the model actually check the picture to count the legs or does it just see it's a dog and answer four because it knows dogs usually has four legs?
It's a perfectly valid benchmark and very telling.
Telling of where the boundary of competence is for these models. And to show that these models aren't doing what most expect them to be doing, i.e. not counting legs, and maybe instead inferring information based on the overall image (dogs usually have 4 legs) to the detriment of find grained or out-of-distribution tasks.
reply