Hacker Newsnew | past | comments | ask | show | jobs | submit | daquisu's commentslogin

It was a recent edit though. Yesterday snapshot: https://web.archive.org/web/20260613072958/https://huggingfa...

How does that contradict that they uploaded the wrong model?

"I thought it was interesting and a bit underappreciated that the fraction of gold medalists at the 2025 IMO (72/630 = 11.4%) is the highest it’s been since 1981.

Crudely, IMO gold medals are awarded to the highest-scoring 1/12 of contestants.1 However, because scores are integers up to 42 and there’s no provision for tiebreaking, it’s possible for a lot of contestants to be tied around the threshold. In that case, either all of them get a gold medal or none do, and the fraction of gold medalists might deviate substantially from 1/12. That’s what happened this year: 46 contestants all won a gold medal by scoring exactly 35 points.

In fact, bizarrely, 35 is the mode of the scores this year; the last time the modal score was a gold medal score was in 1994. And, of course, 35 is the same score claimed by AI systems from Google, OpenAI, and others."

From https://blog.vero.site/post/imo-2025


I was under the impression that IMO is conducted in an official "exam" capacity, on site and in a very formal setting. So I find it hard to believe _direct_ LLM usage would be a factor Then again - it very well could be a factor in the training and preparation? I imagine "Write me a prep document for the IMO" will surface all kinds of interesting things from the training set.


> And, of course, 35 is the same score claimed by AI systems from Google, OpenAI, and others.

This is the part of the quote your6 replying about.

You seemed to take "of course" as an implication that the contestants used LLMs, and that's why they got the same score as the LLMs.

I took it to mean: since this was the modal score, there seemed to be 35 points worth of significantly easier answers (relatively speaking) than the remaining points, so it's not a surprise that LLMs got the same easier bits right. (Though I doubt all contestants got their points on exactly the same answers.)

But it's certainly unclear what exactly the author meant.


Later in the same blog post, the author says:

> We can also consider the IMO 2025 problems individually. In the Epoch AI newsletter, Greg Burnham combines a subjective analysis with Evan Chen’s MOHS ratings to argue that the first five problems at IMO 2025 were unusually easy and the sixth was unusually hard, so it’s not surprising that the first five problems were exactly the ones solved by these AIs. Though I’m not sure the MOHS scale is rigorous enough to make sense as the x-axis of a bar chart it’s easy to corroborate the high-level story with the official IMO statistics. Based on average scores, this year’s Problem 6 was the fourth hardest and its Problem 3 was by far the easiest of all Problem 3s and 6s since 2000.

In the linked MaxProof paper, in the section "6.3.1. Per-Problem Analysis" it shows the same behavior: 7/7 in the first 5 problems, 0/7 in the last problem.


This is not bizarre, it's a reflection of how the IMO is scored: 6 questions with scores from 0-7 but partial credit is rare. It's really a score of 5/6.


> Partial credit is rare

It's not rare at all. I can't find the 2026 results, but here are the 2025 ones.

https://www.imo-official.org/results/individual/year/2025/

The top of the table is full of 7 and the bottom is full of 0, but in the middle there are a lot of intermediate points. It's not uncommon "7 ? 0 7 ? 0" because the 1st and 4th are usually the easiest and the 3rd and 6th the hardest. But there are a of of other combinations due to stupid mistakes and lucky solutions and different personal styles/preferences that make some problems easier/harder for each contestant.


You're right, it's less uncommon than I recalled, and thanks for the source. But I don't think that 35/42 is suspicious/bizarre, and it does look like the 5x7 scores make up the bulk of the 35s.


I agree. If 6 was too hard and 3 "easy", then it would be common to get 777770.

I just tried that since I read your comment and it is really helping me. Thanks!


Now it is even easier. Cloudflare has a beta product called AI Search that implements most of these 160 lines of code


12.5 million a year for a hundred people seems reasonable? 125k per person per year. GP still said "a few hundred" - two hundred would drop that value to 62.5k per person


Firefox on mobile works with ublock. It can also play videos even with the screen locked, although you do have to unpause it after locking the screen.


> The "You are an expert software engineer" really helps?

Anecdata, but it weirdly helped me. Seemed BS for me until I tried.

Maybe because good code is contextual? Sample codes to explain concepts may be simpler than a production ready code. The model may have the capability to do both but can't properly disguished the correct thing to do.

I don't know.


Maybe it's not the "expert", but "software engineer" part that works? Essentially it's given a role. This constrains it a bit; e.g. it's not going to question the overall plan. Maybe this helps it take a subordinate position rather than an advisor or teacher. Which may help when there is a clear objective with clear boundaries laid out? Anyway, I will try myself and simply observe it if makes a difference.


That is a common narrative but Google had LaMDA as an LLM with over 100B parameters before the ChatGPT release. There was even a Xoogler that claimed it was alive.

From my POV Google could have released a good B2C LLM before OpenAI, but it would compete with their own Ads business.


True, actually people forget that quite good LLMs existed 2-3 years before ChatGPT, from Google, Microsoft, Facebook… OpenAI itself open-sourced GPT-2 all the way back in 2019 and had a GPT-3 API service for years before ChatGPT.

The breakthrough that ChatGPT brought was not technical, but the foresight to bet on laborious human-feedback fine-tuning to make LLMs somewhat controllable and practical. All those previous LLMs where mostly as “intelligent” as the GPT-3.5 that ChatGPT was built on, but they hallucinated so much, and it was so easy to manipulate them to be horribly racist and such. They remained niche tech demos until OpenAI trained them, not with new tech really, just the right vision and lots of expensive experimentation.


Which better measurement do you propose?


It is done by the extension without any fancy stuff. Extensions can load static js / css and bypass CSP with it, if it is declared in their manifest.json. Grammarly's manifest.json is here: https://gist.github.com/Daquisu/11eb1a7000b4141c4404edcc6e16...

For more advanced CSP bypass with extension, you can:

1. Inject JS code into any webpage with a CSP.

2. Create an event listener for your content script and reacting according to it.

3. Use your content script to communicate with the background script.

4. Use the background script to communicate with any website, including blocked websites by the CSP.

Basically, any website <-> extension content script <-> background script <-> any website.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: