Hacker Newsnew | past | comments | ask | show | jobs | submit | underlines's commentslogin

I maintain a private evaluation set of what many call "misguided attention" questions.

In many of these cases, the issue isnt failed logical reasoning. Its ambiguity, underspecified context, or missing constraints that allow multiple valid interpretations. Models often fail not because they can’t reason, but because the prompt leaves semantic gaps that humans silently fill with shared assumptions.

A lot of viral "frontier model fails THIS simple question" examples are essentially carefully constructed token sequences designed to bias the statistical prior toward an intuitively wrong answer. Small wording changes can flip results entirely.

If you systematically expand the prompt space around such questions—adding or removing minor contextual cues you'll typically find symmetrical variants where the same models both succeed and fail. That suggests sensitivity to framing and distributional priors (adding unnecessary info, removing clear info, add ambiguity, ...), not necessarily absence of reasoning capability.


Some might argue "sensitivity to framing and distributional priors" is a fancy way to say "absence of reasoning capability".


Actually, no those are separate things, and you can see for yourself with this simple experiment. Just adding "Check your assumptions" to the prompt -- note, without specifying what assumptions! -- leads them to reason to the right answer.

You can try it with the free version of ChatGPT yourself (remember to ask the original question in a separate session to verify it hasn't been "patched" yet.)


that's what the cultivators of these examples are preying on. but in practice what people care about is "can i get it to do <X>", not "is it a decider on every possible token sequence that humans perceive to be about <X>".


But what is being pitched as "AGI" hype is the latter.


none of what we are using today is even remotely being pitched as AGI. if anything, the foundation model makers go out of their way to pitch the opposite. this is a thing made up entirely in your head, and then you put it on others and then claim it was their doing.

Fair, but that's just what hype is. Overpromise, underdeliver. Most of us recognize its limits and take advantage of its strengths. This post (and many in it) seem to be inferring that AI is useless because it isn't AGI, answered a simple question wrong, was tricked, or didn't answer perfectly. This is cherry-picking at best, disingenuous at worst.


Well, that means the AI is garbage. They'll eventually train it to answer this specific question, and then it will perform worse in some other aspect. Wash, rinse, repeat, and eventually they'll claim the new frontier model is the best yet on carwash tests.


> They'll eventually train it to answer this specific question, and then it will perform worse in some other aspect.

Not necessarily. Simply asking models to "check your assumptions" -- note, without specifying what assumptions! -- overcomes a lot of these gotcha questions. The reason it's not in their system prompts by default is I think just a cost optimization: https://news.ycombinator.com/item?id=47040530


Crazy how five years ago this level of AI would be seen as scifi, and now there are people out there who think it's trash because we can trick it if we ask questions in weird ways.


I think the level of ai we have is amazing.

> there are people out there who think it's trash because we can trick it if we ask questions in weird ways.

Some of this sentiment comes form wanting AI to be predictable and for me stumbling into questions that the current models interpret oddly is not uncommon. There are a bunch of rules of thumbs that can be used to help when you run into a cases like this but no guarantee that they will work, or that the problem will remain solved after a model update, or across models.


There are a lot of rules of thumb you can follow to avoid getting bitten by a rattlesnake, but the easiest way is to just not pick up random snakes. I don't know where I'm going with this, but I am going for a walk.


When did Microsoft release that chat bot that went full nazi in a couple of hours?


2016 for those keeping score


An issue in the chat format is that all these models seem bad at recognizing when they have extraneous information from user that can be ignored, or insufficient information from the user to answer the question fully.

This issue is compounded by the lack of probabilities in the answers, despite the machines ultimately being probabilistic.

Notice a human in a real conversation will politely ignore extra info (the distance to car wash) or ask clarifying questions (where is the car?).

Even non-STEM people answer using probabilistic terms casually (almost certainly / most likely / probably / possibly / unlikely).

I suspect some of this is to minimize token usage in the fixed monthly price chat models, because back&forth would cost more tokens.. but maybe I'm too cynical.


The systems recognized the pattern that it looks like a generic article on the internet asking whether someone should walk or drive and answered it exactly as expected based on their training data. None of this should be surprising.

We are the ones fooling ourselves into believing there's more intelligence in these systems than they really have. At the end of the day, it's just an impressive parlor trick.


In that sense the google AI summary search results are a better UX for this type experience


The better UX is that the google ai search summary is easy to ignore.


Sounds interesting, would be nice to see the questions if you're open to sharing?


Why would you ask that? The whole point of making it private is to avoid it leaking into the training data.


Because it's interesting to me, it doens't mean they have to share them publicly btw


You should publish your evaluation set, that seems pretty interesting!

What’s your favourite one?


Why would you ask that? The whole point of making it private is to avoid it leaking into the training data.


I thought open benchmarks helped, sorry, guess I was being naive.


Ha, sorry, I was a bit brusque there.

Open benchmarks do help, but they mostly help the vendors, not we the users!


Keeping tests private is the only way to keep them valid.


Absolutely. If you give enough disambiguation, then even the least scoring model in the test gets it right first shot:

Minimax:

Question: I Want to Wash My Car at the Car Wash. The Car Wash Is 50 Meters Away from my home. I am at home. My car is at home. Do I walk or drive My Car to the car wash?

Answer:

Drive your car to the car wash.

Since your goal is to wash your car at the car wash facility, you need to bring the car there. Walking would leave your car at home, which defeats the purpose!

That said, 50 meters is such a short distance that:

    The drive takes less than a minute

    You might consider walking to the car wash first to check if there's space/availability, then drive your car over when ready
But fundamentally, if you want to wash your car at the car wash, you need to bring the car there — so drive.

Bonus, It's thinking output:

Thinking: The user is asking a simple question about whether they should walk or drive their car to a car wash that is 50 meters away from their home.

This is a straightforward practical question. Let me analyze:

    1.
    The car wash is 50 meters away
    2.
    The user is at home
    3.
    The car is at home
    4.
    The goal is to wash the car
The practical considerations are:

    Driving 50 meters is a very short distance - likely less than a minute
    Starting a car, driving, and parking for such a short distance uses fuel and creates wear and tear
    Walking 50 meters would take about 30-40 seconds
    The car needs to be driven to the car wash to use the car wash facilities
    However, for just 50 meters, it might be more practical to walk if the car wash is within walking distance and you can simply walk there and use the facility
Wait, I need to think about this more carefully. The question is: "Do I walk or drive My Car to the car wash?"

If the car wash is a facility where you wash your car (like a self-service car wash or automatic car wash), then you NEED to drive the car there to use it. You can't wash a car that's at home from 50 meters away - unless you're using a mobile car wash service.

But if there's a car wash facility 50 meters away, and you want to use it, you would typically drive your car there. However, for just 50 meters, walking might seem more efficient, but then you wouldn't have your car at the car wash to wash it!

Actually, re-reading the question: "I Want to Wash My Car at the Car Wash" - this implies they want to use the car wash facility. To wash the car AT the car wash, they need to bring the car there, which means driving.

But wait - maybe they're asking about the mode of transportation to GET to the car wash (not to bring the car). Let me re-read:

    "I Want to Wash My Car at the Car Wash" - they want to wash the car at the car wash facility
    "The Car Wash Is 50 Meters Away from my home" - it's very close
    "I am at home" - they're currently at home
    "My car is at home" - the car is also at home
    "Do I walk or drive My Car to the car wash?" - should they walk or drive to get there
I think the answer is: You should drive your car to the car wash because:

    1.
    The purpose is to wash the car AT the car wash
    2.
    You need the car to be at the car wash to wash it
    3.
    Walking would leave the car at home, defeating the purpose
However, for such a short distance, I could also mention that walking might be more practical if they just want to use the facility quickly, or if there are parking/maneuvering concerns at the car wash.

Actually, I'll keep it simple and direct.



The article shared here seems to be a 2022 follow-up of sorts to the original blog post from 2013 that your links reference.


The (pun ahead) peak of this method imho is implemented in "PeakFinder", afaik uses a low res nation wide (switzerland) height map, after initial gps fix it downloads local high res height map, calculates peak contours based of current location AND height and overlays that grid including the peak names onto the camera feed using the gyro and compass.

It's quite easy to build accurate geo-related applications in Switzerland due to the excellent work of the government office "Swiss Topo" that maps every tree, every house, every road in the whole country. Trees in cities have metadata such as: year planted, type etc. :)

Johnny Harris, the map aficionado mentioned Swiss maps and Swiss Topo's dedication multiple times in his videos.


PeakFinder is indeed an excellent app, worth every cent.

It's also available as a website! https://www.peakfinder.com/


PeakFinder works worldwide, not just in Switzerland.


The official Swiss Topo app has this AR feature built-in, by the way. Or at least on iOS. Pretty magical when it works, like most (real) AR applications. (It does lose alignment from time to time.)


Tool stuffing degrades LLM tool use quality. 100+ tools is crazy. We probably need a tool that does relevant tool retreaval and reranking lol


Oh got, this scales bad and bloats your context window!

Just create an MCP server that does embedding retrieval or agentic retrieval with a sub agent on your framework docs.

Finally add an instruction to AGENT.md to look up stuff using that MCP.


I also look for a sophisticated self hosted, open source transcoding solution as a web app, but in the mean time, the complete opposite: no bells and whistles, no config, no control except size: https://github.com/JMS1717/8mb.local

or do you mean a web based file manager / video gallery with transcoding capabilities?


rag will be pronounced differently ad again and again. it has its use cases. we moved to agentic search having rag as a tool while other retrieval strategies we added use real time search in the sources. often skipping ingested and chunked soueces. large changes next windows allow for putting almost whole documents into one request.


Even though I work as an IT Professional, I was almost always the only person not self hosting anything at home and not having a NAS.

I jumped the hoop and bought a Ugreen nas with 4 bays where the first thing I did was installing TrueNAS CE onto it and then use ChatGPT with highly customized prompts and the right context (my current docker-compose files).

Without much previous knowledge of docker, networking etc. except what I remembered from my IT vocational education from 15 years ago, I now have:

- Dockerized Apps

- App-Stacks in their own App-Network

- Apps that expose web UI not via ports, but via Traefik + Docker labels

- Only Traefik 443 ports reachable from WAN, plus optional port forwarding for non-http services

- Optional Cloudflare Tunnel

- Automatic Traefik TLS termination for LAN and WAN for my domain

- Split-DNS to get hostnames routed properly on LAN and WAN

- CrowdSec for all exposed containers

- Optional MFA via Cloudflare for exposed services

- Local DHCP/DNS via Technitium

- Automatic ZFS snapshots and remote backups

- Separation between ephemeral App data (DBs, Logs) on SSD and large files on HDD


Reading the title, I thought: finally someone rooted/jailbroke sony cameras.

On Canon you can run Magic Lantern, an extensive mod that adds many features to Canon cameras.

Even Samsung N1 had SD Card loadable mods before they moved away from the camera market.

Rooting sony seems impossible, I never saw someone Working on it Since their Fullframe lineup launched.


The last two generations of Samsung NX cameras were built around Tizen Linux, and it was (and still is) easy to get a root shell on them. They still make great photos and you still can buy them used for a good price.

NX300/NX30/NX2000 had a read-only rootfs, but for NX500 and NX1 there was a persistent mod that extended camera functionality with a menu, and you can actually SSH into them and rsync your photos... while shooting!

Background: I've recently taken over maintenance of the NX-KS mod at https://github.com/ge0rg/nx-ks-mod


Great to see another Samsung NX hacker in the wild! I'm in the process of developing a mod for my NX300 and NX30 (with the NX2000 likely compatible). It doesn't do anything yet, but I've got a lot of work done on hooking ARM code [0] and compiling modern C++ for the cameras.

Personally I think the NX300/30/2000 are the most hackable cameras ever made, even compared to the NX1/500. The read-only rootfs isn't really a barrier, since the software runs a shell script from the SD card on boot (or rather resume from hibernation, it's a pretty clever system). And unlike the newer models, they don't have an RTOS coprocessor, so everything is handled in the easier-to-modify Linux code. It's not a design decision I would have made, but it makes in-depths mods easier.

The older cameras are also easy to unbrick, since the bootloader files used for firmware flashing without a working OS were released in the FLOSS code dump. The availability of some C headers in that dump is the cherry on top.

I'll admit I'd still rather have an NX500, I just bought the NX300 because I'm cheap :)

[0]: https://gitlab.com/dvdkon/pahil


Yeah, I've documented a thing or two about the NX series on https://op-co.de/blog/tags/samsung-nx/

Regarding the RTOS, I took my NX300 from the shelf some weeks ago to make a few shots for the live demo at https://programm.froscon.org/froscon2025/talk/fc37ae17-9264-... and OH MY FSCKING GOD IT'S SLOW! I made a burst shot of a model train approaching, and the camera was busy processing it for multiple minutes. The NX500 is lightning fast in comparison, and the NX1 is even snappier.

So what do you plan to do with the ARM hook? I've poked at different places of di-camera-binary, but never at the processing pipeline, and there are soooo many things to reverse-engineer, and I'm but one person!


The possibilities are endless, so I need to make sure not to get lost in them and actually get something done :) I have a shortlist of changes to make, from surface-level to harder things:

- Allow configuring the controls. For example, the multi-purpose "focus" ring is great, but is severely hampered by having to press the "iFn" button every time.

- Add bulk upload of photos to Immich (though that could just as easily be an external script).

- Write custom widgets for the LV view, like a RAW histogram or time display. Also hide the bottom buttons that have already burned into my screen.

- Allow full electronic shutter (I already had to change this camera's shutter once).

- Add highlight metering, or rewrite the autoexposure entirely.

- Support outputting raw video.

- Tone down the denoising on out-of-camera JPEGs.

- Play with custom sensor crops, line skipping and other things to get zoomed in videos.


I had the NX1 with all the premium lenses and some photos still seem to be better than what my Sony A7-M4 shoots. But no 10bit 4:2:2 for video and no real flat profile was a bummer. I loved the persistent mod though. Sold all NX1 gear years ago, moved to a Sony A7-M3 and then A7-M4. Full Frame has some great benefits.


> Rooting sony seems impossible, I never saw someone Working on it Since their Fullframe lineup launched.

On some cameras, including the older firmwares for the current cameras, https://github.com/ma1co/Sony-PMCA-RE gives you a root shell.


Yes, aware of that, and nothing recent works with it, the last progress sadly was years ago.

I guess DMCA/Sony Lawyers and the relatively low market share for expensive cameras is the main reason why a PlayStation, an iPhone or a Nintendo Jailbreak is more appealing to reverse engineers than a Sony Camera Jailbreak.


Actually, half of the problem is vertical integration inside Sony cameras. It's all Sony from sensors to DSPs, and everything is designed and built by them.

The current firmware looks like a embedded Linux system designed for fast boot and is largely immutable, so the thing is pretty tightly secured down. You can put the board to flash mode and update the firmware, but that's all apparently.

Someone over DPReview was taking deltas of the file trees between firmware update packages to guess what has been updated, but going one step further was nigh impossible.

Sony doesn't even bin the DSPs from model to model, but create model-specific ones with different model numbers, and solder DRAM on top of them for latency and signal quality, so the cameras are complete black boxes.

The only missing thing is a complete epoxy pour over the board, but that thing gets hot and needs the case as a heat-sink, so it's not possible at this stage.


The other half of the problem is what to gain from a root shell. You can't influence the stages of the image processing without a PhD in Sony DSP Reverse Engineering, and so what remains is probably hooking into the camera controls and injecting key events to re-invent time-lapse timers or bulb exposures, and removing the 30min video recording limit.

This is where the NX mod project arrived - additional hooks into the camera controls and a few changes to internal registers left over by Samsung engineers for debugging, like silent shutter or the 30min limit.


Sony's full frame machines are so customizable out of the box already, so you don't need anything to begin with, at least for normal photography needs. Maybe focus stacking, but it's a pure new procedure.

30 minute recording limit is already lifted and advanced time-lapse is introduced alongside mammal eye tracking with a firmware update by Sony, and you can customize anything sans preliminary image processing steps, and by customization I mean the parameters of the algorithms or algorithms themselves.

Moreover, Sony's full-frame systems are already distributed systems to begin with. There are at least two DSPs running asynchronously during focus and image capture, so there may be no pathways to influence the DSPs significantly after they boot up, even.

Personally I wouldn't muck with a system designed to do image processing 30FPS even before taking a photo (A7-III) incl. face and eye tracking and focus correction without any hiccups to this day.

From what I understood, these cameras perform a nigh impossible dance within the limits of their hardware to maximize speed and reduce energy consumption. Who am I to disrespect them for doing that. Instead, I prefer to use the thing and get nice photos out of it.


I want to write my own metering algorithms in the pursuit of ETTR instead of using the current garbage leftover from film cameras


It works on the stock firmware of the FX30, which is relatively recent.


that's a feature of github when renaming repos


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: