Honestly, that's the only way I've ever been able to trust the output. Once you go beyond the scope of one file it really degrades. But within a single file I've seen amazing results.
Are you not supposed to include as many _preconditions_ (in the form of test cases or function constraints like "assert" macro in C) as you can into your prompt describing an input for a particular program file before asking AI to analyze the file?
Please, read my reply to one of the authors of Angr, a binary analysis tool. Here is an excerpt:
> A "brute-force" algorithm (an exhaustive search, in other words) is the easiest way to find an answer to almost any engineering problem. But it often must be optimized before being computed. The optimization may be done by an AI agent based on neural nets, or a learning Mealy machine.
> Isn't it interesting what is more efficient: neural nets or a learning Mealy machine?
...Then I describe what is a learning Mealy machine. And then:
> Some interesting engineering (and scientific) problems are: - finding an input for a program that hacks it; - finding a machine code for a controller of a bipedal robot, which makes it able to work in factories;