More

bytefire · 2025-07-04T17:09:53 1751648993

Main problem with regular (forward-only time) debugging is a state -- memory, CPU, cache etc -- which is contributed to the bug but is completely lost. With time travel debugging that can be saved which is great but now you have a bunch of data that you need to sift through as you trace the bug. Seems like AI is the right tool to save you this drudgery and get to the root cause sooner (or let AI work on it while you do other things in parallel).

This is new. Something that couldn't have been possible without either of time travel debugging or latest AI tech (MCP, code LLMs).

It will be interesting to know what challenges came up in nudging the model to work better with time travel debug data, since this data is novel and the models today might not be well trained for making use of it.

mark_undoio · 2025-07-04T17:32:45 1751650365

> It will be interesting to know what challenges came up in nudging the model to work better with time travel debug data, since this data is novel and the models today might not be well trained for making use of it.

This is actually quite interesting - it's something I'm planning to make a future post about.

But basically the LLM seems to be fairly good at using this interface effectively so long as we tuned what tools we provide quite carefully:

* Where we would want the LLM to use a tool sparingly it was better not to provide it at all. When you have time travel debugging it's usually better to work backwards since that tells you the causality of the bug. If we gave Claude the ability to step forward it tended to use it for everything, even when appropriate.

* LLMs weren't great at managing state they've set up. Allowing the LLM to set breakpoints just confused it later when it forget they were there.

* Open ended commands were a bad fit. For example, a time travel debugger can usually jump around in time according to an internal timebase. If the LLM was given access to that, unconstrained, it tended to just waste lots of effort guessing timebases and looking to see what was there.

* Sometimes the LLM just wants to hold something the wrong way and you have to let it. It was almost impossible to get the AI to understand that it could step back into a function on the previous line. It would always try going to the line, then stepping back, resulting in an overshoot. We had to just adapt the tool so that it could use it the way it thought it should work.

The overall result is actually quite satisfactory but it was a bit of a journey to understand how to give the LLM enough flexibility to generate insights without letting it get itself into trouble.

bytefire · on Nov 18, 2018

It's interesting how little known this exception handling mechanism is: https://stackoverflow.com/questions/51761688/linux-driver-tr...

bytefire · on Nov 16, 2018

Thanks. I am actually using this inside a kernel module whose job is to inspect Intel's virtualisation state: https://github.com/bytefire/vmtool

bytefire · on Oct 15, 2018

good point. may be the central idea of how it's implemented isn't too bad: i see hypervisor as a sort of OS kernel for VMs and the transitions from VM to hypervisor - VM exits - akin to syscalls. of course there is more but the above analogy is the basic idea and other things get added along the way

bytefire · on Oct 14, 2018

hi userbinator :) isn't the purpose of virtual 8086 mode somewhat different? i.e. to run real mode applications while the cpu is in protected mode? or did you mean that virtual 8086 could be generalised into a wider virtualisation system?

userbinator · on Oct 15, 2018

or did you mean at virtual 8086 could be generalised into a wider virtualisation system?

Yes, if you look at the way V86 is implemented, it wouldn't be too hard to extend it to full virtualisation --- something like a "VMX mode task" would've been ideal.

bytefire · on Oct 15, 2018

i see, makes sense. may be a different team from V86 worked on it? Conways law :)

bytefire · on Oct 14, 2018

very interesting and creative use if EPT, will read the link. thanks for sharing

bytefire · on Oct 14, 2018

thank you that means a lot! please do add any information you think is relevant :)

bonzini · on Oct 14, 2018

The bit about TLBs is a bit confusing, it seems like you're taking about a software TLB but EPT is just a second layer of address translation.

Also, after moving a VMCS from a physical CPU to another you have to do VMLAUNCH the first time your start the guest on the new CPU, because you had VMCLEARed it on the old CPU. That's it. :-)

bytefire · on Oct 14, 2018

very good, thank you. i'll try to tidy it up

bytefire · on Oct 8, 2018

no you didn't overlook. the article doesn't discuss actual mechanics of DRAM init, so thank you for adding this info :) i know there is a process of memory training whose aim is to arrive at the right parameters for that DRAM. the way i see it, it is sort of in-field caliberation. boot firmware can then store those parameters inside BIOS chip and then on next reboot just use those parameters, because memory training is a time-consuming process.

bytefire · on Oct 8, 2018

you're right, MRC is a major part of FSP but i think FSP does more work than just initialise memory. it also performs some CPU init and also ICH.

bytefire · on Oct 7, 2018

@burfog i have updated the post with explanation of how the reset vector address is calculated. thanks for pointing out :)

JdeBP · on Oct 8, 2018

As others here, I strongly recommend reading the IA manuals on this subject, as well as the equivalent AMD doco. Most of the processor part (but not the firmware part) of this subject is in the manufacturer doco.

And yes, one has to be careful about outdated information.

* https://superuser.com/a/347115/38062

* https://superuser.com/a/695716/38062

* https://superuser.com/a/345333/38062

* https://unix.stackexchange.com/a/461774/5132