Hacker Newsnew | past | comments | ask | show | jobs | submit | cahaya's commentslogin

Building Cursor IDE, but for knowledge workers. Domain will go live soon on https://document.bot

Hoping OpenAI/ Codex will launch this soon too.


I run Codex using terminus, iOS, and a VPS.


Hoping Cursor will also adopt.


Cursor already has agents accessible from web/mobile https://cursor.com/blog/agent-web


But can those web/mobile-accessible agents be on your own hardware, e.g. your desktop at home?


Those are cloud agents.


Nice. Curious about 5.3-codex-high results


  Codex app-server is the interface Codex uses to power rich clients (for example, the Codex VS Code extension). Use it when you want a deep integration inside your own product.
It mentions 'Inside your own product', but not sure if that means also your own commercial application.


I think it's permissible. Zed uses it to power their Codex integration. OpenAI has been quite vocal about it.


Same question here. A while ago I read rumors OpenAI might build a "Login with OpenAI" (comparable to login with Apple, Facebook, Google) so people can also use their existing sub in commercial apps. Hope it's true.


looks nice!


Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML: Tables with multiple headers and merged cells that get mixed up, multiple columns with tick boxes get mixed up, multi page tables that are not understood correctly. Also Llamaindex fails miserably on those things.

Curious to hear which OCR/ LLM excels with these specific issues? Example complex table: https://cdn.aviation.bot/complex-tables.zip

I can only parse this table correctly by first parsing the table headers manually into HTML as example output. However, it still mixes up tick boxes. Full table examples: https://www.easa.europa.eu/en/icao-compliance-checklist


> Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML:

But that's something else, that's no longer just OCR ("Optical Character Recognition"). If the goal suddenly changes from "Can take letters in images and make into digital text" to "Can replicate anything seen on a screen", the problem-space gets too big.

For those images you have, I'd use something like Magistral + Structured Outputs instead, first pass figure out what's the right structure to parse into, second pass to actually fetch and structure the data.


> But that's something else, that's no longer just OCR ("Optical Character Recognition").

Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.

It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents where wider context may be important. But saying it's "not OCR" doesn't seem meaningful from a technical perspective. It's an extension of the same goal to convert images of documents into the most accurate and useful digitized form with the least manual intervention.


Personally I think it's a meaningful distinction between "Can extract text" VS "Can extract text and structure". It is true that some OCR systems can handle trying to replicate the structure, but still today I think that's the exception, not the norm.

Not to mention it's helpful to separate the two because there is such a big difference in the difficulty of the tasks.


> But that's something else, that's no longer just OCR ("Optical Character Recognition").

Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.

It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents. But saying it's "not OCR" do doesn't seem meaningful from a technical perspective.


htceaad t nofdnsy lyruuieo sieerrr t owcope?


I threw in the first image/table into Gemini 2.5 Pro letting it choose the output format and it looks like it extracted the data just fine. It decided to represent the checkboxes as "checked" and "unchecked" because I didn't specify preferences.


Nice. Seems like i cannot run this on my Apple silicon M chips right?


If you have 64 GB of RAM you should be able to run the 4-bit quantized mlx models, which are specifically for the Apple silicon M chips. https://huggingface.co/collections/mlx-community/qwen3-next-...


Got 32GB so was hoping I could use ollm to offload it to my SSD. Slower but making it possible to run bigger models (in emergencies)


I have can host it on my M3 laptop somewhere around 30-40 tokens per second using mlx_lm's server command:

mlx_lm.server --model mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit --trust-remote-code --port 4444

I'm not sure if there is support for Qwen3-Next in any releases yet, but when I set up the python environment I had to install mlx_lm from source.


This particular one may not work on M chips, but the model itself does. I just tested a different sized version of the same model in LM Studio on a Macbook Pro, 64GB M2 Max with 12 cores, just to see.

Prompt: Create a solar system simulation in a single self-contained HTML file.

qwen3-next-80b (MLX format, 44.86 GB), 4bit 42.56 tok/sec , 2523 tokens, 12.79s to first token

- note: looked like ass, simulation broken, didn't work at all.

Then as a comparison for a model with a similar size, I tried GLM.

GLM-4-32B-0414-8bit (MLX format, 36.66 GB), 9.31 tok/sec, 2936 tokens, 4.77s to first token

- note: looked fantastic for a first try, everything worked as expected.

Not a fair comparison 4bit vs 8bit but some data. The tok/sec on Mac is pretty good depending on the models you use.


Depends how much ram yours has. Get a 4bit quant and it'll fit in ~40-50GB depending on context window.

And it'll run at like 40t/s depending on which one you have


I haven't tested on Apple machines yet, but gpt-oss and qwen3-next should work I assume. Llama3 versions use cuda specific loading logic for speed boost, so it won't work for sure


Nice, but can somebody tell me if this performs better than my simple Postgres MCP using npx? My current setup uses the LLM to search through my local Postgres in multiple steps. I guess this Pgmcp is doing multiple steps in the background and returns the final result to the LLM calling the MCP tool?

Codex: ``` [mcp_servers.postgresMCP] command = "npx" args = ["-y", "@modelcontextprotocol/server-postgres", "postgresql://user:password@localhost:5432/db"] ```

Cursor: ``` "postgresMCP": { "command": "npx", "args": [ "-y", "@modelcontextprotocol/server-postgres", "postgresql://user:password@localhost:5432/db" ] }, ```

With my setup i can easily switch between LLM's


nice! is there a way for the agent to know about it's own queries / resource usage?

eg the agent could actively monitor memory/cpu/time usage of a query and cancel it if it's taking too long?


Are there any existing scripts/ tools to use these evolutionary algorithms also at home with e.g. Codex/GPT-5 / Claude Code?


dspy approach seems rather similar to that: https://dspy.ai/tutorials/gepa_ai_program/


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: