Hacker Newsnew | past | comments | ask | show | jobs | submit | fromlogin
Soft Contamination Means Benchmarks Test Shallow Generalization (arxiv.org)
2 points by cjbarber 14 days ago | past | 1 comment
SkillsBench: Benchmarking how well agent skills work across diverse tasks (arxiv.org)
364 points by mustaphah 14 days ago | past | 171 comments
Virtual Width Networks (VWN) (arxiv.org)
9 points by tesserato 14 days ago | past
CodeLogician: Neuro-symbolic reasoning for precise software analysis (arxiv.org)
2 points by NTCTech 14 days ago | past | 1 comment
Intelligent AI Delegation (2026) (arxiv.org)
1 point by Nydhal 14 days ago | past
Delegated Agent Authorization Constrained to Semantic Task-to-Scope Matching (arxiv.org)
1 point by mooreds 14 days ago | past
Evaluating AGENTS.md: are they helpful for coding agents? (arxiv.org)
232 points by mustaphah 14 days ago | past | 161 comments
Multi-Agent Teams Hold Experts Back (arxiv.org)
1 point by fauigerzigerk 15 days ago | past
Large Language Model Reasoning Failures (arxiv.org)
1 point by kawera 15 days ago | past
Towards Autonomous Mathematics Research (arxiv.org)
107 points by gmays 15 days ago | past | 53 comments
Retrieval-Aware Distillation for Transformer-SSM Hybrids (arxiv.org)
2 points by readitalready 15 days ago | past
Biases in the Blind Spot: Detecting What LLMs Fail to Mention (arxiv.org)
2 points by mpweiher 16 days ago | past
A Framework for Time-Updating Probabilistic Forecasts (arxiv.org)
6 points by Luc 16 days ago | past
Towards Autonomous Mathematics Research (Google DeepMind) (arxiv.org)
1 point by u1hcw9nx 16 days ago | past
Remote Labor Index: Measuring AI Automation of Remote Work (arxiv.org)
2 points by Leynos 17 days ago | past
Generalized on-policy distillation with reward extrapolation (arxiv.org)
3 points by fzliu 17 days ago | past
OpenAI model proposes and proves Physics result (arxiv.org)
1 point by KothuRoti 17 days ago | past
An API for Biological Neural Networks (arxiv.org)
1 point by bwjx 17 days ago | past
Adversarial Patch: images that make classifiers ignore other items in a scene (arxiv.org)
1 point by felineflock 17 days ago | past
Maximum Agreement Linear Predictor (MALP) (arxiv.org)
1 point by tesserato 17 days ago | past | 1 comment
Standardized and In-Depth Benchmarking of Post-Moore Dataflow AI Accelerators (arxiv.org)
1 point by PaulHoule 17 days ago | past
Fine-Tuning GPT-5 for GPU Kernel Generation (arxiv.org)
4 points by matt_d 17 days ago | past
SWE-ContextBench: context learning benchmark in coding (arxiv.org)
1 point by mustaphah 17 days ago | past
LLMs exceed physicians on complex text-based differential diagnosis (arxiv.org)
3 points by rippeltippel 17 days ago | past | 2 comments
Horus: A Protocol For Trustless Verification Under Uncertainty (arxiv.org)
1 point by optimalsolver 17 days ago | past
Learning to Reason in 13 Parameters (arxiv.org)
2 points by stared 17 days ago | past
LLM Reasoning Failures (arxiv.org)
1 point by gradus_ad 18 days ago | past
Defining causal mechanism in dual process theory and 2 types of feedback control (arxiv.org)
1 point by s6i 18 days ago | past
Routing LLM queries using internal success predictions (70% cost reduction) (arxiv.org)
1 point by stansApprentice 18 days ago | past | 3 comments
SWE-AGI: benchmarking spec-driven software construction (arxiv.org)
1 point by mustaphah 18 days ago | past | 1 comment

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: