Submissions from arxiv.org

		Soft Contamination Means Benchmarks Test Shallow Generalization (arxiv.org)
		2 points by cjbarber 14 days ago \| past \| 1 comment
		SkillsBench: Benchmarking how well agent skills work across diverse tasks (arxiv.org)
		364 points by mustaphah 14 days ago \| past \| 171 comments
		Virtual Width Networks (VWN) (arxiv.org)
		9 points by tesserato 14 days ago \| past
		CodeLogician: Neuro-symbolic reasoning for precise software analysis (arxiv.org)
		2 points by NTCTech 14 days ago \| past \| 1 comment
		Intelligent AI Delegation (2026) (arxiv.org)
		1 point by Nydhal 14 days ago \| past
		Delegated Agent Authorization Constrained to Semantic Task-to-Scope Matching (arxiv.org)
		1 point by mooreds 14 days ago \| past
		Evaluating AGENTS.md: are they helpful for coding agents? (arxiv.org)
		232 points by mustaphah 14 days ago \| past \| 161 comments
		Multi-Agent Teams Hold Experts Back (arxiv.org)
		1 point by fauigerzigerk 15 days ago \| past
		Large Language Model Reasoning Failures (arxiv.org)
		1 point by kawera 15 days ago \| past
		Towards Autonomous Mathematics Research (arxiv.org)
		107 points by gmays 15 days ago \| past \| 53 comments
		Retrieval-Aware Distillation for Transformer-SSM Hybrids (arxiv.org)
		2 points by readitalready 15 days ago \| past
		Biases in the Blind Spot: Detecting What LLMs Fail to Mention (arxiv.org)
		2 points by mpweiher 16 days ago \| past
		A Framework for Time-Updating Probabilistic Forecasts (arxiv.org)
		6 points by Luc 16 days ago \| past
		Towards Autonomous Mathematics Research (Google DeepMind) (arxiv.org)
		1 point by u1hcw9nx 16 days ago \| past
		Remote Labor Index: Measuring AI Automation of Remote Work (arxiv.org)
		2 points by Leynos 17 days ago \| past
		Generalized on-policy distillation with reward extrapolation (arxiv.org)
		3 points by fzliu 17 days ago \| past
		OpenAI model proposes and proves Physics result (arxiv.org)
		1 point by KothuRoti 17 days ago \| past
		An API for Biological Neural Networks (arxiv.org)
		1 point by bwjx 17 days ago \| past
		Adversarial Patch: images that make classifiers ignore other items in a scene (arxiv.org)
		1 point by felineflock 17 days ago \| past
		Maximum Agreement Linear Predictor (MALP) (arxiv.org)
		1 point by tesserato 17 days ago \| past \| 1 comment
		Standardized and In-Depth Benchmarking of Post-Moore Dataflow AI Accelerators (arxiv.org)
		1 point by PaulHoule 17 days ago \| past
		Fine-Tuning GPT-5 for GPU Kernel Generation (arxiv.org)
		4 points by matt_d 17 days ago \| past
		SWE-ContextBench: context learning benchmark in coding (arxiv.org)
		1 point by mustaphah 17 days ago \| past
		LLMs exceed physicians on complex text-based differential diagnosis (arxiv.org)
		3 points by rippeltippel 17 days ago \| past \| 2 comments
		Horus: A Protocol For Trustless Verification Under Uncertainty (arxiv.org)
		1 point by optimalsolver 17 days ago \| past
		Learning to Reason in 13 Parameters (arxiv.org)
		2 points by stared 17 days ago \| past
		LLM Reasoning Failures (arxiv.org)
		1 point by gradus_ad 18 days ago \| past
		Defining causal mechanism in dual process theory and 2 types of feedback control (arxiv.org)
		1 point by s6i 18 days ago \| past
		Routing LLM queries using internal success predictions (70% cost reduction) (arxiv.org)
		1 point by stansApprentice 18 days ago \| past \| 3 comments
		SWE-AGI: benchmarking spec-driven software construction (arxiv.org)
		1 point by mustaphah 18 days ago \| past \| 1 comment
		More