Researchers have made significant advancements in artificial intelligence, particularly in the areas of formal verification, multimodal knowledge editing, and self-evolving agent skills. Inductive Deductive Synthesis (IDS) has been developed to address the gap in formal guarantees of full coverage, achieving 7/7 in about 6.8 hours and $106 per spec on average. Agentic Proving for Program Verification has shown that Claude generates arguably valid specifications for 98.8% of problems and certifies implementations against correct ground-truth specifications for 87.5% of problems. Additionally, SkillOpt has been introduced as a systematic controllable text-space optimizer for agent skills, achieving best or tied results on all 52 evaluated cells. Energy per Successful Goal (EpG) has been proposed as a cross-layer measurement framework to redefine the unit of AI energy accounting, showing that agentic workflows consume 4.33x higher mean energy per successful goal than linear baselines.
The development of large language models (LLMs) has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. A three-step approach has been proposed to make explicit how benchmarked tasks represent the work claims attached to their scores, covering task mapping, tested settings, and scoring. The approach has been demonstrated through three benchmark case analyses: GDPval, OfficeQA Pro, and APEX-SWE. Furthermore, the Foundation Protocol (FP) has been introduced as a graph-first coordination layer for an emerging human-AI society, unifying heterogeneous entities and supporting native multi-party organization and event-based collaboration.
Researchers have also made progress in the area of strategic reasoning in large language models. GENSTRAT has been introduced as a procedurally generated strategic environment to evaluate model competence across six axes, including state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness. The capability profile and jaggedness measure have been proposed to provide a deployment-relevant diagnostic that the overall ranking alone cannot provide. Additionally, the theory of accountability boundaries in agentic ecosystems has been developed, introducing accountability assets and three boundary strategies: component, integrated, and dual-track.
Key Takeaways
- Inductive Deductive Synthesis (IDS) achieves 7/7 in about 6.8 hours and $106 per spec on average.
- Agentic Proving for Program Verification shows Claude generates arguably valid specifications for 98.8% of problems.
- SkillOpt is a systematic controllable text-space optimizer for agent skills, achieving best or tied results on all 52 evaluated cells.
- Energy per Successful Goal (EpG) redefines the unit of AI energy accounting, showing agentic workflows consume 4.33x higher mean energy per successful goal than linear baselines.
- The Foundation Protocol (FP) unifies heterogeneous entities and supports native multi-party organization and event-based collaboration.
- GENSTRAT evaluates model competence across six axes, including state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness.
- The capability profile and jaggedness measure provide a deployment-relevant diagnostic that the overall ranking alone cannot provide.
- The theory of accountability boundaries in agentic ecosystems introduces accountability assets and three boundary strategies: component, integrated, and dual-track.
- BOHM extracts a hierarchical attribution tree directly from the routing weights of compound AI systems, providing multi-resolution attribution at every level simultaneously.
- NeuroNL2LTL is a neurosymbolic architecture that unifies learned translation with formal verification, achieving 28% semantic equivalence with reference specifications.
Sources
- Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems
- Agentic Proving for Program Verification
- Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment
- SkillOpt: Executive Strategy for Self-Evolving Agent Skills
- Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems
- From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills
- ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization
- SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research
- RMA: an Agentic System for Research-Level Mathematical Problems
- AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery
- The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems
- PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning
- EVE-Agent: Evidence-Verifiable Self-Evolving Agents
- Ontological Knowledge Blocks: Executable Compliance and Profile-Based Validation for Trustworthy AI Systems
- Parallel Context Compaction for Long-Horizon LLM Agent Serving
- DART: Semantic Recoverability for Structured Tool Agents
- Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning
- When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems
- EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation
- Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
- MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection
- One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents
- Solving the Aircraft Disassembly Scheduling Problem
- CP or DP? Why Not Both: A Case Study in the Partial Shop Scheduling Problem
- Design and Report Benchmarks for Knowledge Work
- Foundation Protocol: A Coordination Layer for Agentic Society
- GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models
- Redrawing the AI Map: A Theory of Accountability Boundaries in Agentic Ecosystems
- Mediative Fuzzy Logic: From Type-1 Foundations to Type-2, Type-3 and Quantum Extensions
- SPACENUM: Revisiting Spatial Numerical Understanding in VLMs
- NeuroNL2LTL: A Neurosymbolic Framework for Natural Language Translation of Linear Temporal Logic
- BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems
Comments
Please log in to post a comment.