New Research Shows AI Performance Gains as TempoBench Creates Metrics

Research Brief

Key Takeaways

• Key findings from research papers

Sources

LLM4Branch: Large Language Model for Discovering Efficient Branching Policies of Integer Programs
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
PLACO: A Multi-Stage Framework for Cost-Effective Performance in Human-AI Teams
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
Political Plasticity: An Analysis of Ideological Adaptability in Large Language Models
Measuring What Matters: Benchmarking Generative, Multimodal, and Agentic AI in Healthcare
AI-Care: A Conversational Agentic System for Task Coordination in Alzheimer's Disease Care
Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge
OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control
Biological Plausibility and Representational Alignment of Feedback Alignment in Convolutional Networks
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
Reconciling Consistency-Based Diagnosis with Actual-Causality-Based Explanations
AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization
Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
Not All Turns Matter: Credit Assignment for Multi-Turn Jailbreaking
Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
SynerDiff: Synergetic Continuous Batching for Fast and Parallel Diffusion Model Inference
Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution
MDGYM: Benchmarking AI Agents on Molecular Simulations
Can We Formally Verify Neural PDE Surrogates? SMT Compilation of Small Fourier Neural Operators
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics
Containment Verification: AI Safety Guarantees Independent of Alignment
When (and How) to Trust the Expert: Diagnosing Query-Time Expert-Guided Reinforcement Learning
FORTIS: Benchmarking Over-Privilege in Agent Skills
CIVeX: Causal Intervention Verification for Language Agents
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
Evidence Over Plans: Online Trajectory Verification for Skill Distillation
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
Beyond ESG Scores: Learning Dynamic Constraints for Sequential Portfolio Optimization
SKG-VLA: Scene Knowledge Graph Priors for Structured Scene Semantics and Multimodal Reasoning for Decision Making
Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation
The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?
Explainable Knowledge Tracing via Probabilistic Embeddings and Pattern-based Reasoning
Towards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration
VulTriage: Triple-Path Context Augmentation for LLM-Based Vulnerability Detection
Functional Stable Model Semantics and Answer Set Programming Modulo Theories
Weighted Rules under the Stable Model Semantics
CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents
Primal-Dual Guided Decoding for Constrained Discrete Diffusion
Attribution-based Explanations for Markov Decision Processes
Marrying Generative Model of Healthcare Events with Digital Twin of Social Determinants of Health for Disease Reasoning
Fairness of Explanations in Artificial Intelligence (AI): A Unifying Framework, Axioms, and Future Direction toward Responsible AI
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
Optimizer-Induced Mode Connectivity: From AdamW to Muon
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
Rethinking Constraint Awareness for Efficient State Embedding of Neural Routing Solver
Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research
FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models
Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery
Verifiable Process Rewards for Agentic Reasoning
Positive Alignment: Artificial Intelligence for Human Flourishing
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
CORTEG: Foundation Models Enable Cross-Modality Representation Transfer from Scalp to Intracranial Brain Recordings
How Mobile World Model Guides GUI Agents?
GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic
Agent-X: Full Pipeline Acceleration of On-device AI Agents
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
SkillEvolver: Skill Learning as a Meta-Skill
PrimeKG-CL: A Continual Graph Learning Benchmark on Evolving Biomedical Knowledge Graphs
Bridging Sequence and Graph Structure for Epigenetic Age Prediction
LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation
Budget-Efficient Automatic Algorithm Design via Code Graph
PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines
Teacher-Aware Evolution of Heuristic Programs from Learned Optimization Policies
diffGHOST: Diffusion based Generative Hedged Oblivious Synthetic Trajectories
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
GESR: A Genetic Programming-Based Symbolic Regression Method with Gene Editing
MATRA: Modeling the Attack Surface of Agentic AI Systems -- OpenClaw Case Study
New AI-Driven Tools for Enhancing Campus Well-being: A Prevention and Intervention Approach
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD
Autonomous FAIR Digital Objects: From Passive Assertions to Active Knowledge
EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents
TMAS: Scaling Test-Time Compute via Multi-Agent Synergy
IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs
E-TCAV: Formalizing Penultimate Proxies for Efficient Concept Based Interpretability
Towards Autonomous Railway Operations: A Semi-Hierarchical Deep Reinforcement Learning Approach to the Vehicle Rescheduling Problem
SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems
Beyond Autonomy: A Dynamic Tiered AgentRunner Framework for Governable and Resilient Enterprise AI Execution
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing
TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning
From Single-Step Edit Response to Multi-Step Molecular Optimization
Prospective Compression in Human Abstraction Learning
expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling
RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation
EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents
Medical Model Synthesis Architectures: A Case Study
Unpredictability dissociates from structured control in language agents
Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities
Cplus2ASP: Computing Action Language C+ in Answer Set Programming
WindINR: Latent-State INR for Fast Local Wind Query and Correction in Complex Terrain
Don't Click That: Teaching Web Agents to Resist Deceptive Interfaces
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay
CHAINTRIX: A multi-pipeline LLM-augmented framework for automated smart-contract security auditing
Dsat: A Native SAT Solver for Discrete Logic
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning
CLEF: EEG Foundation Model for Learning Clinical Semantics
PathISE: Learning Informative Path Supervision for Knowledge Graph Question Answering
Probing Cross-modal Information Hubs in Audio-Visual LLMs
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
A Resilient Solution for Sewer Overflow Monitoring across Cloud and Edge
Deep Arguing
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
SLASH the Sink: Sharpening Structural Attention Inside LLMs
EmbodiSkill: Skill-Aware Reflection for Self-Evolving Embodied Agents
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
Learning the Interaction Prior for Protein-Protein Interaction Prediction: A Model-Agnostic Approach
Automated Approach for Solving Infinite-state Polynomial Reachability Games
M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
Workspace Optimization: How to Train Your Agent
PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
EpiGraph: A Knowledge Graph and Benchmark for Evidence-Intensive Reasoning in Epilepsy
Position: Avoid Overstretching LLMs for every Enterprise Task
How LLMs Are Persuaded: A Few Attention Heads, Rerouted
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
Towards Conversational Medical AI with Eyes, Ears and a Voice
Agentic MIP Research: Accelerated Constraint Handler Generation
From Holo Pockets to Electron Density: GPT-style Drug Design with Density
MIND-Skill: Quality-Guaranteed Skill Generation via Multi-Agent Induction and Deduction
The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection
What Will Happen Next: Large Models-Driven Deduction for Emergency Instances
Evaluating Developmental Cognition Capabilities of LLMs
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
Behavioral Determinants of Deployed AI Agents in Social Networks: A Multi-Factor Study of Personality, Model, and Guardrail Specification
LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification
Belief or Circuitry? Causal Evidence for In-Context Graph Learning
On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective
Playing games with knowledge: AI-Induced delusions need game theoretic interventions
Hierarchical Causal Abduction: A Foundation Framework for Explainable Model Predictive Control
The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime
Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems
ASIA: an Autonomous System Identification Agent
Agentic Performance at the Edge: Insights from Benchmarking
How Much is Brain Data Worth for Machine Learning?
Learning the Preferences of a Learning Agent
Open Ontologies: Tool-Augmented Ontology Engineering with Stable Matching Alignment
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
PnP-Corrector: A Universal Correction Framework for Coupled Spatiotemporal Forecasting
Internalizing Safety Understanding in Large Reasoning Models via Verification
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
C2L-Net: A Data-Driven Model for State-of-Charge Estimation of Lithium-Ion Batteries During Discharge
TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
Alignment as Jurisprudence
Constant-Target Energy Matching: A Unified Framework for Continuous and Discrete Density Estimation
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
Generalization Bounds of Emergent Communications for Agentic AI Networking
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors
When Can Human-AI Teams Outperform Individuals? Tight Bounds with Impossibility Guarantees
The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs
MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments
MaD Physics: Evaluating information seeking under constraints in physical environments
The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark
A Reflective Storytelling Agent for Older Adults: Integrating Argumentation Schemes and Argument Mining in LLM-Based Personalised Narratives
CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators
UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
CATO: Charted Attention for Neural PDE Operators
Sufficient conditions for a Heuristic Rating Estimation Method application
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
Agentic AI Scientists Are Not Built For Autonomous Scientific Discovery
FRACTAL: SSM with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences
Mental Health AI Safety Claims Must Preserve Temporal Evidence
AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design
RewardHarness: Self-Evolving Agentic Post-Training
MBP-KT: Learning Global Collaborative Information from Meta-Behavioral Pattern for Enhanced Knowledge Tracing
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
Embeddings for Preferences, Not Semantics
Spatial Priming Outperforms Semantic Prompting: A Grid-Based Approach to Improving LLM Accuracy on Chart Data Extraction
Arcane: An Assertion Reduction Framework through Semantic Clustering and MCTS-Guided Rule Exploring
Active Testing of Large Language Models via Approximate Neyman Allocation
Strategic Exploitation in LLM Agent Markets: A Simulation Framework for E-Commerce Trust
Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought
Strategic commitments shape collective cybersecurity under AI inequality
Do Linear Probes Generalize Better in Persona Coordinates?
NEXUS: Continual Learning of Symbolic Constraints for Safe and Robust Embodied Planning
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
Log analysis is necessary for credible evaluation of AI agents
Human-Inspired Memory Architecture for LLM Agents
CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
Route by State, Recover from Trace: STAR with Failure-Aware Markov Routing for Multi-Agent Spatiotemporal Reasoning
Yield Curve Forecasting using Machine Learning and Econometrics: A Comparative Analysis
UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification
A Game Theoretic Free Energy Analysis of Higher Order Synergy in Attention Heads of Large Language Models
Emergent Semantic Role Understanding in Language Models
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
M$^3$: Reframing Training Measures for Discretized Physical Simulations
Reasoning Compression with Mixed-Policy Distillation
Human-LLM Dialogue Improves Diagnostic Accuracy in Emergency Care
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
The Generalized Turing Test: A Foundation for Comparing Intelligence
Interpretable Machine Learning for Football Performance Analysis: Evidence of Limited Transferability from Elite Leagues to University Competition
LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs
Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents
Token Economics for LLM Agents: A Dual-View Study from Computing and Economics
Bias by Necessity: Impossibility Theorems for Sequential Processing with Convergent AI and Human Validation
LLM Jaggedness Unlocks Scientific Creativity
A Prompt-Aware Structuring Framework for Reliable Reuse of AI-Generated Content in the Agentic Web
Why Retrying Fails: Context Contamination in LLM Agent Pipelines
The Metacognitive Probe: Five Behavioural Calibration Diagnostics for LLMs
Do LLMs Experience an Internal Polylogue? Investigating Reasoning through the Lens of Personas
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
Data-driven Circuit Discovery for Interpretability of Language Models

New Research Shows AI Performance Gains as TempoBench Creates Metrics

Key Takeaways

Sources

Comments

You might also like

New Research Shows AI Performance Gains as TempoBench Creates Metrics

Researchers Advance AI Applications While Developing New Models

Researchers Develop AI Models with Exponential Improvements in Retail Tasks and Autonomous Systems

Prelaunch

Lebesgue

TubeIQ

Prelaunch

Lebesgue

TubeIQ

New Research Shows AI Performance Gains as TempoBench Creates Metrics

Key Takeaways

Sources

Comments

You might also like

New Research Shows AI Performance Gains as TempoBench Creates Metrics

Researchers Advance AI Applications While Developing New Models

Researchers Develop AI Models with Exponential Improvements in Retail Tasks and Autonomous Systems

Prelaunch

Lebesgue

TubeIQ

Prelaunch

Lebesgue

TubeIQ

This website uses cookies