Researchers have made significant progress in developing multimodal datasets, large language models, and multimodal models for various applications. GroupAffect-4 is a multimodal dataset of four-person collaborative interaction, while AgentAtlas extends the line of work on diagnosing the limitations of direct accuracy columns for deployable agents. Interaction Locality in Hierarchical Recursive Reasoning proposes a task-geometry-aware framework for measuring information flow in hierarchical and recursive reasoning models. VBFDD-Agent is a vehicle battery fault detection and diagnosis agent for automotive-grade battery systems, while Teaching AI Through Benchmark Construction introduces a course-based practice that teaches AI through benchmark construction. Conditional Equivalence of DPO and RLHF proves that the equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice. Declarative Data Services proposes an architecture for structured agentic discovery of data-system compositions from declarative user intent. Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines evaluates the problem of latency-sensitive workflows in industrial asset operations. Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination identifies two groups of attention heads with opposing causal roles in modality-conflict hallucination. Not all uncertainty is alike: volatility, stochasticity, and exploration studies the asymmetry between volatility and stochasticity in driving optimal exploration. Evaluating the Utility of Personal Health Records in Personalized Health AI assesses the potential of large language models to provide helpful answers to user health queries when provided clinical data from PHRs as context. Progressive Autonomy as Preference Learning proposes a formalization of trust calibration for agentic tool use as a preference-learning problem. DecisionBench introduces a benchmark substrate for emergent delegation in long-horizon agentic workflows. Embedding by Elicitation proposes a Bayesian optimization framework based on embedding by elicitation. Learning to Hand Off proposes a framework for workflow learning in a setting where specialized agents hand off control through a shared artifact. Interference-Aware Multi-Task Unlearning introduces a framework for multi-task unlearning with two settings: full-task unlearning and partial-task unlearning. Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses evaluates the potential of large language models to revolutionize survey research. Hallucination as Exploit proposes evidence-carrying multimodal agents that treat free-form model text as inadmissible evidence. Discoverable Agent Knowledge proposes a formal framework for agentic knowledge graph affordances. SimGym proposes a framework for simulating A/B tests on e-commerce storefronts using vision-language model agents. MOCHA proposes a framework for generating scalable and verifiable planning data for evaluating and training large language models. Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling proposes a framework for conflict-resilient multi-agent reasoning using signed graph modeling. Generative-Evaluative Agreement proposes a validity criterion for LLM-enabled adaptive assessment. BLINKG proposes a benchmark for LLM-integrated knowledge graph generation. Efficient Elicitation of Collective Disagreements proposes a stratified framework for identifying the minimal aggregated preference information needed to compute a number of disagreement measures. Towards Multi-Model LLM Schedulers proposes a study of how different LLMs behave across hardware platforms. Library Drift proposes a framework for diagnosing and fixing a silent failure mode in self-evolving LLM skill libraries. EMO-BOOST proposes a multimodal deepfake detection framework that fuses an off-the-shelf RGB- and acoustic-focused deepfake detector with an emotion-based deepfake detector. When Tabular Foundation Models Meet Strategic Tabular Data proposes a framework for adapting tabular foundation models to strategic tabular data. Transforming Constraint Programs to Input for Local Search proposes a link between symmetry properties of constraint optimization problems and local search neighborhoods. EngiAI proposes a benchmark suite with three evaluation dimensions for LLM-driven engineering design. Pseudocode-Guided Structured Reasoning proposes a framework for adaptive structured reasoning using pseudocode. Projecting Latent RL Actions proposes a framework for projecting latent RL actions for generalizable and scalable graph combinatorial optimization. CogScale proposes a benchmark of 14 scalable synthetic tasks for evaluating sequence processing. Distribution-Free Uncertainty Quantification proposes a framework for distribution-free uncertainty quantification for continuous AI agent evaluation. Explainable Wastewater Digital Twins proposes a framework for explainable digital twins for wastewater treatment plants. PEEK proposes a system that caches and maintains reusable orientation knowledge about recurring external contexts. From Prompts to Pavement Through Time proposes a framework for temporal grounding in agentic scene-to-plan reasoning. Robotics-Inspired Guardrails proposes a framework for runtime behavioral control over interaction trajectories in socially sensitive domains. AutoResearchClaw proposes a multi-agent autonomous research pipeline for scientific discovery. GeoX proposes a self-play framework for acquiring spatial logic through executable programs. Probabilistic Tiny Recursive Model proposes a framework for probabilistic tiny recursive models. When Skills Don't Help proposes a study of when skills don't help in procedural knowledge for tool-grounded agents in offensive cybersecurity. A Methodology for Selecting and Composing Runtime Architecture Patterns proposes a methodology for selecting and composing runtime architecture patterns for production LLM agents. Using Aristotle API for AI-Assisted Theorem Proving proposes a case study of using Aristotle API for AI-assisted theorem proving in Lean 4. Neurosymbolic Learning for Inference-Time Argumentation proposes a framework for inference-time argumentation using neurosymbolic learning. HaorFloodAlert proposes a framework for 72-hour flood prediction in Bangladesh Haor wetlands. Not Every Rubric Teaches Equally proposes a framework for policy-aware rubric rewards for RLVR. Operationalizing Document AI proposes a microservice architecture for OCR and LLM pipelines in production. Learn-by-Wire Training Control Governance proposes a framework for bounded autonomous training control governance. KAN-MLP-Mixer proposes a framework for improving IMU-based human activity recognition using Kolmogorov-Arnold Networks. Probing Embodied LLMs proposes a study of probing embodied LLMs using a sequential mechanical puzzle. Prior Knowledge or Search proposes a study of LLM agents in hardware-aware code optimization. From SGD to Muon proposes a framework for adaptive optimization via Schatten-p norms. OpenComputer proposes a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. What Really Improves Mathematical Reasoning proposes a study of what really improves mathematical reasoning using structured reasoning signals. Memory-Augmented Reinforcement Learning Agent proposes a framework for memory-augmented reinforcement learning for CAD generation. Formal Skill proposes a framework for programmable runtime skills for efficient and accurate LLM agents. SceneCode proposes a framework for executable world programs for editable indoor scenes with articulated objects. Beyond Mode Collapse proposes a framework for distribution matching for diverse reasoning. Agentic Trading proposes a study of LLM agents in financial markets. Streamlined Constraint Reasoning proposes a framework for streamlined constraint reasoning via CNN pattern recognition. Beyond Rational Illusion proposes a framework for behaviorally realistic strategic classification. Position proposes a position on the Turing-completeness of real-world autoregressive Transformers. PRISM proposes a benchmark for programmatic spatial-temporal reasoning. Generative Recursive Reasoning proposes a framework for generative recursive reasoning. Attention-Guided Reward proposes a framework for attention-guided reward for reinforcement learning-based jailbreak against large reasoning models. How Far Are We From True Auto-Research? proposes a study of how far we are from true auto-research. POLAR-Bench proposes a diagnostic benchmark for privacy-utility trade-offs in LLM agents. Trustworthy Agent Network proposes a framework for trustworthy agent networks. AgentNLQ proposes a general-purpose agent for natural language to SQL. Minimax Optimal Variance-Aware Regret Bounds proposes a framework for minimax optimal variance-aware regret bounds for multinomial logistic MDPs. SOLAR proposes a self-optimizing open-ended autonomous agent for lifelong learning and continual adaptation. Tool-Augmented Agent proposes a tool-augmented agent for closed-loop optimization, simulation, and modeling orchestration. OSCToM proposes a framework for observer-self conflict theory of mind. Open-World Evaluations proposes a study of open-world evaluations for measuring frontier AI capabilities. ECUAS_n proposes a family of metrics for principled evaluation of uncertainty-augmented systems. COAgents proposes a multi-agent framework to learn and navigate routing problems search space. From Automated to Autonomous proposes a hierarchical agent-native network architecture. Mahjax proposes a GPU-accelerated Mahjong simulator for reinforcement learning in JAX. Conflict-Aware Additive Guidance proposes a framework for conflict-aware additive guidance for flow models under compositional rewards. Playing Devil's Advocate proposes a study of off-the-shelf persona vectors for sycophancy. For How Long Should We Be Punching? proposes a study of learning action duration in fighting games. Governance by Construction proposes a framework for governance by construction for generalist agents. Insights Generator proposes a systematic corpus-level trace diagnostics for LLM agents. ScenePilot proposes a controllable boundary-driven critical scenario generation for autonomous driving. AutoRPA proposes a framework for efficient GUI automation through LLM-driven code synthesis from interactions. AiraXiv proposes an AI-driven open-access platform for human and AI scientists. PALS proposes a power-aware LLM serving for mixture-of-experts models. High Quality Embeddings proposes a framework for high-quality embeddings for Horn logic reasoning. AgentCo-op proposes a retrieval-based synthesis of interoperable multi-agent workflows. DeepWeb-Bench proposes a deep research benchmark demanding massive cross-source evidence and long-horizon derivation. Personality Engineering proposes a methodology for negotiation research using AI agents. Mind the Sim-to-Real Gap & Think Like a Scientist proposes a study of the sim-to-real gap and the importance of thinking like a scientist.
A recent study has shown that large language models (LLMs) can be used to generate high-quality embeddings for Horn logic reasoning. The study proposes several approaches to creating embeddings that result in better downstream results, including generating anchors that are more likely to have repeated terms, generating positive and negative examples in a way that ensures a good balance between easy, medium, and hard examples, and periodically emphasizing the hardest examples during training. The study also conducts several experiments to evaluate this approach, including a comparison of different embeddings across different knowledge bases, in an attempt to identify what characteristics make an embedding well-suited to a particular reasoning task.
Researchers have proposed a framework for efficient GUI automation through LLM-driven code synthesis from interactions. The framework, called AutoRPA, introduces two core innovations: a translator-builder pipeline and a hybrid repair strategy during code verification. Experiments across multiple GUI environments demonstrate that RPA functions generated by AutoRPA successfully solve similar tasks while reducing token usage by 82% to 96%, significantly improving runtime efficiency and reusability.
Key Takeaways
- Researchers have made significant progress in developing multimodal datasets, large language models, and multimodal models for various applications.
- GroupAffect-4 is a multimodal dataset of four-person collaborative interaction.
- AgentAtlas extends the line of work on diagnosing the limitations of direct accuracy columns for deployable agents.
- Interaction Locality in Hierarchical Recursive Reasoning proposes a task-geometry-aware framework for measuring information flow in hierarchical and recursive reasoning models.
- VBFDD-Agent is a vehicle battery fault detection and diagnosis agent for automotive-grade battery systems.
- Teaching AI Through Benchmark Construction introduces a course-based practice that teaches AI through benchmark construction.
- Conditional Equivalence of DPO and RLHF proves that the equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice.
- Declarative Data Services proposes an architecture for structured agentic discovery of data-system compositions from declarative user intent.
- Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines evaluates the problem of latency-sensitive workflows in industrial asset operations.
- Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination identifies two groups of attention heads with opposing causal roles in modality-conflict hallucination.
- Not all uncertainty is alike: volatility, stochasticity, and exploration studies the asymmetry between volatility and stochasticity in driving optimal exploration.
- Evaluating the Utility of Personal Health Records in Personalized Health AI assesses the potential of large language models to provide helpful answers to user health queries when provided clinical data from PHRs as context.
- Progressive Autonomy as Preference Learning proposes a formalization of trust calibration for agentic tool use as a preference-learning problem.
- DecisionBench introduces a benchmark substrate for emergent delegation in long-horizon agentic workflows.
- Embedding by Elicitation proposes a Bayesian optimization framework based on embedding by elicitation.
- Learning to Hand Off proposes a framework for workflow learning in a setting where specialized agents hand off control through a shared artifact.
- Interference-Aware Multi-Task Unlearning introduces a framework for multi-task unlearning with two settings: full-task unlearning and partial-task unlearning.
- Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses evaluates the potential of large language models to revolutionize survey research.
- Hallucination as Exploit proposes evidence-carrying multimodal agents that treat free-form model text as inadmissible evidence.
- Discoverable Agent Knowledge proposes a formal framework for agentic knowledge graph affordances.
- SimGym proposes a framework for simulating A/B tests on e-commerce storefronts using vision-language model agents.
- MOCHA proposes a framework for generating scalable and verifiable planning data for evaluating and training large language models.
- Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling proposes a framework for conflict-resilient multi-agent reasoning using signed graph modeling.
- Generative-Evaluative Agreement proposes a validity criterion for LLM-enabled adaptive assessment.
- BLINKG proposes a benchmark for LLM-integrated knowledge graph generation.
- Efficient Elicitation of Collective Disagreements proposes a stratified framework for identifying the minimal aggregated preference information needed to compute a number of disagreement measures.
- Towards Multi-Model LLM Schedulers proposes a study of how different LLMs behave across hardware platforms.
- Library Drift proposes a framework for diagnosing and fixing a silent failure mode in self-evolving LLM skill libraries.
- EMO-BOOST proposes a multimodal deepfake detection framework that fuses an off-the-shelf RGB- and acoustic-focused deepfake detector with an emotion-based deepfake detector.
- When Tabular Foundation Models Meet Strategic Tabular Data proposes a framework for adapting tabular foundation models to strategic tabular data.
- Transforming Constraint Programs to Input for Local Search proposes a link between symmetry properties of constraint optimization problems and local search neighborhoods.
- EngiAI proposes a benchmark suite with three evaluation dimensions for LLM-driven engineering design.
- Pseudocode-Guided Structured Reasoning proposes a framework for adaptive structured reasoning using pseudocode.
- Projecting Latent RL Actions proposes a framework for projecting latent RL actions for generalizable and scalable graph combinatorial optimization.
- CogScale proposes a benchmark of 14 scalable synthetic tasks for evaluating sequence processing.
- Distribution-Free Uncertainty Quantification proposes a framework for distribution-free uncertainty quantification for continuous AI agent evaluation.
- Explainable Wastewater Digital Twins proposes a framework for explainable digital twins for wastewater treatment plants.
- PEEK proposes a system that caches and maintains reusable orientation knowledge about recurring external contexts.
- From Prompts to Pavement Through Time proposes a framework for temporal grounding in agentic scene-to-plan reasoning.
- Robotics-Inspired Guardrails proposes a framework for runtime behavioral control over interaction trajectories in socially sensitive domains.
- AutoResearchClaw proposes a multi-agent autonomous research pipeline for scientific discovery.
- GeoX proposes a self-play framework for acquiring spatial logic through executable programs.
- Probabilistic Tiny Recursive Model proposes a framework for probabilistic tiny recursive models.
- When Skills Don't Help proposes a study of when skills don't help in procedural knowledge for tool-grounded agents in offensive cybersecurity.
- A Methodology for Selecting and Composing Runtime Architecture Patterns proposes a methodology for selecting and composing runtime architecture patterns for production LLM agents.
- Using Aristotle API for AI-Assisted Theorem Proving proposes a case study of using Aristotle API for AI-assisted theorem proving in Lean 4.
- Neurosymbolic Learning for Inference-Time Argumentation proposes a framework for inference-time argumentation using neurosymbolic learning.
- HaorFloodAlert proposes a framework for 72-hour flood prediction in Bangladesh Haor wetlands.
- Not Every Rubric Teaches Equally proposes a framework for policy-aware rubric rewards for RLVR.
- Operationalizing Document AI proposes a microservice architecture for OCR and LLM pipelines in production.
- Learn-by-Wire Training Control Governance proposes a framework for bounded autonomous training control governance.
- KAN-MLP-Mixer proposes a framework for improving IMU-based human activity recognition using Kolmogorov-Arnold Networks.
- Probing Embodied LLMs proposes a study of probing embodied LLMs using a sequential mechanical puzzle.
- Prior Knowledge or Search proposes a study of LLM agents in hardware-aware code optimization.
- From SGD to Muon proposes a framework for adaptive optimization via Schatten-p norms.
- OpenComputer proposes a verifier-grounded framework for constructing verifiable software worlds for computer-use agents.
- What Really Improves Mathematical Reasoning proposes a study of what really improves mathematical reasoning using structured reasoning signals.
- Memory-Augmented Reinforcement Learning Agent proposes a framework for memory-augmented reinforcement learning for CAD generation.
- Formal Skill proposes a framework for programmable runtime skills for efficient and accurate LLM agents.
- SceneCode proposes a framework for executable world programs for editable indoor scenes with articulated objects.
- Beyond Mode Collapse proposes a framework for distribution matching for diverse reasoning.
- Agentic Trading proposes a study of LLM agents in financial markets.
- Streamlined Constraint Reasoning proposes a framework for streamlined constraint reasoning via CNN pattern recognition.
- Beyond Rational Illusion proposes a framework for behaviorally realistic strategic classification.
- Position proposes a position on the Turing-completeness of real-world autoregressive Transformers.
- PRISM proposes a benchmark for programmatic spatial-temporal reasoning.
- Generative Recursive Reasoning proposes a framework for generative recursive reasoning.
- Attention-Guided Reward proposes a framework for attention-guided reward for reinforcement learning-based jailbreak against large reasoning models.
- How Far Are We From True Auto-Research? proposes a study of how far we are from true auto-research.
- POLAR-Bench proposes a diagnostic benchmark for privacy-utility trade-offs in LLM agents.
- Trustworthy Agent Network proposes a framework for trustworthy agent networks.
- AgentNLQ proposes a general-purpose agent for natural language to SQL.
- Minimax Optimal Variance-Aware Regret Bounds proposes a framework for minimax optimal variance-aware regret bounds for multinomial logistic MDPs.
- SOLAR proposes a self-optimizing open-ended autonomous agent for lifelong learning and continual adaptation.
- Tool-Augmented Agent proposes a tool-augmented agent for closed-loop optimization, simulation, and modeling orchestration.
- OSCToM proposes a framework for observer-self conflict theory of mind.
- Open-World Evaluations proposes a study of open-world evaluations for measuring frontier AI capabilities.
- ECUAS_n proposes a family of metrics for principled evaluation of uncertainty-augmented systems.
- COAgents proposes a multi-agent framework to learn and navigate routing problems search space.
- From Automated to Autonomous proposes a hierarchical agent-native network architecture.
- Mahjax proposes a GPU-accelerated Mahjong simulator for reinforcement learning in JAX.
- Conflict-Aware Additive Guidance proposes a framework for conflict-aware additive guidance for flow models under compositional rewards.
- Playing Devil's Advocate proposes a study of off-the-shelf persona vectors for sycophancy.
- For How Long Should We Be Punching? proposes a study of learning action duration in fighting games.
- Governance by Construction proposes a framework for governance by construction for generalist agents.
- Insights Generator proposes a systematic corpus-level trace diagnostics for LLM agents.
- ScenePilot proposes a controllable boundary-driven critical scenario generation for autonomous driving.
- AutoRPA proposes a framework for efficient GUI automation through LLM-driven code synthesis from interactions.
- AiraXiv proposes an AI-driven open-access platform for human and AI scientists.
- PALS proposes a power-aware LLM serving for mixture-of-experts models.
- High Quality Embeddings proposes a framework for high-quality embeddings for Horn logic reasoning.
- AgentCo-op proposes a retrieval-based synthesis of interoperable multi-agent workflows.
- DeepWeb-Bench proposes a deep research benchmark demanding massive cross-source evidence and long-horizon derivation.
- Personality Engineering proposes a methodology for negotiation research using AI agents.
- Mind the Sim-to-Real Gap & Think Like a Scientist proposes a study of the sim-to-real gap and the importance of thinking like a scientist.
Sources
- GroupAffect-4: A Multimodal Dataset of Four-Person Collaborative Interaction
- Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
- Interaction Locality in Hierarchical Recursive Reasoning
- VBFDD-Agent for Electric Vehicle Battery Fault Detection and Diagnosis: Descriptive Text Modeling of Battery Digital Signals
- Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
- Towards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G
- AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
- PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models
- Declarative Data Services: Structured Agentic Discovery for Composing Data Systems
- Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines
- Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination
- Not all uncertainty is alike: volatility, stochasticity, and exploration
- Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance
- Evaluating the Utility of Personal Health Records in Personalized Health AI
- Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use
- DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
- Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts
- Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints
- Interference-Aware Multi-Task Unlearning
- Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses
- Hallucination as Exploit: Evidence-Carrying Multimodal Agents
- Discoverable Agent Knowledge -- A Formal Framework for Agentic KG Affordances (Extended Version)
- SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents
- MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization
- Swimming with Whales: Analysis of Power Imbalances in Stake-Weighted Governance
- AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees
- What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
- Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling
- Generative Auto-Bidding with Unified Modeling and Exploration
- Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment
- BLINKG: A Benchmark for LLM-Integrated Knowledge Graph Generation
- Efficient Elicitation of Collective Disagreements
- Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption
- Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries
- EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection
- When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach
- Transforming Constraint Programs to Input for Local Search
- EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design
- Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models
- Projecting Latent RL Actions: Towards Generalizable and Scalable Graph Combinatorial Optimization
- CogScale: Scalable Benchmark for Sequence Processing
- Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation
- Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support
- PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
- From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning
- Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains
- AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
- GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards
- Probabilistic Tiny Recursive Model
- When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity
- A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents
- Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem
- Neurosymbolic Learning for Inference-Time Argumentation
- HaorFloodAlert: Deseasonalized ML Ensemble for 72-Hour Flood Prediction in Bangladesh Haor Wetlands
- Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
- Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production
- Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency
- KAN-MLP-Mixer: A comprehensive investigation of the usage of Kolmogorov-Arnold Networks (KANs) for improving IMU-based Human Activity Recognition
- Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving
- Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization
- From SGD to Muon: Adaptive Optimization via Schatten-p Norms
- OpenComputer: Verifiable Software Worlds for Computer-Use Agents
- What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code
- Memory-Augmented Reinforcement Learning Agent for CAD Generation
- Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents
- SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects
- Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
- Agentic Trading: When LLM Agents Meet Financial Markets
- Streamlined Constraint Reasoning via CNN Pattern Recognition on Enumerated Solutions
- Beyond Rational Illusion: Behaviorally Realistic Strategic Classification
- Position: The Turing-Completeness of Real-World Autoregressive Transformers Relies Heavily on Context Management
- PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning
- Generative Recursive Reasoning
- Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models
- How Far Are We From True Auto-Research?
- POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents
- Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On
- AgentNLQ: A General-Purpose Agent for Natural Language to SQL
- Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs
- SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation
- Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration
- OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind
- Open-World Evaluations for Measuring Frontier AI Capabilities
- $ECUAS_n$: A family of metrics for principled evaluation of uncertainty-augmented systems
- COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space
- From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)
- Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX
- Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards
- Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
- For How Long Should We Be Punching? Learning Action Duration in Fighting Games
- Governance by Construction for Generalist Agents
- Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents
- ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving
- AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions
- AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists
- PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
- High Quality Embeddings for Horn Logic Reasoning
- AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows
- DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
- Mind the Sim-to-Real Gap & Think Like a Scientist
- Personality Engineering with AI Agents: A New Methodology for Negotiation Research
Comments
Please log in to post a comment.