Agent 最新研究综述(2026-05-19)
本报告自动生成自 papers.cool/arxiv/cs.AI
筛选标准:AI Agent 系统相关论文
生成时间:2026/5/19 17:30:05
📊 今日概况
- 总论文数: 25 篇
- Agent 相关: 12 篇
方向分布
| 方向 | 论文数 |
|---|---|
| evaluation | 6 |
| planning | 5 |
| safety | 2 |
| other | 2 |
1️⃣ 今日 Agent 相关论文列表
EVALUATION (6 篇)
1. SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents
- arXiv ID: 2605.18693
- 研究方向: evaluation
- 核心要点:
- skill,skillgenbench,generation,reusable,skills,agents,task,procedures,pipelines,repositories
2. GIM: Evaluating models via tasks that integrate multiple cognitive domains
- arXiv ID: 2605.18663
- 研究方向: planning, evaluation
- 核心要点:
- gim,irt,knowledge,test,cognitive,reasoning,public,configurations,grounded,thinking
3. SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
- arXiv ID: 2605.18630
- 研究方向: evaluation
- 核心要点:
- sciconvbench,clarification,science,scientific,llms,disambiguation,mechanics,task,computational,conversational
4. QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi
- arXiv ID: 2605.18380
- 研究方向: planning, evaluation
- 核心要点:
- rcc,calculi,qstr,benchmark,calculus,qualitative,qstrbench,reasoning,temporal,indu
5. Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows
- arXiv ID: 2605.18327
- 研究方向: evaluation
- 核心要点:
- causely,causal,sre,telemetry,workflows,reliability,opentelemetry,intelligence,enterprise,environment
6. Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks
- arXiv ID: 2605.18194
- 研究方向: evaluation
- 核心要点:
- agent,mllms,illusion,embodied,sensory,spatial,mllm,cartesian,mind,perceptual
PLANNING (5 篇)
1. Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning
- arXiv ID: 2605.18674
- 研究方向: planning
- 核心要点:
- policies,abstracted,lookahead,planning,ipc,relational,atoms,classical,2023,gnns
2. GIM: Evaluating models via tasks that integrate multiple cognitive domains
- arXiv ID: 2605.18663
- 研究方向: planning, evaluation
- 核心要点:
- gim,irt,knowledge,test,cognitive,reasoning,public,configurations,grounded,thinking
3. Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning
- arXiv ID: 2605.18570
- 研究方向: planning, safety
- 核心要点:
- alignment,entity,qcea,query,medical,knowledge,conditioned,cross,correspondence,reasoning
4. QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi
- arXiv ID: 2605.18380
- 研究方向: planning, evaluation
- 核心要点:
- rcc,calculi,qstr,benchmark,calculus,qualitative,qstrbench,reasoning,temporal,indu
5. SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
- arXiv ID: 2605.18299
- 研究方向: planning
- 核心要点:
- search,teacher,hindsight,policy,external,reward,rollout,reasoning,step,distillation
SAFETY (2 篇)
1. Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment
- arXiv ID: 2605.18672
- 研究方向: safety
- 核心要点:
- agent,llm,layer,guarantee,safe,deployment,safety,architecture,three,position
2. Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning
- arXiv ID: 2605.18570
- 研究方向: planning, safety
- 核心要点:
- alignment,entity,qcea,query,medical,knowledge,conditioned,cross,correspondence,reasoning
OTHER (2 篇)
1. AI for Auto-Research: Roadmap & User Guide
- arXiv ID: 2605.18661
- 研究方向: other
- 核心要点:
- research,end,frontier,writing,roadmap,ideas,playbook,rebuttal,agents,scientific
2. Latent Action Reparameterization for Efficient Agent Inference
- arXiv ID: 2605.18597
- 研究方向: other
- 核心要点:
- action,latent,lar,agent,inference,reparameterization,llm,actions,reparameterizing,decision
2️⃣ 研究趋势分析
今日热点方向
根据今日 12 篇相关论文分析:
- evaluation 方向: 6 篇论文 🔥 热点
- planning 方向: 5 篇论文 🔥 热点
- safety 方向: 2 篇论文 📈 增长
技术范式变化
- 暂无明显范式变化
新兴架构模式
- Agent Workflow: 工作流编排架构
3️⃣ 关键洞察
- Planning 从规则转向学习: 传统符号规划正在被神经网络学习取代
- Safety 从后置到前置: 安全性设计正在融入系统架构,而非事后补救
- 评估基准快速演进: Agent 能力评估正在从单一任务向复杂场景扩展
- 开源方案快速迭代: 商业 Agent 能力正在被开源实现快速追赶
4️⃣ 技术演进路径
1 | Prompt Engineering |
当前热点路径
- ReAct → Planning System → Goal Reasoning: 推理能力增强
5️⃣ 与开源 Agent 项目的关联
主流项目对照
| 开源项目 | 相关方向 | 今日论文验证 |
|---|---|---|
| LangChain | tool, planning | ✅ |
| LlamaIndex | memory, rag | ➖ |
| AutoGPT | planning, autonomous | ✅ |
| CrewAI | multi-agent | ➖ |
| Mem0 | memory | ➖ |
| OpenDevin | tool, planning | ➖ |
设计验证与演进
被验证的设计:
- Memory System 的必要性得到持续验证
- Tool Use 作为 Agent 核心能力已成共识
- Multi-Agent 架构在复杂任务中表现优越
需要演进的设计:
- 简单的 RAG 正在被 Memory System 取代
- 单体 Agent 架构在复杂场景中受限
- 静态 Tool Definition 需要向动态学习演进
6️⃣ 架构级结论
- Memory First: 新 Agent 项目应优先设计 Memory System,而非事后添加
- Tool Abstraction: 工具抽象层应支持动态发现和学习,而非硬编码
- Safety by Design: 安全机制应在架构设计阶段考虑,而非事后补救
- Evaluation Driven: 建立持续评估机制,而非依赖人工测试
7️⃣ 下一步行动建议
Memory Schema 设计
- 采用分层记忆架构: Working Memory → Episodic → Long-term
- 设计统一的 Memory Interface,支持多种后端(向量、图、关系型)
- 实现 Memory Compression 机制,避免无限增长
Retrieval Policy 升级
- 从简单相似度检索升级为混合检索(关键词 + 向量 + 知识图谱)
- 实现上下文感知的动态检索策略
- 考虑引入 Reranking 机制提升相关性
📚 附录
论文完整列表
- SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents - evaluation
- Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning - planning
- Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment - safety
- GIM: Evaluating models via tasks that integrate multiple cognitive domains - planning, evaluation
- AI for Auto-Research: Roadmap & User Guide - other
- SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science - evaluation
- Latent Action Reparameterization for Efficient Agent Inference - other
- Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning - planning, safety
- QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi - planning, evaluation
- Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows - evaluation
- SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning - planning
- Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks - evaluation
本报告由 OpenClaw 自动生成
面向 Agent 架构师,提供决策参考