Agent 最新研究综述(2026-05-12)
本报告自动生成自 papers.cool/arxiv/cs.AI
筛选标准:AI Agent 系统相关论文
生成时间:2026/5/12 17:43:21
📊 今日概况
- 总论文数: 25 篇
- Agent 相关: 14 篇
方向分布
| 方向 | 论文数 |
|---|---|
| other | 4 |
| memory | 2 |
| evaluation | 5 |
| planning | 2 |
| safety | 2 |
| multi_agent | 1 |
1️⃣ 今日 Agent 相关论文列表
OTHER (4 篇)
1. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
- arXiv ID: 2605.10913
- 研究方向: other
- 核心要点:
- shepherd,meta,agents,runtime,execution,agent,trace,empowering,formalized,forked
2. MaD Physics: Evaluating information seeking under constraints in physical environments
- arXiv ID: 2605.10820
- 研究方向: other
- 核心要点:
- mad,physics,scientific,measurements,constraints,physical,flash,agents,capabilities,evaluating
3. The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
- arXiv ID: 2605.10754
- 研究方向: other
- 核心要点:
- cybernetics,agent,foundation,agents,engineering,principles,beings,missing,science,steps
4. Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
- arXiv ID: 2605.10663
- 研究方向: other
- 核心要点:
- experience,evolving,mind2web,alfworld,self,utilization,gains,end,reusable,tasks
MEMORY (2 篇)
1. Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
- arXiv ID: 2605.10870
- 研究方向: memory
- 核心要点:
- memory,decision,demem,distortion,budget,quality,distinctions,remember,agent,runtime
2. TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding
- arXiv ID: 2605.10782
- 研究方向: memory, evaluation
- 核心要点:
- trajectory,trajprism,language,urban,trajectories,retrieval,task,instruction,benchmark,travel
EVALUATION (5 篇)
1. BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD
- arXiv ID: 2605.10865
- 研究方向: evaluation
- 核心要点:
- cad,benchcad,industrial,programs,code,executable,part,parametric,benchmark,lofts
2. From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
- arXiv ID: 2605.10834
- 研究方向: evaluation
- 核心要点:
- pentesting,evaluation,agents,protocol,truth,realistic,vulnerability,targets,ground,wild
3. ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
- arXiv ID: 2605.10787
- 研究方向: evaluation
- 核心要点:
- complexmcp,textbf,interdependent,agents,llm,sandbox,sandboxes,evaluation,dynamic,tool
4. TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding
- arXiv ID: 2605.10782
- 研究方向: memory, evaluation
- 核心要点:
- trajectory,trajprism,language,urban,trajectories,retrieval,task,instruction,benchmark,travel
5. Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
- arXiv ID: 2605.10639
- 研究方向: evaluation
- 核心要点:
- benchmarks,evaluation,toxicity,setups,navigating,biases,sea,unrecognized,llm,investigating
PLANNING (2 篇)
1. The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning
- arXiv ID: 2605.10828
- 研究方向: planning
- 核心要点:
- proportion,ink,distractor,distractors,drop,misleading,hard,context,performance,marginal
2. Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
- arXiv ID: 2605.10805
- 研究方向: planning
- 核心要点:
- reasoning,racer,judges,routing,judge,cost,llm,shift,robust,efficient
SAFETY (2 篇)
1. MATRA: Modeling the Attack Surface of Agentic AI Systems – OpenClaw Case Study
- arXiv ID: 2605.10763
- 研究方向: safety
- 核心要点:
- matra,agentic,openclaw,deployment,threat,translate,attack,risks,sandboxing,modeling
2. PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines
- arXiv ID: 2605.10614
- 研究方向: multi_agent, safety
- 核心要点:
- leakage,prism,agent,risk,generation,llm,credential,secret,leak,000
MULTI_AGENT (1 篇)
1. PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines
- arXiv ID: 2605.10614
- 研究方向: multi_agent, safety
- 核心要点:
- leakage,prism,agent,risk,generation,llm,credential,secret,leak,000
2️⃣ 研究趋势分析
今日热点方向
根据今日 14 篇相关论文分析:
- evaluation 方向: 5 篇论文 🔥 热点
- other 方向: 4 篇论文 🔥 热点
- memory 方向: 2 篇论文 📈 增长
技术范式变化
- RAG → Memory System: 检索增强正在向系统化记忆架构演进
- Tool Calling → Tool Learning: 从简单工具调用到自主工具学习
新兴架构模式
- 暂无明显新架构模式
3️⃣ 关键洞察
- Memory 正在成为基础设施: 越来越多的系统将记忆能力视为标配,而非可选特性
- Planning 从规则转向学习: 传统符号规划正在被神经网络学习取代
- Multi-Agent 协作标准化: 多智能体通信协议和协调机制正在形成共识
- Safety 从后置到前置: 安全性设计正在融入系统架构,而非事后补救
- 评估基准快速演进: Agent 能力评估正在从单一任务向复杂场景扩展
- 开源方案快速迭代: 商业 Agent 能力正在被开源实现快速追赶
4️⃣ 技术演进路径
1 | Prompt Engineering |
当前热点路径
- RAG → Memory System → World Model: 记忆架构持续深化
- ReAct → Planning System → Goal Reasoning: 推理能力增强
5️⃣ 与开源 Agent 项目的关联
主流项目对照
| 开源项目 | 相关方向 | 今日论文验证 |
|---|---|---|
| LangChain | tool, planning | ✅ |
| LlamaIndex | memory, rag | ✅ |
| AutoGPT | planning, autonomous | ✅ |
| CrewAI | multi-agent | ✅ |
| Mem0 | memory | ✅ |
| OpenDevin | tool, planning | ➖ |
设计验证与演进
被验证的设计:
- Memory System 的必要性得到持续验证
- Tool Use 作为 Agent 核心能力已成共识
- Multi-Agent 架构在复杂任务中表现优越
需要演进的设计:
- 简单的 RAG 正在被 Memory System 取代
- 单体 Agent 架构在复杂场景中受限
- 静态 Tool Definition 需要向动态学习演进
6️⃣ 架构级结论
- Memory First: 新 Agent 项目应优先设计 Memory System,而非事后添加
- Tool Abstraction: 工具抽象层应支持动态发现和学习,而非硬编码
- Multi-Agent Ready: 即使当前是单 Agent,架构应预留多 Agent 扩展能力
- Safety by Design: 安全机制应在架构设计阶段考虑,而非事后补救
- Evaluation Driven: 建立持续评估机制,而非依赖人工测试
7️⃣ 下一步行动建议
Memory Schema 设计
- 采用分层记忆架构: Working Memory → Episodic → Long-term
- 设计统一的 Memory Interface,支持多种后端(向量、图、关系型)
- 实现 Memory Compression 机制,避免无限增长
Retrieval Policy 升级
- 从简单相似度检索升级为混合检索(关键词 + 向量 + 知识图谱)
- 实现上下文感知的动态检索策略
- 考虑引入 Reranking 机制提升相关性
Agent Orchestration 调整
- 设计标准化的 Agent 通信协议
- 实现动态任务分配机制
- 考虑引入 Orchestrator 角色
📚 附录
论文完整列表
- Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace - other
- Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory - memory
- BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD - evaluation
- From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World - evaluation
- The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning - planning
- MaD Physics: Evaluating information seeking under constraints in physical environments - other
- Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge - planning
- ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox - evaluation
- TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding - memory, evaluation
- MATRA: Modeling the Attack Surface of Agentic AI Systems – OpenClaw Case Study - safety
- The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents - other
- Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents - other
- Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks - evaluation
- PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines - multi_agent, safety
本报告由 OpenClaw 自动生成
面向 Agent 架构师,提供决策参考