这是用户在 2024-3-19 21:56 为 https://openreview.net/forum?id=5XtEZpr4kY&referrer=%5Bthe%20profile%20of%20Yuqian%20Fu%5D(%2Fprofil... 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

LDR: Learning Discrete Representation to Improve Noise Robustness in Multi-Agent Tasks
LDR:学习离散表示以提高多智能体任务中的噪声鲁棒性

Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao
付玉谦、朱元亨、柴家俊、赵东斌

06 May 2023 (modified: 03 Nov 2023)Submitted to NeurIPS 2023Conference, Senior Area Chairs, Area Chairs, Reviewers, AuthorsRevisionsBibTeX
2023 年 5 月 6 日(修改:2023 年 11 月 3 日)提交给 NeurIPS 2023 会议,高级区域主席、区域主席、审稿人、作者修订 BibTeX
Keywords: multi-agent reinforcement learning, robustness, discrete representation
关键词:多智能体强化学习,鲁棒性,离散表示
TL;DR: This paper proposes a policy gradient method that uses discretization and combination to improve noise robustness in multi-agent tasks.
TL;DR:本文提出了一种策略梯度方法,利用离散化和组合来提高多智能体任务中的噪声鲁棒性。
Abstract: 抽象的:

In real-world applications of multi-agent reinforcement learning (MARL), agents often face inaccurate environments due to unavoidable noise, presenting a challenge to their robustness. However, limited prior work focuses on addressing such noise in observations, hindering the deployment of multi-agent systems. In this paper, we propose Learning Discrete Representations (LDR) to improve robustness against noise in multi-agent tasks. Specifically, LDR employs a quantization module with a segment mechanism to encode observations and teammate actions, generating discrete representations from learnable codebooks. These representations are subsequently processed via a combiner for decision-making. To enhance the learning efficiency, we incorporate a set-input block that treats the joint observations of agents as a permutation-invariant set, thereby reducing the joint observation space. Additionally, we theoretically analyze the expressiveness of discrete representation and the boundedness of discrete distortion. We evaluate the proposed method on StarCraft II micromanagement tasks and Multi-Agent MuJoCo with noisy observations. Empirical results demonstrate that LDR outperforms existing algorithms, improving robustness in noisy cooperative MARL tasks while maintaining superior performance in clean observations.
在多智能体强化学习(MARL)的实际应用中,智能体经常面临由于不可避免的噪声而导致的不准确的环境,这对其鲁棒性提出了挑战。然而,有限的先前工作集中于解决观测中的此类噪音,阻碍了多智能体系统的部署。在本文中,我们提出学习离散表示(LDR)来提高多智能体任务中对抗噪声的鲁棒性。具体来说,LDR 采用具有分段机制的量化模块来对观察结果和队友动作进行编码,从可学习的密码本生成离散表示。这些表示随后通过组合器进行处理以进行决策。为了提高学习效率,我们引入了一个集合输入块,将代理的联合观察视为排列不变集,从而减少了联合观察空间。此外,我们还从理论上分析了离散表示的表现力和离散失真的有界性。我们在《星际争霸 II》微管理任务和带有噪声观测的多智能体 MuJoCo 上评估了所提出的方法。实证结果表明,LDR 优于现有算法,提高了嘈杂的协作 MARL 任务中的鲁棒性,同时保持了干净观测中的卓越性能。

Supplementary Material: pdf
补充材料:pdf
Corresponding Author: yuanheng.zhu@ia.ac.cn
通讯作者:yuanheng.zhu@ia.ac.cn
Reviewer Nomination: chaijiajun2020@ia.ac.cn
审稿人提名:chaijiajun2020@ia.ac.cn
Primary Area: Reinforcement learning
主要领域:强化学习
Claims: yes 索赔:是的
Code Of Ethics: yes 道德准则:是
Broader Impacts: n/a. Although we haven’t evaluated our method on real applications, we believe our work doesn’t have potential negative societal impacts.
更广泛的影响:不适用。尽管我们还没有在实际应用中评估我们的方法,但我们相信我们的工作不会产生潜在的负面社会影响。
Limitations: yes. See Sec. 6.
限制:是的。参见第 2 节。 6.
Theory: yes. See Appendix.
理论:是的。参见附录。
Experiments: yes 实验:是的
Training Details: yes. See Appendix.
培训详情:是的。参见附录。
Error Bars: yes 误差线:是
Compute: yes 计算:是
Reproducibility: yes. See Appendix.
再现性:是的。参见附录。
Safeguards: n/a 保障措施:不适用
Licenses: yes. See Appendix.
许可证:是的。参见附录。
Assets: n/a 资产:不适用
Human Subjects: n/a 人类受试者:不适用
IRB Approvals: n/a IRB 批准:不适用
Submission Number: 2315 提交编号:2315
28 / 28 replies shown 显示 28 / 28 条回复
Add: 添加:

Opt in or Veto Public Release by Yuqian Fu
选择加入或否决 Yuqian Fu 的公开发布

Opt in or Veto Public ReleaseYuqian Fu01 Oct 2023, 22:34Program Chairs, Authors
选择加入或否决公开发布 Yuqian Fu 2023 年 10 月 1 日,22:34 项目主席、作者
Make Public: No, keep it confidential
公开:不,保密

Paper Decision 论文决定

DecisionProgram Chairs22 Sept 2023, 01:42 (modified: 22 Sept 2023, 03:36)Program Chairs, AuthorsRevisions
决策项目主席 2023 年 9 月 22 日,01:42(修改:2023 年 9 月 22 日,03:36)项目主席、作者修订
Decision: Reject 决定:拒绝
Comment: 评论:

This paper introduces a quantisation process to make learning in Dec-POMDPs more noise robust. While the authors managed to address some of the concerns in the rebuttal phase, I believe that one of the reviewers were particularly excited about the paper which makes this a typical borderline submission.
本文介绍了一种量化过程,使 Dec-POMDP 中的学习更加抗噪。虽然作者在反驳阶段设法解决了一些问题,但我相信其中一位审稿人对这篇论文特别兴奋,这使得这是一篇典型的边缘提交。

Upon reading the paper myself, I found a major conceptual issue which tips the balance towards a clear rejection: The authors frame the method as addressing the Dec-POMDP setting. However, there are two parts of the method in clear violation of this setting: First of all, the method requires each agent to have access to the actions and observation of their team mates. This is also evident in the auto-regressive policy example in section 3.2. Secondly, the method is formulated through policies that act on the observation at time t, rather than on action-observation histories as is required in Dec-POMDPs in general. Thirdly, the ON-Dec-POMDP is an unnecessary formalism since the "standard" Dec-POMDP contains the observation function which is flexible enough to account for noise.
在我自己阅读这篇论文后,我发现了一个主要的概念性问题,该问题将平衡推向明确的拒绝:作者将该方法构建为解决 Dec-POMDP 设置。然而,该方法有两个部分明显违反了这一设置:首先,该方法要求每个智能体都能访问其队友的动作和观察。这在 3.2 节的自回归策略示例中也很明显。其次,该方法是通过对时间 t 的观察采取行动的政策来制定的,而不是像 Dec-POMDP 中一般要求的那样对行动观察历史采取行动。第三,ON-Dec-POMDP 是一种不必要的形式主义,因为“标准”Dec-POMDP 包含足够灵活以考虑噪声的观测函数。

Given all of these issues I recommend rejection of the paper and resubmission to a future venue.
考虑到所有这些问题,我建议拒绝该论文并重新提交到未来的地点。

Reviewers E6Ba and m33J-- please engage
审稿人 E6Ba 和 m33J——请参与

Official CommentArea Chair 7GJP18 Aug 2023, 17:00Reviewer E6Ba, Reviewer m33J, Area Chairs, Reviewers, Program Chairs, Authors, Senior Area Chairs
官方评论区域主席 7GJP18 Aug 2023, 17:00审稿人 E6Ba, 审稿人 m33J, 区域主席, 审稿人, 程序主席, 作者, 高级区域主席
Comment: 评论:

The discussion phase is now coming to an end -- please engage urgently by reading the rebuttal and responding to the authors.
讨论阶段现已接近尾声——请紧急参与,阅读反驳并回复作者。

Many thanks!  非常感谢!

AC

Author Rebuttal by Authors
作者反驳

Author RebuttalAuthors (Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)09 Aug 2023, 12:25 (modified: 24 Aug 2023, 00:32)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Ethics Reviewers Submitted, AuthorsRevisions
作者反驳作者 ( Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)09 Aug 2023, 12:25 (modified: 24 Aug 2023, 00:32) 项目主席、高级区域主席、区域主席、提交审稿人、提交道德审稿人、作者修订
Rebuttal: 反驳:

We sincerely appreciate the reviewers for their valuable and positive feedback regarding our paper. We are delighted to hear that they perceive the novelty of our approach in enhancing noise robustness through the use of a discretization module (E6Ba, ypgJ), as well as noting the superior performance compared to baselines (m33J, Sma1, E6Ba, bUKo, ypgJ), as well as the clarity of our writing (E6Ba,bUKo, ypgJ).
我们衷心感谢审稿人对我们的论文提出的宝贵和积极的反馈。我们很高兴听到他们认识到我们的方法在通过使用离散化模块(E6Ba、ypgJ)来增强噪声鲁棒性方面的新颖性,并注意到与基线(m33J、Sma1、E6Ba、bUKo、ypgJ)相比的卓越性能),以及我们写作的清晰度(E6Ba,bUKo,ypgJ)。

One primary concern was the lack of sufficient ablation experiments on different components. In response to this concern, we conducted additional experiments to demonstrate the effectiveness of each part of LDR. We also conduct additional experiments including other discretization methods (Gumbel-softmax), adding other baselines (HAPPO), the relationship between codebook size and performance in different scenarios (SMAC and MAMuJoCo), and performance in different noise distribution (Uniform noise). We are pleased to report that the obtained results align with the statements made in our paper, thereby enhancing the overall soundness of our paper.
一个主要问题是缺乏对不同组件进行足够的消融实验。针对这一担忧,我们进行了额外的实验来证明LDR各部分的有效性。我们还进行了额外的实验,包括其他离散化方法(Gumbel-softmax)、添加其他基线(HAPPO)、不同场景下码本大小和性能之间的关系(SMAC和MAMuJoCo)以及不同噪声分布下的性能(均匀噪声)。我们很高兴地报告,所获得的结果与我们论文中的陈述一致,从而提高了我们论文的整体可靠性。


We kindly ask the reviewers to refer to the accompanying PDF file submitted, which contains the detailed experimental results. The corresponding figures are provided as follows:
我们恳请审稿人参考提交的随附 PDF 文件,其中包含详细的实验结果。相应数字如下:

  • Figure 1: Components ablation and Components ablation and performance comparison of Gumbel-softmax and VQ.
    图 1:Gumbel-softmax 和 VQ 的组件消融以及组件消融和性能比较。
  • Figure 2: Performance comparisons between LDR and HAPPO.
    图 2:LDR 和 HAPPO 之间的性能比较。
  • Figure 3: Performance in different scenarios with different sizes of codebook.
    图 3:不同场景下不同大小码本的性能。
  • Figure 4: Performance comparisons in Uniform noises.
    图 4:均匀噪声中的性能比较。

Once again, we extend our heartfelt gratitude to the reviewers for their valuable feedback, which has significantly contributed to improving the quality and comprehensiveness of our paper.
我们再次衷心感谢审稿人的宝贵反馈意见,这对提高我们论文的质量和综合性做出了巨大贡献。

PDF: pdf PDF:pdf

Concerns Regarding the Inaccuracies and Quality of Reviewer Sma1's Review
关于审稿人 Sma1 审稿的不准确性和质量的担忧

Official CommentAuthors (Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)08 Aug 2023, 21:02 (modified: 18 Aug 2023, 15:33)Area Chairs, Program Chairs, Senior Area Chairs, AuthorsRevisions
官方评论作者 ( Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao) 2023年8月8日, 21:02 (修改: 2023年8月18日, 15:33) 区域主席、程序主席、高级区域主席、作者修订
[Deleted] [已删除]

Official Review of Submission2315 by Reviewer m33J
审稿人 m33J 对提交2315的正式审稿

Official ReviewReviewer m33J11 Jul 2023, 00:33 (modified: 01 Sept 2023, 10:50)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer m33JRevisions
官方审稿人 m33J11 Jul 2023, 00:33 (modified: 01 Sept 2023, 10:50) 项目主席、高级区域主席、区域主席、提交的审稿人、作者、审稿人 m33J 修订
Summary: 概括:

In practical MARL applications, the observations noise can significantly impact performance and safety. Previous robust methods cannot scale to MARL due to the computation complexity and agent interaction. This work proposed Learning Discrete Representations (LDR) to improve the robustness against noise in multi-agent reinforcement learning tasks. LDR uses a quantization module with a segment mechanism to discretize the noisy observations. The discrete representations of the observations and actions are combined via a cross-attention block. A set-input block was further proposed to reduce the joint observation space and thus improve the sample efficiency. Empirical results on StarCraft II and multiagent MuJoCo show that LDR improves the performance in noisy observations.
在实际的 MARL 应用中,观测噪声会显着影响性能和安全性。由于计算复杂性和代理交互,以前的稳健方法无法扩展到 MARL。这项工作提出了学习离散表示(LDR)来提高多智能体强化学习任务中对抗噪声的鲁棒性。 LDR 使用具有分段机制的量化模块来离散化噪声观测值。观察和动作的离散表示通过交叉注意块组合起来。进一步提出了集合输入块来减少联合观察空间,从而提高样本效率。 《星际争霸 II》和多智能体 MuJoCo 的实证结果表明,LDR 提高了噪声观测中的性能。

Soundness: 3 good 健全性:3 良好
Presentation: 3 good 介绍:3 好
Contribution: 3 good 贡献:3 好
Strengths: 优势:
  1. The proposed LDR incorporates the quantization mechanism into MARL to improve the robustness against the observation noise. LDR consists of (1) discrete representation learning and action termination via the quantization module, (2) representation combination for decision-making, and (3) use of the set-input block to process the noisy observation as a permutation-invariant set. The set-input block treats the joint observations of agents as a permutation-invariant set, thereby reducing the joint observation space.
    所提出的 LDR 将量化机制合并到 MARL 中,以提高针对观测噪声的鲁棒性。 LDR 包括(1)通过量化模块进行离散表示学习和动作终止,(2)用于决策的表示组合,以及(3)使用集合输入块将噪声观察处理为排列不变集。集合输入块将智能体的联合观察视为排列不变集,从而减少联合观察空间。
  2. This work provides a theoretical analysis to understand the expressiveness of discrete representation and the boundness of discrete distortion.
    这项工作提供了理论分析来理解离散表示的表现力和离散失真的有界性。
  3. The proposed method has been evaluated on diverse tasks, including StarCraft II micromanagement tasks and multi-agent MuJoCo with noisy observations. The results show that LDR outperforms two baselines in the noisy observations as the noise level increases. An ablation study has been provided for various codebook sizes' effects.
    所提出的方法已在多种任务上进行了评估,包括《星际争霸 II》微管理任务和具有噪声观测的多智能体 MuJoCo。结果表明,随着噪声水平的增加,LDR 在噪声观测中优于两个基线。已经针对各种密码本大小的影响进行了消融研究。
Weaknesses: 弱点:
  1. During discretization learning, the quantization module is from VQ-VAE. No analysis or comparison shows why this specific quantization module was chosen in this work.
    离散化学习时,量化模块来自VQ-VAE。没有分析或比较表明为什么在这项工作中选择这个特定的量化模块。
  2. Based on my understanding, the attention weight in the combiner and the permutation invariance contribute to MARL's robustness with noisy observations. However, there is no discussion on which part contributes the most.
    根据我的理解,组合器中的注意力权重和排列不变性有助于 MARL 对噪声观测的鲁棒性。然而,没有讨论哪一部分贡献最大。
Questions: 问题:
  1. Is there any specific reason for choosing the quantization module from VQ-VAE instead of using another quantization module?
    选择 VQ-VAE 中的量化模块而不是使用其他量化模块有什么具体原因吗?
  2. Which component contributes to the robustness against noisy observation, the attention weight in the combiner or the permutation invariance from the set-in block?
    哪个组件有助于对抗噪声观察的鲁棒性、组合器中的注意力权重或来自设置块的排列不变性?
Limitations: 限制:
  1. One limitation of this work is that the performance of LDR can suffer when an excessively large codebook size is chosen, depending on the nature of tasks.
    这项工作的一个限制是,根据任务的性质,当选择过大的码本大小时,LDR 的性能可能会受到影响。
Flag For Ethics Review: No ethics review needed.
道德审查标志:无需道德审查。
Rating: 5: Borderline accept: Technically solid paper where reasons to accept outweigh reasons to reject, e.g., limited evaluation. Please use sparingly.
评级:5:临界接受:技术上可靠的论文,接受的理由超过拒绝的理由,例如有限的评估。请谨慎使用。
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.
信心:3:您对自己的评估相当有信心。您可能不理解提交的某些部分,或者您不熟悉某些相关工作。数学/其他细节没有仔细检查。
Code Of Conduct: Yes 行为准则:是

Rebuttal by Authors 作者反驳

RebuttalAuthors (Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)04 Aug 2023, 21:16 (modified: 23 Aug 2023, 23:41)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Ethics Reviewers Submitted, AuthorsRevisions
反驳作者 ( Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)04 Aug 2023, 21:16 (modified: 23 Aug 2023, 23:41) 项目主席、高级区域主席、区域主席、提交审稿人、提交道德审稿人、作者修订
Rebuttal: 反驳:

We thank the reviewer m33J for his/her constructive comments that will surely turn our paper into a better shape.
我们感谢审稿人 m33J 的建设性意见,这些意见必将使我们的论文变得更好。

Q1: Is there any specific reason for choosing the quantization module from VQ-VAE instead of using another quantization module?
Q1:选择 VQ-VAE 的量化模块而不是使用其他量化模块有什么具体原因吗?

A1: Thank you for your valuable feedback and raising your concern regarding the choice of the quantization module in our paper. We sincerely appreciate the opportunity to clarify this matter. Quantization is the process of mapping input values from a large set (often a continuous set) to output values in a smaller (discrete) set, including scalar quantization (SQ) and vector quantization (VQ). By discretizing noisy information, we effectively fit the data into bins and reduce the impact of small perturbations, which are often considered noise. The discretization process ensures that small fluctuations are absorbed and represented as discrete states or actions, thereby reducing the impact of noise on the overall learning process [1, 2]. We chose VQ instead of SQ motivated by the fundamental result of Shannon’s rate-distortion theory [3, 4]: better performance can always be achieved by coding vectors instead of scalars, even if the sequence of source symbols are independent random variables.
A1:感谢您提供宝贵的反馈意见,并对我们论文中量化模块的选择表示关注。我们真诚地感谢有机会澄清此事。量化是将输入值从大集合(通常是连续集合)映射到较小(离散)集合中的输出值的过程,包括标量量化 (SQ) 和矢量量化 (VQ)。通过离散化噪声信息,我们可以有效地将数据放入箱中,并减少小扰动(通常被认为是噪声)的影响。离散化过程确保小的波动被吸收并表示为离散状态或动作,从而减少噪声对整个学习过程的影响[1, 2]。我们选择 VQ 而不是 SQ 的动机是香农率失真理论 [3, 4] 的基本结果:通过编码向量而不是标量总是可以实现更好的性能,即使源符号序列是独立的随机变量。

In our paper, we referenced the quantization module from VQ-VAE, as it has been widely recognized for its success in various applications and has paved the way for subsequent research [5, 6]. However, it is important to note that the quantization module in VQ-VAE is not inherently different from other vector quantization methods. We acknowledge that explicitly mentioning that the quantization module is from VQ-VAE might create misinterpretation in our paper. We assure you that we will address this concern in the revised version by minimizing the reference to the specific origin of the quantization module. Instead, we will focus on the attributes and capabilities of the chosen quantization module to ensure clarity and avoid any unnecessary ambiguity.
在我们的论文中,我们引用了 VQ-VAE 的量化模块,因为它在各种应用中的成功已得到广泛认可,并为后续研究铺平了道路 [5, 6]。然而,需要注意的是,VQ-VAE 中的量化模块与其他矢量量化方法并没有本质上的不同。我们承认,明确提及量化模块来自 VQ-VAE 可能会在我们的论文中产生误解。我们向您保证,我们将通过最小化对量化模块特定来源的引用来在修订版本中解决此问题。相反,我们将重点关注所选量化模块的属性和功能,以确保清晰度并避免任何不必要的歧义。

Q2: Which component contributes to the robustness against noisy observation?
Q2:哪个组件有助于对抗噪声观测的鲁棒性?

A2: We conduct three more ablations for LDR, named LDR_unquantized, LDR_permutation_variant, and LDR_wo_attention, respectively. LDR_unquantized means LDR without the quantization module. LDR_permutation_variant means LDR without a set-input block. LDR_wo_attention represents that combiner without an attention mechanism. We compare these three ablations with LDR on the SMAC hard map 8m_vs_9m (10% noise) and the MAMuJoCo scenario Hopper (3x1) (20% noise). We show the mean and standard deviation of the test win rate (%) and episode return across three random seeds as follows:
A2:我们对 LDR 进行了另外三个消融,分别命名为 LDR_unquantized、LDR_permutation_variant 和 LDR_wo_attention。 LDR_unquantized 表示没有量化模块的 LDR。 LDR_permutation_variant 表示没有设置输入块的 LDR。 LDR_wo_attention 表示没有注意力机制的组合器。我们将这三种消融与 SMAC 硬图 8m_vs_9m(10% 噪声)和 MAMuJoCo 场景 Hopper (3x1)(20% 噪声)上的 LDR 进行比较。我们显示了三个随机种子的测试获胜率 (%) 和剧集回报的平均值和标准差,如下所示:

Table 1: Ablation studies on 8m_vs_9m.
表 1:8m_vs_9m 的消融研究。

Timesteps 时间步长 LDR LDR_unquantized LDR_未量化 LDR_permutation_variant LDR_排列变体 LDR_wo_attention
1M 31.3±10.7 14.6±6.5 22.9±14.4 19.8±9.1
2M 63.8±7.0 55.2±9.5 48.9±16.0 63.5±4.8
3M 77.0±9.6 57.2±11.7 67.7±9.3 72.9±9.5
4M 79.3±4.7 54.1±13.2 73.9±12.6 70.8±12.6
5M 81.3±3.1 53.1±9.4 75.0±14.3 76.1±9.0

Table 2: Ablation studies on Hopper (3x1).
表 2:Hopper (3x1) 的消融研究。

Timesteps 时间步长 LDR LDR_unquantized LDR_未量化 LDR_permutation_variant LDR_排列变体 LDR_wo_attention
2M 1680.3±188.8 1279.8±241.2 1292.5±149.7 1415.3±222.9
4M 1675.9±323.5 1504.0±79.6 1574.8±407.5 1508.7±156.3
6M 1875.2±207.4 1586.0±164.2 1468.2±458.7 1531.9±221.0
8M 1971.6±255.8 1888.1±281.7 1540.7±228.6 1960.1±262.4
10M 2004.3±362.9 1379.3±130.9 1620.2±372.3 1684.3±242.1

The detailed curves can be found in the global rebuttal Figure 1. Our experimental results indicated that the quantization module played the most significant role in enhancing the noise robustness, thus validating our ideas. Additionally, we observed that the set-input block and combiner module also contributed to the overall performance improvement of the algorithm. These combined contributions from multiple components enabled our proposed algorithm, LDR, to achieve satisfactory performance in noisy environments.
详细的曲线可以在全局反驳图1中找到。我们的实验结果表明量化模块在增强噪声鲁棒性方面发挥了最显着的作用,从而验证了我们的想法。此外,我们观察到设置输入块和组合器模块也有助于算法整体性能的提高。这些来自多个组件的综合贡献使我们提出的算法 LDR 能够在噪声环境中实现令人满意的性能。

Finally, thank you again for your thoughtful comments. We will incorporate your suggestions into our next revision.
最后,再次感谢您的深思熟虑的评论。我们会将您的建议纳入我们的下一次修订中。

Citations: 引用:

[1] Chen, Jiefeng, et al. "Improving adversarial robustness by data-specific discretization." arXiv preprint (2018).
[1] 陈杰峰,等. “通过特定于数据的离散化来提高对抗鲁棒性。” arXiv 预印本(2018)。

[2] Liu, Qiang, et al. "An empirical study on feature discretization." arXiv preprint (2020).
[2] 刘强,等。 “特征离散化的实证研究。” arXiv 预印本(2020)。

[3] Gersho, Allen, and Robert M. Gray. Vector quantization and signal compression. Vol. 159. Springer Science & Business Media, 2012.
[3] 艾伦·格修和罗伯特·M·格雷。矢量量化和信号压缩。卷。 159.施普林格科学与商业媒体,2012。

[4] Cover, Thomas M. Elements of information theory. John Wiley & Sons, 1999.
[4] 盖,托马斯·M。信息论要素。约翰·威利父子公司,1999 年。

[5] Ramesh, et al. "Zero-shot text-to-image generation." ICML (2021).
[5]拉梅什等人。 “零镜头文本到图像生成。” ICML(2021)。

[6] Esser, et al. "Taming transformers for high-resolution image synthesis." CVPR (2021).
[6] 埃塞尔等人。 “驯服变形金刚以进行高分辨率图像合成。” CVPR(2021)。

Replying to Rebuttal by Authors
回复作者的反驳

Reply to author 回复作者

Official CommentReviewer m33J18 Aug 2023, 14:46 (modified: 29 Aug 2023, 03:14)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Ethics Reviewers SubmittedRevisions
官方评论审稿人 m33J18 Aug 2023, 14:46 (modified: 29 Aug 2023, 03:14)项目主席、高级区域主席、区域主席、提交的审稿人、作者、道德审稿人提交的修订
Comment: 评论:

Thank you for the rebuttal. I will leave my score as is.
谢谢你的反驳。我将保持我的分数不变。

Replying to Reply to author
回复回复作者

Official Comment by Authors
作者的官方评论

Official CommentAuthors (Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)18 Aug 2023, 16:28 (modified: 29 Aug 2023, 03:14)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Ethics Reviewers SubmittedRevisions
官方评论作者(Yuqian Fu、Yuanheng Zhu、Jiajun Chai、Dongbin Zhao)2023年8月18日,16:28(修改:2023年8月29日,03:14)程序主席、高级区域主席、区域主席、提交的审稿人、作者、道德审稿人提交的修改
Comment: 评论:

We are glad that all the comments of the reviewer have been addressed. Thanks to the reviewer for acknowledging the paper's contribution.
我们很高兴审稿人的所有意见都得到了解决。感谢审稿人对本文贡献的认可。

Official Review of Submission2315 by Reviewer Sma1
审稿人 Sma1 对提交 2315 的正式审阅

Official ReviewReviewer Sma110 Jul 2023, 12:22 (modified: 01 Sept 2023, 10:50)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer Sma1Revisions
官方审稿人 Sma110 Jul 2023, 12:22 (modified: 01 Sept 2023, 10:50) 项目主席、高级区域主席、区域主席、提交的审稿人、作者、审稿人 Sma1 修订
Summary: 概括:

This paper introduces a novel framework called LDR, which improves the robustness of multi-agent tasks against noise. The framework utilizes a quantization module with a segment mechanism to encode observations and actions, generating discrete representations from learnable codebooks. These representations are then processed through a combiner for decision-making.
本文介绍了一种名为 LDR 的新颖框架,它提高了多智能体任务对抗噪声的鲁棒性。该框架利用具有分段机制的量化模块来编码观察和动作,从可学习的密码本生成离散表示。然后通过组合器处理这些表示以进行决策。

Soundness: 2 fair 健全性:2 一般
Presentation: 2 fair 展示:2场
Contribution: 2 fair 贡献:2公平
Strengths: 优势:
  1. The motivation for using a segment mechanism against noise in the observation is clear and reasonable.
    在观察中使用分段机制来消除噪声的动机是明确且合理的。
  2. Empirical results demonstrate that LDR outperforms compared baselines.
    实证结果表明,LDR 优于比较基线。
Weaknesses: 弱点:
  1. Lacking of ablation study. The paper proposed many components, like the quantization module, combiner with cross-attention, and set-input block, but there is no ablation study showing the effectiveness of each part. How much gain does each component bring? Also, how would the hyperparameters λ and β influence the performance of LDR?
    缺乏消融研究。论文提出了许多组件,例如量化模块、具有交叉注意的组合器和设置输入块,但没有显示每个部分有效性的消融研究。每个组件带来多少增益?另外,超参数 λβ 将如何影响 LDR 的性能?
  2. Only MAT and MAPPO are compared, could compare with some more baselines like HAPPO [1].
    仅比较 MAT 和 MAPPO,可以与 HAPPO 等更多基线进行比较 [1]。

[1] Kuba, J. G., Chen, R., Wen, M., Wen, Y., Sun, F., Wang, J., & Yang, Y. (2021, October). Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning. In International Conference on Learning Representations.
[1] Kuba, J. G., Chen, R., Wen, M., Wen, Y., Sun, F., Wang, J., & Yang, Y.(2021 年 10 月)。多代理强化学习中的信任区域策略优化。在国际学习代表会议上。

Questions: 问题:
  1. Can you conduct additional experiments for varying L in different environments? Does the conclusion you found hold for all environments? Why do you choose L=256 for all experiments?
    您能否在不同环境下对不同的 L 进行额外的实验?您发现的结论是否适用于所有环境?为什么所有实验都选择 L =256?
  2. Can you explain how exactly the noise is applied to the observation (L257)? how do you calculate normalized observations?
    您能解释一下噪声到底是如何应用到观察中的(L257)吗?如何计算标准化观测值?
  3. How realistic is your Gaussian assumption for the noise? Can you list some real-world cases? How would the noise affect agents' cooperation? Can you try other distributions of noise?
    您对噪声的高斯假设有多现实?您能列出一些真实的案例吗?噪音会如何影响代理商的合作?你能尝试其他的噪声分布吗?
  4. How do you implement LDR? Which algorithm is the base method? what's the performance of that method?
    如何实施LDR?哪种算法是基本方法?该方法的性能如何?

Minor: 次要的:

  1. Can you check whether the environment name and the curves are matched in Figure 3?
    你能检查一下环境名称和图3中的曲线是否匹配吗?
Limitations: 限制:
  1. The assumption for the noise might not be realistic. See Q3 for details.
    对噪声的假设可能不现实。详情请参阅问题 3。

post rebuttal: raised score from 3 to 5.
反驳后:将分数从 3 提高到 5。

Flag For Ethics Review: No ethics review needed.
道德审查标志:无需道德审查。
Rating: 5: Borderline accept: Technically solid paper where reasons to accept outweigh reasons to reject, e.g., limited evaluation. Please use sparingly.
评级:5:临界接受:技术上可靠的论文,接受的理由超过拒绝的理由,例如有限的评估。请谨慎使用。
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.
信心:3:您对自己的评估相当有信心。您可能不理解提交的某些部分,或者您不熟悉某些相关工作。数学/其他细节没有仔细检查。
Code Of Conduct: Yes 行为准则:是

Rebuttal by Authors 作者反驳

RebuttalAuthors (Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)07 Aug 2023, 12:09 (modified: 23 Aug 2023, 23:41)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Ethics Reviewers Submitted, AuthorsRevisions
反驳作者 ( Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)07 Aug 2023, 12:09 (modified: 23 Aug 2023, 23:41) 项目主席、高级区域主席、区域主席、提交审稿人、提交道德审稿人、作者修订
Rebuttal: 反驳:

We thank the reviewer Sma1 for his/her constructive comments that will surely turn our paper into a better shape.
我们感谢审稿人 Sma1 的建设性意见,这些意见必将使我们的论文变得更好。

Q1: How realistic is your Gaussian assumption for the noise? Can you list some real-world cases? How would the noise affect agents' cooperation? Can you try other distributions of noise?
问题 1:您对噪声的高斯假设有多现实?您能列出一些真实的案例吗?噪音会如何影响代理商的合作?你能尝试其他的噪声分布吗?

A1: Gaussian noise is very common in nature [1, 2, 3]. It turns out that a Gaussian distribution is the limit of the sum of a large number of unknown noise sources (this is technically called the central limit theorem). So the Gaussian assumption for the noise is plausible. We highlighted that noise affects the input to the policy network of the agents, which in turn amplifies its impact due to the collaboration among multiple agents (L27). We conduct additional experiments in Uniform noises:
A1:高斯噪声在自然界中非常常见[1,2,3]。事实证明,高斯分布是大量未知噪声源之和的极限(这在技术上称为中心极限定理)。因此,噪声的高斯假设是合理的。我们强调,噪声会影响代理策略网络的输入,而由于多个代理之间的协作,这反过来又放大了其影响(L27)。我们在均匀噪声中进行了额外的实验:

Table 1: Performance comparisons in Uniform noises.
表 1:均匀噪声中的性能比较。

Scenario 设想 LDR MAT MAPPO
8m_vs_9m 95.8±1.8 91.6±6.5 26.0±12.6
Hopper (3x1) 料斗 (3x1) 2236.3±361.3 1737.9±157.9 1491.8±436.0

The detailed curves can be found in the global rebuttal Figure 4. The results showed that LDR still outperformed the baselines, indicating the robustness of LDR to different types of noise.
详细的曲线可以在全局反驳图4中找到。结果表明LDR仍然优于基线,表明LDR对不同类型噪声的鲁棒性。

Q2: Lacking of ablation study. How would the hyperparameters λ and β influence the performance of LDR?
Q2:缺乏消融研究。超参数 λβ 将如何影响 LDR 的性能?

A2: We conduct more ablations for LDR, named LDR_unquantized (without the quantization module), LDR_permutation_variant (without a set-input block), and LDR_wo_attention (without an attention mechanism). We compare these ablations with LDR on SMAC map 8m_vs_9m and MAMuJoCo scenario Hopper (3x1) :
A2:我们对LDR进行了更多的消融,命名为LDR_unquantized(没有量化模块)、LDR_permutation_variant(没有设置输入块)和LDR_wo_attention(没有注意机制)。我们将这些消融与 SMAC 图 8m_vs_9m 和 MAMuJoCo 场景 Hopper (3x1) 上的 LDR 进行比较:

Table 2: Ablation studies regarding components of LDR in different scenarios.
表 2:不同场景下 LDR 组成部分的消融研究。

Scenario 设想 LDR LDR_unquantized LDR_未量化 LDR_permutation_variant LDR_排列变体 LDR_wo_attention
8m_vs_9m 81.3±3.1 53.1±9.4 75.0±14.3 76.1±9
Hopper (3x1) 料斗 (3x1) 2004.3±362.9 1379.3±130.9 1620.2±372.3 1684.3±242.1

The detailed curves can be found in the global rebuttal Figure 1. Our experimental results indicated that the quantization module played the most significant role in enhancing the noise robustness, thus validating our ideas. Additionally, we observed that the set-input block and combiner module also contributed to the overall performance improvement of LDR.
详细的曲线可以在全局反驳图1中找到。我们的实验结果表明量化模块在增强噪声鲁棒性方面发挥了最显着的作用,从而验证了我们的想法。此外,我们观察到设置输入块和组合器模块也有助于 LDR 整体性能的提高。

β weighs how strongly the codebook aligns with input vectors (L167). Theoretically, increasing β may improve learning speed, but it can also lead to overfitting. Technically, [4] found that algorithm is quite robust to β. This is because the primary goal of the training is to change the codebook, and fine-tuning β is not as crucial.
β 衡量码本与输入向量 (L167) 的对齐程度。理论上,增加 β 可以提高学习速度,但也可能导致过度拟合。从技术上讲,[4]发现该算法对于 β 非常稳健。这是因为训练的主要目标是改变码本,微调 β 并不那么重要。

Q3: Only MAT and MAPPO are compared, could compare with some more baselines like HAPPO.
Q3:只比较MAT和MAPPO,可以与HAPPO等更多基线进行比较。

A3: We conduct additional experiments to compare LDR with HAPPO in different scenarios:
A3:我们进行了额外的实验来比较不同场景下的 LDR 和 HAPPO:

  • 8m_vs_9m: 15.6±10.8
    8m_vs_9m:15.6 ± 10.8
  • Hopper (3x1): 1969.8±379.2
    料斗 (3x1):1969.8 ± 379.2

The detailed curves can be found in the global rebuttal Figure 2. The additional baseline experiments enhance the persuasiveness of LDR.
详细的曲线可以在全局反驳图2中找到。额外的基线实验增强了LDR的说服力。

Q4: Can you conduct additional experiments for varying L in different environments? Does the conclusion you found hold for all environments? Why do you choose L=256 for all experiments?
Q4:能否针对不同环境下改变 L 进行额外的实验?您发现的结论是否适用于所有环境?为什么所有实验都选择 L =256?

A4: We conduct additional experiments with different sizes of codebooks in MAMuJoCo scenario:
A4:我们在MAMuJoCo场景下用不同大小的码本进行了额外的实验:

Table 3: Performance with different codebook sizes.
表 3:不同码本大小的性能。

L 8 64 128 256 512
Episode return 剧集回归 1434.7±376.6 1620.9±459.7 1827.2±271.0 2004.3±362.9 1781.9±253.4

The detailed bar graphs can be found in the global rebuttal Figure 3. The results align with the findings stated in the paper, showing that noise robustness initially improves and then declines with an increase in L. We believe that the conclusion should hold for most environments. This is because L needs to trade off noise robustness and expressiveness. To demonstrate the robustness of L selection, we chose the same L across all the environments. However, we acknowledge that there may be room for tuning L in specific environments.
详细的条形图可以在全局反驳图3中找到。结果与论文中所述的发现一致,表明噪声鲁棒性随着 L 的增加首先提高然后下降。我们相信这个结论对于大多数环境都成立。这是因为 L 需要权衡噪声鲁棒性和表现力。为了证明 L 选择的稳健性,我们在所有环境中选择了相同的 L 。但是,我们承认在特定环境中 L 可能还有调整的空间。

Q5: Can you explain how exactly the noise is applied to the observation (L257)? How do you calculate normalized observations?
Q5:您能解释一下噪声到底是如何应用到观察中的(L257)吗?如何计算标准化观测值?

A5: We apply additive Gaussian noise to the observations to introduce noise. This noise is then added element-wise to the raw observations. We normalize observations based on the range of possible values given by the environmental setting.
A5:我们将加性高斯噪声应用于观测值以引入噪声。然后将该噪声逐元素添加到原始观测值中。我们根据环境设置给出的可能值的范围对观察结果进行标准化。

Q6: How do you implement LDR? Which algorithm is the base method? What's the performance of that method?
Q6:LDR是如何实现的?哪种算法是基本方法?该方法的性能如何?

A6: We implement LDR based on an autoregressive mechanism and MAPPO, which is one of the baselines in our paper.
A6:我们基于自回归机制和MAPPO实现LDR,这是我们论文中的基线之一。

Q7: Can you check whether the environment name and the curves are matched in Figure 3?
Q7:您能检查一下环境名称和图3中的曲线是否匹配吗?

A7: We have checked the environment name and the curves in Figure 3 and can confirm that they are indeed matched.
A7:我们已经检查了环境名称和图3中的曲线,可以确认它们确实匹配。

Citations: 引用:

[1] Cattin, Dr Philippe. "Image restoration: Introduction to signal and image processing." MIAC. 2013.
[1] 菲利普·卡丁博士。 “图像恢复:信号和图像处理简介。”米亚克。 2013年。

[2] Singh, Rahul, et al. "Improving robustness via risk averse distributional reinforcement learning." Learning for Dynamics and Control. 2020.
[2] 拉胡尔·辛格等人。 “通过风险规避分布式强化学习提高鲁棒性。”学习动力学和控制。 2020.

[3] Everett, Michael, et al. "Certifiable robustness to adversarial state uncertainty in deep reinforcement learning." IEEE TNNLS. 2021.
[3] 迈克尔·埃弗里特等人。 “深度强化学习中对抗性状态不确定性的可证明稳健性。” IEEE TNNLS。 2021 年。

[4] Van Den Oord, et al. "Neural discrete representation learning." NeurIPS. 2017.
[4]范登奥尔德等人。 “神经离散表示学习。”神经信息处理系统。 2017年。

Replying to Rebuttal by Authors
回复作者的反驳

Reply to Authors 回复作者

Official CommentReviewer Sma113 Aug 2023, 05:32 (modified: 29 Aug 2023, 03:14)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Ethics Reviewers SubmittedRevisions
官方评论审稿人 Sma113 Aug 2023, 05:32 (修改:29 Aug 2023, 03:14)项目主席、高级区域主席、区域主席、提交的审稿人、作者、道德审稿人提交的修订
Comment: 评论:

Thanks for your response, which addresses most of my concerns, I would like to raise my score from 3 to 5.
感谢您的回复,解决了我的大部分担忧,我想将我的分数从 3 提高到 5。

Replying to Reply to Authors
回复回复作者

Thank you for updating the rating
感谢您更新评级

Official CommentAuthors (Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)13 Aug 2023, 12:20 (modified: 29 Aug 2023, 03:14)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Ethics Reviewers SubmittedRevisions
官方评论作者(Yuqian Fu、Yuanheng Zhu、Jiajun Chai、Dongbin Zhao)2023年8月13日,12:20(修改:2023年8月29日,03:14)程序主席、高级区域主席、区域主席、提交的审稿人、作者、道德审稿人提交的修改
Comment: 评论:

We are happy that we could address your concerns. We feel very grateful that you can raise the score.
我们很高兴能够解决您的疑虑。我们非常感谢您能够提高分数。

Official Review of Submission2315 by Reviewer E6Ba
审稿人E6Ba对Submission2315的正式审稿

Official ReviewReviewer E6Ba07 Jul 2023, 04:27 (modified: 01 Sept 2023, 10:50)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer E6BaRevisions
官方审稿人 E6Ba07 Jul 2023, 04:27 (modified: 01 Sept 2023, 10:50) 项目主席、高级区域主席、区域主席、提交的审稿人、作者、审稿人 E6Ba 修订
Summary: 概括:

The paper introduces a method called Learning Discrete Representations (LDR) to improve the robustness of multi-agent reinforcement learning (MARL) against noise in observations. LDR uses a quantization module with a segment mechanism to generate discrete representations from learnable codebooks. These representations are then processed via a combiner for decision-making. The authors evaluate their method on StarCraft II micromanagement tasks and Multi-Agent MuJoCo with noisy observations. The results show that LDR outperforms existing algorithms in improving robustness while maintaining superior performance in clean observations.
该论文引入了一种称为学习离散表示(LDR)的方法,以提高多智能体强化学习(MARL)针对观察中的噪声的鲁棒性。 LDR 使用具有分段机制的量化模块从可学习的码本生成离散表示。然后通过组合器处理这些表示以进行决策。作者在《星际争霸 II》微管理任务和多智能体 MuJoCo 上通过噪声观测评估了他们的方法。结果表明,LDR 在提高鲁棒性方面优于现有算法,同时在干净观测中保持卓越性能。

Soundness: 3 good 健全性:3 良好
Presentation: 3 good 介绍:3 好
Contribution: 3 good 贡献:3 好
Strengths: 优势:

Originality The paper has a unique approach to studying the robustness of multi-agent reinforcement learning (MARL) against noise in observations. This originality in tackling a significant issue in MARL contributes to the paper's strengths.
原创性这篇论文采用了一种独特的方法来研究多智能体强化学习(MARL)针对观察中的噪声的鲁棒性。这种解决 MARL 中重大问题的独创性增强了本文的优势。

Quality See Weaknesses Section.
质量 请参阅弱点部分。

Clarity The paper offers a concise presentation of the proposed method. The authors provide a thorough explanation of the LDR method, making it easy for readers to understand the concept and its application.
清晰度本文对所提出的方法进行了简洁的介绍。作者对 LDR 方法进行了详尽的解释,使读者能够轻松理解该概念及其应用。

Significance LDR not only outperforms other baselines but also highlights the importance of learning representations to improve robustness to noise.
意义 LDR 不仅优于其他基线,而且还强调了学习表示对于提高对噪声的鲁棒性的重要性。

Weaknesses: 弱点:

Performance with Respect to Time One of the limitations of the paper is the lack of information on the performance of the Learning Discrete Representations (LDR) method with respect to time. Including this data would provide more insight into the efficiency of LDR and its practicality for real-world applications.
相对于时间的性能 本文的局限性之一是缺乏有关学习离散表示 (LDR) 方法相对于时间的性能的信息。包含这些数据将有助于更深入地了解 LDR 的效率及其在实际应用中的实用性。

Ablation Study The paper could benefit from an ablation study of the components of LDR. This would provide more detailed information on which component contributes most to noise robustness. By isolating and analyzing the impact of each component, readers could gain a deeper understanding of how LDR works and which aspects are most critical for its performance. This could also help identify potential areas for further improvement or refinement in the method.
消融研究 本文可以从 LDR 组成部分的消融研究中受益。这将提供关于哪个组件对噪声鲁棒性贡献最大的更详细信息。通过隔离和分析每个组件的影响,读者可以更深入地了解 LDR 的工作原理以及哪些方面对其性能最关键。这也有助于确定该方法需要进一步改进或完善的潜在领域。

Questions: 问题:
  1. Across how many seeds were the experiments on SMAC and Multi-agent MuJoCo run?
    SMAC 和多智能体 MuJoCo 上的实验运行了多少种子?
Limitations: 限制:

The authors adequately addressed the limitations.
作者充分解决了这些局限性。

Flag For Ethics Review: No ethics review needed.
道德审查标志:无需道德审查。
Rating: 6: Weak Accept: Technically solid, moderate-to-high impact paper, with no major concerns with respect to evaluation, resources, reproducibility, ethical considerations.
评级:6:弱接受:技术可靠,影响力中等至高的论文,在评估、资源、再现性、道德考虑方面没有重大问题。
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.
信心:4:您对自己的评估有信心,但不是绝对确定。您不太可能(但并非不可能)不理解提交内容的某些部分,或者您不熟悉某些相关工作。
Code Of Conduct: Yes 行为准则:是

Rebuttal by Authors 作者反驳

RebuttalAuthors (Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)04 Aug 2023, 21:18 (modified: 23 Aug 2023, 23:41)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Ethics Reviewers Submitted, AuthorsRevisions
反驳作者 ( Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)04 Aug 2023, 21:18 (modified: 23 Aug 2023, 23:41) 项目主席、高级区域主席、区域主席、提交审稿人、提交道德审稿人、作者修订
Rebuttal: 反驳:

We thank the reviewer E6Ba for his/her constructive comments that will surely turn our paper into a better shape.
我们感谢审稿人 E6Ba 的建设性意见,这些意见必将使我们的论文变得更好。

Q1: Performance with Respect to Time.
Q1:相对于时间的表现。

A1: Thanks so much for the suggestion. We conducted experiments to record and compare the wall clock time of LDR, MAT, and MAPPO in two different scenario with 10M environment steps as follows:
A1:非常感谢您的建议。我们进行了实验,记录并比较了 LDR、MAT 和 MAPPO 在两种不同场景下 10M 环境步骤的挂钟时间,如下所示:

Table 1: Wall clock time comparison on 6h_vs_8z and Hopper (3x1) with 10M environment steps.
表 1:6h_vs_8z 和 Hopper (3x1) 上 10M 环境步骤的挂钟时间比较。

Scenario 设想 LDR MAT MAPPO
6h_vs_8z 5.6±0.3 (h) 5.6 ± 0.3(小时) 6.1±0.4 (h) 6.1 ± 0.4(小时) 5.7±0.0 (h) 5.7 ± 0.0(小时)
Hopper (3x1) 料斗 (3x1) 5.3±0.5 (h) 5.3 ± 0.5(小时) 7.0±0.2 (h) 7.0 ± 0.2(小时) 7.1±1.2 (h) 7.1 ± 1.2(小时)

The experimental results clearly demonstrate that LDR exhibits superior training efficiency compared to others. This finding highlights the feasibility of LDR for real-world applications. Also, we will add this comparison to our revised Appendix to better demonstrate the advantages.
实验结果清楚地表明,与其他方法相比,LDR 表现出优越的训练效率。这一发现凸显了 LDR 在实际应用中的可行性。此外,我们还将将此比较添加到修订后的附录中,以更好地展示优势。

Q2: Ablation Study. The paper could benefit from an ablation study of the components of LDR.
Q2:消融研究。这篇论文可以从 LDR 组成部分的消融研究中受益。

A2: We greatly appreciate your suggestion regarding the inclusion of more ablation experiments, as this will enhance the completeness and refinement of our study. LDR consists of three parts: quantization modules, a combiner, and a set-input block. We conduct three more ablations for LDR, named LDR_unquantized, LDR_permutation_variant, and LDR_wo_attention, respectively. LDR_unquantized means LDR without the quantization module. LDR_permutation_variant means LDR without a set-input block. LDR_wo_attention represents that combiner without an attention mechanism. We compare these three ablations with LDR on the SMAC hard map 8m_vs_9m (10% noise) and MAMuJoCo scenario Hopper (3x1) (20% noise). We show the mean and standard deviation of the test win rate (%) and episode return across three random seeds as follows:
A2:我们非常感谢您关于纳入更多消融实验的建议,因为这将提高我们研究的完整性和完善性。 LDR由三部分组成:量化模块、组合器和设置输入块。我们对 LDR 进行了另外三种消融,分别命名为 LDR_unquantized、LDR_permutation_variant 和 LDR_wo_attention。 LDR_unquantized 表示没有量化模块的 LDR。 LDR_permutation_variant 表示没有设置输入块的 LDR。 LDR_wo_attention 表示没有注意力机制的组合器。我们将这三种消融与 SMAC 硬图 8m_vs_9m(10% 噪声)和 MAMuJoCo 场景 Hopper (3x1)(20% 噪声)上的 LDR 进行比较。我们显示了三个随机种子的测试获胜率 (%) 和剧集回报的平均值和标准差,如下所示:

Table 2: Ablation studies regarding components of LDR on 8m_vs_9m.
表 2:8m_vs_9m 上 LDR 成分的消融研究。

Timesteps 时间步长 LDR LDR_unquantized LDR_未量化 LDR_permutation_variant LDR_排列变体 LDR_wo_attention
1M 31.3±10.7 14.6±6.5 22.9±14.4 19.8±9.1
2M 63.8±7.0 55.2±9.5 48.9±16.0 63.5±4.8
3M 77.0±9.6 57.2±11.7 67.7±9.3 72.9±9.5
4M 79.3±4.7 54.1±13.2 73.9±12.6 70.8±12.6
5M 81.3±3.1 53.1±9.4 75.0±14.3 76.1±9

Table 3: Ablation studies regarding components of LDR on Hopper (3x1).
表 3:关于 Hopper (3x1) 上 LDR 成分的消融研究。

Timesteps 时间步长 LDR LDR_unquantized LDR_未量化 LDR_permutation_variant LDR_排列变体 LDR_wo_attention
2M 1680.3±188.8 1279.8±241.2 1292.5±149.7 1415.3±222.9
4M 1675.9±323.5 1504.0±79.6 1574.8±407.5 1508.7±156.3
6M 1875.2±207.4 1586.0±164.2 1468.2±458.7 1531.9±221.0
8M 1971.6±255.8 1888.1±281.7 1540.7±228.6 1960.1±262.4
10M 2004.3±362.9 1379.3±130.9 1620.2±372.3 1684.3±242.1

The detailed curves can be found in the global rebuttal Figure 1. Our experimental results indicated that the quantization module played the most significant role in enhancing the noise robustness, thus validating our ideas. Additionally, we observed that the set-input block and combiner module also contributed to the overall performance improvement of the algorithm. These combined contributions from multiple components enabled LDR to achieve satisfactory performance in noisy environments.
详细的曲线可以在全局反驳图1中找到。我们的实验结果表明量化模块在增强噪声鲁棒性方面发挥了最显着的作用,从而验证了我们的想法。此外,我们观察到设置输入块和组合器模块也有助于算法整体性能的提高。多个组件的综合贡献使 LDR 在嘈杂的环境中实现了令人满意的性能。

Q3: Across how many seeds were the experiments on SMAC and Multi-agent MuJoCo run?
Q3:SMAC 和 Multi-agent MuJoCo 上的实验运行了多少种子?

A3: We run three seeds in each experiment, similar to works [1-5].
A3:我们在每个实验中运行三个种子,类似于工作[1-5]。

Finally, thank you again for your thoughtful comments. We will incorporate your suggestions into our next revision.
最后,再次感谢您的深思熟虑的评论。我们会将您的建议纳入我们的下一次修订中。

Citations: 引用:

[1] Cui, Brandon, et al. "Adversarial Diversity in Hanabi." ICLR. 2023.
[1] 崔布兰登等人。 “花火中的对抗性多样性。” ICLR 2023年

[2] Qiu, Wei, et al. "RPM: Generalizable Multi-Agent Policies for Multi-Agent Reinforcement Learning." ICLR. 2023.
[2] 邱伟,等。 “RPM:用于多智能体强化学习的通用多智能体策略。” ICLR。 2023 年。

[3] Omidshafiei, Shayegan, et al. "Beyond rewards: a hierarchical perspective on offline multiagent behavioral analysis." NeurIPS. 2022.
[3] Omidshafiei、Shayegan 等人。 “超越奖励:离线多智能体行为分析的分层视角。”神经信息处理系统。 2022 年。

[4] Shao, Jianzhun, et al. "Self-Organized Group for Cooperative Multi-agent Reinforcement Learning." NeurIPS. 2022.
[4] 邵建准,等. “自组织合作多智能体强化学习小组。”神经信息处理系统。 2022 年。

[5] Chen, Jiayu, et al. "Variational automatic curriculum learning for sparse-reward cooperative multi-agent problems." NeurIPS. 2021.
[5] 陈家宇, 等. “稀疏奖励合作多智能体问题的变分自动课程学习。”神经信息处理系统。 2021 年。

Replying to Rebuttal by Authors
回复作者的反驳

Official Comment by Reviewer E6Ba
审稿人 E6Ba 的官方评论

Official CommentReviewer E6Ba19 Aug 2023, 16:54 (modified: 29 Aug 2023, 03:14)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Ethics Reviewers SubmittedRevisions
官方评论审稿人 E6Ba19 Aug 2023, 16:54 (modified: 29 Aug 2023, 03:14)项目主席、高级区域主席、区域主席、提交的审稿人、作者、道德审稿人提交的修订
Comment: 评论:

Thank you for addressing my concerns in your rebuttal. I will leave my score as is.
感谢您在反驳中解决了我的担忧。我将保持我的分数不变。

Replying to Official Comment by Reviewer E6Ba
回复审稿人E6Ba的官方评论

Official Comment by Authors
作者的官方评论

Official CommentAuthors (Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)21 Aug 2023, 16:03 (modified: 29 Aug 2023, 03:14)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Ethics Reviewers SubmittedRevisions
官方评论作者(Yuqian Fu、Yuanheng Zhu、Jiajun Chai、Dongbin Zhao)2023年8月21日,16:03(修改:2023年8月29日,03:14)程序主席、高级区域主席、区域主席、提交的审稿人、作者、道德审稿人提交的修改
Comment: 评论:

We are glad that all the comments of the reviewer have been addressed. Thanks to the reviewer for acknowledging the paper's contribution.
我们很高兴审稿人的所有意见都得到了解决。感谢审稿人对本文贡献的认可。

Official Review of Submission2315 by Reviewer bUKo
审稿人 bUKo 对提交 2315 的正式审核

Official ReviewReviewer bUKo05 Jul 2023, 09:28 (modified: 01 Sept 2023, 10:50)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer bUKoRevisions
官方审稿人 bUKo05 Jul 2023, 09:28 (modified: 01 Sept 2023, 10:50) 项目主席、高级区域主席、区域主席、提交的审稿人、作者、审稿人 bUKo 修订
Summary: 概括:

This paper proposed Learning Discrete Representations (LDR) to improve robustness against noise in multi-agent tasks. Specifically, LDR employs a quantization module with a segment mechanism to encode observations and teammate actions, generating discrete representations from learnable codebooks. These representations are sub-sequently processed via a combiner for decision-making. Additionally, the expressiveness of discrete representation and the boundedness of discrete distortion are analyzed theoretically. Empirical results demonstrate that LDR outperforms existing algorithms, improving robustness in noisy cooperative MARL tasks while maintaining superior performance in clean observations.
本文提出学习离散表示(LDR)来提高多智能体任务中对抗噪声的鲁棒性。具体来说,LDR 采用具有分段机制的量化模块来对观察结果和队友动作进行编码,从可学习的密码本生成离散表示。这些表示随后通过组合器进行处理以进行决策。此外,还从理论上分析了离散表示的表现力和离散失真的有界性。实证结果表明,LDR 优于现有算法,提高了嘈杂的协作 MARL 任务中的鲁棒性,同时保持了干净观测中的卓越性能。

Soundness: 3 good 健全性:3 良好
Presentation: 2 fair 展示:2场
Contribution: 2 fair 贡献:2公平
Strengths: 优势:

(1) There is a certain degree of application innovation in this work. It is claimed that this study is the first effort to leverage the discretization idea in MARL problems.
(1)本工作具有一定的应用创新。据称,这项研究是首次在 MARL 问题中利用离散化思想。

(2) The research on related works is relatively comprehensive. The theoretical analysis about the expressiveness of discrete representations and the boundedness of discrete distortion overall seems correct and rigorous.
(2)相关著作的研究比较全面。关于离散表示的表现力和离散失真的有界性的理论分析总体上看来是正确和严谨的。

(3) The main body of the paper is well-written and easy to follow relatively.
(3)论文的主体部分写得很好,比较容易理解。

(4) This work improved robustness against noise in multi-agent tasks, which is of great significance for the deployment of multi-agent systems.
(4)这项工作提高了多智能体任务中抗噪声的鲁棒性,对于多智能体系统的部署具有重要意义。

Weaknesses: 弱点:

(1) While there is a certain degree of application innovation, LDR is achieved by combining VQ-VAE, cross-attention and permutation invariance, lacking of methodological innovation.
(1)虽然有一定的应用创新,但LDR是结合VQ-VAE、交叉注意力和排列不变性来实现的,缺乏方法上的创新。

(2) In experiment, baselines only include two methods (MAT and MAPPO). It seems to lack persuasiveness. As is known to all, there are many other MARL algorithms to solve robustness against noise problem.
(2) 实验中,基线仅包括两种方法(MAT 和 MAPPO)。似乎缺乏说服力。众所周知,还有许多其他的MARL算法来解决针对噪声的鲁棒性问题。

(3) While the main body of the paper is well-written, there is space for improvement. I defer some of my issues in the appendix to "Questions".
(3)论文主体部分写得不错,但还有改进的空间。我将我的一些问题推迟到“问题”的附录中。

(4) Plots can be improved by: Reduce the saturation of the background (blue) in Figure 2 (a)-(c), and increase the color difference of module ‘MLP’, module ‘Self-Input block’ and module ‘Masked Self-Attention’.
(4) 可以通过以下方式改进绘图: 降低图 2 (a)-(c) 中背景(蓝色)的饱和度,并增加模块“MLP”、模块“自输入块”和模块“的色差”蒙面自我关注”。

Questions: 问题:
  1. Why discretization can achieve the improvement of robustness against noise? In the example of meeting in line 55, noisy information itself is inherently discrete (massive words). The essence of discretizing noisy information into discrete words is extracting critical information (feature). So, is the essence of discretization feature engineering?
    为什么离散化可以达到抗噪声鲁棒性的提高?在第 55 行会议的示例中,噪声信息本身本质上是离散的(大量单词)。将噪声信息离散化为离散词的本质是提取关键信息(特征)。那么,离散化的本质是特征工程吗?

  2. In Equ.(1) and (2), why is ‘s_0=s’ instead of ‘s_0=b’? What is the difference of V(b) and general V(s)? Is the input of pi?
    等式(1)和(2)中,为什么是‘s_0=s’而不是‘s_0=b’? V(b)和一般V(s)有什么区别?输入的是pi吗?

  3. In Equ.(3), is the dim of ‘a_1:i-1’ dynamic for different i? How do the neural networks deal with the dynamic input of policy pi?
    在式(3)中,‘a_1:i-1’的暗淡对于不同的i是动态的吗?神经网络如何处理策略 pi 的动态输入?

  4. Why not describe the discretization process binding b or a_i:i-1? Such much writing seems no different from VQ-VAE.
    为什么不描述绑定 b 或 a_i:i-1 的离散化过程?写了这么多,看起来和VQ-VAE没什么区别。

  5. In Equ.(7), dose ‘b_ia_j’ mean product or catenation? If product, what is the meaning of ‘state product action’?
    在方程(7)中,剂量“b_ia_j”是指乘积还是串联?如果是产品,“状态产品行动”是什么意思?

  6. In Equ.(7), do agents make decisions sequentially? If not, a_j means the action of j at t-1, assuming the current timis t?
    在式(7)中,智能体是否按顺序做出决策?如果不是,a_j 表示 j 在 t-1 时的动作,假设当前时间为 t?

  7. The observation encoder (set-input block) is the first step in LDR framework. Why not describe the ‘Set-input Block’ firstly, matching the Fig.2 (a)?
    观察编码器(设置输入块)是 LDR 框架的第一步。为什么不首先描述“设置输入块”,匹配图2(a)?

  8. In Fig.2, what is the specific equation at each addition symbol? Or what are the input and output of each module, describing with the symbols in Equ.(4-7)?
    图2中每个加法符号的具体方程是什么?或者说各个模块的输入输出是什么,用式(4-7)中的符号来描述?

  9. Have comparison experiments with other MARL algorithms been conducted? How well?
    是否与其他MARL算法进行过对比实验?多好?

Limitations: 限制:

Limitations are explicitly discussed by the paper and the authors have partially addressed them. As far as I can see, has no potential negative societal impact.
论文明确讨论了局限性,并且作者已经部分解决了这些局限性。据我所知,没有潜在的负面社会影响。

Flag For Ethics Review: No ethics review needed.
道德审查标志:无需道德审查。
Rating: 6: Weak Accept: Technically solid, moderate-to-high impact paper, with no major concerns with respect to evaluation, resources, reproducibility, ethical considerations.
评级:6:弱接受:技术可靠,影响力中等至高的论文,在评估、资源、再现性、道德考虑方面没有重大问题。
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.
信心:4:您对自己的评估有信心,但不是绝对确定。您不太可能(但并非不可能)不理解提交内容的某些部分,或者您不熟悉某些相关工作。
Code Of Conduct: Yes 行为准则:是

Rebuttal by Authors 作者反驳

RebuttalAuthors (Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)06 Aug 2023, 17:53 (modified: 23 Aug 2023, 23:41)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Ethics Reviewers Submitted, AuthorsRevisions
反驳作者 ( Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)06 Aug 2023, 17:53 (modified: 23 Aug 2023, 23:41) 项目主席、高级区域主席、区域主席、提交审稿人、提交道德审稿人、作者修订
Rebuttal: 反驳:

We thank the reviewer bUKo for his/her constructive comments that will surely turn our paper into a better shape.
我们感谢审稿人 bUKo 的建设性意见,这些意见必将使我们的论文变得更好。

Q1: While there is a certain degree of application innovation, LDR lacks of methodological innovation.
Q1:LDR虽然有一定的应用创新,但缺乏方法创新。

A1: We agree that the primary contribution of our method lies in its novel application. Nevertheless, we would like to gently appeal that the simplicity of LDR is an advantage rather than a disadvantage. While these techniques individually may not be novel, their integration and application within the context of noise robustness in MARL represent a significant contribution. We demonstrate through theoretical analysis and experimental results that our proposed approach improves noise robustness, thereby providing a valuable contribution to the field of MARL.
A1:我们同意我们的方法的主要贡献在于其新颖的应用。尽管如此,我们还是想温和地呼吁一下,LDR 的简单性是一个优点而不是缺点。虽然这些技术单独来看可能并不新颖,但它们在 MARL 噪声鲁棒性背景下的集成和应用代表了重大贡献。我们通过理论分析和实验结果证明,我们提出的方法提高了噪声鲁棒性,从而为 MARL 领域做出了宝贵的贡献。

Q2: Have comparison experiments with other MARL algorithms been conducted? As is known to all, there are many other MARL algorithms to solve robustness against noise problem.
Q2:是否与其他MARL算法进行过对比实验?众所周知,还有许多其他的MARL算法来解决针对噪声的鲁棒性问题。

A2: We conduct additional experiments to compare LDR with HAPPO [1] in different scenarios:
A2:我们进行了额外的实验来比较不同场景下的 LDR 和 HAPPO [1]:

  • 8m_vs_9m: 15.6±10.8
    8m_vs_9m:15.6 ± 10.8
  • Hopper (3x1): 1969.8±379.2
    料斗 (3x1):1969.8 ± 379.2

The detailed curves can be found in the global rebuttal Figure 2. As mentioned in related work (L90), current MARL algorithms either focus on mitigating noise in communication between agents or assume that only a single agent is affected by noisy observations. LDR relaxes these limitations by considering that all agents are confronted with noisy observations. We believe that there is a lack of methods to tackle this problem. If there are relevant works that we have missed, we would greatly appreciate it if you could kindly point them out.
详细的曲线可以在全局反驳图2中找到。正如相关工作(L90)中提到的,当前的MARL算法要么专注于减轻智能体之间通信中的噪声,要么假设只有单个智能体受到噪声观测的影响。 LDR 考虑到所有智能体都面临着噪声观测,从而放宽了这些限制。我们认为目前缺乏解决这个问题的方法。如果有我们遗漏的相关作品,请您指出,我们将不胜感激。

Q3: Plots can be improved.
Q3:情节可以改进。

A3: We will distinguish different components more clearly in the next revision, thank you!
A3:我们会在下次版本中更清晰地区分不同的组件,谢谢!

Q4: Is the essence of discretization feature engineering?
Q4:离散化的本质是特征工程吗?

A4: We agree with you that the essence of discretization is feature engineering [2]. Feature engineering is the process of using domain knowledge to extract features from raw data. By discretizing noisy information, we effectively fit the data into bins and reduce the impact of small perturbations, which are often considered noise. The discretization process ensures that small fluctuations are absorbed and represented as discrete states or actions, thereby reducing the impact of noise on the overall learning process [3, 4].
A4:我们同意你的观点,离散化的本质是特征工程[2]。特征工程是利用领域知识从原始数据中提取特征的过程。通过离散化噪声信息,我们可以有效地将数据放入箱中,并减少小扰动(通常被认为是噪声)的影响。离散化过程确保小的波动被吸收并表示为离散状态或动作,从而减少噪声对整个学习过程的影响[3​​, 4]。

Q5: In Equ.(1) and (2), why is ‘s_0=s’ instead of ‘s_0=b’? What is the difference of V(b) and general V(s)? Is the input of pi?
Q5:式(1)和式(2)中为什么是‘s_0=s’而不是‘s_0=b’? V(b)和一般V(s)有什么区别?输入的是pi吗?

A5: We apologize for the confusion caused. In the setting of LDR, noise only affects the agent's observations without affecting the environment's transitions (L123). In Equ.(1) and (2), the notation s0=s represents the true state corresponding to the noisy observation b. To make this clearer, we will revise it in the next version to b0=b. Since s can have multiple possible noisy observations b, the value of V(s) can be the same as multiple V(b). The inputs of Equ.(1) and (2) are the current joint noisy observation and joint action.
A5:对于造成的混乱,我们深表歉意。在 LDR 的设置中,噪声仅影响智能体的观察,而不影响环境的转换(L123)。在方程(1)和(2)中,符号 s0=s 表示与噪声观测值b相对应的真实状态。为了更清楚地说明这一点,我们将在下一个版本中将其修改为 b0=b 。由于 s 可以有多个可能的噪声观测值 b,因此 V(s) 的值可以与多个 V(b) 相同。式(1)和式(2)的输入是当前的联合噪声观测值和联合动作。

Q6: In Equ.(3), is the dim of ‘a_1:i-1’ dynamic for different i? How do the neural networks deal with the dynamic input of policy pi?
Q6:在式(3)中,‘a_1:i-1’的暗度对于不同的 i 是动态的吗?神经网络如何处理策略 pi 的动态输入?

A6: Yes. We adopt an implementation similar to the one used in Multi-agent Transformer [5] that addresses the dynamic input through attention mechanisms. (The attention mechanism in Transformer allows it to handle input sequences of different lengths.)
答6:是的。我们采用了一种类似于 Multi-agent Transformer [5] 中使用的实现,通过注意力机制处理动态输入。 (Transformer 中的注意力机制允许它处理不同长度的输入序列。)

Q7: Why not describe the discretization process binding b or a_i:i-1?
Q7:为什么不描述绑定b或a_i:i-1的离散化过程?

A7: We would like to clarify that both the discretization processes for b and a_i:i-1 are implemented using the same quantization module but with separate codebooks. This general description of the discretization process was intended to highlight its universality. Unlike VQ-VAE, LDR uses segmentation to enhance the expressiveness of codebooks. In the revised version, we will present a specific explanation of the discretization process, incorporating b.
A7:我们想澄清一下,b 和 a_i:i-1 的离散化过程是使用相同的量化模块但使用单独的码本实现的。对离散化过程的一般描述旨在强调其普遍性。与VQ-VAE不同,LDR使用分段来增强码本的表达能力。在修订版中,我们将具体解释离散化过程,并结合b。

Q8: In Equ.(7), does ‘b_ia_j’ mean product or catenation?
Q8:式(7)中,‘b_ia_j’是指乘积还是串联?

A8: bi^a^j means product. The state product action allows us to capture the relevance between the quantized noisy observation (query) and the quantized teammate's action (key). By calculating the attention weights, we can effectively weigh the teammate's action (value), thus facilitating the decision-making process of the agent.
A8: bi^a^j 表示产品。状态积动作使我们能够捕获量化的噪声观察(查询)和量化的队友的动作(关键)之间的相关性。通过计算注意力权重,我们可以有效地权衡队友的行动(价值),从而促进智能体的决策过程。

Q9: In Equ.(7), do agents make decisions sequentially?
Q9:在式(7)中,智能体是顺序做出决策的吗?

A9: Yes, we use auto-regressive mechanisms (L131) to make decisions.
A9:是的,我们使用自回归机制(L131)来做出决策。

Q10: Why not describe the ‘Set-input Block’ firstly, matching the Fig.2 (a)?
Q10:为什么不首先描述“设置输入块”,匹配图2(a)?

A10: Thanks for your feedback. We introduce the quantization module first as we believe it is the central contribution of our work. However, we understand the importance of maintaining a logical flow that aligns with the figures presented. In the revised version, we will first describe the set-input block, as illustrated in Figure 2(a). Then, we will describe the quantization module and the combiner.
A10:感谢您的反馈。我们首先介绍量化模块,因为我们相信它是我们工作的核心贡献。然而,我们了解保持与所呈现的数字一致的逻辑流程的重要性。在修订版本中,我们将首先描述设置输入块,如图2(a)所示。然后,我们将描述量化模块和组合器。

Q11: In Fig.2, what is the specific equation at each addition symbol?
Q11:图2中每个加法符号的具体方程式是什么?

A11: The addition symbol represents vector addition (similar to residual blocks). This addition operation is used to address the issue of gradient vanishing to some extent. To clarify this, we will update it in the revised version.
A11:加法符号表示向量加法(类似于残差块)。这种加法操作在一定程度上解决了梯度消失的问题。为了澄清这一点,我们将在修订版中进行更新。

Citations: 引用:

[1] Kuba, J. G., et al. "Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning." ICLR. 2022.
[1] 库巴,J.G.,等人。 “多代理强化学习中的信任区域策略优化。” ICLR。 2022 年。

[2] Dong, Guozhu, et al. "Feature engineering for machine learning and data analytics". CRC press, 2018.
[2] 董国柱,等. “机器学习和数据分析的特征工程”。 CRC出版社,2018年。

[3] Chen, Jiefeng, et al. "Improving adversarial robustness by data-specific discretization." arXiv preprint (2018).
[3] 陈杰峰,等. “通过特定于数据的离散化来提高对抗鲁棒性。” arXiv 预印本(2018)。

[4] Liu, Qiang, et al. "An empirical study on feature discretization." arXiv preprint (2020).
[4] 刘强,等。 “特征离散化的实证研究。” arXiv 预印本(2020)。

[5] Wen, Muning, et al. "Multi-agent reinforcement learning is a sequence modeling problem." NeurIPS. 2022.
[5] 温穆宁,等. “多智能体强化学习是一个序列建模问题。”神经信息处理系统。 2022 年。

Replying to Rebuttal by Authors
回复作者的反驳

Official Comment by Reviewer bUKo
审稿人 bUKo 的官方评论

Official CommentReviewer bUKo17 Aug 2023, 15:23 (modified: 29 Aug 2023, 03:14)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Ethics Reviewers SubmittedRevisions
官方评论审稿人 bUKo17 Aug 2023, 15:23(修改:29 Aug 2023, 03:14)项目主席、高级区域主席、区域主席、提交的审稿人、作者、道德审稿人提交的修订
Comment: 评论:

In this second round of review, I am grateful for the changes and improvements the authors have made in response to my feedback. The authors have provided thorough explanations to the concerns I raised, and they have incorporated necessary revisions to make the paper more comprehensible and coherent. Particularly commendable is the authors' expansion of the experimental section. The additional details provided in this section enhance the reader's understanding of the methodology and results. While the authors have made improvements in this revision, they have not yet demonstrated substantial advancements over existing methods. To conclude, the authors have responded positively to the issues raised in this review, resulting in an improved paper. However, I still suggest that they invest further effort into fostering innovation, which will make the paper more compelling and competitive. I agree to increase the score of this article by 2 point.
在第二轮评审中,我感谢作者根据我的反馈所做的更改和改进。作者对我提出的问题提供了详尽的解释,并进行了必要的修改,使论文更加易于理解和连贯。特别值得称赞的是作者对实验部分的扩展。本节提供的其他详细信息可增强读者对方法和结果的理解。虽然作者在这次修订中做出了改进,但他们尚未展示出相对于现有方法的实质性进步。总而言之,作者对本次综述中提出的问题做出了积极回应,从而改进了论文。不过,我仍然建议他们进一步努力促进创新,这将使论文更具吸引力和竞争力。我同意将这篇文章的分数提高2分。

Replying to Official Comment by Reviewer bUKo
回复审稿人 bUKo 的官方评论

Thank you for updating the rating
感谢您更新评级

Official CommentAuthors (Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)17 Aug 2023, 16:31 (modified: 29 Aug 2023, 03:14)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Ethics Reviewers SubmittedRevisions
官方评论作者(Yuqian Fu、Yuanheng Zhu、Jiajun Chai、Dongbin Zhao)2023年8月17日,16:31(修改:2023年8月29日,03:14)程序主席、高级区域主席、区域主席、提交的审稿人、作者、道德审稿人提交的修改
Comment: 评论:

We are pleased to hear that we could address your concerns and you can raise the score. Your constructive feedback makes our paper better.
我们很高兴得知我们可以解决您的疑虑并且您可以提高分数。您的建设性反馈使我们的论文变得更好。

Replying to Rebuttal by Authors
回复作者的反驳

Official Comment by Reviewer bUKo
审稿人 bUKo 的官方评论

Official CommentReviewer bUKo17 Aug 2023, 16:37 (modified: 29 Aug 2023, 03:14)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Ethics Reviewers SubmittedRevisions
官方评论审稿人 bUKo17 Aug 2023, 16:37(修改:29 Aug 2023, 03:14)项目主席、高级区域主席、区域主席、提交的审稿人、作者、道德审稿人提交的修订
Comment: 评论:

In this second round of review, I am satisfied with the changes and improvements the authors have made in response to my feedback. The authors have provided thorough explanations to the concerns I raised, and they have incorporated necessary revisions to make the paper more comprehensible and coherent. Particularly commendable is the authors' expansion of the experimental section. The additional details provided in this section enhance the reader's understanding of the methodology and results. While the authors have made improvements in this revision, they have not yet demonstrated substantial advancements over existing methods. To conclude, the authors have responded positively to the issues raised in this review, resulting in an improved paper. However, I still suggest that they invest further effort into fostering innovation, which will make the paper more compelling and competitive. I agree to increase the score of this article by 2 point.
在第二轮评审中,我对作者根据我的反馈所做的更改和改进感到满意。作者对我提出的问题提供了详尽的解释,并进行了必要的修改,使论文更加易于理解和连贯。特别值得称赞的是作者对实验部分的扩展。本节提供的其他详细信息可增强读者对方法和结果的理解。虽然作者在这次修订中做出了改进,但他们尚未展示出相对于现有方法的实质性进步。总而言之,作者对本次综述中提出的问题做出了积极回应,从而改进了论文。不过,我仍然建议他们进一步努力促进创新,这将使论文更具吸引力和竞争力。我同意将这篇文章的分数提高2分。

Official Review of Submission2315 by Reviewer ypgJ
审稿人 ypgJ 对 Submission2315 的正式审稿

Official ReviewReviewer ypgJ15 Jun 2023, 03:53 (modified: 01 Sept 2023, 10:50)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer ypgJRevisions
官方审稿人 ypgJ15 Jun 2023, 03:53 (modified: 01 Sept 2023, 10:50) 项目主席、高级区域主席、区域主席、提交的审稿人、作者、审稿人 ypgJ 修订
Summary: 概括:

The authors propose LDR, a method for addressing noisy observations in MARL frameworks via learnable discretizations.
作者提出了 LDR,一种通过可学习的离散化来解决 MARL 框架中的噪声观测的方法。

Motivated by realistic settings, wherein agents might receive noisy or corrupted observations, instead of perfectly "clean" observations, the authors seek to develop a method that improves robustness in the face of noise. Vector quantization methods, which generate discrete representations embedded in a continuous space, are proposed to mitigate such noise issues. Some theoretical analysis shows how a large enough codebook will not distort representations more than a small bound.
受现实环境的启发,代理可能会收到噪声或损坏的观察结果,而不是完全“干净”的观察结果,作者试图开发一种方法来提高面对噪声时的鲁棒性。提出了矢量量化方法来减轻此类噪声问题,该方法生成嵌入连续空间中的离散表示。一些理论分析表明,足够大的码本不会使表示失真超过一个小范围。

In experiments, the authors train agents in noiseless settings and evaluate agents (both LDR and SMAC/MAPPO baselines) when observations are corrupted by varying levels of Gaussian noise. In all settings with noise, LDR agents perform better than baselines, although all methods suffer with high enough noise levels.
在实验中,作者在无噪声环境中训练智能体,并在观察结果被不同级别的高斯噪声破坏时评估智能体(LDR 和 SMAC/MAPPO 基线)。在所有有噪声的环境中,LDR 代理的表现都优于基线,尽管所有方法都受到足够高的噪声水平的影响。

Updates during rebuttal period:
反驳期间更新:

I have increased my overall score from 3 to 5. The authors included further results to address my questions about discretization in general vs. the specific discretization method they originally used.
我将我的总分从 3 提高到 5。作者提供了进一步的结果来解决我关于一般离散化与他们最初使用的特定离散化方法的问题。

Soundness: 2 fair 健全性:2 一般
Presentation: 4 excellent 介绍:4优秀
Contribution: 3 good 贡献:3 好
Strengths: 优势:

The paper has many strengths including clear writing and strong results. Overall, I like both the motivation (robustness to noise) and many of the underlying techniques (vector quantization).
这篇论文有很多优点,包括文字清晰和结果有力。总的来说,我喜欢其动机(对噪声的鲁棒性)和许多基础技术(矢量量化)。

Originality 独创性

This work proposes a somewhat novel problem, considering robustness to observational noise in MARL settings. According to the authors, prior work has only considered noise in communication, actions, rewards, or policies of other agents, whereas their work, "concentrates on the noise in the observation." I have some questions about noise in communication vs. observations, which I raise later in the review.
这项工作提出了一个有点新颖的问题,考虑了 MARL 设置中观测噪声的鲁棒性。作者表示,之前的工作只考虑了其他智能体的沟通、行为、奖励或政策中的噪音,而他们的工作“专注于观察中的噪音”。我对沟通中的噪音与观察中的噪音有一些疑问,我在稍后的评论中提出了这些问题。

Quality 质量

This paper is well-written, and the experiments are very well done. I have some concerns about the problem formulation, which I discuss in the Weaknesses section. Again, I emphasize that the experimental setup and analysis is quite nice. The core experiments demonstrating robustness were well done, and followup work on the codebook size was quite nice. Lastly, the website, with anonymized code, is very good too.
这篇论文写得很好,实验也做得很好。我对问题的表述有一些担忧,我将在“弱点”部分讨论这一点。我再次强调实验设置和分析非常好。展示鲁棒性的核心实验做得很好,并且关于码本大小的后续工作也非常好。最后,该网站具有匿名代码,也非常好。

Clarity 明晰

I found this paper quite clear and easy to understand. From the start of the paper, I felt like I knew what to expect, and the authors worked through the development and experiments in a clear way.
我发现这篇论文非常清晰易懂。从论文一开始,我就觉得我知道会发生什么,而且作者以清晰的方式完成了开发和实验。

Significance 意义

I have conflicted views about the significance of this work. On the one hand, the authors have proposed a relatively simple (in a good way!) change to improve robustness of MARL agents. On the other hand, as I discuss later, I have concerns that the real contribution of this work, and the reason the method achieves good results, is not actually what this paper focuses on. This mis-attribution would limit the significance of this work.
我对这项工作的意义有不同的看法。一方面,作者提出了一个相对简单(以一种好的方式!)的改变来提高 MARL 代理的稳健性。另一方面,正如我稍后讨论的,我担心这项工作的真正贡献以及该方法取得良好结果的原因实际上并不是本文的重点。这种错误归因将限制这项工作的意义。

Weaknesses: 弱点:

My main concerns with this work relate to theory and framing of this work. As discussed in the strengths section, I think the experiments are well done and point to solid empirical benefits to this method. However, some of the claims in this work appear wrong or not quite supported by the experiments.
我对这项工作的主要关注点与这项工作的理论和框架有关。正如优势部分中所讨论的,我认为实验做得很好,并表明这种方法具有坚实的经验优势。然而,这项工作中的一些主张似乎是错误的或没有得到实验的充分支持。

  1. Relation to prior work
    与之前工作的关系

I think this work is proposing something novel, and is by no means subsumed by prior work, but I might recommend that the authors scale back some of their claims of being "the first effort to leverage the discretization idea in MARL problems." Any number of discrete emergent communication works can be viewed of as using discretizations in MARL frameworks.
我认为这项工作提出了一些新颖的东西,并且绝不包含在以前的工作中,但我可能会建议作者缩减他们的一些主张,即“在 MARL 问题中利用离散化思想的第一个努力”。任何数量的离散紧急通信工作都可以被视为在 MARL 框架中使用离散化。

The authors also state that their work "concentrates on the noise in the observation..." as opposed to prior methods, which focus on, among other things, noise in communications. I view communication as, among other things, an observation. Therefore, if other works consider noisy communication channels, how is the authors' framing of considering noisy observations distinct? (I suspect that the authors can make a distinction by stating that they consider more general noise or something.) The authors should also cite [1, 2] for work on noisy communication.
作者还表示,他们的工作“专注于观察中的噪声……”,而不是先前的方法,后者主要关注通信中的噪声。我将沟通视为一种观察。因此,如果其他作品考虑噪声通信渠道,那么作者考虑噪声观察的框架有何不同? (我怀疑作者可以通过声明他们考虑更一般的噪音或其他东西来进行区分。)作者还应该引用 [1, 2] 关于噪音通信的工作。

[1] Catalytic Role Of Noise And Necessity Of Inductive Biases In The Emergence Of Compositional Communication. Kuciski et al. 2021 [2] Emergent Discrete Communication in Semantic Spaces. Tucker et al. 2021
[1] 噪音的催化作用和归纳偏差的必要性在组合传播的出现中的作用。库西斯基等人。 2021 [2] 语义空间中的新兴离散通信。塔克等人。 2021年

  1. Problem formulation 问题表述

I am somewhat confused by the problem formulation in Section 3.1 and think it could be significantly simplified. In my reading from lines 112 to 130, the authors discuss both standard Dec-POMDPs and their novel "ON-Dec-POMDPs". As I understand it, the authors are trying to highlight how, in ON-Dec-POMDPs, agent observations may be noisy.
我对 3.1 节中的问题表述有些困惑,并认为它可以大大简化。在我阅读的第 112 至 130 行中,作者讨论了标准 Dec-POMDP 和他们的小说“ON-Dec-POMDP”。据我了解,作者试图强调在 ON-Dec-POMDP 中,代理观察可能是有噪音的。

I am confused because I believe that is already covered by standard Dec-POMDPs. Even in just a (single-agent) POMDP formulation, the observation function is typically a stochastic function. For example, the wikipedia article for POMDPs writes that "O is a set of conditional observation probabilities." Thus, noisy observations are already naturally captured by normal Dec-POMDP formulations. I do not understand why the authors have proposed a different framework.
我很困惑,因为我相信标准 Dec-POMDP 已经涵盖了这一点。即使在(单剂)POMDP 配方中,观察函数通常也是随机函数。例如,POMDP 的维基百科文章写道“O 是一组条件观察概率”。因此,正常的 Dec-POMDP 公式已经自然地捕获了噪声观测结果。我不明白为什么作者提出了不同的框架。

If I have misunderstood this section, could the authors please clarify in the rebuttal? Otherwise, the very simple fix would be to just remove the ON-Dec-POMDPs and properly define a Dec-POMDP with a stochastic observation function.
如果我误解了这一部分,请作者在反驳中澄清一下吗?否则,非常简单的修复方法就是删除 ON-Dec-POMDP,并使用随机观察函数正确定义 Dec-POMDP。

  1. Theoretical calculations of model capacity.
    模型容量的理论计算。

The authors include a discussion of the model capacity (or, using their phrasing, the mutual information). Equation 8 (and Appendix A) state that I(e; h) >= log(L).
作者对模型容量(或者用他们的措辞来说,互信息)进行了讨论。公式 8(和附录 A)表明 I(e; h) >= log(L)。

This is provably false. Using notation from the paper, e is the continuous embedding, h is the discrete representation, and there are L elements in the codebook. Let us set G = 1 (the number of times to segment h).
事实证明这是错误的。使用论文中的符号,e 是连续嵌入,h 是离散表示,码本中有 L 个元素。让我们设置G = 1(分割h的次数)。

For an example, let us say that we have a really bad encoder, where there is one codebook element at the origin in the latent space, and all other codebook elements are (nearly) infinitely far from the origin. Lastly, in this example, let us say that the encoder always outputs continuous embeddings, e, somewhat spread out but close to the origin.
举个例子,假设我们有一个非常糟糕的编码器,其中在潜在空间的原点处有一个码本元素,而所有其他码本元素距离原点(几乎)无限远。最后,在这个例子中,我们假设编码器总是输出连续的嵌入,e,有点分散但接近原点。

In this example, every continuous embedding will be discretized to the h at the origin. Thus, I(e; h) = 0. Clearly, this is less than log(L). This disproves the statement in Equation 8.
在这个例子中,每个连续嵌入都将被离散化到原点处的 h 。因此,I(e; h) = 0。显然,这小于 log(L)。这反驳了公式 8 中的陈述。

I suspect that the authors messed up the direction of an inequality. At least intuitively, for G = 1, I(e; h) should be upper-bounded (rather than lower-bounded) by the entropy of the categorical distribution of the codebook. This follows from I(X; Y) = H(X) - H(X|Y) <= H(X).
我怀疑作者搞乱了不平等的方向。至少直观上,对于 G = 1,I(e; h) 应该以码本分类分布的熵为上限(而不是下限)。这是从 I(X; Y) = H(X) - H(X|Y) <= H(X) 得出的。

As a brief aside, while I find Theorem 1, about bounded distortion as the codebook size grows, interesting, it seems largely at odds with the main thrust of the paper. In particular, if most of the paper is about how using discrete representations improves robustness by passing through a discrete bottleneck, this section seems to be stating that the bottleneck is not that tight and, in particular, as L grows, it becomes like a continuous representation space.
顺便说一句,虽然我发现定理 1(关于随着密码本大小的增长而产生的有限失真)很有趣,但它似乎与本文的主旨大相径庭。特别是,如果论文的大部分内容是关于如何使用离散表示通过穿过离散瓶颈来提高鲁棒性,那么本节似乎是在说明瓶颈并不那么紧,特别是,随着 L 的增长,它变得像一个连续的表示空间。

  1. Theoretical guarantees, or intuition, for robustness via discretization.
    通过离散化实现稳健性的理论保证或直觉。

My greatest concern with the current framing of this work is the fundamental claim which is that discrete representations support robustness to noise. I want to clarify by stating that I understand that the experimental results are compelling, and (in my quick experiments with the code) I was able to recreate results.
我对这项工作当前框架最关心的是基本主张,即离散表示支持对噪声的鲁棒性。我想澄清一下,我明白实验结果是令人信服的,并且(在我对代码的快速实验中)我能够重新创建结果。

My issue is that I do not think that the claims that discretization provide robustness are substantiated. Intuitively, I understand that, if the agents can robustly generate the right discrete representation given a noisy input, robustness will improve. However, I do not see how generating a representation should be any more robust to noise than taking an action. In other words, couldn't noise mess up the discretization process just as much as it messes up normal action selection?
我的问题是,我认为离散化提供鲁棒性的说法没有得到证实。直观上,我明白,如果代理能够在给定噪声输入的情况下稳健地生成正确的离散表示,那么鲁棒性就会提高。然而,我不认为生成表示应该比采取行动更能抵抗噪音。换句话说,噪声不会像扰乱正常动作选择一样扰乱离散化过程吗?

I readily concede, again, that the results are compelling insofar as they argue that the LDR agents outperform other models. Beyond passing through a discrete bottleneck, however, LDR agents also differ from baselines via the precise discretization process. If the authors really wanted to make a strong claim about discretization, they should consider other discrete representations via the gumbel-softmax trick, for example.
我再次承认,结果是令人信服的,因为他们认为 LDR 代理优于其他模型。然而,除了通过离散瓶颈之外,LDR 代理还通过精确的离散化过程与基线不同。如果作者真的想对离散化做出强有力的主张,他们应该考虑通过gumbel-softmax技巧等其他离散表示。

My suspicion is that it is the vector quantization mechanism of categorization based on proximity in a latent space that affords the robustness benefits reported in the paper.
我的怀疑是,基于潜在空间邻近性的分类矢量量化机制提供了论文中报告的鲁棒性优势。

  1. Minor points: 小要点:

I find the graphs in Figure 3 (and Figures 9 and 10) misleading. The red line represents the noiseless performance after convergence, right? And the other curves are the training curves, evaluated under noise, at different checkpoints during training. Instead of plotting a flat red dashed line representing peak performance, why don't the authors plot the noiseless performance as a training curve as well, changing over time? That seems much more clear and informative, to me.
我发现图 3(以及图 9 和 10)中的图表具有误导性。红线代表收敛后的无噪声性能,对吗?其他曲线是训练曲线,在训练期间的不同检查点在噪声下评估。作者没有绘制代表峰值性能的平坦红色虚线,而是将无噪声性能也绘制为随时间变化的训练曲线?对我来说,这似乎更加清晰和信息丰富。

In the first sentence of the abstract, the authors write that "agents often face inaccurate environments..." The phrase "inaccurate environments" doesn't seem right. Environments are always "correct." The authors may observe inaccurate views of their environment. Obviously, this is very minor, but I suggest the authors reword that little section.
在摘要的第一句话中,作者写道“代理经常面临不准确的环境……”“不准确的环境”一词似乎并不正确。环境总是“正确的”。作者可能会对他们的环境产生不准确的看法。显然,这是非常小的,但我建议作者重写这一小节。

In 5.4, the authors discuss visualizations of noisy observations (from Figure 6). The authors write, "discretization can be interpreted as a smooth operation in the space of noisy observations, which can reduce noise effectively." I don't understand how discretization is a smooth process. This seems unclear.
在 5.4 中,作者讨论了噪声观测的可视化(来自图 6)。作者写道,“离散化可以解释为在噪声观测空间中的平滑操作,可以有效地降低噪声。”我不明白离散化是一个多么顺利的过程。这似乎还不清楚。

Overview 概述

Based on these limitations, including what I believe are false theoretical claims, I am arguing for rejection. I think many of these issues are addressable, however, so I look forward to engaging with the authors during the review process.
基于这些限制,包括我认为是错误的理论主张,我主张拒绝。然而,我认为其中许多问题都是可以解决的,因此我期待在审查过程中与作者互动。

Questions: 问题:
  1. What are the input formats and noise models for the different experiments?
    不同实验的输入格式和噪声模型是什么?

The authors write that observations are corrupted by zero-mean Gaussian noise. In experiments, they describe corruptions by 20% or 30%. What are these percentages of? Further, without knowing the exact input features (which may be written somewhere, but I cannot find them now), it is hard to calibrate what the different noise levels are corrupting.
作者写道,观测结果被零均值高斯噪声破坏了。在实验中,他们描述了 20% 或 30% 的损坏。这些百分比是多少?此外,在不知道确切的输入特征(可能写在某个地方,但我现在找不到它们)的情况下,很难校准不同噪声水平正在破坏的内容。

  1. Please address my question about the bound on mutual information of the codebook. Did I misunderstand something?
    请解决我关于码本互信息界限的问题。我是不是误会了什么?

  2. I am happy to engage in discussion about discretization in general vs. vector quantization specifically. Do the authors think that it is discretization in particular that is generating robustness benefits?
    我很高兴参与有关一般离散化与具体矢量量化的讨论。作者是否认为正是离散化产生了鲁棒性优势?

Limitations: 限制:

The authors did a good job discussing limitations of their approach. In particular, I appreciated the discussion about a tradeoff with codebook size, noting how there seems to be an important balance to strike between large enough to capture important information, and not too large to reduce benefits of robustness.
作者很好地讨论了他们方法的局限性。特别是,我很欣赏关于密码本大小权衡的讨论,并指出如何在足够大以捕获重要信息和不太大以降低鲁棒性的好处之间取得重要的平衡。

Flag For Ethics Review: No ethics review needed.
道德审查标志:无需道德审查。
Rating: 5: Borderline accept: Technically solid paper where reasons to accept outweigh reasons to reject, e.g., limited evaluation. Please use sparingly.
评级:5:临界接受:技术上可靠的论文,接受的理由超过拒绝的理由,例如有限的评估。请谨慎使用。
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.
置信度:5:您对自己的评估绝对确定。你对相关工作非常熟悉,并仔细检查了数学/其他细节。
Code Of Conduct: Yes 行为准则:是

Rebuttal by Authors 作者反驳

RebuttalAuthors (Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)08 Aug 2023, 18:47 (modified: 23 Aug 2023, 23:41)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Ethics Reviewers Submitted, AuthorsRevisions
反驳作者 ( Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)08 Aug 2023, 18:47 (modified: 23 Aug 2023, 23:41) 项目主席、高级区域主席、区域主席、提交审稿人、提交道德审稿人、作者修订
Rebuttal: 反驳:

We thank the reviewer ypgJ for his/her constructive comments that will surely turn our paper into a better shape.
我们感谢审稿人 ypgJ 的建设性意见,这些意见肯定会使我们的论文变得更好。

Q1: Relation to prior work.
Q1:与之前工作的关系。

A1.1: (scale back claims) We acknowledge that the claim may be overly ambitious. Indeed, there are prior works that can be viewed as using discretizations in MARL frameworks, in the context of discrete emergent communication. We will scale back our claims accordingly to state that our work is the "first effort to leverage the discretization idea in robust MARL".
A1.1:(缩减声明)我们承认该声明可能过于雄心勃勃。事实上,之前的一些工作可以被视为在离散紧急通信的背景下在 MARL 框架中使用离散化。我们将相应地缩减我们的主张,以声明我们的工作是“在稳健的 MARL 中利用离散化思想的第一次努力”。

A1.2: (communication is an observation) Yes, communication can be seen as an observation. We would like to highlight that the noise in communication and observation are on different levels, and they have different impacts. The noise in communication includes both channel-related noise and lossy networking [1-4]. In our work, we consider observation noise, which is inherent in the agent's sensors. If agents build communication in noisy observations, the messages will become inaccurate (as they are derived from observations), leading to potentially greater impacts. We will include the relevant citations you provided in revised version.
A1.2:(沟通是一种观察)是的,沟通可以被视为一种观察。我们想强调的是,沟通和观察中的噪音是不同程度的,它们会产生不同的影响。通信中的噪声包括与信道相关的噪声和有损网络噪声[1-4]。在我们的工作中,我们考虑观察噪声,这是代理传感器固有的。如果代理在嘈杂的观察中建立通信,则消息将变得不准确(因为它们是从观察中得出的),从而导致潜在的更大影响。我们将在修订版本中包含您提供的相关引文。

Q2: Problem formulation. Q2:问题表述。

A2: Yes. We agree with you and acknowledge that ON-Dec-POMDP can be seen as a version of robust Dec-POMDP where the policy needs to be robust under noisy and incomplete observations [5]. This also explains why baselines based on Dec-POMDP can still achieve some level of reward. To improve clarity, we will remove ON-Dec-POMDP in the revised version and integrate noisy observations with the Dec-POMDP. (We've changed it in the revised version, but can't display due to character limit.) Thank you for bringing this to our attention and helping us enhance our work quality.
答2:是的。我们同意您的观点,并承认 ON-Dec-POMDP 可以被视为稳健的 Dec-POMDP 的一个版本,其中策略需要在嘈杂和不完整的观察下保持稳健 [5]。这也解释了为什么基于 Dec-POMDP 的基线仍然可以获得一定程度的奖励。为了提高清晰度,我们将在修订版本中删除 ON-Dec-POMDP,并将噪声观测值与 Dec-POMDP 集成。 (我们在修订版中对其进行了更改,但由于字符限制而无法显示。)感谢您让我们注意到这一点并帮助我们提高工作质量。

Q3: Theoretical calculations of model capacity.
Q3:模型容量的理论计算。

A3: We apologize for the confusion. We have reexamined the proof and confirmed its correctness. Firstly, we would like to clarify that in our paper, e represents the codebook vector, while h represents the continuous input. In your example, all h are discretized to the e at the origin. In this case, I(e;h) should indeed equal log(L) (we can provide a detailed derivation in the discussion stage). From another perspective, the codebook and the embedding vectors are not independent, so I(e;h) should not be equal to 0. However, there is a case where I(e;h) = 0: when all codebook vectors are located at the origin and h is also close to the origin. In this case, p(e=h) = 1, but this situation is unrealistic during training, because codebook vectors would undergo small-scale shifts to align with different h. Our paper assumes p(e=h) = 1/L, which is consistent with VQ-VAE [6].
A3:对于造成的混乱,我们深表歉意。我们重新检查了证明并确认了其正确性。首先,我们想澄清一下,在我们的论文中,e 代表码本向量,而 h 代表连续输入。在您的示例中,所有 h 都在原点处离散化为 e。在这种情况下,I(e;h)确实应该等于log(L)(我们可以在讨论阶段提供详细的推导)。从另一个角度来看,码本和嵌入向量不是独立的,因此I(e;h)不应该等于0。但是,存在I(e;h) = 0的情况:当所有码本向量都定位时位于原点,h 也接近原点。在这种情况下,p(e=h) = 1,但这种情况在训练期间是不现实的,因为码本向量会经历小规模的移位以与不同的 h 对齐。我们的论文假设 p(e=h) = 1/L,这与 VQ-VAE [6] 一致。

Q4: Theorem 1 seems largely at odds with the main thrust.
Q4:定理 1 似乎与主旨大相径庭。

A4: Theorem 1 highlights the relationship between codebook size and discrete distortion. While discretization enhances robustness, it also introduces inherent information loss. It needs to strike a balance between robustness and distortion. We will remove the statement (L222) to ensure that the main thrust of our work.
A4:定理 1 强调了码本大小和离散失真之间的关系。虽然离散化增强了鲁棒性,但它也带来了固有的信息丢失。它需要在鲁棒性和失真度之间取得平衡。我们将删除该声明(L222)以确保我们工作的主旨。

Q5: Theoretical guarantees.
Q5:理论保证。

A5.1: (Theoretical guarantees) By discretizing noisy information, we effectively fit the data into bins and reduce the impact of small perturbations, which are often considered noise. This is the process of “smoothing”, wherein each bin smooths fluctuations, thus reducing noise in the data [7, 8].
A5.1:(理论保证)通过离散化噪声信息,我们可以有效地将数据放入箱中,并减少小扰动(通常被认为是噪声)的影响。这是“平滑”的过程,其中每个 bin 平滑波动,从而减少数据中的噪声 [7, 8]。

A5.2: (Discussion about discretization vs. quantization) We agree that VQ based on categorization using proximity in a latent space can enhance robustness. However, as mentioned in A5.1, we believe that discretization can also provide robustness advantages. In response to your suggestion, we conducted experiments using the gumbel-softmax trick as a discretization method:
A5.2:(关于离散化与量化的讨论)我们同意基于使用潜在空间中的邻近性进行分类的 VQ 可以增强鲁棒性。然而,正如 A5.1 中提到的,我们相信离散化也可以提供鲁棒性优势。根据您的建议,我们使用gumbel-softmax技巧作为离散化方法进行了实验:

  • 8m_vs_9m: 71.9±15.7
    8m_vs_9m:71.9 ± 15.7
  • Hopper(3x1): 1553.1±168.6
    料斗(3x1):1553.1 ± 168.6

The detailed curves can be found in the global rebuttal Figure 1. The performance of the discretization algorithm in noisy observations still outperforms the baselines. In the revised version, we will discuss both the concepts of discretization and vector quantization. Considering the limitations on characters, we look forward to discussing this further with you!
详细的曲线可以在全局反驳图1中找到。离散化算法在噪声观测中的性能仍然优于基线。在修订版中,我们将讨论离散化和矢量量化的概念。考虑到角色的限制,我们期待与您进一步讨论!

Q6: What are the input formats and noise models for the different experiments?
Q6:不同实验的输入格式和噪声模型是什么?

A6: The noise level refers to the percentage of noise amplitude relative to the range of normalized observations (L257). The information included in the input can be found in L249 and L253. More details can be found in the corresponding environment documentation. Due to character limitations, we regret that we are unable to present the input format here.
A6:噪声水平是指噪声幅度相对于归一化观测范围(L257)的百分比。输入中包含的信息可以在 L249 和 L253 中找到。更多详细信息可以参见相应的环境文档。由于字符限制,我们很遗憾无法在此展示输入格式。

Q7: Minor points Q7:小问题

A7.1: (Visual misleading) We would like to express our gratitude for pointing out the need for improvement in these figures. We will make revisions to these figures in the revised version.
A7.1:(视觉误导)我们对指出这些数字需要改进的地方表示感谢。我们将在修订版中对这些数字进行修改。

A7.2: (Environments are always correct) Thank you for pointing this out, and we will reword the first sentence of the abstract in the revised version.
A7.2:(环境总是正确的)感谢您指出这一点,我们将在修订版中重新措辞摘要的第一句话。

Citations: 引用:

[1] Xue, Wanqi, et al. "Mis-spoke or mis-lead: Achieving Robustness in Multi-Agent Communicative Reinforcement Learning." AAMAS. 2022.
[1] 薛万琪,等. “说错或误导:实现多智能体交流强化学习的稳健性。”美国医学协会。 2022 年。

[2] Tu, James, et al. "Adversarial attacks on multi-agent communication." ICCV. 2021.
[2] 涂,詹姆斯,等人。 “对多代理通信的对抗性攻击。” ICCV。 2021 年。

[3] Zhang, et al. "Succinct and robust multi-agent communication with temporal message control." NeurIPS. 2020.
[3] 张,等。 “通过时间消息控制进行简洁而强大的多代理通信。”神经信息处理系统。 2020.

[4] Sun, Yanchao, et al. "Certifiably Robust Policy Learning against Adversarial Multi-Agent Communication." ICLR. 2023.
[4] 孙彦超,等. “针对对抗性多智能体通信的可证明稳健的策略学习。” ICLR。 2023 年。

[5] Zhang, Huan, et al. "Robust Reinforcement Learning on State Observations with Learned Optimal Adversary." ICLR. 2021.
[5] 张欢,等。 “通过学习的最佳对手对状态观察进行鲁棒强化学习。” ICLR。 2021 年。

[6] Van Den Oord, et al. "Neural discrete representation learning." NeurIPS. 2017.
[6] 范登奥尔德等人。 “神经离散表示学习。”神经信息处理系统。 2017年。

[7] Chen, Jiefeng, et al. "Improving adversarial robustness by data-specific discretization." arXiv preprint. 2018.
[7] 陈杰峰,等. “通过特定于数据的离散化来提高对抗鲁棒性。” arXiv 预印本。 2018.

[8] Liu, Qiang, et al. "An empirical study on feature discretization." arXiv preprint. 2020.
[8] 刘强,等。 “特征离散化的实证研究。” arXiv 预印本。 2020.

Replying to Rebuttal by Authors
回复作者的反驳

Thanks, brief followups 谢谢,简短的后续行动

Official CommentReviewer ypgJ10 Aug 2023, 20:22 (modified: 29 Aug 2023, 03:14)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Ethics Reviewers SubmittedRevisions
官方评论审稿人 ypgJ10 Aug 2023, 20:22 (modified: 29 Aug 2023, 03:14)项目主席、高级区域主席、区域主席、提交的审稿人、作者、道德审稿人提交的修订
Comment: 评论:

Question about gumbel-softmax results
关于gumbel-softmax结果的问题

Thank you for performing the gumbel-softmax experiments. I think that nicely separates aspects of discretization generally vs. vector quantization specifically. I see Figure 1 in the attached PDF but want to make sure I understand it. For each domain, you plot reward over training time. What is the noise level used during evaluation? I'd guess 20%, because that's what you used in Figure 3 in the main paper, and the Hopper (3x1) results for LDR line up with that. But then I'm guessing the StarCraft results are for 10% noise, based on results in Table 1 in the main paper?
感谢您进行gumbel-softmax 实验。我认为这很好地将离散化的各个方面与具体的矢量量化分开。我在所附 PDF 中看到了图 1,但想确保我理解它。对于每个领域,您都可以根据训练时间绘制奖励。评估期间使用的噪音水平是多少?我猜测是 20%,因为这就是您在主论文图 3 中使用的内容,并且 LDR 的 Hopper (3x1) 结果与此一致。但根据主论文表 1 中的结果,我猜测《星际争霸》的结果是在 10% 噪声下的结果?

Assuming that's the right interpretation of noise levels (and please correct me if I'm wrong), then, at least in the Hopper domain, it doesn't seem like Gumbel-softmax discretization results meaningfully outperform the non-discrete baselines. As in the figure at the last timestep, and written elsewhere in the rebuttal, the GS agents achieve reward of about 1550. Zooming in to Figure 3 of the main paper, that seems very similar to the MAT and MAPPO results, right? Or am I misinterpreting?
假设这是对噪声水平的正确解释(如果我错了,请纠正我),那么,至少在 Hopper 域中,Gumbel-softmax 离散化结果似乎并没有明显优于非离散基线。正如最后一个时间步的图中以及反驳中其他地方所写的那样,GS 智能体获得了大约 1550 的奖励。放大到主论文的图 3,这似乎与 MAT 和 MAPPO 结果非常相似,对吧?还是我误解了?

I also readily concede that GS outperforms MAT and MAPPO in the SMAC domain (again, assuming I made the correct assumptions about noise).
我也承认 GS 在 SMAC 领域优于 MAT 和 MAPPO(再次假设我对噪声做出了正确的假设)。

Questions about theoretical capacity.
关于理论能力的问题。

Sorry, I think I still don't understand how the mutual information >= log(L). I think we agree on my example setup: all continuous representations, h, are discretized to the same encoding, e. If literally all inputs are encoded via the same discrete representation, doesn't that mean the mutual information of input and representation is 0? E.g., I(e;h)=0? I believe I disagree with the authors in this analysis, as they write about the same example, "In this case, I(e;h) should indeed equal log(L)... From another perspective, the codebook and the embedding vectors are not independent, so I(e;h) should not be equal to 0."
抱歉,我想我还是不明白互信息如何 >= log(L)。我认为我们同意我的示例设置:所有连续表示 h 都离散化为相同的编码 e 。如果实际上所有输入都通过相同的离散表示进行编码,那么这是否意味着输入和表示的互信息为 0?例如, I(e;h)=0 ?我相信我不同意这个分析中的作者,因为他们写了同一个例子,“在这种情况下,I(e;h) 确实应该等于 log(L)...从另一个角度来看,密码本和嵌入向量不是独立的,因此 I(e;h) 不应等于 0。”

I understand the frustrations of character limits, so I welcome a more extensive discussion of this point.
我理解字符限制的挫败感,所以我欢迎对这一点进行更广泛的讨论。

Overall updates 整体更新

Overall, I appreciate the new experiments that the authors conducted in this rebuttal. I'm inclined to increase my score once 1) the authors clarify my very brief question about noise levels in the new figures attached in the rebuttal PDF and 2) the authors and I figure out my confusion about the theoretical capacity of the discretization process.
总的来说,我很欣赏作者在反驳中进行的新实验。一旦 1) 作者澄清了我关于反驳 PDF 中所附新数据中的噪声水平的非常简短的问题,以及 2) 作者和我弄清楚了我对离散化过程的理论能力的困惑,我倾向于提高我的分数。

Replying to Thanks, brief followups
回复谢谢,简短的后续行动

Response to the follow-up points
对后续要点的回应

Official CommentAuthors (Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Dongbin Zhao)11 Aug 2023, 14:08 (modified: 29 Aug 2023, 03:14)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Ethics Reviewers SubmittedRevisions
官方评论作者(Yuqian Fu、Yuanheng Zhu、Jiajun Chai、Dongbin Zhao)2023年8月11日,14:08(修改:2023年8月29日,03:14)程序主席、高级区域主席、区域主席、提交的审稿人、作者、道德审稿人提交的修改
Comment: 评论:

Thank you once again for your time and consideration. We hope we can address your concerns below.
再次感谢您的时间和考虑。我们希望能够在下面解决您的疑虑。

Q1: Question about Gumbel-softmax results.
Q1:关于 Gumbel-softmax 结果的问题。

Yes. We used a noise level of 20% in MAMuJoCo and 10% in SMAC, as stated in the main paper. As you rightly pointed out, Gumbel-softmax's performance in MAMuJoCo is similar to baselines. In our revised version, we will analyze the similarities and differences between vector quantization and discretization for noise robustness. Consequently, we will also include additional performance experiments with Gumbel-softmax in MAMuJoCo to provide a more comprehensive evaluation.
是的。正如主论文中所述,我们在 MAMuJoCo 中使用了 20% 的噪声水平,在 SMAC 中使用了 10% 的噪声水平。正如您正确指出的那样,Gumbel-softmax 在 MAMuJoCo 中的性能与基线相似。在我们的修订版本中,我们将分析矢量量化和离散化在噪声鲁棒性方面的异同。因此,我们还将在 MAMuJoCo 中使用 Gumbel-softmax 进行额外的性能实验,以提供更全面的评估。

Q2: Questions about theoretical capacity.
Q2:关于理论能力的问题。

After carefully examining your concerns, we believe we have identified the cause of the discrepancy. In our paper, we assume a uniform distribution for p(e), which is consistent with the assumption made in VQ-VAE [1]. However, in the example you provided, p(e) follows a Dirac distribution. Let us elucidate the derivation in more detail: p(ei)=hp(ei|h)p(h)={1,i=00,i{2,...L} where e1 is located at the origin (p(e1|h)=0), and e2,...,L are placed infinitely far away (p(ei|h)=0, where i{2,...L}). We hope that this response clarifies your concerns.
在仔细检查您的疑虑后,我们相信我们已经确定了差异的原因。在我们的论文中,我们假设 p(e) 呈均匀分布,这与 VQ-VAE [1] 中的假设一致。但是,在您提供的示例中, p(e) 遵循狄拉克分布。让我们更详细地阐明推导: