比特派钱包安卓下载|openai dota

作者：比特派钱包安卓下载

2024-03-15 18:15:05

一文解析OpenAI Five，一个会打团战的Dota2 AI - 知乎

一文解析OpenAI Five，一个会打团战的Dota2 AI - 知乎首发于论智切换模式写文章登录/注册一文解析OpenAI Five，一个会打团战的Dota2 AI论智来源：OpenAI编译：Bot编者按：关于OpenAI的那篇博客，相信很多玩家一早起来就已经看过了。昨晚打完Dota2时，云玩家小编也在Reddit上看了相关视频，还和队友一起推测了会儿内在机制。但不曾想，我这一睡就又错过了头条。本文会重新编译原博内容，并补上被大家忽视的一些关键点。去年，OpenAI的强化学习bot在中路solo中击败职业选手Dendi，赢得众人瞩目，但Dota2是一个5人游戏，在那之后，我们目标是制作一个由神经网络构成的5人团队，它能在8月份举办的Ti8国际邀请赛上，用有限的英雄击败职业队。时至今日，我们有了OpenAI Five，它已经可以在比赛中击败业余玩家。OpenAI Five玩的是限制版的Dota2，它只会瘟疫法师、火枪、毒龙、冰女和巫妖5个英雄，因为镜像训练，它的对手也只能玩这5个。游戏的“限制性”主要体现在以下几方面：英雄受限（上述5个）；没有假眼和真眼；没有肉山；不能隐身（消耗品和相关物品，可以理解为没有雾、微光、隐刀、大隐刀、隐身符等）；没有召唤物和分身（没有分身斧、分身符、支配头盔等）；没有圣剑、魔瓶、补刀斧、飞鞋、知识之书、凝魂之露（没有骨灰？）；每队五只无敌信使（和加速模式一样）；不能扫描。这些限制使OpenAI Five的游戏和正常游戏有一定区别，尤其是队长模式，但总体而言，它和随机征召等模式差别不大（果然最强赛区，NA冰女都不插眼！）。OpenAI Five每天玩的游戏量相当于人类玩家180年的积累，和围棋AI一样，它从自学中提取经验。训练设备是256个GPU和128,000个CPU，使用的强化学习算法是近端策略优化（PPO）。因为不同英雄间技能、出装各异，这5个英雄使用的是5个独立的LSTM，无人类数据，由英雄从自己的数据中学习可识别策略。实验表明，在没有根本性进展的前提下，强化学习可以利用LSTM进行大规模的、可实现的长期规划，这出乎我们的意料。为了考察这个成果，7月28日，OpenAI Five会和顶级玩家进行比赛，届时玩家可以在Twitch上观看实况转播。OpenAI Five击败OpenAI员工队伍问题如果一个AI能在像星际、Dota这样复杂的游戏里超越人类水平，那它就是一个里程碑。相较于AI之前在国际象棋和围棋里取得的成就，游戏能更好地捕捉现实世界中的混乱和连续性，这就意味着能解决游戏问题的AI系统具有更好的通用性。醉翁之意不在酒，它的目标也不仅仅是游戏。Dota2是一款实时战略游戏，一场比赛由2支队伍构成，每支队伍5人，在游戏中，每个玩家需要操控一个“英雄”单位。如果AI想玩Dota2，它必须掌握以下几点：时间较长。Dota2的运行帧数是30帧每秒，一场游戏平均45分钟，也就是一场游戏要跑80,000帧左右。在游戏中，大多数动作（action，例如让英雄移动到某一位置）产生的独立影响相对较小，但一些独立动作，比如TP，就可能会对游戏战略产生重大影响。同时，游戏中也存在一些贯彻始终的战略，比如推线、farm（刷钱）和gank（抓人）。OpenAI Five的观察频率是4帧一次，也就是场均20,000个动作，而国际象棋一般在40步以内就能决出胜负，围棋是150步。这些动作几乎都具有战略性意义。视野有限。在Dota2中，地图本身是黑的，只能靠英雄和建筑提供一定视野（禁止插眼），这就意味着比赛要根据不完整的数据信息进行推断，同时预测敌方英雄的发育进度。国际象棋和围棋都是全知视角。高维的、连续的动作空间。在比赛中，一个英雄可以采取的动作有数十个，其中有些是对英雄使用的，有些是点地面的。对于每个英雄，我们把这些连续的动作空间分割成170,000个可能的动作（有CD，不是每个都能用），除去其中的连续部分，平均每帧约有1000个动作可以选择。而在国际象棋中，每个节点的分支因子只有35个，围棋则是平均250个。高维的、连续的观察空间。Dota2的地图相当丰富，比如一场比赛中有10个英雄、几十个建筑、多个NPC单位，以及包括神符、树木、圣坛（火锅）等在内的诸多要素。我们的模型通过V社的Bot API观察游戏状态，用20,000个数据（大多数使浮点数据）总结了整张地图的所有信息。相较之下，国际象棋只有约70个（8×8棋盘），围棋只有约400个（19×19棋盘）。Dota2的游戏规则非常复杂——它已经被积极开发了十几年，游戏逻辑代码也有数十万行。对于AI来说，这个逻辑需要几毫秒才能执行，而国际象棋和围棋只需几纳秒。目前，游戏还在以每两周一次的频率持续更新，不断改变语义环境。我们的方法我们使用的算法是前阵子刚推出的PPO，这次用的是它的大规模版本。和去年的1v1机器人一样，OpenAI Five也是从自学中总结游戏经验，它们从随机参数开始训练，不使用任何人类数据。强化学习（RL）研究人员一般认为，如果想让智能体在长时间游戏中表现出色，就难免需要一些根本上的新突破，比如hierarchical reinforcement learning（分层强化学习）。但实验结果表明，我们应该给予已有算法更多信任，如果规模够大、结构够合理，它们也能表现出色。智能体的训练目标是最大化未来回报，这些回报被折扣因子γ加权。在OpenAI Five的近期训练中，我们把因子γ从0.998提高到了0.9997，把评估未来奖励的半衰期从46秒延长到了五分钟。为了体现这个进步的巨大，这里我们列几个数据：在PPO这篇论文中，最长半衰期是0.5秒；在Rainbow这篇论文中，最长半衰期是4.4秒；而在Observe and Look Further这篇论文中，最长半衰期是46秒。尽管当前版本的OpenAI Five在“补刀”上表现不佳（大约是Dota玩家的中位数），但它对于经验、金钱的的优先级匹配策略和专业选手基本一致。为了获得长期回报，牺牲短期回报是很正常的，就好比队友抱团推塔时，玩家不该自己在线上补刀刷钱。这是个振奋人心的发现，因为我们的AI系统真的在进行长期优化。模型架构过大，看不清请留言每个OpenAI Five神经网络都包含一个单层的LSTM（左下淡紫），其中有1024个神经元。输入当前的游戏状态（从Valve的Bot API中提取）后，它会单独计算各个action head（输出动作标签），如图中下方亮蓝色方框中的X坐标、Y坐标、目标单位等，再把所有action head合并成一系列动作。下图是OpenAI Five使用的观察空间和动作空间的交互式演示。它把整张地图看做一个有20,000个数据的列表，并通过8个列举值的列表来采取行动。这个场景是夜魇上天辉高地，我们选中冰女，可以发现，冰女脚下的9×9小方格表示她可以前进位置，其中白色目标方块的坐标是(-300,0)。大方框表示可以放Nova地方，目标分别是投石车、小兵、毒龙、巫妖、瘟疫法师和另一个冰女。OpenAI Five可以就自己观察到的内容对缺失信息做出反应。例如火枪的一技能是榴霰弹，这是一个范围伤害，虽然除了星际玩家以外的正常玩家都看得到这个区域，但它并不属于OpenAI Five的观察范围。即便“看不到”，每当AI走进霰弹区时，它们还是会急着走出来，因为那时它们的血量在不断下降。探索既然AI可以学会“深谋远虑”，那接下来的问题就是环境探索。前文提到了，OpenAI Five玩的是限制版Dota2，即便少了很多复杂内容，它还有上百种道具、数十种建筑物、法术、单位类型和游戏机制要学习——其中某些内容的组合还会产生更强大的东西。对于智能体来说，有效探索这个组合广阔的空间并不容易。OpenAI Five的学习方法是自我训练（从随机参数开始），这就为探索环境提供了初级经验。为了避免“战略崩溃”，我们把自我训练分成两部分，其中80%是AI和自己对战，剩下20%则是AI和上一版AI对战。经过几个小时的训练，带线、刷钱、中期抓人等战略陆续出现了。几天后，它们已经学会了基础的人类战略：抢对面的赏金神符，走到己方外塔附近补刀刷钱，不停把英雄送去占线扩大优势。在这个基础上，我们做了进一步训练，这时，OpenAI Five就已经能熟练掌握5人推塔这样的高级策略了，2017年3月，我们的第一个智能体击败了机器人，却对人类玩家手足无措。为了强制在战略空间进行探索，在训练期间（并且只在训练期间），我们随机化了它的各项属性（血量、移速、开始等级等），之后它开始能战胜一些玩家。后来，它又在另一名测试玩家身上屡战屡败，我们就又增加了随机训练，AI变强了，那名玩家也开始输了。OpenAI Five使用了我们之前为1v1智能体编写的随机数据，它也启用了一种新的“分路”方法。在每次训练比赛开始时，我们随机地将每个英雄“分配”给一些线路子集，并对其进行惩罚以避开这几路。上述探索自然离不开回报的指引。我们为Dota2设计的回报机制基于人类玩家对行为的具体评判：团队作用、技能施放、死亡次数、助攻次数和击杀次数等。为了防止智能体钻漏洞，我们的方法是计算另一队的平均表现，然后用本队英雄表现减去这个值来具体评判。英雄的技能点法、装备和信使管理都从脚本导入。团队合作Dota2是个团队合作游戏，但OpenAI Five的5名英雄间不存在神经网络上的明确沟通渠道。他们的团队合作由一个名为“team spirit”的超参数控制，范围是0到1，由它给每个英雄的加权，让它们知道这时是团队利益更重要还是个人刷钱更重要。Rapid这个AI是在我们的强化学习训练系统Rapid上实现的，后者可以应用于Gym环境库。我们已经用Rapid解决了OpenAI的许多其他问题，比如Competitive Self-Play。整个训练系统被分为rollout workers和optimizer两部分，其中前者运行一个游戏副本，并用一个智能体收集经验，后者则在一系列GPU中执行同步梯度下降。rollout workers通过Redis跟optimizer同步经验。如上图所示，每个实验还包括一个Eval workers的过程，它的作用是评估经过训练的智能体和参考智能体。除此之外还有一些监控软件，如TensorBoard、Sentry和Grafana。在同步梯度下降过程中，每个GPU在各自batch计算梯度，然后再对梯度进行全局平均。我们最初使用MPI的allreduce进行平均，但现在用我们自己的NCCL2封装来并行GPU计算和网络数据传输。上图显示了不同数量的GPU同步58MB数据（OpenAI Five参数）的延迟，几乎可以被并行运行的GPU计算所掩盖。我们还为Rapid开发了Kubernetes、Azure和GCP后端。游戏到目前为止，OpenAI Five已经在限制版Dota2中获得了非常辉煌的战绩：顶级OpenAI员工队伍：天梯分2500+（前46%玩家）观看比赛的最强观众队（包括解说Blitz）：天梯分4000-6000（前90-99%玩家）——非开黑V社员工队伍：天梯分2500-4000（前46-90%玩家）业余选手队伍：天梯分4200（前93%玩家）——开黑队半职业队：天梯分5500（前99%玩家）——开黑队4月23日，OpenAI Five首次击败机器人脚本；5月15日，它在和OpenAI员工队的较量中1胜1负，首次战胜人类玩家；6月6日，它突破OpenAI队、观众队和V社队的封锁，决定性地赢得了所有的比赛。之后我们又和业余队、半职业队进行了非正式比赛，OpenAI Five没有像预想中那样一败涂地，而是在和两个队的前三场比赛中都赢了两场。这些AI机器人的团队合作几乎是压倒性的，它们就像5个无私的玩家，知道最好的总体战略。——Blitz我们也从OpenAI Five的比赛中观察到了一些东西：它们会为了抢夺敌方优势路舍弃自家优势路（天辉的下路和夜魇的上路），使对方无力回防。这种战略近几年常出现在职业队伍比赛中，解说Blitz也称自己是从液体（李逵）那里得知这点的。推动局势转变，比对面更快地把战局从前期推进中期。这样做的具体方法是：（1）如下图所示，成功的gank；（2）在对面抱团后，及时反制。它们在少数领域背离了目前的游戏风格，比如AI前期会给辅助更多经验和钱，让它们在强势期打足伤害，扩大局面优势，打赢团战，然后抓住对方失误快速致胜。AI和人类的差别OpenAI Five可以观察的信息和人类玩家相同，游戏里有什么数据，它就看到什么数据。比如玩家需要手动去检查英雄位置、血量情况和身上的装备。我们的方法并没有从根本上与观察状态相关联，但仅从游戏渲染像素看，它就需要数千个GPU。对于许多人关心的APM问题，OpenAI Five只有150-170（每4帧一次动作，理论上最高有450）。但需要注意的是，这150是有效操作，不是逛街和打字嘲讽，它的平均反应时间为80ms，比人类快。这两个差异在1v1中最为重要，但在比赛中，我们发现人类玩家可以轻松适跟上AI的节奏，所以双方竞技还是比较公平的。事实上去年Ti7期间，一些职业玩家也和我们的1v1 AI做了多次训练，根据Blitz的说法，1v1 AI改变了人们对1v1的看法（AI采用了快节奏的游戏风格，现在每个人都适应了）。令人惊讶的发现二元回报能够提供良好的表现。1v1模型的回报是多尺度的，包括击杀英雄、连续击杀等。我们做了一个实验，让智能体只能从输赢中获得回报。如上图所示，和常见的平滑曲线（黄线）相比，虽然它（紫线）在训练中期出现了一个较慢并且稍微平稳的阶段，但它的训练结果和黄线很接近。这个实验用了4,500个CPU和16个k80 GPU，模型性能达到半专业级（70个TrueSkill），而我们的1v1模型是90个TrueSkill。可以自学卡兵。在去年的1v1模型中，我们独立训练模型卡兵，并附加一个“卡兵块”奖励。我们团队的一名员工在训练2v2模型时，因为要休假，于是建议他（现在的）妻子看看要花多久才能提高性能。令人惊讶的是，这个模型居然在没有任何特殊指引和回报激励的情况下得出了卡兵会产生优势的结论。我们仍在修复错误。上图中的黄线模型已经可以击败业余玩家，但修复了一些Bug后，它的提升非常明显。这给我们带来的启示是即便已经击败更强的人类玩家，我们的模型还是可能隐藏着严重错误。下一步我们现在正在为8月份的Ti8做准备，虽然不知道击败职业队伍的愿望能否实现，但我们相信，通过努力工作，至少我们得到了一个真正的机会。以上内容是截至6月6日的OpenAI Five简介，我们会继续进行更新，并把后续工作和结果汇总成一个成熟报告。7月28日，我们会邀请一线选手和AI在线比拼，千万记得来参加。PS：我们的工作围绕Dota2，但不仅限于Dota2，招聘正在进行中！发布于 2018-06-26 19:23DotA（游戏）人工智能强化学习 (Reinforcement Learning)赞同 508 条评论分享喜欢收藏申请转载文章被以下专栏收录论智AI新技术【公众号：

Dota 2

ChatGPT的成功背后，OpenAI打了45000年DOTA2_腾讯新闻

ChatGPT的成功背后，OpenAI打了45000年DOTA2

文｜王枢（腾讯研究院研究员）

2022年，OpenAI旗下ChatGPT横空出世，人工智能再次成为全球瞩目的焦点。ChatGPT的成功得益于OpenAI团队在人工智能大语言模型和强化学习领域持续不断地投入、探索和创新。但鲜为人知的是，在OpenAI不断迭代升级的过程中，电子游戏也曾发挥过举足轻重的作用。早期的OpenAI 曾在2019年打造出名为OpenAI Five的游戏AI，并成功击败了两届DOTA2国际邀请赛的世界冠军OG战队（恭喜OG！）。

前几天，来自斯坦福大学和谷歌的研究人员也构建了一个名为Smallville的2D虚拟游戏场景，并将25个基于ChatGPT的AI智能体置于该游戏场景中进行训练，研究发现25个AI智能体实现了对人类行为的可信模拟，他们不仅能够相互交谈，还能够与自身所处环境互动，记住并回忆它们所做的和观察到的事情，并作出相应决策。[1]

那么，为什么OpenAI会选择电子游戏作为训练和测试AI模型，电子游戏对于AI的发展究竟意味着什么？

01 鲜为人知：Open AI团队为AI打造专属“游戏训练”平台

在展开OpenAI与DOTA2的故事之前，不妨简要回顾下电子游戏与OpenAI那段鲜为人知的历史，或许能够帮助我们更好的理解电子游戏与OpenAI之间的关系。

成立于2015年12月美国旧金山，OpenAI最初是一个由小团队组成的非盈利性质的人工智能实验室，其目标是通过与其他机构和研究者的“自由合作”，向公众开放AI专利和研究成果。OpenAI在成立之时并未获得太多关注，在成立一年后（2016年12月），OpenAI对外发布了首款产品基于电子游戏的AI测试平台“Universe”。这是一款能在几乎所有环境中衡量和训练 AI 通用智能水平的开源平台，其发布时间甚至早于第一代GPT（基于转换器的生成式预训练模型）产品。

OpenAI 的Universe是一个训练 AI 通用智能水平的开源平台，由微软、英伟达等公司参与建设，其中包含多达1000多种游戏训练环境，主要包括了各类Flash游戏、Atari 2600游戏，以及《GTA 5》等PC游戏。OpenAI研究人员介绍说，Universe平台最初是从李飞飞等人创立的ImageNet数据库项目中获得启发，希望把ImageNet在降低图像识别错误率上的成功经验引入到通用人工智能的研究上来，取得实质进展。[2]

图1 OpenAI Universe平台

对于OpenAI而言，打造Universe的最终目标是训练出一个“通用人工智能”，可以灵活地将在训练环境中积累和掌握的经验快速应用到陌生、困难的环境。

当时的人工智能发展已经在“听、说、看”感知智能取得了一定突破，基于强化学习的AlphaGo也刚刚击败了人类围棋世界冠军，但在OpenAI团队看来，这些突破依旧没有跳出“弱人工智能（Narrow AI）”的范畴，并不具备理解问题的和解决问题的能力。[2]

OpenAI团队认为，要想让人工智能具备这种能力，就必须将其置于更为广泛和复杂的环境中进行训练，只有通过不断的训练，才能让人工智能发展出可以有效迁移复用的知识和问题解决策略，而电子游戏就是这个“训练环境”的绝佳选择。[3]

02 最佳陪练：OpenAI从DOTA2中学到了什么？

事实上，早在2017年的DOTA2国际邀请赛上，OpenAI的智能体已经能在1v1比赛中击败过顶尖人类职业选手；在2018年DOTA2国际邀请赛上崭露头角，与人类玩家组成的职业战队过招；而到了2019年4月，OpenAI对外宣布旗下的智能体项目OpenAI Five已经能够在5V5的比赛中击败DOTA2世界冠军OG职业战队，成为了首个击败电子竞技游戏世界冠军战队的AI系统。

OpenAI团队为什么要选择DOTA2作为训练环境呢？在开发OpenAI five之前，OpenAI团队一直在探寻如何让AI在深度强化学习方向上实现突破，创造性地提升智能体的效率。当时，一般强化学习（RL）的研究人员倾向于认为，如果想让智能体在长时间游戏中表现出色，就难免需要一些根本上的新突破，比如采用Hierarchical Reinforcement learning（分层强化学习）的方式，即将复杂问题分解成若干子问题（sub-problem），通过分而治之(divide and conquer)的方法，逐个解决子问题从而最终解决一个复杂问题。[4]

而以规则复杂、要素众多、环境多变，同时也是全球拥有超高人气的电子游戏DOTA2，顺理成章地成为了OpenAI的首选，正如OpenAI团队所言“相比标准的RL开发环境，DOTA2显得更加有趣，也更加困难。但是，如果一个AI能在像DOTA这样复杂的游戏里超越人类水平，那这个AI本身就是一个里程碑。”相较于AI之前在国际象棋和围棋里取得的成就，像DOTA2这类复杂的游戏能更好地捕捉现实世界中的混乱和连续性，使其训练出的AI能够拥有更好的通用性，使之更有可能应用于游戏之外的人类社会。

为了战胜DOTA2人类职业战队，OpenAI 团队进行了长达数年的努力，详细拆解游戏中各类复杂规则和问题，并依次不断调整优化AI模型。

DOTA2的游戏内容十分丰富，而且还存在“战争迷雾”设定，即单位和建筑物只能看到它们周围的区域，地图的其余部分被迷雾所覆盖，AI需要根据不完全信息进行推断，而国际象棋和围棋则都是全信息游戏。下图是OpenAI Five使用的观察空间和动作空间的交互式演示，它把整张地图看做一个有2万个数据的列表，并通过8个列举值的列表来采取行动[5]：

图2 OpenAI Five决策过程

在OpenAI战胜DOTA2 Ti冠军OG战队时，OpenAI Five团队使用的训练计算量比2018年的版本增加了8倍，并在10个实时月内经历了大约4万5千年的DOTA2游戏，平均每天的游戏量相当于人类玩家250年的积累。[6]

在战胜人类世界冠军之后的总结中，OpenAI团队提及他们在DOTA2中的训练环境中，学到了最重要的一点是：如果想要提升智能体的性能，其根本并非要实现训练方法的突破，而是要不断扩大规模。如果规模够大、结构够合理，AI依旧可以表现出强大的能力。正如OpenAI首席科学家Ilya Sutskever 所言“我们坚信越大越好，OpenAI 的目标就是扩大规模。”[7]

OpenAI在DOTA2上的尝试，为AI强化学习效率提升提供了方向，这些都成为了ChatGPT的养分。在Ilya Sutskever看来，“通过DOTA2的训练，OpenAI的学习模式从“强化学习”转变为了“基于人类反馈的强化学习（RLHF）”,而这些与GPT技术的结合，才最终塑造出了ChatGPT。”[8]

03 未来之路：AIGC 助力游戏内容生产

如今，伴随着以ChatGPT为代表的AIGC技术的成熟，AI也开始反向助力游戏的内容生产。ChatGPT真的能辅助游戏制作吗？首先，不妨让我们来听听ChatGPT自己的回答：

GPT-4可以在多个方面辅助游戏制作。以下是一些具体的应用场景[9]：

1、游戏剧本和故事创作：GPT-4可以生成故事情节、角色设定和对话，为游戏制作提供创意灵感。

2、任务和关卡设计：GPT-4可以根据提供的游戏类型和主题生成任务、关卡设计和谜题，帮助设计师更高效地制作游戏内容。

3、人工智能NPC：GPT-4可以生成自然的对话和反应，使得游戏内的非玩家角色（NPC）更加真实、有趣。

4、自动生成游戏文档：GPT-4可以帮助制作团队编写设计文档、技术文档和测试用例等，提高生产效率。

5、游戏本地化：GPT-4支持多种语言，可以辅助游戏的翻译和本地化工作，让游戏能够更好地适应不同地区的市场。

6、社区管理和客户支持：GPT-4可以帮助回答玩家的问题、解决技术问题并对玩家反馈进行分类和总结，提高客户支持的效率和质量。

7、游戏内文本和资源生成：GPT-4可以帮助生成游戏内的文本，如物品描述、任务说明等，减轻游戏开发者的负担。

8、测试和质量保证：GPT-4可以帮助分析游戏的错误报告，确定可能的问题来源，以及为开发团队提供修复建议。

9、创意概念和美术设计：虽然GPT-4主要以文本处理为主，但它可以为美术设计团队提供创意概念和设计建议，有助于创造独特的游戏视觉效果。

通过与游戏制作团队的协作，GPT-4有望提高游戏开发的效率、创新性和质量。当然，GPT-4仍然是一个AI辅助工具，它不能完全替代人类的创造力和专业知识。在实际应用中，开发者需要根据自己的需求和项目特点，合理地利用GPT-4的能力。

看完以上GPT-4的回答，不禁会想问ChatGPT回答真的能够在现实中实现吗？目前来看，ChatGPT的确可以制作一些简单的数字游戏，但对于制作3A游戏大作而言，显然有些捉襟见肘。

Digital Trends的一份新报告详细介绍了如何使用流行的对话和脚本生成器来创建可玩的PC游戏。报告指出，只需来自用户的几个简单提示，ChatGPT就发明了自己的基于数学的逻辑益智游戏。不仅如此，它还生成了一些简单的游戏代码，后来变成了一款令人上瘾的免费浏览器游戏，已经在网上引起了一些轰动。但是，当被问到是否可以制作出像《最后生还者》(The Last of US)这样的3A游戏大作时，ChatGPT显然有些“茫然”，只能够简单吐露出一些故事情节，无法为游戏生成代码。[10]

图3 ChatGPT自动生成简单的数字游戏

虽然对于游戏复杂规则的设计、代码的编写等工作，在短期内依旧需要依靠人工来解决，但是以ChatGPT为代表的人工智能，已经能够帮助游戏开发者们生成对话、脚本和其他数字资产，提升游戏开发者的工作效率，帮助游戏开发者们简单地填充虚拟的游戏空间，缩短游戏制作的周期。

而随着AIGC技术发展，游戏AI智能体（决策智能）也会随之不断迭代升级。正如文章开头提及的斯坦福和谷歌团队训练的AI智能体，其已经能够基于大模型实现一些简单的决策，而生成式智能(AIGC)和决策智能的结合，将打开通用人工智能的大门。

可以预见的是，未来AI与游戏的发展势必会更加紧密的联系在一起。现实中，已经有越多越多的人们意识到游戏与人工智能的共生关系：2023年3月25日出版的最新一期《经济学人》刊文，认为游戏在21世纪全球流行文化及国际竞争中扮演重要地位。在系列报道中，《经济学人》也为AI 技术的革命和普及，将会带动“用户自制游戏的兴起”，“人工智能技术的发展将允许开发者用简单的文本、语音指令创建交互式3D模型”，极大降低游戏制作的门槛。Omdia发布的2023年技术趋势展望报告中，也将“游戏科技（GamesTech）”列为最值得关注的技术趋势之一，并认为游戏AI将成为2023年游戏开发中最受关注的热门技术话题。[11] 此外，在中国音数协游戏工委、中国游戏产业研究院与多家单位合作推出《游戏科技能力与科技价值研究报告》中，面向游戏与电子通信、硬件制造等领域的行业调研数据显示，81%的受访者认同游戏促进了AI技术的发展。

在OpenAI Five之后，包括索尼、腾讯在内的多家科技公司都开始基于游戏训练AI智能体。前者基于《GT赛车》游戏创新了AI强化学习算法，研究成果登上《Nature》杂志封面；后者基于《王者荣耀》游戏开发出AI开放研究平台“开悟”，助力构建产学研体系。

回到文章的开头，OpenAI团队选择游戏训练AI的初心是想要打造出“通用人工智能”。而对于通用人工智能的发展而言，目前以ChatGPT为代表的大语言训练模型，让人们窥见到了AGI（通用人工智能）的未来图景，而以游戏AI为代表的决策智能，以及游戏提供的绝佳AI训练场，也正在加快AI走向通用的进程。

我们期待未来的AI能够与游戏在实现“通用人工智能”的道路上携手共进，为人类社会的发展带来更多美好的期待。

感谢腾讯AI Lab danierdeng，腾讯研究院田小军、胡璇等多位老师在本文写作过程中给予的支持与帮助！

参考资料来源:

[1]Source:https://www.businessinsider.com/ai-avatars-let-loose-in-virtual-town-display-beginnings-agi-2023-4

[2]Source：https://openai.com/research/universe

[3]Source：https://openai.com/research/universe

[4]Source：https://openai.com/research/openai-five

[5]Source：https://openai.com/research/openai-five

[6]Source：https://openai.com/research/openai-five-defeats-dota-2-world-champions

[7]Source:https://blogs.nvidia.com/blog/2023/03/22/sutskever-openai-gtc/

[8]Source:https://www.youtube.com/watch?v=goOa0biX6Tc

[9]笔者根据GPT-4回答整理

[10]Source:https://www.digitaltrends.com/gaming/sumplete-chatgpt-ai-game-design-ethics/

[11]Source:https://omdia.tech.informa.com/OM027558/2023-Trends-to-Watch-Games-Tech

OpenAI Five

CloseSearch Submit Skip to main contentSite NavigationResearchOverviewIndexGPT-4DALL·E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer storiesSearch Navigation quick links Log inTry ChatGPTMenu Mobile Navigation CloseSite NavigationResearchOverviewIndexGPT-4DALL·E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer stories Quick Links Log inTry ChatGPTSearch Submit ResearchOpenAI FiveOur team of five neural networks, OpenAI Five, has started to defeat amateur human teams at Dota 2.June 25, 2018Dota 2, Reinforcement learning, Self-play, Games, Software engineering, OpenAI FiveOur team of five neural networks, OpenAI Five, has started to defeat amateur human teams at Dota 2. While today we play with restrictions, we aim to beat a team of top professionals at The International in August subject only to a limited set of heroes. We may not succeed: Dota 2 is one of the most popular and complex esports games in the world, with creative and motivated professionals who train year-round to earn part of Dota’s annual $40M prize pool (the largest of any esports game).OpenAI Five plays 180 years worth of games against itself every day, learning via self-play. It trains using a scaled-up version of Proximal Policy Optimization running on 256 GPUs and 128,000 CPU cores—a larger-scale version of the system we built to play the much-simpler solo variant of the game last year. Using a separate LSTM for each hero and no human data, it learns recognizable strategies. This indicates that reinforcement learning can yield long-term planning with large but achievable scale—without fundamental advances, contrary to our own expectations upon starting the project.To benchmark our progress, we’ll host a match versus top players on August 5th. Follow us on Twitch to view the live broadcast, or request an invite to attend in person!Play videoThe problemOne AI milestone is to exceed human capabilities in a complex video game like StarCraft or Dota. Relative to previous AI milestones like Chess or Go, complex video games start to capture the messiness and continuous nature of the real world. The hope is that systems which solve complex video games will be highly general, with applications outside of games.Dota 2 is a real-time strategy game played between two teams of five players, with each player controlling a character called a “hero”. A Dota-playing AI must master the following:Long time horizons. Dota games run at 30 frames per second for an average of 45 minutes, resulting in 80,000 ticks per game. Most actions (like ordering a hero to move to a location) have minor impact individually, but some individual actions like town portal usage can affect the game strategically; some strategies can play out over an entire game. OpenAI Five observes every fourth frame, yielding 20,000 moves. Chess usually ends before 40 moves, Go before 150 moves, with almost every move being strategic.Partially-observed state. Units and buildings can only see the area around them. The rest of the map is covered in a fog hiding enemies and their strategies. Strong play requires making inferences based on incomplete data, as well as modeling what one’s opponent might be up to. Both chess and Go are full-information games.High-dimensional, continuous action space. In Dota, each hero can take dozens of actions, and many actions target either another unit or a position on the ground. We discretize the space into 170,000 possible actions per hero (not all valid each tick, such as using a spell on cooldown); not counting the continuous parts, there are an average of ~1,000 valid actions each tick. The average number of actions in chess is 35; in Go, 250.High-dimensional, continuous observation space. Dota is played on a large continuous map containing ten heroes, dozens of buildings, dozens of NPC units, and a long tail of game features such as runes, trees, and wards. Our model observes the state of a Dota game via Valve’s Bot API as 20,000 (mostly floating-point) numbers representing all information a human is allowed to access. A chess board is naturally represented as about 70 enumeration values (a 8x8 board of 6 piece types and minor historical info); a Go board as about 400 enumeration values (a 19x19 board of 2 piece types plus Ko).The Dota rules are also very complex — the game has been actively developed for over a decade, with game logic implemented in hundreds of thousands of lines of code. This logic takes milliseconds per tick to execute, versus nanoseconds for Chess or Go engines. The game also gets an update about once every two weeks, constantly changing the environment semantics.Our approachOur system learns using a massively-scaled version of Proximal Policy Optimization. Both OpenAI Five and our earlier 1v1 bot learn entirely from self-play. They start with random parameters and do not use search or bootstrap from human replays.OpenAI 1v1 botOpenAI FiveCPUs60,000 CPU cores on Azure128,000 preemptible CPU cores on GCPGPUs256 K80 GPUs on Azure256 P100 GPUs on GCPExperience collected~300 years per day~180 years per day (~900 years per day counting each hero separately)Size of observation~3.3 kB~36.8 kBObservations per second of gameplay107.5Batch size8,388,608 observations1,048,576 observationsBatches per minute~20~60RL researchers (including ourselves) have generally believed that long time horizons would require fundamentally new advances, such as hierarchical reinforcement learning. Our results suggest that we haven’t been giving today’s algorithms enough credit — at least when they’re run at sufficient scale and with a reasonable way of exploring.Our agent is trained to maximize the exponentially decayed sum of future rewards, weighted by an exponential decay factor called γ. During the latest training run of OpenAI Five, we annealed γ from 0.998 (valuing future rewards with a half-life of 46 seconds) to 0.9997 (valuing future rewards with a half-life of five minutes). For comparison, the longest horizon in the PPO paper was a half-life of 0.5 seconds, the longest in the Rainbow paper was a half-life of 4.4 seconds, and the Observe and Look Further paper used a half-life of 46 seconds.While the current version of OpenAI Five is weak at last-hitting (observing our test matches, the professional Dota commentator Blitz estimated it around median for Dota players), its objective prioritization matches a common professional strategy. Gaining long-term rewards such as strategic map control often requires sacrificing short-term rewards such as gold gained from farming, since grouping up to attack towers takes time. This observation reinforces our belief that the system is truly optimizing over a long horizon.OpenAI Five: Dota Gamplay4:20Model structureEach of OpenAI Five’s networks contain a single-layer, 1024-unit LSTM that sees the current game state (extracted from Valve’s Bot API) and emits actions through several possible action heads. Each head has semantic meaning, for example, the number of ticks to delay this action, which action to select, the X or Y coordinate of this action in a grid around the unit, etc. Action heads are computed independently.Interactive demonstration of the observation space and action space used by OpenAI Five. OpenAI Five views the world as a list of 20,000 numbers, and takes an action by emitting a list of 8 enumeration values. Select different actions and targets to understand how OpenAI Five encodes each action, and how it observes the world. The image shows the scene as a human would see it.OpenAI Five can react to missing pieces of state that correlate with what it does see. For example, until recently OpenAI Five’s observations did not include shrapnel zones (areas where projectiles rain down on enemies), which humans see on screen. However, we observed OpenAI Five learning to walk out of (though not avoid entering) active shrapnel zones, since it could see its health decreasing.ExplorationGiven a learning algorithm capable of handling long horizons, we still need to explore the environment. Even with our restrictions, there are hundreds of items, dozens of buildings, spells, and unit types, and a long tail of game mechanics to learn about—many of which yield powerful combinations. It’s not easy to explore this combinatorially-vast space efficiently.OpenAI Five learns from self-play (starting from random weights), which provides a natural curriculum for exploring the environment. To avoid “strategy collapse”, the agent trains 80% of its games against itself and the other 20% against its past selves. In the first games, the heroes walk aimlessly around the map. After several hours of training, concepts such as laning, farming, or fighting over mid emerge. After several days, they consistently adopt basic human strategies: attempt to steal Bounty runes from their opponents, walk to their tier one towers to farm, and rotate heroes around the map to gain lane advantage. And with further training, they become proficient at high-level strategies like 5-hero push.In March 2017, our first agent defeated bots but got confused against humans. To force exploration in strategy space, during training (and only during training) we randomized the properties (health, speed, start level, etc.) of the units, and it began beating humans. Later on, when a test player was consistently beating our 1v1 bot, we increased our training randomizations and the test player started to lose. (Our robotics team concurrently applied similar randomization techniques to physical robots to transfer from simulation to the real world.)OpenAI Five uses the randomizations we wrote for our 1v1 bot. It also uses a new “lane assignment” one. At the beginning of each training game, we randomly “assign” each hero to some subset of lanes and penalize it for straying from those lanes until a randomly-chosen time in the game.Exploration is also helped by a good reward. Our reward consists mostly of metrics humans track to decide how they’re doing in the game: net worth, kills, deaths, assists, last hits, and the like. We postprocess each agent’s reward by subtracting the other team’s average reward to prevent the agents from finding positive-sum situations.We hardcode item and skill builds (originally written for our scripted baseline), and choose which of the builds to use at random. Courier management is also imported from the scripted baseline.CoordinationOpenAI Five does not contain an explicit communication channel between the heroes’ neural networks. Teamwork is controlled by a hyperparameter we dubbed “team spirit”. Team spirit ranges from 0 to 1, putting a weight on how much each of OpenAI Five’s heroes should care about its individual reward function versus the average of the team’s reward functions. We anneal its value from 0 to 1 over training.RapidOur system is implemented as a general-purpose RL training system called Rapid, which can be applied to any Gym environment. We’ve used Rapid to solve other problems at OpenAI, including Competitive Self-Play.The training system is separated into rollout workers, which run a copy of the game and an agent gathering experience, and optimizer nodes, which perform synchronous gradient descent across a fleet of GPUs. The rollout workers sync their experience through Redis to the optimizers. Each experiment also contains workers evaluating the trained agent versus reference agents, as well as monitoring software such as TensorBoard, Sentry, and Grafana.During synchronous gradient descent, each GPU computes a gradient on its part of the batch, and then the gradients are globally averaged. We originally used MPI’s allreduce for averaging, but now use our own NCCL2 wrappers that parallelize GPU computations and network data transfer.The latencies for synchronizing 58MB of data (size of OpenAI Five’s parameters) across different numbers of GPUs are shown on the right. The latency is low enough to be largely masked by GPU computation which runs in parallel with it.We’ve implemented Kubernetes, Azure, and GCP backends for Rapid.The gamesThus far OpenAI Five has played (with our restrictions) versus each of these teams:Best OpenAI employee team: 2.5k MMR (46th percentile)Best audience players watching OpenAI employee match (including Blitz, who commentated the first OpenAI employee match): 4–6k MMR (90th-99th percentile), though they’d never played as a team.Valve employee team: 2.5–4k MMR (46th-90th percentile).Amateur team: 4.2k MMR (93rd percentile), trains as a team.Semi-pro team: 5.5k MMR (99th percentile), trains as a team.The April 23rd version of OpenAI Five was the first to beat our scripted baseline. The May 15th version of OpenAI Five was evenly matched versus team 1, winning one game and losing another. The June 6th version of OpenAI Five decisively won all its games versus teams 1–3. We set up informal scrims with teams 4 & 5, expecting to lose soundly, but OpenAI Five won two of its first three games versus both.The teamwork aspect of the bot was just overwhelming. It feels like five selfless players that know a good general strategy.BlitzWe observed that OpenAI Five:Repeatedly sacrificed its own safe lane (top lane for dire; bottom lane for radiant) in exchange for controlling the enemy’s safe lane, forcing the fight onto the side that is harder for their opponent to defend. This strategy emerged in the professional scene in the last few years, and is now considered to be the prevailing tactic. Blitz commented that he only learned this after eight years of play, when Team Liquid told him about it.Pushed the transitions from early- to mid-game faster than its opponents. It did this by: (1) setting up successful ganks (when players move around the map to ambush an enemy hero—see animation) when players overextended in their lane, and (2) by grouping up to take towers before the opponents could organize a counterplay.Deviated from current playstyle in a few areas, such as giving support heroes (which usually do not take priority for resources) lots of early experience and gold. OpenAI Five’s prioritization allows for its damage to peak sooner and push its advantage harder, winning team fights and capitalizing on mistakes to ensure a fast win.Trophies awarded after the match between the best players at OpenAI and our bot team. One trophy for the humans, one trophy for the bots (represented by Susan Zhang from our team!)Differences versus humansOpenAI Five is given access to the same information as humans, but instantly sees data like positions, healths, and item inventories that humans have to check manually. Our method isn’t fundamentally tied to observing state, but just rendering pixels from the game would require thousands of GPUs.OpenAI Five averages around 150-170 actions per minute (and has a theoretical maximum of 450 due to observing every 4th frame). Frame-perfect timing, while possible for skilled players, is trivial for OpenAI Five. OpenAI Five has an average reaction time of 80ms, which is faster than humans.These differences matter most in 1v1 (where our bot had a reaction time of 67ms), but the playing field is relatively equitable as we’ve seen humans learn from and adapt to the bot. Dozens of professionals used our 1v1 bot for training in the months after last year’s TI. According to Blitz, the 1v1 bot has changed the way people think about 1v1s (the bot adopted a fast-paced playstyle, and everyone has now adapted to keep up).Surprising findingsBinary rewards can give good performance. Our 1v1 model had a shaped reward, including rewards for last hits, kills, and the like. We ran an experiment where we only rewarded the agent for winning or losing, and it trained an order of magnitude slower and somewhat plateaued in the middle, in contrast to the smooth learning curves we usually see. The experiment ran on 4,500 cores and 16 k80 GPUs, training to the level of semi-pros (70 TrueSkill) rather than 90 TrueSkill of our best 1v1 bot).Creep blocking can be learned from scratch. For 1v1, we learned creep blocking using traditional RL with a “creep block” reward. One of our team members left a 2v2 model training when he went on vacation (proposing to his now wife!), intending to see how much longer training would boost performance. To his surprise, the model had learned to creep block without any special guidance or reward.We’re still fixing bugs. The chart shows a training run of the code that defeated amateur players, compared to a version where we simply fixed a number of bugs, such as rare crashes during training, or a bug which resulted in a large negative reward for reaching level 25. It turns out it’s possible to beat good humans while still hiding serious bugs!A subset of the OpenAI Dota team, holding the laptop that defeated the world’s top professionals at Dota 1v1 at The International last year.*What’s nextOur team is focused on making our August goal. We don’t know if it will be achievable, but we believe that with hard work (and some luck) we have a real shot.This post described a snapshot of our system as of June 6th. We’ll release updates along the way to surpassing human performance and write a report on our final system once we complete the project. Please join us on August 5th virtually or in person, when we’ll play a team of top players!Our underlying motivation reaches beyond Dota. Real-world AI deployments will need to deal with the challenges raised by Dota which are not reflected in Chess, Go, Atari games, or Mujoco benchmark tasks. Ultimately, we will measure the success of our Dota system in its application to real-world tasks. If you’d like to be part of what comes next, we’re hiring!AuthorsAuthorsGreg BrockmanChristy DennisonSusan ZhangJakub PachockiMichael PetrovHenrique PondéPrzemysław DębiakDavid FarhiFilip WolskiJonathan RaimanJie TangSzymon SidorBrooke ChanContributorsQuirin FischerChristopher HesseShariq HashmeIlya SutskeverAlec RadfordScott GrayJack ClarkPaul ChristianoDavid LuanChristopher BernerEric SiglerJonas SchneiderLarissa SchiavoDiane YoonJohn SchulmanAcknowledgmentsCurrent set of restrictionsMirror match of Necrophos, Sniper, Viper, Crystal Maiden, and LichNo wardingNo RoshanNo invisibility (consumables and relevant items)No summons/illusionsNo Divine Rapier, Bottle, Quelling Blade, Boots of Travel, Tome of Knowledge, Infused Raindrop5 invulnerable couriers, no exploiting them by scouting or tankingNo ScanThe hero set restriction makes the game very different from how Dota is played at world-elite level (i.e. Captains Mode drafting from all 100+ heroes). However, the difference from regular “public” games (All Pick / Random Draft) is smaller.Most of the restrictions come from remaining aspects of the game we haven’t integrated yet. Some restrictions, in particular wards and Roshan, are central components of professional-level play. We’re working to add these as soon as possible.Draft feedbackThanks to the following for feedback on drafts of this post: Alexander Lavin, Andrew Gibiansky, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, David Dohan, David Ha, Denny Britz, Erich Elsen, James Bradbury, John Miller, Luke Metz, Maddie Hall, Miles Brundage, Nelson Elhage, Ofir Nachum, Pieter Abbeel, Rumen Hristov, Shubho Sengupta, Solomon Boulos, Stephen Merity, Tom Brown, Zak StoneResearchOverviewIndexGPT-4DALL·E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTCompanyAboutBlogCareersCharterSecurityCustomer storiesSafetyOpenAI © 2015 – 2024Terms & policiesPrivacy policyBrand guidelinesSocialTwitterYouTubeGitHubSoundCloudLinkedInBack to top

Follow

比特派钱包安卓下载|openai dota

比特派钱包安卓下载|openai dota

一文解析OpenAI Five，一个会打团战的Dota2 AI - 知乎

Dota 2

ChatGPT的成功背后，OpenAI打了45000年DOTA2_腾讯新闻

OpenAI Five

最近的新闻

您可能喜欢的文章

imtoken官网最新版本|蓝思科技股份有限公司官网

imtoken官网下载|egold数字货币

bitpie怎么防止被篡改

bitpie换手机怎么登录

bitpie安卓怎么注册

bitpie怎么买股票的

bitpie官方怎么联系客服