DQN 2015 Nature 论文复现:Atari Pong 游戏 84x84 像素输入实战(附 PyTorch 代码)

DQN 2015 Nature 论文复现:Atari Pong 游戏 84x84 像素输入实战(附 PyTorch 代码)

当DeepMind在2015年首次提出DQN算法并在Nature上发表时,整个强化学习领域为之震动。这项研究首次证明,一个单一的深度强化学习智能体能够在数十款Atari 2600游戏中达到人类水平的表现。本文将带您从零开始,使用现代PyTorch框架完整复现这一里程碑式的工作,特别聚焦于Atari Pong游戏的实现细节。

1. 环境配置与预处理

在开始构建DQN之前,我们需要先搭建适合的训练环境。Atari游戏的原始输入为210×160像素的RGB图像,这对计算资源提出了较高要求。遵循原论文的方法,我们将进行以下预处理:

import gym import numpy as np from collections import deque import torch import torch.nn as nn import torch.optim as optim class AtariPreprocessor: def __init__(self, env_name, frame_skip=4, history_length=4): self.env = gym.make(env_name) self.frame_skip = frame_skip self.history = deque(maxlen=history_length) def reset(self): frame = self.env.reset() processed = self._process_frame(frame) for _ in range(self.history.maxlen): self.history.append(processed) return np.stack(self.history) def step(self, action): total_reward = 0.0 for _ in range(self.frame_skip): frame, reward, done, info = self.env.step(action) total_reward += reward if done: break processed = self._process_frame(frame) self.history.append(processed) return np.stack(self.history), total_reward, done, info def _process_frame(self, frame): # 转换为灰度图并调整大小 frame = frame.mean(axis=2) # RGB转灰度 frame = frame[34:34+160, :160] # 裁剪得分区域 frame = frame[::2, ::2] # 下采样到80x80 return frame.astype(np.float32) / 255.0

关键预处理步骤包括:

  • 帧堆叠:将连续的4帧堆叠作为网络输入,提供时序信息
  • 帧跳过:每4帧执行一次动作,提高训练效率
  • 图像裁剪:移除不相关的屏幕区域(如得分显示)
  • 灰度转换:将RGB三通道简化为单通道
  • 归一化:将像素值缩放到[0,1]范围

注意:原论文使用84×84分辨率,但实际实现中80×80也是常见选择。确保测试时与训练分辨率一致。

2. DQN网络架构设计

DQN的核心是一个深度卷积神经网络,其架构设计直接影响了特征提取能力。以下是PyTorch实现:

class DQN(nn.Module): def __init__(self, action_dim): super(DQN, self).__init__() self.conv1 = nn.Conv2d(4, 32, kernel_size=8, stride=4) self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2) self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1) self.fc1 = nn.Linear(7*7*64, 512) self.fc2 = nn.Linear(512, action_dim) def forward(self, x): x = x.float() / 255.0 # 确保输入归一化 x = torch.relu(self.conv1(x)) x = torch.relu(self.conv2(x)) x = torch.relu(self.conv3(x)) x = x.view(x.size(0), -1) # 展平 x = torch.relu(self.fc1(x)) return self.fc2(x)

网络结构参数对比如下:

层类型参数输出尺寸激活函数
卷积层132个8×8滤波器,步长420×20×32ReLU
卷积层264个4×4滤波器,步长29×9×64ReLU
卷积层364个3×3滤波器,步长17×7×64ReLU
全连接层1512单元512ReLU
输出层动作空间维度action_dim线性

3. 经验回放与目标网络

DQN的两个关键创新点需要特别实现:

class ReplayBuffer: def __init__(self, capacity): self.buffer = deque(maxlen=capacity) def push(self, state, action, reward, next_state, done): self.buffer.append((state, action, reward, next_state, done)) def sample(self, batch_size): indices = np.random.choice(len(self.buffer), batch_size, replace=False) states, actions, rewards, next_states, dones = zip(*[self.buffer[idx] for idx in indices]) return ( torch.FloatTensor(np.array(states)), torch.LongTensor(np.array(actions)), torch.FloatTensor(np.array(rewards)), torch.FloatTensor(np.array(next_states)), torch.FloatTensor(np.array(dones)) ) def __len__(self): return len(self.buffer) class DQNAgent: def __init__(self, action_dim, lr=1e-4, gamma=0.99, tau=1e-3): self.policy_net = DQN(action_dim) self.target_net = DQN(action_dim) self.target_net.load_state_dict(self.policy_net.state_dict()) self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr) self.gamma = gamma self.tau = tau def update_target(self): # 软更新目标网络 for target_param, policy_param in zip(self.target_net.parameters(), self.policy_net.parameters()): target_param.data.copy_( self.tau * policy_param.data + (1.0 - self.tau) * target_param.data ) def get_action(self, state, epsilon): if np.random.random() < epsilon: return np.random.randint(self.policy_net.fc2.out_features) with torch.no_grad(): q_values = self.policy_net(state.unsqueeze(0)) return q_values.argmax().item()

经验回放和目标网络的作用:

  • 经验回放:打破数据相关性,提高样本效率
  • 目标网络:稳定训练过程,防止Q值过高估计
  • 软更新:缓慢更新目标网络参数(τ通常取0.001)

4. 完整训练流程

将上述组件整合为完整的训练系统:

def train_dqn(env_name="PongNoFrameskip-v4", batch_size=32, buffer_size=100000, total_steps=1000000, learning_starts=10000, target_update=1000, gamma=0.99, epsilon_start=1.0, epsilon_end=0.1, epsilon_decay=100000): env = AtariPreprocessor(env_name) agent = DQNAgent(env.env.action_space.n) buffer = ReplayBuffer(buffer_size) state = env.reset() episode_reward = 0 total_rewards = [] epsilon = epsilon_start for step in range(1, total_steps + 1): # ε-贪心策略 epsilon = epsilon_end + (epsilon_start - epsilon_end) * \ np.exp(-1. * step / epsilon_decay) # 选择并执行动作 action = agent.get_action(torch.FloatTensor(state), epsilon) next_state, reward, done, _ = env.step(action) episode_reward += reward # 存储转移样本 buffer.push(state, action, reward, next_state, done) state = next_state # 训练阶段 if len(buffer) >= learning_starts and step % 4 == 0: batch = buffer.sample(batch_size) states, actions, rewards, next_states, dones = batch # 计算当前Q值 current_q = agent.policy_net(states).gather(1, actions.unsqueeze(1)) # 计算目标Q值 with torch.no_grad(): next_q = agent.target_net(next_states).max(1)[0] target_q = rewards + (1 - dones) * gamma * next_q # 计算损失并更新 loss = nn.MSELoss()(current_q.squeeze(), target_q) agent.optimizer.zero_grad() loss.backward() agent.optimizer.step() # 更新目标网络 if step % target_update == 0: agent.update_target() # 回合结束处理 if done: total_rewards.append(episode_reward) print(f"Step: {step}, Reward: {episode_reward}, Epsilon: {epsilon:.2f}") state = env.reset() episode_reward = 0 return total_rewards

训练过程中的关键参数设置:

参数推荐值作用
batch_size32每次更新的样本数量
buffer_size100,000经验回放缓存大小
learning_starts10,000开始学习前的随机探索步数
target_update1,000目标网络更新频率
gamma0.99未来奖励折扣因子
epsilon_start1.0初始探索率
epsilon_end0.1最终探索率
epsilon_decay100,000探索率衰减步数

5. 训练技巧与性能优化

在实际训练中,以下几个技巧可以显著提升性能:

奖励裁剪:将正奖励设为+1,负奖励设为-1,有助于不同游戏间的泛化

reward = np.clip(reward, -1, 1)

帧差分处理:取连续帧的最大值,消除Atari游戏的闪烁效果

frame = np.maximum(frame, last_frame)

梯度裁剪:防止梯度爆炸,稳定训练过程

for param in agent.policy_net.parameters(): param.grad.data.clamp_(-1, 1)

学习率调度:随着训练进展降低学习率

scheduler = optim.lr_scheduler.StepLR(agent.optimizer, step_size=250000, gamma=0.1)

在Pong游戏中,典型的训练曲线会经历以下阶段:

  1. 随机探索期(0-10k步):智能体随机移动,胜率约50%
  2. 初步学习期(10k-100k步):开始学习基本击球策略
  3. 策略优化期(100k-500k步):发展出位置控制和反击策略
  4. 稳定表现期(500k+步):达到人类水平,胜率超过90%

6. 结果评估与可视化

训练完成后,我们需要评估智能体的实际表现:

def evaluate(agent, env, episodes=10): total_rewards = [] for _ in range(episodes): state = env.reset() episode_reward = 0 done = False while not done: action = agent.get_action(torch.FloatTensor(state), epsilon=0.05) state, reward, done, _ = env.step(action) episode_reward += reward env.render() # 可视化游戏过程 total_rewards.append(episode_reward) return np.mean(total_rewards)

对于Pong游戏,成功的训练应能达到以下指标:

指标预期值说明
平均奖励>18每局21分制,达到人类水平
胜率>90%对阵内置AI的获胜概率
训练时间8-12小时使用现代GPU(如RTX 3080)

7. 进阶改进方向

原始DQN虽然强大,但仍有改进空间。以下是几个值得尝试的扩展:

Double DQN:减少Q值高估问题

next_actions = agent.policy_net(next_states).max(1)[1] next_q = agent.target_net(next_states).gather(1, next_actions.unsqueeze(1))

优先经验回放:更高效地利用重要样本

td_error = (current_q - target_q).abs() priority = (td_error + 1e-5).pow(alpha)

Dueling架构:分离状态价值和优势函数

class DuelingDQN(nn.Module): def __init__(self, action_dim): super().__init__() # 共享特征提取层 self.feature = nn.Sequential(...) # 价值流 self.value = nn.Linear(512, 1) # 优势流 self.advantage = nn.Linear(512, action_dim) def forward(self, x): features = self.feature(x) value = self.value(features) advantage = self.advantage(features) return value + advantage - advantage.mean()

在实际项目中,我发现使用Dueling架构能显著提升Pong游戏的训练速度,通常在200k步左右就能达到不错的表现。而优先回放则在更复杂的游戏中效果更为明显。