PPO Debug Experience
Recently, I need to perform PPO in a complex env. I refer to some code in GitHub, however, I can't grasp their meaning...
After reading PPO paper, I decided to code by myself.
I already have some experience writing RL code. After several minutes, I finished the first version with gym-cart-pole-v0. However, that didn't work...
Then I started to check the core algorithm again and again...It's very sad, the code still did not work.
So I suspect whether the agent's interacting with env is right or not...
Then I started to debug the interaction between agent and env.
Luckily, I found that the reward(or Gt/advantage) went wrong. So I refer to some papers about advantage such as GAE, TRPO and so on...
Then I changed the way reward is calculated. The code work.
You can click here to ref my code.
网友评论