DQN for some reason Q max is low in TensorFlow

Asked 2 years ago, Updated 2 years ago, 77 views

Thank you for your help.

I am writing the following article.

Tried creating Othello AI with tensorflow without understanding the theory of machine learning

What I'd like to ask you this time is
Q_max is not trained while training for Othello AI above.

The source has a link at the above URL.(ttps://github.com/sasaco/tf-dqn-reversi.git)

  • train.py --- Training AI
  • Reversi.py--- Manage Othello Games
  • dqn_agent.py--- Managing AI Training

python3:train.py

players[j].store_experience(state, targets, tr, rewrite, state_X, target_X, end)
    players[j].experience_replay()

Variable name
- state---board surface (=Reversi.screen[0-7][0-7])
- targets --- Number to place
- tr--- Selected behavior
- reward --- Rewards for behavior 0 to 1
- state_X---board surface after action
- targets_X --- Number to leave after action
- end --- Game terminated = True

python3:dqn_agent.py

def store_experience(self, state, targets, action, reward, state_1, targets_1, terminal):
    self.D.append(state, targets, action, reward, state_1, targets_1, terminal))

def experience_replay(self):
    state_minibatch=[]
    y_minibatch=[]

    # sample random minibatch
    minibatch_size=min(len(self.D), self.minibatch_size)
    minibatch_indexes=np.random.randint(0, len(self.D), minibatch_size)

    for jin minibatch_indexes:
        state_j, targets_j, action_j, reward_j, state_j_1, targets_j_1, terminal=self.D[j]
        action_j_index=self.enable_actions.index(action_j)

        y_j=self.Q_values(state_j)

        if terminal:
            y_j[action_j_index] = reward_j
        else:
            # reward_j+gamma*max_action 'Q(state', action')
            qvalue, action=self.select_enable_action(state_j_1, targets_j_1)
            y_j[action_j_index] = reward_j+self.discount_factor*qvalue

        state_minibatch.append(state_j)
        y_minibatch.append(y_j)

    # training
    self.sess.run(self.training, feed_dict={self.x:state_minibatch, self.y_:y_minibatch})

    # for log
    self.current_loss=self.sess.run(self.loss,feed_dict={self.x:state_minibatch,self.y_:y_minibatch})

Update every turn

as shown below
y_j [action_j_index] = reward_j+self.discount_factor*qvalue
    state_minibatch.append(state_j)
    y_minibatch.append(y_j)

# training
self.sess.run(self.training, feed_dict={self.x:state_minibatch, self.y_:y_minibatch})


I'm doing it but
Q_max also gives 1 if it wins, but 0.002 is small.

I used
as a reference. Try implementing Deep Q Network (DQN) with TensorFlow ultra-simple~Introduction~|ALGO GEEKS
That's it.

Thank you for your cooperation.

python machine-learning tensorflow deep-learning

2022-09-30 19:47

1 Answers

It seems that the probability that the player will take random action is exploration=0.1, but I think it is strange that it is fixed with a small value of 0.1 from the beginning.
If epsilon is small from the beginning, you will only accumulate experience about the actions that models you haven't learned happen to take, and your learning will not go well.
In order to gain experience by experimenting with various actions at first and maximizing profits with the models learned, epsilon-greedy seems to be commonly used in DQN implementations (e.g., starting with exploration=1.0 and gradually decreasing to a specific value).

The tasks introduced in "Super Simple with TensorFlow DQN~" are very simple (there are only two types of actions that move left or right?), so even if exploration=0.1 takes time, you will have enough experience and learn well.

I didn't read the code properly, so I'm sorry if it missed the mark.


2022-09-30 19:47

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.