Thank you for your help.
I am writing the following article.
Tried creating Othello AI with tensorflow without understanding the theory of machine learning
What I'd like to ask you this time is
Q_max is not trained while training for Othello AI above.
The source has a link at the above URL.(ttps://github.com/sasaco/tf-dqn-reversi.git)
python3:train.py
players[j].store_experience(state, targets, tr, rewrite, state_X, target_X, end)
players[j].experience_replay()
Variable name
- state---board surface (=Reversi.screen[0-7][0-7])
- targets --- Number to place
- tr--- Selected behavior
- reward --- Rewards for behavior 0 to 1
- state_X---board surface after action
- targets_X --- Number to leave after action
- end --- Game terminated = True
python3:dqn_agent.py
def store_experience(self, state, targets, action, reward, state_1, targets_1, terminal):
self.D.append(state, targets, action, reward, state_1, targets_1, terminal))
def experience_replay(self):
state_minibatch=[]
y_minibatch=[]
# sample random minibatch
minibatch_size=min(len(self.D), self.minibatch_size)
minibatch_indexes=np.random.randint(0, len(self.D), minibatch_size)
for jin minibatch_indexes:
state_j, targets_j, action_j, reward_j, state_j_1, targets_j_1, terminal=self.D[j]
action_j_index=self.enable_actions.index(action_j)
y_j=self.Q_values(state_j)
if terminal:
y_j[action_j_index] = reward_j
else:
# reward_j+gamma*max_action 'Q(state', action')
qvalue, action=self.select_enable_action(state_j_1, targets_j_1)
y_j[action_j_index] = reward_j+self.discount_factor*qvalue
state_minibatch.append(state_j)
y_minibatch.append(y_j)
# training
self.sess.run(self.training, feed_dict={self.x:state_minibatch, self.y_:y_minibatch})
# for log
self.current_loss=self.sess.run(self.loss,feed_dict={self.x:state_minibatch,self.y_:y_minibatch})
Update every turn
as shown belowy_j [action_j_index] = reward_j+self.discount_factor*qvalue
state_minibatch.append(state_j)
y_minibatch.append(y_j)
# training
self.sess.run(self.training, feed_dict={self.x:state_minibatch, self.y_:y_minibatch})
I'm doing it but
Q_max also gives 1 if it wins, but 0.002 is small.
I used
as a reference.
Try implementing Deep Q Network (DQN) with TensorFlow ultra-simple~Introduction~|ALGO GEEKS
That's it.
Thank you for your cooperation.
python machine-learning tensorflow deep-learning
It seems that the probability that the player will take random action is exploration=0.1, but I think it is strange that it is fixed with a small value of 0.1 from the beginning.
If epsilon is small from the beginning, you will only accumulate experience about the actions that models you haven't learned happen to take, and your learning will not go well.
In order to gain experience by experimenting with various actions at first and maximizing profits with the models learned, epsilon-greedy seems to be commonly used in DQN implementations (e.g., starting with exploration=1.0 and gradually decreasing to a specific value).
The tasks introduced in "Super Simple with TensorFlow DQN~" are very simple (there are only two types of actions that move left or right?), so even if exploration=0.1 takes time, you will have enough experience and learn well.
I didn't read the code properly, so I'm sorry if it missed the mark.
© 2024 OneMinuteCode. All rights reserved.