[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Modifying the TD Code in PDP++
Hello all,
I'm trying to revamp the discounted-reward TD-learning aspect of PDP++ to use
a running average reward function (Daw, Touretzky, 2000) but I'm having some
problems identifying the code I need to change.
First, which objects would I need to modify? My list of candidates are
TDModUnit, TDLayerSpec, and LeabraACLayerSpec, with the latter being the most
likely. Specifically, this line here (3393 in Leabra.cc) is what I'm thinking
of:
// use prior plus phase, which was v(t+1) and is now v(t)
// but assume that v(t+1) was discounted so, now "undiscount" it
acu->ext = rew.inv_disc * acu->act_p;
But I'm not sure if this is the only place I'd have to modify. I was hoping to
simply change the ACRewSpec to internally use an averaged reward, but the
object simply has an Initialize and UpdateAfterEdit function.
The changes I need to implement are:
1) Keep a running sum of reward at each timestep (plus phase) and the number
of timesteps that have occured.
2) Change the evaluation of delta from the exponential version:
delta(t) = (r(t) + discount * V(t+1)) - V(t)
to the running average version:
delta(t) = (r(t) - ravg(t)) + (V(t+1) - V(t))
I'm figuring I need to make these changes to the LeabraACLayerSpec but am
unsure of what code will need to be changed since the processing seems to be
distributed across multiple functions...
Thank you in advance,
-Roberto Olivares
Vanderbilt University
Intelligent Robotics Laboratory