[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Modifying the TD Code in PDP++



Hello all,

I'm trying to revamp the discounted-reward TD-learning aspect of PDP++ to use 
a running average reward function (Daw, Touretzky, 2000) but I'm having some 
problems identifying the code I need to change.

First, which objects would I need to modify? My list of candidates are 
TDModUnit, TDLayerSpec, and LeabraACLayerSpec, with the latter being the most 
likely. Specifically, this line here (3393 in Leabra.cc) is what I'm thinking 
of:

// use prior plus phase, which was v(t+1) and is now v(t)
// but assume that v(t+1) was discounted so, now "undiscount" it
acu->ext = rew.inv_disc * acu->act_p;

But I'm not sure if this is the only place I'd have to modify. I was hoping to 
simply change the ACRewSpec to internally use an averaged reward, but the 
object simply has an Initialize and UpdateAfterEdit function.

The changes I need to implement are:

1) Keep a running sum of reward at each timestep (plus phase) and the number 
of timesteps that have occured.
2) Change the evaluation of delta from the exponential version:

delta(t) = (r(t) + discount * V(t+1)) - V(t)

to the running average version:

delta(t) = (r(t) - ravg(t)) + (V(t+1) - V(t))

I'm figuring I need to make these changes to the LeabraACLayerSpec but am 
unsure of what code will need to be changed since the processing seems to be 
distributed across multiple functions...

Thank you in advance,
-Roberto Olivares
Vanderbilt University
Intelligent Robotics Laboratory