[pdp-discuss] TD learning in PDP
Randall C. O'Reilly
Randy.OReilly at colorado.edu
Wed Apr 18 23:40:54 MDT 2007
Matthijs,
this is a bit confusing, but p_dwt only works in conjunction with the
TDRewPredConSpec to use prior activations.. For a regular DaModUnitSpec guy,
this will ENABLE such learning by maintaining the appropriate variables and
calling the dwt functions, but the ConSpec must also be configured to
actually do the weight change using prior sending unit variables. Although
TDRewPredConSpec is typically used only for that specific rew pred guy in TD,
I'm pretty sure you can just use it on the action connections.. Give that a
try.
Meanwhile, here is some irrelevant info about the p_dwt variable in the
connection that I typed before I realized what you were talking about -- it
might be useful to someone..
p_dwt is a bit of a weird variable: the dwt variable reflects any currently
accumulating weight changes, which may not actually be applied for several
steps depending on the learning parameters (e.g., SMALL_BATCH in the epoch
process). It is reset after the weight changes are applied to update the
weights, and p_dwt is updated to reflect that value. So, in the usual
ON_LINE mode, dwt is 0 and p_dwt shows the weight change that was computed on
the trial that just finished.
- Randy
On Tuesday 10 April 2007 17:54, M Snel wrote:
> Hi,
>
> I am trying to construct a model of TD learning in a simple navigation
> task. The network should learn to select the optimal action to get to a
> goal state; thus, the weights between inputs (encoding location in the
> environment) and outputs (encoding navigational actions) should be
> updated based on reward.
>
> I have connected the input units to the predicted reward layer, and
> clamp the external reward upon reachning goal state. By this
> construction the network accurately learns to represent the "value" of
> each input unit (i.e. higher expected reward closer to goal). I have
> connected the TDlayer to the DaModUnit action units and have turned on
> the Da modulation and "p dwt" in those units so that they should learn
> from the modulation.
>
> However, results for learning in the actions units are not as I
> expected: the action units don't learn to map an input state to a
> correct action. I was assuming that in PDP the modulation from the
> TDlayer would be "interpreted" by the DaModUnits as feedback on the
> PREVIOUS action (as per the "p dwt" parameter). Is this correct? Also,
> do the units learn using the Da modulation directly or by a difference
> in Da modulations from one timestep to the next?
>
> Thanks,
> Matthijs
>
> _______________________________________________
> PDP-Discuss mailing list
> PDP-Discuss at psych.Colorado.EDU
> http://psych.colorado.edu/mailman/listinfo/pdp-discuss
More information about the PDP-Discuss
mailing list