[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Modifying the TD Code in PDP++



Yeah, the downside of object oriented programming is that things get a
bit distributed.  The LeabraACLayerSpec is the main thing you'll need
to change to do the computation of the average reward value.  However,
you'll then need to stick the average value somewhere -- I recommend
co-opting the act_avg variable on the LeabraUnit, which requires
setting act.avg_dt = 0 on the UnitSpec to prevent it from being
updated already. 

BTW, I've been doing a lot of work with TD learning in the leabra
framework and have found that it works *much* better if you use a
coarse-coded distributed representation of expected rewards instead of
using the activity value of a single unit as in the existing
implementation.  Also, I reorganized the phase-based mapping of the TD
algorithm to work a lot better -- the previous implementation (in the
LeabraACLayerSpec) does not handle non-absorbing rewards very well,
due to a basic problem in the way I had conceptualized TD..  This new
improved code is still undergoing some development, and is closely
linked with code that simulates the BG & PFC, but you can grab a copy
of it from the following link if you want to take a look:

ftp://grey.colorado.edu/pub/oreilly/pdp++/pdp++_2.3a_src.tar.gz

Meanwhile, I'll email you a copy of some notes I wrote describing how
the new phase-based scheme works.

				- Randy