[pdp-discuss] TD demo and delayed rewards

allan.randall at ntt.ca allan.randall at ntt.ca
Thu Feb 1 09:35:48 MST 2007


The simple TD demo in "Explorations" (p 199) is a really nice way of showing the basic algotithm at work. Even the serial compound "cheat" I find is readily accepted when I explain the demo to others. The bigger problem is the fact that the stimulus must be maintained up to the point of reward... this is readily apparent when you show the demo, and makes it look like it is not really detecting delayed reward at all.

What would be the easiest way to fix this, without destroying the simple, pedagogically clean presentation of the demo? The active memory model (p 307) gets into numerous other things, and is no longer just a demo of the basic TD algorithm. 

I'm guessing the problem could be fixed with a simpler modification of the basic TD demo... but maybe that will get me into trouble? Anyone try this?

I'm not questioning the value of the algorithm... I understand why the demo makes the compromise it does. I'm just trying to respond to objections that I have heard raised by those watching the demo, without getting into a whole other more complicated demo. 

Any thoughts would be appreciated.

Cheers,

Allan Randall, NTT Systems Inc.



More information about the PDP-Discuss mailing list