1. Agarwal, A., Kakade, S. M., Lee, J. D., & Mahajan, G. (2020). Optimality and approximation with policy gradient methods in markov decision processes. In Proceedings of Thirty Third Conference on Learning Theory (pp. 64–66). PMLR. URL: https://proceedings.mlr.press/v125/agarwal20a.html.
2. Policy-gradient based actor-critic algorithms;Awate,2009
3. Baird, L. C. (1994). Reinforcement learning in continuous time: advantage updating. In Proceedings of 1994 IEEE international conference on neural networks, Vol. 4 (pp. 2448–2453). http://dx.doi.org/10.1109/ICNN.1994.374604.
4. Infinite-horizon policy-gradient estimation;Baxter;The Journal of Artificial Intelligence Research (JAIR),2001
5. Natural actor-critic algorithms;Bhatnagar;Automatica,2009