“Model-free Control of Partially Observable Underactuated Systems by pairing Reinforcement Learning with Delay Embeddings”

Authors: Martinius Knudsen, Sverre Hendseth, Gunnar Tufte and Axel Sandvig,
Affiliation: NTNU, Department of Engineering Cybernetics and NTNU
Reference: 2022, Vol 43, No 1, pp. 1-8.

Keywords: Control, Delay embeddings, Reinforcement learning, Partial observability, Underactuated systems

Abstract: Partial observability is a problem in control design where the measured states are insufficient in describing the systems trajectory. Interesting real-world systems often exhibit nonlinear behavior and noisy, continuous-valued states that are poorly described by first principles, and which are only partially observable. If partial observability can be overcome, these conditions suggest the use of reinforcement learning (RL). In this paper we tackle the problem of controlling highly nonlinear underactuated dynamical systems, without a model, and with insufficient observations to infer the systems internal states. We approach the problem by creating a time-delay embedding from a subset of the observed state and apply RL on this embedding rather than the original state manifold. We find that delay embeddings work well with learning based methods, as such methods do not require a precise description of the systems state. Instead, RL learns to map any observation to appropriate action (determined by a reward function), even if these observations do not lie on the original geometric state manifold.

PDF PDF (588 Kb)        DOI: 10.4173/mic.2022.1.1

References:
[1] Aastroem, K.J. (1965). Optimal control of Markov processes with incomplete state information, Journal of Mathematical Analysis and Applications. 10(1):174--205. doi:10.1016/0022-247x(69)90163-2
[2] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). OpenAI Gym, arXiv:1606.01540 .
[3] Bush, K. and Pineau, J. (2009). Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability, Advances in Neural Information Processing Systems. 22.
[4] Dormand, J.R. and Prince, P.J. (1980). A family of embedded Runge-Kutta formulae, Journal of computational and applied mathematics. 6(1):19--26. doi:10.1016/0771-050x(80)90013-3
[5] Kaelbling, L.P., Littman, M.L., and Cassandra, A.R. (1998). Planning and acting in partially observable stochastic domains, Artificial intelligence. 101(1-2):99--134. doi:10.1016/s0004-3702(98)00023-x
[6] Khalil, H. (2002). Nonlinear Systems, Cambridge University Press. doi:10.1017/CBO9781139172455
[7] Lorenz, E.N. (1976). Nondeterministic theories of climatic change, Quaternary Research. 6(4):495--506. doi:10.1016/0033-5894(76)90022-3
[8] Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning, In International Conference on Machine Learning. PMLR, pages 1928--1937.
[9] Raffin, A., Hill, A., Ernestus, M., Gleave, A., Kanervisto, A., and Dormann, N. (2019). Stable baselines3, GitHub.
[10] Sauer, T.D. (2006). Attractor reconstruction, Scholarpedia. 1(10):1727. doi:10.4249/scholarpedia.1727
[11] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization, In International Conference on Machine Learning. PMLR, pages 1889--1897.
[12] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal Policy Optimization Algorithms, arXiv:1707.06347 .
[13] Shinbrot, T., Grebogi, C., Wisdom, J., and Yorke, J.A. (1992). Chaos in a double pendulum, American Journal of Physics. 60(6):491--499. doi:10.1119/1.16860
[14] Stable baselines3. (2017). Proximal Polociy Optimization (PPO) algorithm, https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html, 2017.
[15] Sugihara, G., May, R., Ye, H., Hsieh, C.-h., Deyle, E., Fogarty, M., and Munch, S. (2012). Detecting Causality in Complex Ecosystems, Science. 338(6106):496--500. doi:10.1126/science.1227079
[16] Takens, F. (1981). Detecting strange attractors in turbulence, In D.Rand and L.-S. Young, editors, Dynamical Systems and Turbulence, Warwick 1980, Lecture Notes in Mathematics. Springer, Berlin, Heidelberg, pages 366--381. doi:10.1007/BFb0091924
[17] Tedrake, R. (2020). Underactuated Robotics: Algorithms for Walking, Running, Swimming, Flying, and Manipulation (Course Notes for MIT 6, 832). http://underactuated.mit.edu.


BibTeX:
@article{MIC-2022-1-1,
  title={{Model-free Control of Partially Observable Underactuated Systems by pairing Reinforcement Learning with Delay Embeddings}},
  author={Knudsen, Martinius and Hendseth, Sverre and Tufte, Gunnar and Sandvig, Axel},
  journal={Modeling, Identification and Control},
  volume={43},
  number={1},
  pages={1--8},
  year={2022},
  doi={10.4173/mic.2022.1.1},
  publisher={Norwegian Society of Automatic Control}
};