Balancing an Inverted Pendulum
Kajal Damji Gada
ENPM808F Robot LearningFinal Project
12th December 2016
Balancing an Inverted Pendulum
Contents
Abstract 3
1 Introduction 3
2 Related Work 4
3 Approach 53.1 Q-learning: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Q-learning: Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.3 Q-learning: Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Implementation 6
5 Results 7
6 Analysis 8
7 Conclusion 8
8 Future Work 8
References 9
Appendix 10
Kajal Damji Gada Page 2 / 17
Balancing an Inverted Pendulum
Abstract
1 Introduction
An inverted pendulum is a touchstone which every Robotic student touches once [1]. Beginning from sta-bilization of unstable open-loop system to real-world application of Segway, it is a benchmark in ControlTheory and Robotics. It is also a good application to aid in learning of any new algorithm, which in thisscenario, is Q-learning. Thus, the goal of the project is to understand the working of Q-learning, a machinelearning algorithm, by implementation for an inverted pendulum.
(a) (b)
Figure 1: (a)Segway [2] (b) Furuta Pendulum [3]
The inverted pendulum problem has many variations: Furuta Pendulum [3], Double Inverted Pendulum [4],etc. In this project, a case of inverted Pendulum on cart is considered. The system may appear simplisticin design. However, it is a non-linear system with a static stable (equilibrium) point at pending position(face-down) and dynamic equilibrium point at upright position.
This makes designing a control system for an inverted pendulum into a challenging problem. In the caseof Q-learning, it is not needed to know the model. Q-learning is regarded as a model-free [5] reinforcementlearning. However, it does come with its own set of challenges. One of the most important one being dis-cretization of the model as Q-learning works for discrete system with an end-game reward.
Literature related to this project is discussed in section 2. Then in section 3, plan towards the projectproblem is charted out. Next in section 4 and 5, the actual implementation and results are shown. Theresults are analyzed in section 6 and concluded in section 7.
Kajal Damji Gada Page 3 / 17
Balancing an Inverted Pendulum
2 Related Work
The work by Lasse Scherffig [6] starts with explaination of Reinforcement Learning Theory and goes on toexplain the difference between Supervised Learning and Reinforcement Learning. The main difference beingReinforcement Learning doesn’t have a set of sample actions to be taken, it is infact learn by exploring andassessing the rewards.
The paper then discusses the Inverted Pendulum model, followed by the work done. The paper address2 problems: balancing and full control. Balancing is about maintaining balance when in face-up positionand Full control is about getting to face-up position from any position including face-down position. Whilethe first problem is solved using Q-learning, the second part uses Artificial Neural Network (ANN) as thenumber of states are too large.
In a second paper, the author disusses use of resource-allocation network with Q-learning [7]. The paperstarts with a discussion on use of supervised learning and memorization for balancing an inverted pendulum.The method essentially memorize each move using Gaussian signal. Then the disuccusion moves onto howthe use Q-learning to solve the problem.
Figure 2: Q-learning network with Restart algorithm [7]
Instead of using a Q-table, the paper talks about use of Q-learning network as shown in Figure 2. Thepoint is instead of storing each state-action pair and making it a large memorization table like supervisedlearning, use a network and reallocate resources. So everytime a new state-action is learnt, it is store at theunit that is least useful. This approach is called Restart Algorithm and gives results that work better thana combination of supervised learning and memorization.
Kajal Damji Gada Page 4 / 17
Balancing an Inverted Pendulum
3 Approach
3.1 Q-learning: Introduction
The task is defined as balancing an inverted pendulum on a cart in an upright position. The method chosenfor this task is a machine learning algorithm: Q-learning. It is a method, that doesn’t require knowledge ofmodel for learning. It learns by experiencing the reward for taking a sequence of action [5].
Figure 3: Interface between Agent and Environment in Q-learning [6]
In other words, the agent takes an action and observes the result in form of result from environment asshown in Figure 3. The reward is stored in a table, called Q-table, along with state. The next time, whenthe same state is encountered it decides to taken an action based on rewards learned last time.
3.2 Q-learning: Exploration
A good reward would lead to taking the action again. And a bad reward would lead to not taking the actionagain. But what if there was a better reward? Thus, there is a component of exploration. That is whendeciding the next action, it takes an action not explored even when an existing action gives a good reward.
Based on the available combination of states-action pairs, the size of Q-table is decided. Also, it affects thenumber of iterations to be performed to obtain satisfactory results.
3.3 Q-learning: Formula
For each iteration, the current state (s) is observed. An action is chosen for execution based on equation (1)and then the Q-table is updates based on action chosen as mention in equaion (2):
π(s) = argmaxaQ(s, a) (1)
Q(s, a) = r + γmaxaQ(s′, a
′) (2)
where π(s) is policy for State s; a is action chosen; r is reward for action chosen; γ is delay reward factorand s
′is the new state after action is executed [6].
Kajal Damji Gada Page 5 / 17
Balancing an Inverted Pendulum
4 Implementation
The program is implemented in Python 3. The code is written to build the Q-table over multiple iterationsand store the best result. The best results can then be played in an animation using Penplot command fromthe plot.py file.
The program (Inverted pendulum q learning) starts with an empty Q-table. The program iterates overmultiple episodes and for each episode, the current state is randomized. A policy is calculated for the currentstate and all actions. An action is chosen based on the calculated policy and executed.
Based on the chosen action, a new state is calculated based on the system model. Based on this new state,a reward is calculated. The reward is based on position of cart and the angle of pendulum. The reward isused to updated the Q-table. If the pendulum is dropped, a new episode begins with new random start state.
Note that an inverted pendulum is a continuous system. Thus, each state is discretized for implementation.
The states chosen are:
• Position of cart (x)
• Linear Velocity of cart x
• Angle of pendulum with cart (θ)
• Angular velocity of pendulum (θ)
Next, the actions set includes:
• Move left (−1)
• Move right (1)
Thus, the cart moves with a Force of F Newton on left or right based on action chosen. The F is set to 10Nand can be changed. Other variables include:
• Magnitude of Force on cart (F ) and Gravity constant (g)
• Mass of cart (mc), Mass of pole (mp) and Length of Pole (lp)
• Reward delay factor (γ)
• Exploration factor (ε)
Kajal Damji Gada Page 6 / 17
Balancing an Inverted Pendulum
5 Results
The Figure 5 shows an example of results after 1000000 iterations. As seen, the pendulum is able to maintainitself in the upright position and eventually stops when it goes at the end of cart track (beyond 2.4 units).
Figure 4: Snapshot of animation for Inverted Pendulum Balancing
It can be seen that as the reward is maximum at the top, it attempts to maintain the state. Note, that thissystem is dynamically system and thus must move continuously to be at the unstable equilibrium point.
Figure 5: Results
Kajal Damji Gada Page 7 / 17
Balancing an Inverted Pendulum
6 Analysis
Based on results for various experimental runs, it was observed that system is able to identify the policy formaintaining the angle of pendulum between −1 to 1 degrees. To assist in learning, the initial few trials hadthe start state at (x, x, θ, θ) = (0, 0, 0, 0). At later iterations (episodes), the system starts with initial statewhich is randomized. This helps learn better in fewer iterations.
To achieve better results, another method would be to create more discrete states. This also applies for thecase when the algorithm wants to learn about bringing up the pendulum from face-down to face-up position.However, a Q-table would not be ideal for a high number of states. For such cases, Artifical Neural Networks(ANN) should be considered as shown in [6]
7 Conclusion
The project was concluded by implementing the Q-learning algorithm to balance an inverted pendulum inan upright position. It was also realized that it is difficult to implement a continuous system. It requiresdiscretization of states which can prove challenging.
If the discretization is too little, the transition from one state to another is less accurate and with morestates the Q-table becomes quite big. With lots of state, even more iterations are required to learn and buildthe Q-table. In such a case, other options such as Artifical Neural Networks should be explored.
8 Future Work
This project focused on balancing the pendulum, a natural extension would be to get the pendulum to comeinto an upright position from a face down position.
Figure 6: Balancing a glass of Wine
Though an interesting future work would be to learn to balance the inverted pendulum when moving in aparticular direction. This could be seen applicable for a scenario when a mobile Robot would bring you aglass of wine while balancing it at the end of stick (an inverted pendulum) as shown in Figure 6.
Kajal Damji Gada Page 8 / 17
Balancing an Inverted Pendulum
References
[1] Boubaker Olfa, ”The Inverted Pendulum: A fundamental Benchmark in Control Theory and Robotics”,Education and e-learning Innovations (ICEELI), 2012.
[2] W. Younis, M. Abdelati, Design and implementation of an experimental segway model, AIP ConferenceProceedings, vol. 1107, pp. 350-354, 2009
[3] J. . Acosta, Furuta’s pendulum: A conservative nonlinear model for theory and practice, MathematicalProblems in Engineering, 2010.
[4] Henmi Tomohiro, Deng Mingcong, Inoue Akira, Ueki Nobuyuki and Hirashima Yoichi, ”Swing-up Controlof a Serial Double Inverted Pendulum”, American Control Conference, 2004
[5] Watkins Christopher J.C.H, ”Technical Note: Q-learning”, Machine Learning, pp. 279-292, 1992.
[6] Scherffig Lasse, ”Reinforcement Learning in Motor Control”
[7] Anderson, Charles W., ”Q-learning with Hidden-Unit Restarting”
Kajal Damji Gada Page 9 / 17
Balancing an Inverted Pendulum
Appendix
Read Me
The program is coded in Python 3. To run the program:
python3 inverted pendulum q learning.py
Ensure that both codes: (1) inverted pendulum q learning.py and (2) plot.py are in the same folder. Firstcompile and then run the code. In Ubuntu:
chmod +x inverted pendulum q learning.pychmod +x plot.py
To change values of parameters such as γ, ε, etc. change the value at start of function. To change displaysetting, use command:
Penplot(best states, anime=True, fig=True)
where anime=True is for animation and fig=True is for graph.
Kajal Damji Gada Page 10 / 17
Balancing an Inverted Pendulum
Main Program (In python3)
1 #!/ usr / bin /env python2
3 import numpy as np4 from p lo t import Penplot5 import random6 from math import degrees , s in , cos7
8 # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−#9 # CONSTANT VALUES #
10 # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−#11
12 mass pole = 0 .113 mass cart = 0 .514 mass to ta l = mass pole + mass cart15
16 l e n g t h p o l e = 0 .317
18 f o rce magni tude = 219 c o n s t a n t g r a v i t y = 9 .820
21 tau = 0.0222 alpha = 0 .523 gamma = 0.524
25 g l o b a l e p s i l o n26 e p s i l o n = 0 .227
28 # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−#29 # FUNCTIONS #30 # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−#31
32 de f c a l c u l a t e i n d e x ( c u r r e n t s t a t e ) :33
34 i f c u r r e n t s t a t e [ 0 ] < −0.8:35 x = 036 e l i f c u r r e n t s t a t e [ 0 ] < 0 . 8 :37 x = 138 e l s e :39 x = 240
41 i f c u r r e n t s t a t e [ 1 ] < −0.5:42 x dot = 043 e l i f c u r r e n t s t a t e [ 1 ] < 0 . 5 :44 x dot = 145 e l s e :46 x dot = 247
48 i f deg ree s ( c u r r e n t s t a t e [ 2 ] ) < −12.0:49 theta = 050 e l i f degree s ( c u r r e n t s t a t e [ 2 ] ) < −6.0:51 theta = 152 e l i f degree s ( c u r r e n t s t a t e [ 2 ] ) < −1.0:53 theta = 2
Kajal Damji Gada Page 11 / 17
Balancing an Inverted Pendulum
54 e l i f degree s ( c u r r e n t s t a t e [ 2 ] ) < 0 . 0 :55 theta = 356 e l i f degree s ( c u r r e n t s t a t e [ 2 ] ) < 1 . 0 :57 theta = 458 e l i f degree s ( c u r r e n t s t a t e [ 2 ] ) < 6 . 0 :59 theta = 560 e l i f degree s ( c u r r e n t s t a t e [ 2 ] ) < 1 2 . 0 :61 theta = 662 e l s e :63 theta = 764
65 i f deg ree s ( c u r r e n t s t a t e [ 3 ] ) < −50.0:66 the ta dot = 067 e l i f degree s ( c u r r e n t s t a t e [ 3 ] ) < −25.0:68 the ta dot = 169 e l i f degree s ( c u r r e n t s t a t e [ 3 ] ) < 2 5 . 0 :70 the ta dot = 271 e l i f degree s ( c u r r e n t s t a t e [ 3 ] ) < 5 0 . 0 :72 the ta dot = 373 e l s e :74 the ta dot = 475
76 re turn x , x dot , theta , the ta dot77
78 de f c a l c u l a t e p r o b ( c u r r e n t s t a t e , Q table ) :79
80 p o l i c y = [ ]81
82 x , x dot , theta , the ta dot = c a l c u l a t e i n d e x ( c u r r e n t s t a t e )83
84 value = [ Q table [ act ion , x , x dot , theta , the ta dot ] f o r ac t i on inrange (2 ) ]
85
86 f o r a c t i o n in value :87 i f a c t i o n == max( value ) :88 p o l i c y . append ( 1 . 0 − e p s i l o n + e p s i l o n / 2)89 e l s e :90 p o l i c y . append ( e p s i l o n / 2)91
92 i f sum( p o l i c y ) == 1 . 0 :93 re turn p o l i c y94 e l s e :95 p o l i c y = [ 0 . 5 , 0 . 5 ]96 re turn p o l i c y97
98 de f cho o s e a c t i on ( p o l i c y ) :99
100 prob num = random . randrange (0 ,100) /100 .0101
102 i f prob num <= p o l i c y [ 0 ] :103 ac t i on choosen = 0104 e l s e :105 ac t i on choosen = 1106
Kajal Damji Gada Page 12 / 17
Balancing an Inverted Pendulum
107 re turn ac t i on choosen108
109 de f update s ta t e ( c u r r e n t s t a t e , a c t i on choosen ) :110
111 x cur , x dot cur , theta cur , t h e t a d o t c u r = c u r r e n t s t a t e112
113 i f a c t i on choosen == 0 :# act i on 0 i s l e f t
114 f o r c e v a l u e = − f o rce magni tude115 e l s e :
# ac t i on 1 i s r i g h t116 f o r c e v a l u e = force magni tude117
118 temp = ( f o r c e v a l u e + ( mass pole ∗ l e n g t h p o l e ) ∗ t h e t a d o t c u r ∗∗2 ∗ s i n( the ta cu r ) ) / mas s to ta l
119
120 th e ta ac c = ( c o n s t a n t g r a v i t y ∗ s i n ( the ta cu r ) − cos ( the ta cu r ) ∗ temp) / \
121 ( l e n g t h p o l e ∗ ( ( 4 . 0 / 3 . 0 ) − mass pole ∗ cos ( the ta cu r ) ∗∗2/ mas s to ta l ) )
122
123 x acc = temp − ( mass pole ∗ l e n g t h p o l e ) ∗ th e ta ac c ∗ cos ( the ta cu r ) /mas s to ta l
124
125 x new = x cur + ( tau ∗ x dot cur )126 x dot new = x dot cur + ( tau ∗ x acc )127 theta new = the ta cu r + ( tau ∗ t h e t a d o t c u r )128 theta dot new = t h e t a d o t c u r + ( tau ∗ th e ta ac c )129
130 re turn x new , x dot new , theta new , theta dot new131
132 de f update Qtable ( c u r r e n t s t a t e , ac t ion choosen , new state , reward , Q table ) :133
134 x , x dot , theta , the ta dot = c a l c u l a t e i n d e x ( new state )135 Q max = max( Q table [ 0 , x , x dot , theta , the ta dot ] , Q table [ 1 , x ,
x dot , theta , the ta dot ] )136
137 x , x dot , theta , the ta dot = c a l c u l a t e i n d e x ( c u r r e n t s t a t e )138 Q cur = Q table [ ac t ion choosen , x , x dot , theta , the ta dot ]139
140 Q table [ ac t ion choosen , x , x dot , theta , the ta dot ] = Q cur + alpha ∗( reward + (gamma∗Q max) − Q cur )
141
142 re turn Q table143
144 de f t a k e a c t i o n ( c u r r e n t s t a t e , Q table ) :145
146 p o l i c y = c a l c u l a t e p r o b ( c u r r e n t s t a t e , Q table )147 ac t i on choosen = cho o s e a c t i on ( p o l i c y )148 new state = update s ta t e ( c u r r e n t s t a t e , a c t i on choosen )149
150 reward = 0151
152 i f abs ( new state [ 0 ] ) < 2 . 4 :
Kajal Damji Gada Page 13 / 17
Balancing an Inverted Pendulum
153 i f abs ( degree s ( new state [ 2 ] ) ) < 1 . 0 :154 reward = 10155 e l i f abs ( degree s ( new state [ 2 ] ) ) < 3 . 0 :156 reward = 5157 e l i f abs ( degree s ( new state [ 2 ] ) ) < 6 . 0 :158 reward = 2159 e l i f abs ( degree s ( new state [ 2 ] ) ) < 2 0 . 0 :160 reward = 1161
162 Q table = update Qtable ( c u r r e n t s t a t e , ac t ion choosen , new state ,reward , Q table )
163
164 re turn reward , new state , Q table165
166 # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−#167 # MAIN PROGRAM #168 # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−#169
170 Q table = np . z e ro s ( [ 2 , 3 , 3 , 8 , 5 ] ) # ac t i on (2 ) ∗s t a t e x (3 ) ∗ s t a t e x d o t (3 ) ∗ s t a t e t h e t a (6 ) ∗ s t a t e t h e t a d o t (3 )
171
172 max steps = 0173 b e s t s t a t e s = [ ]174
175 max episodes = 1000000176 # max episodes = 10000177
178 f o r ep i sode in range (1 , max episodes +1) :179
180 s t a t e s = [ ]181
182 i f ep i sode < 10000 :183 c u r r e n t s t a t e = (0 ,0 , random . randrange (−1 ,1) ,0 )
# s t a r t s t a t e = 0184 e l i f ep i sode < 20000 :185 c u r r e n t s t a t e = ( 0 . 1∗ random . randrange (−5 ,5) ,0 , random . randrange
(−3 ,3) ,0 )186 e l i f ep i sode < 30000 :187 c u r r e n t s t a t e = ( 0 . 1∗ random . randrange (−8 ,8) ,0 , random . randrange
(−5 ,5) ,0 )188 e l i f ep i sode < 50000 :189 c u r r e n t s t a t e = ( 0 . 1∗ random . randrange (−15 ,15) ,0 , random .
randrange (−12 ,12) ,0 )190 e l s e :191 c u r r e n t s t a t e = ( 0 . 1∗ random . randrange (−20 ,20) ,0 , random .
randrange (−15 ,15) ,0 )192
193 s t a t e s . append ( c u r r e n t s t a t e )194
195 f o r s tep in range (1 ,1000) :196
197 reward , new state , Q table = t a k e a c t i o n ( c u r r e n t s t a t e ,Q table )
198 c u r r e n t s t a t e = new state
Kajal Damji Gada Page 14 / 17
Balancing an Inverted Pendulum
199 s t a t e s . append ( c u r r e n t s t a t e )200
201 i f reward < 1 :#
Pendulum dropped202
203 i f s t ep > max steps :204 b e s t s t a t e s = s t a t e s205 max steps = step206
207 i f ( ep i sode % 10000) == 0 :208 pr in t ( ’ After ’ , ep isode , ’ ep i sode ’ )209 pr in t ( ’Max s t ep s : ’ , max steps )210 pr in t ( ’−−−−−−−−−−−−−−−−−−−−− ’ )211
212 # Penplot ( b e s t s t a t e s , anime=True , f i g=False )213
214 e p s i l o n −= 0.002215
216 i f e p s i l o n < 0 :217 e p s i l o n = 0218
219 break220
221 Penplot ( b e s t s t a t e s , anime=True , f i g=True )222
223 # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− #
Kajal Damji Gada Page 15 / 17
Balancing an Inverted Pendulum
Program for animation (In python3)
1 #!/ usr / bin /env python2
3 import math4 import matp lo t l i b5 matp lo t l i b . use ( ’Qt5Agg ’ )6 import matp lo t l i b . pyplot as p l t7 # import matp lo t l i b . pyplot as p l t8 import matp lo t l i b . animation as animation9
10 c l a s s Penplot ( ob j e c t ) :11 de f i n i t ( s e l f , s t a t e s , anime=False , f i g=False ) :12 s e l f . anime = anime13 s e l f . f i g = f i g14 s e l f . x = [ s t a t e [ 0 ] f o r s t a t e in s t a t e s ]15 s e l f . x dot = [ s t a t e [ 1 ] f o r s t a t e in s t a t e s ]16 s e l f . theta = [ s t a t e [ 2 ] f o r s t a t e in s t a t e s ]17 s e l f . the ta dot = [ s t a t e [ 3 ] f o r s t a t e in s t a t e s ]18 s e l f . p r o c e s s ( )19
20 de f p l o t ( s e l f , data ) :21 x , theta , frame = data22 s e l f . t ime t ex t . s e t t e x t (” time :%. 2 f s \nstep :%d” % ( frame ∗0 .02 , frame ) )23
24 y = 0.0525 the ta x = x + math . s i n ( theta ) ∗ 0 .2526 the ta y = y + math . cos ( theta ) ∗ 0 .2527
28 s e l f . car . s e t d a t a (x , y / 2 . 0 )29 s e l f . l i n e . s e t d a t a ( ( x , the ta x ) , (y , the ta y ) )30
31 de f gen ( s e l f ) :32 f o r frame in range ( l en ( s e l f . x ) ) :33 y i e l d s e l f . x [ frame ] , s e l f . theta [ frame ] , frame34
35 de f p r o c e s s ( s e l f ) :36 i f s e l f . anime :37 f i g = p l t . f i g u r e ( f i g s i z e =(20 , 4 . 5 ) )38 ax = f i g . add subplot (1 , 1 , 1)39 ax . s e t x l i m (−3.0 , 3 . 0 )40 ax . s e t y l i m (−0.1 , 0 . 9 )41 ax . g r id ( )42
43 s e l f . t ime t ex t = ax . t ex t ( 0 . 0 5 , 0 . 9 , ”” , trans form=ax . transAxes )44 s e l f . car , = ax . p l o t ( [ ] , [ ] , ” s ” , ms=15)45 s e l f . l i n e , = ax . p l o t ( [ ] , [ ] , ”b−”, lw=2)46
47 ani = animation . FuncAnimation ( f i g , s e l f . p l o t , s e l f . gen , i n t e r v a l=1, r e p e a t d e l a y =3000 , repeat=True )
48
49 p l t . show ( )50
51 i f s e l f . f i g :52 s t ep s = range ( l en ( s e l f . x ) )
Kajal Damji Gada Page 16 / 17
Balancing an Inverted Pendulum
53
54 # p l t . f i g u r e55
56 p l t . subplot (2 , 1 , 1)57 p l t . t i t l e (”x , theta ”)58 p l t . p l o t ( steps , s e l f . x , l a b e l=”x ”)59 p l t . p l o t ( steps , s e l f . theta , l a b e l=”theta ”)60 p l t . l egend ( l o c=”best ”)61 p l t . g r id ( )62
63 p l t . subplot (2 , 1 , 2)64 p l t . t i t l e (” x dot , the ta dot ”)65 p l t . p l o t ( steps , s e l f . x dot , l a b e l=”x dot ”)66 p l t . p l o t ( steps , s e l f . theta dot , l a b e l=”the ta dot ”)67 p l t . l egend ( l o c=”best ”)68 p l t . g r id ( )69 p l t . show ( )70 p l t . c l o s e ( )
Kajal Damji Gada Page 17 / 17