    Assignment #3

LQR and Dynamic Planning.

Design a regulator to make a two-link inverted pendulum stand on end.  Use dynamic programming to have a one-link pendulum perform a ‘swing up’.

Code:

Monday, March 15, 2010 Balancing and LQR   Part 1b.
Design a LQR Controller.

The first step in designing the regulator was to generate a linear model of the system dynamics (see below).  In our case we will linearize around the goal state of positioned straight up with a velocity of zero, so A and B will be time-invariant. For the A and B matrices, I started with a derivation of 2-link dynamics from A Mathematical Introduction to Robotic Manipulation, by Murray, Li and Sastry. where This derivation neglected gravity and and viscous damping, but adding these terms and augmenting the state space to include position yielded was fairly straightforward.  Then I inverted the mass matrix (far left side matrix above) and left multiplied it to these augmented matrices to get A and B.  See lqr_balance.m for the full expression.

Q and R are fixed at 0.1 times the identity matrix and 1 times the identity matrix, respectively.  This has the effect of penalizing large torques more strongly than large errors in position.  Since the above A and B matrices were derived in continuous time, MATLAB’s lqrd function was used to provide a discrete-time K matrix for a particular time step (.001 s).

K =
11.8122    3.0090    2.5116    0.6574
2.6454    2.8750    0.5840    0.6412

The advantage of choosing such basic Q and R values is that the resulting K terms are intuitive.  Since Q and R are scaled identity matrices, each parameter in our state is assumed to be independent in terms of cost.  The disadvantage is that it might be more effective to tune the elements of Q and R in order to find an optimal controller that uses smaller overall gains.

To explore the minimal gains that still result in a stable controller, I swept through a range of ratios of the scalar coefficients R / Q.  The K matrix above is R / Q = 1 / .1 = 10.  The different K matrices were compared using their Frobenius norm, and appeared to be asymptotically constant at very large and very small ratios.  This makes sense in terms of large ratios, since there comes a point where a set of torques is required to be globally stable, regardless of how much they are penalized in the cost function.  The convergence at small ratios may be due to a similar situation, where the controller converges so quickly in discrete time that additional torque doesn’t have any benefit.  Testing K values derived for these small values were unstable in simulation.

Part 1c.
Robustness.

As seen in the above video, the controller was already robust to a relatively large set of errors in the parameters.  This is probably in part because the linearization of the sine function for gravity is always an overestimate of the force outside of the region of linearity.  This is due to fact that we are linearizing about 0 angle.
<      Plot of the manual controller converging.  From top to bottom:

Shoulder position (blue)
Elbow position (green)

Shoulder velocity (blue)
Elbow velocity (green)

Shoulder torque

Elbow torque

Plot of the LGR controller converging.  Note the faster convergence compared to the manual controller.
Part 2a and 2b.
Swing up.

A one-link pendulum with a motor at the base was modeled in MATLAB.  Finite-horizon policy iteration was used to optimize a trajectory towards a goal.  The cost function was defined using the same criteria as the LQR controller for the two-link arm.  Since the state space for this problem only had 2 parameters, it was feasible to divide the workspace up into a grid and perform policy iteration across every (reasonably) possible position and velocity.  At every point on this grid a set of values and corresponding set of optimal policies were stored.

This method has the advantage that at runtime we have a lookup table of optimal policies to interpolate for at every possible state.  Thus we can start at any position within the smooth region of the policy map and converge to the goal relatively quickly.

The algorithm for policy iteration was roughly as follows:
1.    Start a point in the gridded world
2.    Execute the current policy for the given starting point
3.    Integrate forward in time, summing costs at each step
4.    When outside of the current ‘box’, stop.
5.    Interpolate “cost-to-go” from the surrounding value map
6.    “Optimize”
7.    Repeat starting on the next grid location

In my case “Optimize” consisted of trying random moves in addition to executing the last policy.  At each grid location a random move within +/- 50% of the current torque, and random move within the max ranges of all current torques were tried.  If either of these resulted in a lower estimated cost based on the above algorithm, it was assigned as the new optimal policy.  This method was chosen over a more rigorous optimization since early in the iterations the value map is very inaccurate, and it seemed to be better to perform many cheap iterations than a smaller number of costly ones.

The difference in average torque and cost was calculated across the gridded world at every time step, which showed that this method converged in approximately 300 iterations (~ 5 min of runtime).  A usable function in most of the workspace takes shape at about 80 iterations.

This simulation is contained within swing_up.m.
Dynamic Planning
In order to assure convergence the goal state was expanded to include the nearest grid location.  This assured an absorbing goal state where costs remained at 0.  A very coarse grid (.33 rad x 1 rad/s per step) was explored see how fast the algorithm could run and still converge to a solution.  Since the random perturbations to the policy were scaled by the current range of policy values, the initial policy was initialized to random values, rather than all zeros as is common practice.
2c.
Robustness

To perform a swing up on a system that might not be modeled correctly, a trajectory was created using the velocity, position and torque history from a simulated swing up with nominal.  The torques became feed-forward torques into a a new controller, and a PD controller was implemented to force the new system to follow the given velocity and position trajectory.  By changing the gains Kp and Kd, this controller can be tuned to handle increasing amounts of error.  The plot below shows the trajectory taken where the system parameters were perturbed by random noise within +/- 25% of each nominal parameter value.

This simulation is contained within swing_trajectory.m.      Plots of the Value Map and Policy Map with the pendulum trajectory starting at (0,0).   Overall, policy iteration worked very well for this system, even with such a coarse grid size.  However, this was made possible by the relatively smooth system dynamics, and the fact that only having 2 parameters made an exhaustive search of the state space possible.

The trajectory that was produced from this optimization seemed to dwell a lot at the origin before swinging up.  This was probably due to the fact that the coarse grid made it difficult to provide a very useful policy at slow speeds, and the fact that I included a damping factor of 0.1 in the model by default.         In order to improve the rate of convergence (at the cost of larger torques) I solved for a range of A and B matrices with parameters perturbed by random values in a finite range.  I iterated this process 50 times and chose a K matrix based on the one with the maximum Frobenius norm.  This resulted in a controller that would be stable to the ‘worst case’ scenario within a confidence interval specified by the range of random perturbations.

The LQR controller was relatively robust to zero-mean noise on measured angles and velocities, although the torques were oscillating extremely harshly (see below).  If a bias in measured angles needs to be compensated for, a kinematic calibration similar to what was done in the last homework assignment would be needed.