Learning to Guide Online Multi-Contact Receding Horizon Planning

In Receding Horizon Planning (RHP), it is critical that the motion being executed facilitates the completion of the task, e.g. building momentum to overcome large obstacles. This requires a value function to inform the desirability of robot states. However, given the complex dynamics, value functions are often approximated by expensive computation of trajectories in an extended planning horizon. In this work, to achieve online multi-contact Receding Horizon Planning (RHP), we propose to learn an oracle that can predict local objectives (intermediate goals) for a given task based on the current robot state and the environment. Then, we use these local objectives to construct local value functions to guide a short-horizon RHP. To obtain the oracle, we take a supervised learning approach, and we present an incremental training scheme that can improve the prediction accuracy by adding demonstrations on how to recover from failures. We compare our approach against the baseline (long-horizon RHP) for planning centroidal trajectories of humanoid walking on moderate slopes as well as large slopes where static stability cannot be achieved. We validate these trajectories by tracking them via a whole-body inverse dynamics controller in simulation. We show that our approach can achieve online RHP for 95%-98.6% cycles, outperforming the baseline (8%-51.2%).


I. INTRODUCTION
In this work, we consider the problem of planning multicontact motions for legged robots to traverse uneven terrain (Fig. 1).This problem is high-dimensional, nonlinear, and subject to discrete changes of dynamics arising from contact switches [1], [2], [3], [4].Traditionally, multi-contact motions are pre-planned offline and then tracked by a controller [5], [2].However, in the real world, the preplanned motion can become invalid due to perturbations, e.g.environment changes and state drifts [5], [6].To adapt to these changes, online motion (re)-planning is required.
To this end, Receding Horizon Planning (RHP) [6], [7] is a promising solution.Similar to Model Predictive Control (MPC) [8], RHP aims at constantly updating the motion plan for immediate execution based on the state of the robot and the environment.This is usually achieved by solving a finite-horizon Trajectory Optimization (TO) problem [9], which comprises an execution horizon to plan the optimal actions for execution and a prediction horizon to foresee the future.Although the prediction horizon is never executed, it is critical to the success of RHP.Drawing a parallel with Fig. 1: Snapshots of simulations.Video is available at https://youtu.be/oCsOBHHc9XM Bellman's equation [10], the prediction horizon can be seen as an approximation of the value function, which directs the decision of the execution horizon towards optimal actions that are favorable for the future.
However, approximating the value function with the prediction horizon can significantly increase the computational complexity.To reduce the computation burden, [11] and [12] propose a multi-fidelity TO that relaxes the model accuracy in the prediction horizon.Nevertheless, such a model-based formalism can still struggle to compute online when a long lookahead is needed, e.g. when traversing large slopes.
To further accelerate the computation speed of multicontact RHP, we explore the possibility of approximating the value function with a learned model.To achieve this, one of the options is to learn a global value function [13] that can reflect the value of the states for a given environment.However, this requires a flexible representation that can parameterize the value function in the coupled stateenvironment space [14].In this work, instead of learning the value function directly, we propose to learn an oracle that can predict a local objective-an intermediate goal state that is favourable towards the completion of a given task-based on the current robot state, goal position, and environment model.We then employ a short-horizon TO to compute the optimal actions for reaching the local objective.This is achieved by constructing a local value function based on the local objective and placing it as the terminal cost of the short-horizon TO.We refer to this novel RHP framework as Locally-Guided Receding Horizon Planning (LG-RHP).The oracle is obtained through supervised learning based on the expert knowledge computed by the long-horizon RHP.
To demonstrate the computational advantage of our approach, we use LG-RHP to plan the centroidal trajectories of the humanoid robot Talos [15] in an online RHP setting, and compare it against the long-horizon RHP as the baseline.This requires the TO to converge within a time budget-the duration of the motion to be executed.
We show that LG-RHP can achieve online computation for 95%-98.6%cycles in the scenarios considered, outperforming the baseline (8%-51.2%cycles computed online).However, due to the prediction error, LG-RHP can reach ill-posed states, from which the TO fails to converge.We show that this issue can be mitigated by a data augmentation technique, in which we add data points to demonstrate how to recover from the states that cause convergence failures.
Our main contributions are: • We propose to achieve online RHP by approximating the value function with a learned oracle that can predict local objectives as intermediate goals for a given task.These local objectives are then used to construct local value functions to guide a short-horizon TO. • We realize this approach for planning multi-contact locomotion in an online RHP fashion, and we validate the dynamic feasibility of the planned trajectories by tracking them with a whole-body inverse dynamics controller in simulation.• We present an incremental training scheme that can iteratively improve the prediction accuracy of the oracle by demonstrating recovery actions from the states that cause convergence failures.

A. Multi-contact Motion Planning via TO
RHP frameworks usually employ TO to compute the motion plan, which can be time-consuming due to complex dynamics.For instance, although we can generate impressive motions with the whole-body dynamics model [1], [16], [17], the computation time is far from online usage.This is because the whole-body model considers the inertia of every link, which results in high dimensionality and non-convexity.
Alternatively, the centroidal dynamics model [18] has become popular for multi-contact planning [19], [20], [21].This model has lower dimension, as it only considers the dynamics of the total linear and angular momentum expressed at the Center of Mass (CoM) [18].Moreover, approximations are introduced to the robot kinematics and the momentum variation induced by the movements of individual robot links.Unfortunately, this model is still non-convex, which can make online computation challenging, except when limiting assumptions (e.g.fixed gait, flat surface) are made [22], [23].
To improve the computation speed, convex inner/outer approximations [21], [24], [25], [26] of the centroidal dynamics are proposed.Nevertheless, convex inner approximations can fail due to a restricted search space [21], while convex outer approximation can violate the dynamics unless it is gradually tightened [26].To combine the benefit of using both the accurate model and the approximated model, [11] and [12] present a TO formulation that relaxes the model accuracy along the lookahead horizon.However, online computation can still be challenging when a long lookahead is needed.

B. Learning-enhanced Locomotion Planning
Recently, researchers have explored using machine learning to bootstrap locomotion planning.For instance, [27] proposes to learn the evolution of centroidal momenta, which guides an A* planner to generate contact plans.Another line of research tries to accelerate the computation of TO.For example, [28], [29], [7] propose to learn (near)-optimal solutions to warm-start TO.Alternatively, [13], [14], [30], [31] shorten the planning horizon with a learned global value function placed as the terminal cost.However, learning a value function for the multi-contact problem is challenging.The main difficulty is that the value function is defined in a coupled state-environment space, which requires a flexible parameterization that can capture the landscape changes of the value function with respect to environment variations [14].In contrast, we avoid this issue by learning an oracle to predict intermediate goal states for completing a given task based on the current state, the final goal, and the environment, and then we construct local value functions based on these intermediate goals.
When predicting sequential actions with a learned model, the prediction accuracy can drop dramatically once the robot reaches a state that is un-explored in the training data-set.This problem is known as distribution shift [32], which can be mitigated by data augmentation, i.e. adding demonstrations from the states that either appeared from the roll-out of the learned policy [32], [33], or sampled from the expert policy with injected noise [34].In this work, we present a similar but more targeted data augmentation strategy that only focuses on demonstrating corrective actions from the states that cause convergence failures.

III. PROBLEM DESCRIPTION
Let us denote by x the robot state and u the control input.In traditional RHP frameworks, each cycle aims to compute a motion plan that consists of a state trajectory {x 0 , ..., x T }, and a control trajectory {u 0 , ..., u T } over a finite horizon [0, T ].This is achieved by solving a TO problem: where l is the running cost1 and (1b) is the dynamics constraint.Each cycle only executes part of the plan until t * < T .We refer to the horizon being executed as Execution Horizon (EH), and the rest as Prediction Horizon (PH).

Execution Horizon (EH)
Prediction Horizon (PH) LG-RHP Fig. 2: As Bellman equation ( 2) suggests, the EH should not only optimize its running cost, but also reach a state xt * that optimizes the value function V (x t * ).Since the value function is often hard to obtain analytically, traditional RHP approximates it by expensive computation of trajectories in the PH.In contrast, based on the initial state x 0 , the final goal x g , and the environment Ω, our LG-RHP employs a learned oracle O to predict the optimal xt * as a local objective for guiding a short-horizon TO to plan the EH.
Although the PH never gets executed, computing it is critical to RHP.To illustrate this, let us consider the TO formulation from Bellman's principle of optimality, where u t is the optimal policy, V (x) is the value function, x t+1 is the state evolved from the dynamics (1b).Drawing a parallel to the Bellman equation, we note that the EHwhich approximates the optimal policy-should optimize its running cost l while leading to a state xt * that optimizes the value function V (x t * ).The value function V (x t * ) reflects the desirability for the EH to arrive at a certain state x t * .Unfortunately, for complex systems, an analytical expression of the value function is often unavailable.Hence, TO approximates the value function by computing trajectories in the PH (Fig. 2).Nevertheless, this can increase the complexity of TO, which hinders online computation.
To speed up the computation, a promising direction is to replace the PH with an approximate value function model Ṽ (x).This leads to a short-horizon TO where the planning horizon terminates at t * (the end of EH), and (1) becomes where x t and u t subject to system dynamics (1b).Based on this idea, [13] proposes to learn a global value function Ṽ (x|θ) parameterized by θ.However, multi-contact problem requires the consideration of environment geometry.This gives rise to the challenge of finding a flexible representation of the coupled state-environment space [14] We refer to these intermediate goal states as local objectives, and use them to construct local value functions Ṽ (x t * | xt * ) to attract the short-horizon TO towards the local objective xt * .We call this approach Locally-Guided Receding Horizon Planning (LG-RHP) and illustrate the idea in Fig. 2.
Although we can learn to predict the optimal policy of the EH directly from past experiences, the learning error may lead to trajectories that violate system dynamics and cause tracking failures.Hence, we decide to compute the EH using TO, which guarantees the dynamic feasibility.Next, we present how to implement LG-RHP in the context of multicontact planning and the approach for learning the oracle.

IV. METHODS
This section presents the technical details of LG-RHP.In Section IV-A, we briefly describe the RHP formulation for planning multi-contact motions, and we refer the reader to [11] for more details.The long-horizon version of this formulation is used to compute the training data offline, while the short-horizon version is employed in LG-RHP for planning the EH.In Section IV-B, we elaborate the oracle formulation for multi-contact planning, and we present the incremental training scheme in Section IV-C.
Furthermore, the following assumptions are made: • We plan bipedal walking motion, in which a step consists of 3 phases: pre-swing (double support), swing (single support), and landing (double support).• EH always plans a single-step motion, and thus the oracle predicts the local objective for making one step.• We model the robot feet as rectangular patches, and we assign a contact point to each vertex of the rectangle.• We assume that the environment is made of rectangular surfaces.We pre-define the sequence of these surfaces upon which the swing foot lands, while we optimize the contact timings and the locations (within the surfaces).• We approximate the kinematic constraints of the CoM and the relative positions of contacts as convex polytopes attached to the feet in contact, with the orientations aligned to the contact surfaces [35].

A. Receding Horizon Planning for Multi-contact Motions
This section briefly describes the RHP formulation for planning multi-contact motions (see [11] for more details).In each planning cycle, given a finite lookahead of n steps, the current robot state x cur , the goal state x g , and the sequence of contact surfaces on which the robot will step upon, the RHP framework computes a motion plan of N ph phases by solving a TO problem.The motion plan is composed of: • X = {X 1 , ..., X N ph }: state trajectory X q of all phases q ∈ N ph .Each phase is discretized to N k knots: where c, ċ ∈ R 3 are the CoM position and velocity, and L ∈ R 3 is the total angular momentum.• U = {U 1 , ..., U N ph }: discretized control trajectory U q = {u q 1 , ..., u q N k } of all phases.We denote u = [f ⊤ 1 , ..., f ⊤ Nc ] ⊤ , which collects the contact force f c ∈ R 3 of all contact points c ∈ {1, ..., N c }.
• P = {p 1 , ..., p n }: a list of contact locations, where p i ∈ R 3 is the contact location of the i-th step.• T = {t 1 , ..., t N ph }: a list of phase switching timings.
The time step of each phase is τ q = (t q − t q−1 )/N k .The discretized TO is given by: min X ,U ,T ,P where is the running cost of each phase that encourages smooth trajectory, ϕ(x T ) = (x T − x g ) ⊤ (x T − x g ) is the terminal cost that attracts the end state x T towards the final goal x g , and (5b) collects the constraints described as: switching timings to increase monotonically.• p i ∈ S i constrains each contact p i to stay on the preassigned surface where the equality defines the plane containing the surface and the inequalities bound the surface.• p i ∈ R i−1 constrains each contact p i to be within the reachable space R i−1 of the previous footstep p i−1 .• c q k ∈ K l , ∀l ∈ C q restricts the CoM position at the k-th knot to stay in the reachable space K l established by each active contact l ∈ C q at each phase q.
• x q k+1 = x q k + τ q F q (x q k , u q k ), the dynamics constraint approximated by forward Euler scheme.We consider the centroidal dynamic model as detailed in [11].In this work, we use (5) with a long lookahead (n ≥ 2) to compute the data-set and also employ it as the baseline.For LG-RHP, we plan optimal actions by ( 5) with 1-step lookahead (only covers EH) along with the adapted terminal cost and the constraints as described in Section IV-B.

B. Oracle Formulation for Multi-Contact Planning
This section presents the oracle formulation for multicontact planning and the associated variable definitions (Fig. 3-a).Following the idea in Section III, we define the oracle as: The oracle's output is an intermediate goal configuration after making a step, which includes: i) target CoM state x * ; ii) target contact location p * ; iii) target phase switching timings T * = { t1 , t2 , t3 } for the 3 phases that complete the step.To predict this goal configuration, we require: • δ l/r ∈ {L, R}: swing foot indicator that informs which foot (left/right) will be re-positioned.• x 0 : initial CoM state.
• p 0 : initial contact location of the swing foot.
• Ω = {S l0 , S r0 , S 1 , ..., S n }: a finite-preview of the environment, where S l0 , S r0 are the surfaces that the left and right feet initially stand upon, S 1 , ..., S n are the surfaces that future n steps will land on; We represent each surface with their vertices S = {V 1 , ..., V 4 }, V ∈ R 3 .• x g : final goal position (fixed in our case).We define all spatial terms in the contact-foot frame W Σ , whose origin locates at the position of the non-swing foot, and the orientation aligns with the surface in contact.
To guide the LG-RHP, we adapt the short-horizon version of (5) that only covers the EH with the following changes.First, we replace the terminal cost with that encourages the end state x T and the contact location p 1 to approach the predicted targets.Second, we narrow down2 the search space of phase switching timings t i centered at their predicted values (1 − ϵ) ti ≤ t i ≤ (1 + ϵ) ti , where ϵ is a user-defined slack.

C. Incremental Training Scheme
This section describes the incremental training scheme for learning the oracle.The key idea is to improve the prediction accuracy by adding data points to demonstrate corrective actions from the states that cause convergence failures.As depicted in Fig. 4

Add
LG-RHP Traj.Corrective Traj.i , we firstly use LG-RHP with the currently trained oracle O i to plan trajectories in a RHP fashion on the previously sampled environments.Then, we use the long-horizon RHP to compute recovery actionsstarting from 1-3 cycles before LG-RHP fails-until the cycle converges back to the ground-truth trajectory (roll-out of long-horizon RHP on the same terrain).We repeat the process until there is no further improvement.

V. RESULTS
To highlight the computational advantage of our approach, we compare Locally-Guided Receding Horizon Planning (LG-RHP) against the baseline (long-horizon RHP) in an online setting, in which the TO should converge within a time budget, i.e. the duration of the motion under execution 4 .

A. Experiment Setup
We carry out two case studies: walking on moderate slope (Fig. 5) and large slope (Fig. 6).Planning motions on these terrains is challenging because the admissible contact force is limited by the orientation of the surface in contact, requiring careful selections of contact locations and the timings [26].
We use LG-RHP and the baseline to compute centroidal trajectories of the humanoid robot Talos [15] in RHP fashion.Both LG-RHP and the baseline are tested on the terrains that are unseen during training 5 .We refer to the trial on each terrain as an episode.In each cycle, we warm-start the TO with the solution of two cycles before.A cycle converges online if the TO is solved within the time budget.If it fails to converge online, we still let the TO compute until convergence and continue the episode, unless no solution is found (fail to converge).In LG-RHP, we employ the oracle trained with the best prediction accuracy.Lastly, we validate the planned trajectories by tracking them via a whole-body inverse dynamics controller [36] in simulation (Video: https://youtu.be/oCsOBHHc9XM).

B. Implementation Details
We model TO problems in Python and solve them by the interior-point method of KNITRO [37].We use the automatic differentiation framework CasADi [38] to provide the gradients and the Hessians.Computations are achieved on a desktop with an Intel i9-CPU (3.6GHz) and 64GB memory.
The oracle is modelled as a Neural Network (NN) of 4 hidden layers, each layer has 256 nodes with ReLu activation functions.The NN is implemented and trained with Tensorflow [39].Furthermore, we find that the data distributions of the two types of terrains exhibit different modalities, i.e. when traversing the large slope, the robot has larger momentum variations than walking on the moderate slopes.Mixing these data points can result in a discontinuous and unbalanced data-set, on which a single NN can struggle to interpolate.As a result, we decide to train separate oracles for the two types of terrains.The possibility of generalizing across these two modalities is discussed in Section VI.

C. Case Study 1 (CS1): Moderate Slope
This section presents the experiment result on the terrain with moderate slopes (Fig. 5).Although quasi-static motions can be quickly found for this terrain6 [40], we are interested in planning dynamic walking motions using TO.This provides a unified approach to handle non-quasi-static cases (e.g.large slope) and also allows more efficient task completion.
In this case study, we choose the long-horizon RHP with 1-step PH as the (fastest) baseline.For LG-RHP, we set the slack of the phase switching timing constraints as ϵ = 0.6.We find this can increase the chance of finding a solution (enlarged search space) without sacrificing much computation speed.Each episode plans maximum 28 cycles.
To show the computational benefit of our approach, we compare LG-RHP and the baseline based on their cycle-wise success rate (the percentage of cycles that converged) for online computation.As Table I lists, in offline mode (unlimited time budget), both methods can achieve 99% cyclewise success rate, while having up to 1% of the cycles fail to converge.However, when online computation is required (consider time out), only LG-RHP maintains a success rate of 0.21s    LG-RHP 23.5 +/-2.0 cycles Baseline (1-Step PH) 0.6 +/-0.3 cycles 98.63%, whereas the baseline drops to 51.21%.Additionally, as Table II shows, LG-RHP only consumes on average 19% of the time budget.This suggests the potential of using LG-RHP in real robot control, as the remaining time budget can be allocated to the overheads, e.g.data transmission.Moreover, to demonstrate that LG-RHP can consecutively achieve online RHP in an episode, we check its 95% Confidence Interval (CI) on the number of cycles that continually converged online.As Table III shows, the CI of LG-RHP is 23.5 +/-2.0 cycles, indicating that LG-RHP can perform online computation for 76.8% to 91.1% cycles of an episode.In contrast, the baseline CI is 0.6 +/-0.3 cycles, implying its incapability of achieving online RHP.A sequence of snapshots of our simulation can be seen in Fig. 5.

D. Case Study 2 (CS2): Large Slope
This section considers the large slope terrain (Fig. 6), on which the robot cannot maintain static balance.For this terrain, the baseline requires 2-step PH to foresee the large slope early enough and start building momentum.
Each episode starts from the cycle when the large slope is inside the lookahead horizon and ends at the cycle when the robot gets off the large slope (6 cycles).For LG-RHP, we define the slack of the phase switching timing constraints ϵ = 0.15, as empirically determined to give a good balance between the success rate and the computation speed.
In Table IV, we compare the cycle-wise success rate of LG-RHP and the baseline.In offline mode, the success rate of the baseline and LG-RHP are 99.5% and 95.99% respectively, while the rest of the cycles fail to converge.However, when it comes to online mode, the cycle-wise success rate of LG-RHP stays at 95.0%, whereas the baseline decreases to 7.99%.Furthermore, as indicated in Table II, LG-RHP only consumes on average 18% of the time budget.
To examine whether the online RHP can be consecutively achieved in an episode, we compare the episodic success rate of each method.We define an episode as successful if all the cycles have their TO converged online.As Table V lists, the baseline never completes an episode in an online fashion (0.0%), and 98.4% episodes failed due to time out.In contrast, LG-RHP can achieve online RHP over 76.1% episodes, while only 3.8% of the failures are due to time out, and the rest (20.1%) are caused by convergence failures induced by the prediction error of the oracle.A sequence of simulation snapshots are presented in Fig. 6.

E. Improving Prediction with Incremental Training Scheme
This section demonstrates the effectiveness of our incremental training scheme (Section IV-C).In Table VI, we list the episodic success rate of LG-RHP achieved on the training environments with oracles trained from different iterations of the data augmentation process.Our result shows that adding corrective data points of interest can increase the prediction accuracy, which improves the episodic success rate of LG-RHP.The success rate saturates after 5 training iterations.

VI. DISCUSSION
We empirically show that LG-RHP can achieve online RHP of multi-contact motions while consuming on average 19% of the time budget.This suggests that LG-RHP can be used for real-world robot control, as it leaves sufficient time for overheads, e.g.data transmission. Step-Down Step-up  However, as mentioned in Section V-B, we find that the data points for the two types of terrains exhibit different modalities.This can impose challenges when training a single NN on the combined data-set.Although we capture the two modalities by using separate NNs, it is worthwhile to explore a more unified approach that can handle multi-modal data, e.g. using mixture density networks [41].
Furthermore, LG-RHP struggles in the following two cases.First, the oracle has prediction errors due to imperfect fitting and insufficient data coverage.This can lead the robot to ill-posed states, and cause failures to converge.Although our incremental training scheme can mitigate this issue, it is hard to reach a 100% success rate.To improve the performance, we can use a Recurrent Neural Network (RNN) or impose safety constraints.Second, although LG-RHP features a short-horizon TO, it is still a nonlinear problem that may time out.To alleviate this issue, we can reduce the number of decision variables by representing trajectories with parameterized curvatures, e.g.Bezier Curves [21].
Lastly, multi-contact problems necessarily require discrete decisions, i.e. the sequence of contact surfaces [40] and the gait pattern [42].It would be beneficial to extend the oracle to predict these discrete choices.

VII. CONCLUSION
We present Locally-Guided Receding Horizon Planning (LG-RHP)-a novel RHP framework for planning multicontact motions.The proposed approach features a learned oracle that can predict local objectives, which are used for building local value functions for guiding a short-horizon TO.Our experiment shows that LG-RHP can achieve online computation for 95%-98.6%cycles, which outperforms the baseline (8%-51.2%).Moreover, LG-RHP only requires around 19% of the time budget, suggesting the potential for real-robot control.However, due to the prediction error, LG-RHP does not achieve 100% success rates.In future work, we plan to improve the success rate by using RNN or adding safety constraints.We also plan to test LG-RHP on real robot.

Fig. 3 :
Fig. 3: (a) Variables definition of the oracle.x 0 and p 0 are the initial CoM state and initial swing-foot position.x * and p * are the target CoM state and the contact location respectively.Purple patches are initial contact surfaces, while blue patches are for future steps.Contact surfaces are modeled by their vertices V i .Spatial terms are defined in the contact frame W Σ established at the non-swing foot.(b) The target contact location p * is represented as the vector sum of α 1 r 1 and α 2 r 2 , which scale along the contact surface borders r 1 , r 2 ∈ R 3 with the proportion defined by α 1 , α 2 ∈ [0, 1].
, in each training iteration i, we train an oracle O i based on the current data-set D = D 0 ∪ D * 1 ∪ ... ∪ D * i−1 , where D 0 is the initial data-set,

Fig. 4 :
Fig. 4: The incremental training scheme concept.Each iteration i trains an oracle O i on the data-set D that aggregates the initial data-set D 0 and the augmented data-sets D *i .The augmented data-set contains recovery actions (purple), starting from 1-3 cycles before LG-RHP fails (dashed blue circle) until the cycle converges back to ground-truth trajectory.

Fig. 5 :
Fig. 5: Snapshots of simulations on moderate slopes (5-12 degrees).We use orange bar to represent the motion duration of current cycle, and green bar as the computation time for the next cycle.

Fig. 6 :
Fig.6: Snapshots of simulations on the large slope (17-25 degrees).The blue block is the large slope (25 degrees), while the rest are moderate slopes (5-12 degrees).The robot tends to build momentum to achieve dynamic balancing on the large slope.We show that LG-RHP can be used online, as the computation time of the next cycle (green bar) is smaller than the motion duration of the current cycle (orange bar).
that are favorable for completing a given task, based on the current state x 0 , final goal state x g , and environment model Ω: . In this work, instead of learning a value function over the entire stateenvironment space, we propose to learn an oracle O to predict intermediate goal states xt *

TABLE I :
Cycle-wise success rate for CS1.

TABLE II :
Average computation time of LG-RHP v.s.average time budget for the cycles that converged online.

TABLE III :
95% CI of the average number of cycles that consecutively converge online (max.28).

TABLE IV :
Number of cycles converged online for CS2.

TABLE V :
Episodic success rate for CS2.

TABLE VI :
Episodic success rate of different iterations of the incremental training scheme on the training environments.