Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions sections/04_imitation_learning.tex
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ \subsubsection{Flow Matching}
This results in a continuous (and deterministic) trajectory at inference, which is in practice more efficient compared to following stochastic paths like in DMs.
Formally, FM can be fully characterized by an ordinary differential equation (ODE) relating instantaneous variations of flows with the underlying vector field, and hence providing complete trajectories over the distributions' support when integrating over time,
\begin{align}
\frac{d}{dt} \psi(z, t) &= v(t, \psi(t, z)), \\
\frac{d}{dt} \psi(t, z) &= v(t, \psi(t, z)), \\
\psi(0, z) &= z .
\end{align}
In practice, flow models learn to approximate these dynamics by estimating a vector field \( v \) that matches the true, unknown \( u \), so that the induced flows \( \psi \) can approximate the ideal trajectories \( \psi^* \).
Expand Down Expand Up @@ -422,7 +422,7 @@ \subsubsection{Code Example: Training and Using ACT in Practice}

\subsection{Diffusion Policy}
DMs have proven very effective in approximating complex highly dimensional distributions, such as distributions over images~\citep{hoDenoisingDiffusionProbabilistic2020} or videos~\citep{polyakMovieGenCast2025}, thanks to their inherent capability to deal with multimodal data, and their training stability.
In Diffusion Policy (DP),~\citet{chiDiffusionPolicyVisuomotor2024} present an application of DMs the field of robot learning, leveraging diffusion to model expert demonstrations in a variety of simulated and real-world tasks.
In Diffusion Policy (DP),~\citet{chiDiffusionPolicyVisuomotor2024} present an application of DMs to the field of robot learning, leveraging diffusion to model expert demonstrations in a variety of simulated and real-world tasks.
Similarily to ACT~\citep{zhaoLearningFineGrainedBimanual2023},~\citet{chiDiffusionPolicyVisuomotor2024} (1) adopt a modified \emph{observation-conditioned target distribution} instead of the full joint \( p(o,a) \), and (2) predict multiple actions into the future instead of a single action.
Besides the intractability of the observations' marginal \( p_\theta(o) \) given \(p_\theta(o,a) \), DP's choice to model the data distribution through \( p_\theta(a \vert o) \) also stems from the computational burden of diffusion at test time: generating actions together with observations would require a large number of denoising steps—an unnecessarily slow and ultimately unhelpful process, given that robotics focuses on producing controls rather than reconstructing observations.

Expand Down Expand Up @@ -454,7 +454,7 @@ \subsection{Diffusion Policy}
Notably, the authors ablated the relevance of using RGB camera streams as input to their policy, and observed how high frame-rate visual observations can be used to attain performance (measured as success rate) comparable to that of state-based policies, which are typically trained in simulation with priviledged information not directly available in real-world deployments.
As high-frame rate RGB inputs naturally accomodate for dynamic, fast changing environments,~\citet{chiDiffusionPolicyVisuomotor2024}'s conclusion offers significant evidence for learning streamlined control policies directly from pixels.
In their work,~\citet{chiDiffusionPolicyVisuomotor2024} also ablate the performance of DP against the size of the dataset collected, showing that DP reliably outperforms the considered baseline for all benchmark sizes considered.
Further, in order accelerate inference,~\citet{chiDiffusionPolicyVisuomotor2024} employ Denoising Diffusion Implicit Models~\citep{songDenoisingDiffusionImplicit2022}, a variant of Denoising Diffusion Probabilistic Models~\citep{hoDenoisingDiffusionProbabilistic2020} (DDPM) adopting a strictly deterministic denoising paradigm (differently from DDPM's natively stochastic one) inducing the same final distribution's as DDPM's, and yet resulting in 10x less denoising steps at inference time~\citep{chiDiffusionPolicyVisuomotor2024}.
Further, in order to accelerate inference,~\citet{chiDiffusionPolicyVisuomotor2024} employ Denoising Diffusion Implicit Models~\citep{songDenoisingDiffusionImplicit2022}, a variant of Denoising Diffusion Probabilistic Models~\citep{hoDenoisingDiffusionProbabilistic2020} (DDPM) adopting a strictly deterministic denoising paradigm (differently from DDPM's natively stochastic one) inducing the same final distribution's as DDPM's, and yet resulting in 10x less denoising steps at inference time~\citep{chiDiffusionPolicyVisuomotor2024}.
Across a range of simulated and real-world tasks,~\citet{chiDiffusionPolicyVisuomotor2024} find DPs particularly performant when modeling \( \epsilon_\theta \) with a transformer-based network, although the authors note the increased sensitivity of transformer networks to hyperparameters.
Thus,~\citet{chiDiffusionPolicyVisuomotor2024} explicitly recommend starting out with a simpler, convolution-based architecture for diffusion (Figure~\ref{fig:diffusion-policy-architecture}), which is however reported to be biased towards learning low-frequency components~\citep{tancikFourierFeaturesLet2020}, and thus may prove more challenging to train with non-smooth action sequences.

Expand Down Expand Up @@ -483,7 +483,7 @@ \subsection{Optimized Inference}
Sync inference allocates computation every \( H_a \) timesteps, resulting in a reduced computational burden (on average) at control time.
In contrast, sync inference also inherently hinders the responsiveness of robot systems, introducing blind lags due to the robot being \emph{idle} while computing \( \actionchunk \).

One can use the fact that policies output multiple actions at the same time to directly (1) the lack of adaptiveness and (2) the presence of lags at runtime by decoupling action chunk \emph{prediction} \( \actionchunk \) from action \emph{execution} \( a_t \gets \textsc{PopFront}(\actionchunk_t) \).
One can use the fact that policies output multiple actions at the same time to directly address (1) the lack of adaptiveness and (2) the presence of lags at runtime by decoupling action chunk \emph{prediction} \( \actionchunk \) from action \emph{execution} \( a_t \gets \textsc{PopFront}(\actionchunk_t) \).
This decoupled stack, which we refer to as \emph{asynchronous} (async) inference (\ref{alg:async-inference}), also enables optimized inference by allowing action-chunk inference to run on a separate machine, typically equipped with better computational resources than the ones onboard a robot.
In async inference, a \( \textsc{RobotClient} \) sends an observation \( o_t \) to a \( \textsc{PolicyServer} \), receiving an action chunk \( \actionchunk_t \) once inference is complete (Figure~\ref{fig:ch4-async-inference}).
In this, we avoid execution lags by triggering chunk prediction while the control loop is still consuming a previously available chunk, aggregating the previous and incoming chunks whenever the latter is available to the \( \textsc{RobotClient} \).
Expand Down