This code contains implementation for teleoperation and imitation learning of Open-TeleVision.
git clone https://github.com/ShalikAI/TeleVision.git
Activate conda environment and install packages:
cd TeleVision
source ~/miniconda3/bin/activate
conda create -n tv python=3.8
conda activate tv
conda install -c conda-forge av ffmpeg
# pip3 install -r requirements.txt
pip3 install -r requirements.txt --no-deps
cd act/detr
pip3 install -e .Install ZED sdk: https://www.stereolabs.com/developers/release/
Install ZED Python API:
cd /usr/local/zed/
python3 get_python_api.pyInstall Isaac Gym:
Download the Isaac Gym Preview 4 release from the website. Extract the zip file and copy the folder isaacgym inside TeleVision. Go inside TeleVision:
cd ~/TeleVision/isaacgym/python/
pip3 install -e .Clone Dinov2 for Python 3.8:
cd ~/TeleVision
git clone https://github.com/facebookresearch/dinov2.git
cd dinov2
git checkout b48308a394a04ccb9c4dd3a1f0a4daa1ce0579b8For Quest local streaming, we will use adb (Android Debug Bridge):
sudo apt install -y adb
adb version
sudo apt install -y android-sdk-platform-tools-common
sudo usermod -aG plugdev $USER
Use the adb reverse tool to set tcp port:
adb reverse tcp:8012 tcp:8012
Then, do a adb reverse --list to verify:
adb reverse --list
See the hand in action in vuer site (https://vuer.ai?grid=False):
To test the application locally, we need to create a self-signed certificate and install it on the client. You need a ubuntu machine and a router. Connect the Quest3 and the ubuntu machine to the same router through WiFi.
- install mkcert:
wget https://github.com/FiloSottile/mkcert/releases/download/v1.4.4/mkcert-v1.4.4-linux-amd64
chmod +x mkcert-v1.4.4-linux-amd64
sudo mv mkcert-v1.4.4-linux-amd64 /usr/local/bin/mkcert
mkcert -version
- Check local ip address:
ifconfig | grep inet
Suppose the local ip address of the ubuntu machine is 192.168.8.102.
- create certificate:
cd TeleVision/teleop
mkcert -install && mkcert -cert-file cert.pem -key-file key.pem 192.168.8.102 localhost 127.0.0.1
ps. cert.pem and key.pem files should be generated/moved in Television/teleop.
- open firewall on server
sudo iptables -A INPUT -p tcp --dport 8012 -j ACCEPT
sudo iptables-save
sudo iptables -L
or can be done with ufw:
sudo ufw allow 8012
tv = OpenTeleVision(self.resolution_cropped, shm.name, image_queue, toggle_streaming, ngrok=False)
Settings > General > About > Certificate Trust Settings. Under "Enable full trust for root certificates", turn on trust for the certificate.
-
Open the browser on Quest 3 and go to
https://192.168.8.102:8012?ws=wss://192.168.8.102:8012 -
Click
Enter VRandAllowto start the VR session.
For Meta Quest3, installation of the certificate is not trivial. We need to use a network streaming solution. We use ngrok to create a secure tunnel to the server.
- Install ngrok:
curl -sSL https://ngrok-agent.s3.amazonaws.com/ngrok.asc \
| sudo tee /etc/apt/trusted.gpg.d/ngrok.asc >/dev/null \
&& echo "deb https://ngrok-agent.s3.amazonaws.com bookworm main" \
| sudo tee /etc/apt/sources.list.d/ngrok.list \
&& sudo apt update \
&& sudo apt install ngrok
- Add the authentication token:
ngrok config add-authtoken <authentication_token>
For first time, you have to generate the authentication token by opening an account in ngrok and signing up for 2FA. From next time, you will find the authentication token inside /home/arghya/.config/ngrok/ngrok.yml.
3. Run ngrok:
ngrok http 8012
- Copy the https address (https://unenthralling-nonunited-thomas.ngrok-free.dev) and open the browser on Meta Quest3 and go to the address. You can also open
http://127.0.0.1:4040orhttp://localhost:8012to see the same thing in your ubuntu pc.
ps. When using ngrok for network streaming, remember to call OpenTeleVision with:
self.tv = OpenTeleVision(self.resolution_cropped, self.shm.name, image_queue, toggle_streaming, ngrok=True)
You can record the VR Streaming and copy files from Occulus Quest. List the adb devices:
adb devices
List the files inside the SD card of Occulus Quest 3:
adb shell ls /sdcard/Oculus/Screenshots
adb shell ls /sdcard/Oculus/VideoShots
Copy the necessary files:
adb pull /sdcard/Oculus/VideoShots/<file_name>.mp4 .
- After setup up streaming with either local or network streaming following the above instructions, you can try teleoperating two robot hands in Issac Gym:
cd teleop && python teleop_hand.py
- Confirm your Vuer server is actually on 8012:
sudo lsof -iTCP:8012 -sTCP:LISTEN -n -P
You should see which process is listening on 8012. 3. Go to the local site (https://localhost:8012/) and see the hands in action:
-
Go to vuer site (https://localhost:8012/?ws=wss://localhost:8012) on Quest 3 for local streaming or ngrok site (https://unenthralling-nonunited-thomas.ngrok-free.dev) for network streaming, click
Enter VRandAllowto enter immersive environment. -
See hands in 3D inside isaac gym:
-
Download dataset from here. There are in total 5 datasets namely
00-can-sorting,01-can-insertion,02-folding,03-unloading, and04-gr1-can-sorting. -
Place the downloaded dataset in
data/recordings/. Each dataset contains observation states of multiple zedsvo(left and right image pair) and correspondinghdf5(action sequences/joint positions) as episodes. -
Process the specified dataset for training using
scripts/post_process.py.
python post_process.py
This will create a folder inside every dataset named processed where corresponding svo and hdf5 files for each episodes will be merged to produce single hdf5 files.
- You can verify the observation states (image and action sequences) of a specific episode in the dataset using
scripts/replay_demo.py.
python replay_demo.py
- To train ACT, run:
python imitate_episodes.py --policy_class ACT --kl_weight 10 --chunk_size 60 --hidden_dim 512 --batch_size 45 --dim_feedforward 3200 --num_epochs 50000 --lr 5e-5 --seed 0 --taskid 00 --exptid 01-sample-expt
- After training, save jit for the desired checkpoint:
python imitate_episodes.py --policy_class ACT --kl_weight 10 --chunk_size 60 --hidden_dim 512 --batch_size 45 --dim_feedforward 3200 --num_epochs 50000 --lr 5e-5 --seed 0 --taskid 00 --exptid 01-sample-expt\
--save_jit --resume_ckpt 25000
- You can visualize the action states with trained policy with inputs from dataset using
scripts/deploy_sim.py, example usage:
python deploy_sim.py --taskid 00 --exptid 01 --resume_ckpt 25000
After expansion, full h1 robot qpos = 51 DOF
A) Joint Positions qpos → observation state
From imitate_episodes.py and utils.load_data():
state_dim = 26This is:
qpos_t ∈ ℝ²⁶
✔ Comes from:
data['observation.state']✔ Semantics:
-
Robot joint positions at time t
-
Low-dimensional, compact representation
-
Does NOT include every physical joint
-
Typically:
- left arm
- left hand
- right arm
- right hand
- may be torso / base
B) Images → observation state
Also part of the observation:
image_data.shape = (B, 2, 3, H, W)Processed by:
- DINOv2 backbone
- Produces visual tokens
Role:
- Images tell the policy what the world looks like
So, Observation = { qpos_t , left_image_t , right_image_t }
A) action_dim = 28 → action state
From config:
action_dim = 28This is:
action_t ∈ ℝ²⁸
✔ Comes from:
data['qpos_action']✔ Predicted by ACT:
pred_actions.shape = (chunk_size, 28)✔ Represents:
Desired joint targets to command the robot next
- What the robot is
- Reduced state
- Does not include detailed finger joints
-
What the robot should do
-
Includes:
- left arm (7)
- right arm (7)
- compressed left hand (6 → 12 joints)
- compressed right hand (6 → 12 joints)
From convert_h1_qpos():
| Action indices | Meaning |
|---|---|
action[0:7] |
Left arm (7 DOF) |
action[7:13] |
Left hand compressed (6 scalars → 12 joints) |
action[13:20] |
Right arm (7 DOF) |
action[20:26] |
Right hand compressed (6 scalars → 12 joints) |
action[26:28] |
(unused / padding / legacy) |
That’s why:
action_dim = 28At time t:
-
Observation:
obs_t = {qpos_t (26D), left_img_t, right_img_t} -
Policy prediction (chunk size = 60):
actions_t:t+59 = ACT(obs_t) # shape (60, 28) -
Execution:
apply action_t = actions[0]
