This AI Learns to Handle the Unknown—By Touch Alone

by Large Models (dot tech)June 13th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Robots learn to "feel" like humans by combining touch, vision, and language—boosting their ability to understand objects and act in the real world.

Coin Mentioned

Mention Thumbnail
featured image - This AI Learns to Handle the Unknown—By Touch Alone
Large Models (dot tech) HackerNoon profile picture
0-item

Authors:

(1) Samson Yu, Dept. of Computer Science, National University of Singapore (samson.yu@u.nus.edu);

(2) Kelvin Lin. Dept. of Computer Science, National University of Singapore;

(3) Anxing Xiao, Dept. of Computer Science, National University of Singapore;

(4) Jiafei Duan, University of Washington;

(5) Harold Soh, Dept. of Computer Science, National University of Singapore and NUS Smart Systems Institute (harold@comp.nus.edu.sg).

V. EXPERIMENTAL SETUP

In this section, we evaluate the physical property prediction and reasoning capabilities of our proposed method. We design several experiments to answer the following questions:


  1. Are our physical property predictions useful for OCTOPI to reason about everyday scenarios?


  2. Can OCTOPI be used in real robots to help them accomplish tasks using physical reasoning?


  3. Can OCTOPI’s understanding of the physical properties generalize to the unseen daily life objects?


A. Data Processing


The tactile videos were processed into frames. To focus on a few salient frames for better efficiency, we selected frames that have the top 30% total pixel intensity difference with their preceding frames. We randomly sampled 5 frames from these salient frames during training and selected 5 frames at uniform intervals from the first salient frame during evaluation. Data augmentation was performed during training in the form of random horizontal and vertical flips with 50% chance for each flip.


TABLE VI. Results on PHYSICLEAR Physical Understanding Tasks. OCTOPI’s performance on physical understanding tasks improves with object property descriptions (OPD). Performance also increases with larger LLM size, with OCTOPI-13b outperforming OCTOPI-7b across all three tasks.


B. Training Hyperparameters



C. Training Requirements


Encoder fine-tuning took 6 hours and required less than 5GB of GPU VRAM. Tactile feature alignment together with end-to-end fine-tuning took 5 hours for OCTOPI-7b and 6.5 hours for OCTOPI-13b. We used 1 NVIDIA RTX A6000 for OCTOPI-7b and 2 NVIDIA RTX A6000s for OCTOPI-13b.


This paper is available on arxiv under CC BY 4.0 DEED license.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks