Abstract: Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information ...
For inspection robots to achieve generalizability, stability, and ease of use, it is crucial that they understand natural language commands and navigate accurately to specified target objects. We ...