Using large-scale language models to learn robust robotic representations

Using large-scale language models to learn robust robotic representations

Note that this project will be co-supervised by Alara Dirik from Colors Lab.
 
Foundation models are large-scale models trained on huge unlabeled datasets. While foundation language models such as BERT or GPT are trained for specific tasks (predicting masked words, predicting the next word/s), their intermediate feature spaces are used to generate language embeddings and train smaller, downstream models such as question-answering, sentence classification, as well as to compute word/sentence probability and similarity. Other foundation models such as CLIP and ALIGN learn joint image-text representations that can be used in various downstream tasks such as image captioning, image-text matching and image querying/content-based image retrieval.
 
Given the need for scalable and generalizable object and task representations in robotics, foundation models provide a tremendous opportunity that is still largely untapped. A very recent work submitted to ICLR by Huang et al.  shows that these models can be successfully used to break down seemingly simple commands that are incomprehensible to robots into understandable subtasks. However, the consensus among the reviewers is, while the work is very promising, it suffers from a narrow space of actions, constraints on those actions, and lack of multiple objects of the same class. 
 
Can we come up with a more generalizable framework that draws insights from a large concept dataset such as ConceptNet or even leverage the capabilities of joint image-text models and applications (CLIP, CLIP-caption, Crop CLIP)? How to handle situations where there is more than one object of the target category?
 
Please see works by  DeChant et al., Luo et al., Lynch et al.  and Suglia et al., for discussions and applications around using language cues for robotic tasks.
 
Datasets and Environments: ALFRED, VirtualHome, ALFWorld, BIG-Bench, Habitat
Keywords: Human-Computer Interaction, Reasoning, Scene and Task Representations

Project Advisor: 

Emre Uğur

Project Status: 

Project Year: 

2023
  • Spring

Contact us

Department of Computer Engineering, Boğaziçi University,
34342 Bebek, Istanbul, Turkey

  • Phone: +90 212 359 45 23/24
  • Fax: +90 212 2872461
 

Connect with us

We're on Social Networks. Follow us & get in touch.