Google and MIT researchers have come together to formulate a study called “LEARNING TO SEE BEFORE LEARNING TO ACT: VISUAL PRETRAINING FOR MANIPULATION”, where they are collaborating to see if pretraining a robot in visual computer manipulation can improve its grasping ability.

They believe that affordance based manipulation, or the ability to interact with an object in its environment, will be able to train robots in less than 10 minutes to hold an object. This affordance enables learning complex vision-based manipulation skills, including grasping, pushing and throwing.

To robots, these affordances are represented as vast pixel-based action value maps that offer the opportunity to carry out a predefined action.

Recent research in transfer learning has shown that deep learning of large -scale visual datasets can be used for visual recognition. Similarly, when robots operating on affordance models, i.e. mapping from pixels to action, make use of these representation visuals, then the vast amount of data that is collected and already available can be used to make robots learn real-life skills in lesser time frame.

The team used backbones or neural networks weights such as detecting corners, colors, and filtering edges used in computer vision models into affordance based models. This training was injected into a robot for it to learn to grasp objects through trial and error.

Google-and-MIT-Robotic-CS.png

In computer vision, deep model architecture consists of two parts--the backbone and the head. Backbone makes up early-stage processing like making out colors and edges, and the head is used to identify contextual cues and spatial reasoning. For each new task, it is common to use transfer learning to retrain and share backbone and head abilities for each task.

Using both the backbone and head weights in transference for affordance manipulation saw success in speed after some initial hiccups.

The technique saw a success rate of 73 per cent in 500 attempts of grasping, which jumped to 86 percent in 1000 attempts.

On new objects, the backbone grasped 83 percent better, and the success rate was 90 percent with head and backbone neural weights together.

The team found that the experiment turned out better results when network weights from both the backbone and head of pre-trained vision models were transferred to the affordance models, as opposed to only transferring the backbone.

Using object localization weights from computer vision tasks improved the exploration tasks in affordance models to manipulate objects, thus enabling the generation of datasets that could distinguish between good and bad grasps.

The researchers, further elaborating on the transference methodology and its future application in robot manipulations, said, “Many of the methods that we use today for end-to-end robot learning are effectively the same as those being used for computer vision tasks. Our work here on visual pretraining illuminates this connection and demonstrates that it is possible to leverage techniques from visual pretraining to improve the learning efficiency of affordance-based manipulation applied to robotic grasping tasks. While our experiments point to a better understanding of deep learning for robots, there are still many interesting questions that have yet to be explored. For example, how do we leverage large-scale pretraining for additional modes of sensing (e.g. force-torque or tactile)? How do we extend these pretraining techniques toward more complex manipulation tasks that may not be as object-centric as grasping? These areas are promising directions for future research.”

Computer vision is seeing broad acceptance in many fields, from medicine, manufacturing to e-commerce to security. The global market for computer vision is predicted to be worth anywhere from $17.4 billion to $48.32 billion by 2023.

Once this computer vision and the vast database available is used in robot manipulation through deep learning, then it can be leveraged to improve the above-mentioned industry tasks.