For service robots, an accurate understanding of human behavior is the basis for personalized interactions. With the vigorous development of computer vision technology in recent years, understanding human behavior from images has been a basically solved problem, which is also a well-known task – visual action recognition.
But is this enough for a robot? In addition to recognizing some action patterns (e.g., eating, running), we humans also have forward-looking predictions of events (such as drinking water after running). This ability comes from our accumulated experience of various events in our lives. Mathematically speaking, we accumulate samples of various random events, then these samples form a distribution. This kind of forward-looking prediction of behavior/events is crucial for robots that provide personalized service. If the robot can anticipate human behavior, it can plan the service content in advance, making the service and interaction more human-oriented and intelligent.
It is challenging to enable robots with such forward-looking prediction capabilities. Recently, action anticipation in the field of computer vision attempts to solve such a problem: given a few past frames, the model needs to predict the action corresponding to the future frames (answer phone in Figure 1). However, such predictions are usually discriminative and cannot model uncertainty about future actions. For example, the person’s future actions may also be reading the message.
To model the multiple possibilities of future actions, generative models are in the position to capture the distribution of actions/events. Generally speaking, discriminative models determine whether the current event belongs to a or b by learning a decision surface, while generative models need to have a global understanding of events a and b and learn their distributions (see Figure 2). Therefore, the action anticipation task mentioned above can be better achieved by generative models. Specifically, past frames are first input into the model, and then the model outputs multiple possible future frames.
The is an overall introduction to action anticipation. We expect that generative models will bring new technological breakthroughs to this task and move robots towards a better human behavior understanding