Article Review Assignment

DTE
percivingtheperson.pdf

Pattern Recognition Letters 118 (2019) 3–13

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier.com/locate/patrec

Perceiving the person and their interactions with the others for social

robotics – A review

Adriana Tapus a , Antonio Bandera b , ∗, Ricardo Vazquez-Martin c , Luis V. Calderita b

a Autonomous Systems and Robotics Lab, Computer Science and System Engineering Department (U2IS), ENSTA ParisTech, 828 Blv des MArechaux, Palaiseau

91120, France b AVISPA Group, Department of Electronic Technology, Universidad de Málaga, Málaga, 29071, Spain c Robotics and Mechatronics Lab., Department of System Engineering and Automation, Universidad de Málaga, Málaga, 29071, Spain

a r t i c l e i n f o

Article history:

Available online 6 March 2018

Keywords:

Social robots

Human perception

Human–robot interaction

Social interactions

Proxemics

a b s t r a c t

Social robots need to understand human activities, dynamics, and the intentions behind their behaviors.

Most of the time, this implies the modeling of the whole scene. The recognition of the activities and

intentions of a person are inferred from the perception of the individual, but also from their interactions

with the rest of the environment (i.e., objects and/or people). Centering on the social nature of the per-

son, robots need to understand human social cues, which include verbal but also nonverbal behavioral

signals such as actions, gestures, body postures, facial emotions, and proxemics. The correct understand-

ing of these signals helps these robots to anticipate the needs and expectations of people. It also avoids

abrupt changes on the human–robot interaction, as the temporal dynamics of interactions are anchored

and driven by a major repertoire of social landmarks . Within the general framework of interaction of

robots with their human counterparts, this paper reviews recent approaches for recognizing human ac-

tivities, but also for perceiving social signals emanated from a person or a group of people during an

interaction. The perception of visual and/or audio signals allow them to correctly localize themselves

with respect to humans from the environment while also navigating and/or interacting with a person or

a group of people.

© 2018 Elsevier B.V. All rights reserved.

1

a

o

a

i

w

f

c

s

t

t

i

i

i

b

t

o

a

a

r

e

o

c

n

o

i

r

t

o

t

R

e

s

h

0

. Introduction

One of the basic skills allowing people to interact in a safe

nd comfortable way is their ability to understand intuitively each

ther’s role and activities. Everyday, people observe one another

nd, through these observations, they recognize what they are do-

ng and also infer their intentions. In addition, this is addressed

ithout remarkable effort. It is clear that this ordinary and ef-

ortless ability is not only the result of having at our disposal a

omplex multimodal perception system, and those other complex

ystems, related to learning and planning, are also involved. Ac-

ivities that have not been seen before cannot be recognized. In

he same way, intentions, which do not respond to, or cannot be

ncluded within, a normal course of actions will not be correctly

nferred. The recognition of activities and intentions is therefore

ntimately tied to the existence of a specific, shared socio-cultural

ackground, which is continuously acquired and improved within

he framework of the interaction with the others. The importance

∗ Corresponding author. E-mail address: ajbandera@uma.es (A. Bandera).

a

d

s

ttps://doi.org/10.1016/j.patrec.2018.03.006

167-8655/© 2018 Elsevier B.V. All rights reserved.

f the observation and interpretation of various social cues em-

nating from their social interaction with the others is therefore

lso crucial for our acquisition of the correct collection of social

ules.

Now that robots are moving from automatized factories into our

veryday environments, it is natural to endow them with some

f the aforementioned skills (e.g., based on a set of social rules)

entered on the challenge of interacting with humans. In this sce-

ario, it is fundamental to have a robot perception system capable

f reading the social signals emerged from the interaction. The aim

s to produce a socially correct and smooth interaction between the

obot and the humans in its surroundings, based on the predic-

ion of their behaviors [76] . Anticipating which activities people in

ur surroundings will do next (and why they will do so) can help

he robot to plan in advance its next responses and behaviors [94] .

obots need to understand verbal and nonverbal social cues from

ach individual person and from the dynamics of their relation-

hips. Signals such as body postures, gestures, and facial emotions,

re relevant for estimating the internal state of the humans. Un-

erstanding the dynamics of a group of people and identifying the

ocial role of each member of the group help the robot to exhibit a

4 A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13

Fig. 1. Robots interact with people in human-centred environments. In the figure,

the Gualzru robot trying to convince a woman to follow it to an interactive adver-

tising panel [62] .

[

m

f

a

[

t

r

r

t

T

c

p

s

t

d

s

t

fi

c

g

s

n

a

a

t

c

h

s

n

t

d

d

a

b

t

t

f

p

t

correct behavior from a social perspective. All this knowledge can

only be acquired from the observation and modeling of the human

and from their social interactions with other people.

This paper focuses on reviewing recent approaches and relevant

topics related to the perception and modeling of the human, as an

isolated individual, but also as part of a group of people. Restricted

to the ability of identifying the signals that can help having a so-

cial interaction, the acquisition of this complex skill requires the

robot to be equipped with hardware and software modules that

allows it (i) to perceive humans and their static and dynamic at-

tributes; and (ii) to match the obtained features with a specific,

memorized or on-line captured state (social knowledge) for mod-

eling them. It is important to note that the static role, as a passive

observer, that we are assuming here for the robot is not the real

situation. Our robots are situated agents that perceive but also act

in this outer world. The Theory of Event Coding [32] proposes that

stimulus representations underlying perception are encoded using

the same format that sensorimotor representations underlying ac-

tion. This is a significant difference with respect to the analysis of

video sequences captured from static cameras. Although, we do not

include within this contribution the importance of topics such as

affordance or goal directedness, we must consider that the situ-

atedness of the robot within the whole context plays a significant

role on its ability to recognize the behaviors and social interactions

of the humans in its surroundings.

The rest of the paper is organized as follows:

Section 2 overviews the problem tackled in this work, the model-

ing of the activities and social behavior of individuals, and their

social interactions. Among the most important requirements are

extraction and classification of hand-crafted or learned features,

and modeling and internalizing of the social relationships. Both

topics are described in Sections 3 and 4 , respectively. Section 3 is

divided up into two main sections, which review the typical

parameters of the perception system designed for a dyadic inter-

action ( Section 3.1 ) and for the interaction with a group of people

( Section 3.2 ). It is important to note that this strict separation

between feature extraction algorithms and classifiers does not

always exist and that both processes can be encoded together

within the same solution. A general discussion follows this study

in Section 5 . Finally, our conclusions are drawn in Section 6 .

2. Understanding a scene populated by humans

In this last decade, there has been a growing interest on the

design, methodology, and theory of human–robot interaction [29] .

This is justified by the fact that robots are expected to share our

same environments and cooperating with us to a greater or lesser

extent in our daily activities. Hence, autonomous robots used for

specific tasks with a very limited interaction with humans is not

a viable solution. The restriction of the human–robot interaction

(HRI) scenario to a dyadic interaction, where the robot interacts

with only one human is not true most of the time. Robots are more

and more part of teams (robots or humans), for instance, work-

ing closely alongside humans in industrial settings [66] or help-

ing physiotherapists to evaluate how a patient performs a motion-

based test in a hospital room [84] . This understanding of a situation

forces the robot to perceive details from the whole scene, captur-

ing not only the human but also its interaction with the surround-

ing objects and, especially, its social interaction with other people

( Fig. 1 ). Focusing on only one person could lead to the omission of

important information, and this can conduct to wrong decisions.

The recognition of human activities and social interactions is

a complex task for robots, which require the design and interac-

tion of several modules. Detailing the scheme stated in Section 1 ,

these systems typically include modules for (i) extracting sig-

nificant unary and pairwise-interaction human-related features

74,89] from the scene; (ii) obtaining meaningful, semantic infor-

ation (gender, gestures,...) [67] from these descriptors; and (iii)

using the information coming from several sources for modeling

nd internalizing the scene (usually employing a graphical model

34,43] ). The internalization of the perceived information can help

o fuse multimodal cues or to deal with the subsequent intention

ecognition problem. Fig. 2 summarizes this approach. As other

elated approaches, the classification algorithms need to have at

heir disposal datasets (knowledge) for comparison and matching.

hus, although it is not drawn on the figure, the scheme must in-

orporate the learning mechanisms for updating this knowledge.

The modules in charge of extracting features (unary- and

airwise-interaction features including objects) and the ones re-

ponsible of returning semantic concepts from these features must

ry to build a model of the scene. This allows the robot to un-

erstand the behaviors of the people and even get the gist of this

cene (e.g., catalog the event as birthday celebration, award func-

ion, etc. [59] ). The parameters of these modules are tuned by the

nal use case or application: it is not the same to encourage a

hild to perform an exercise within a rehabilitation session than

uiding a group of people through a museum. Moreover, the sen-

ors, features and recognition needs are not the same either. The

eed of fusing perceptions coming from different modalities (e.g.,

udio and video for emotion recognition) could be a reason for

dding a new module, the so-called ‘Internal representation’, on

he scheme on Fig. 2 . In some cases, the internal representation

an include part of this knowledge (e.g., a priori known models of

uman bodies or faces) and then to be used also as an additional

ource of information for action recognition [6] or emotion recog-

ition [17] . For instance, the hierarchical recognition approaches

hat are build over primitive sub-actions or sub-activities do not

irectly deal with the raw data for activity recognition [1] . An ad-

itional advantage of working over an inner representation is that

pproaches designed for performing the recognition processes can

e partially decoupled from the hardware resources available on

he robot [45] . Finally, the inference module on the Fig. 2 encodes

he processes in charge of extending the model with data obtained

rom the outcomes of the classifiers.

Within this paper, we conduct a survey on the solutions pro-

osed for allowing a robot to perceive and internalize the activi-

ies and social interactions of a group of people. Thus, this review

A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13 5

Fig. 2. Major modules on a system in charge of modeling human behaviors and interactions. It typically includes feature extraction and classification, internal representation

and inference mechanisms. The pipeline scheme is only partially true. The internal representation can store raw data from the feature extractors and help on modeling the

whole scene (see text for details).

c

p

t

e

b

A

g

c

i

3

3

c

g

r

r

r

n

r

f

t

p

i

a

o

a

s

t

t

t

t

a

f

t

d

T

t

y

f

b

i

m

s

c

K

t

m

o

i

d

t

d

e

a

r

a

3

r

q

d

t

r

r

h

w

m

t

t

c

overs the perception and modeling of the activities of a group of

eople that share the environment with the robot. The term ac-

ivity takes in this context a significance that exceeds the simple

xecution of certain movements. Following the terminology given

y Turaga et al. [79] , we distinguish between action and activity.

ction is referred to a simple motion pattern, executed by a sin-

le person and usually with a short duration of time. Activities are

omplex sequences of actions performed by one or several people,

n a scenario that is typically driven by social cues.

. Perceiving and modeling people and their interactions

.1. Modeling the human

As aforementioned, there exist a large number of signals that

an be captured for modeling a person: speech, face expression,

aze, gestures, and any sort of measurements that a robot can

ecord from the environment related to social interaction. The

obot probably needs the use of dedicated hardware and software

esources for dealing with each one of them. In a simplest sce-

ario, only concerning with activity recognition, the robot typically

equires at least using visual information for extracting motion in-

ormation and characterizing the dynamics of the scene. It concen-

rates all resources on the interaction with one human counter-

art: an action is in any case a sequence of body movements, and

t usually involves several body parts concurrently.

Fig. 3 provides some snapshots of human–robot dyadic inter-

ction. On the right, the ARMAR-III from the Karlsruhe Institute

f Technology in Germany [80] is shown. It focuses on detecting

nd tracking the gestures from a human teacher [26] . The whole

ystem allows the transfer of motion based on predefined ges-

ures and force interaction. Initially, a dynamic movement primi-

ive (DMP) [37] is learned from a human wiping movement. Given

he color of the wiping tool, the robot tracks the movements of the

ool using a stereo camera system. For the subsequent force-based

daptation of the learned DMPs, it relies on the readings of the

orce torque sensor installed on the wrist of the robot. On Fig. 3 (b),

he Loki robot plays a simple game with a person. It is able to

etect the presence of a person and recognize verbal commands.

hus, when the human introduces themselves and asks it to play

he game, Loki uses color and distance information for tracking a

ellow ball. For doing this, it has a RGB-D sensor placed on the

orehead. It continuously fixates its gaze upon the ball. After a ver-

al indication, it reaches the ball with its hand and waits for a new

nteraction. Loki tracks the object and accepts new speech com-

ands during the whole span of the game, representing all the

cene using an undirected graph [6] . Fig. 3 (c) shows the Nao robot

oaching a child during a rehabilitation session [58] . An external

inect sensor from Microsoft is employed for capturing the skele-

on of the human user and threshold values are used for deter-

ining the correct execution of certain exercises. The same kind

f interaction between a Nao robot and children with autism in an

mitation task is also described in [14] . These examples show how

ifferent modalities, features and classifiers are used for modeling

he human and its interaction with the robot. If we analyze the

etails of the hardware and software architectures behind these

xperiments, we could also note the complexity of the perception

nd actuation systems. As it is probably not possible to summa-

ize all perceptual possibilities within one paper, here we provide

brief description of relevant issues, which are classified in Fig. 4 .

.1.1. Feature extraction

In a dyadic scenario, feature extraction aims to transform the

aw information captured by sensors to feature vectors for subse-

uent modeling of the human. Robots usually employ vision, au-

io, and/or range sensors. Table 1 summarizes the features and

echniques for semantic understanding employed by several social

obots. Typical tasks include human tracking, face, and/or speech

ecognition, and scale up to action and activity recognition. It is

owever noticeable that social robots are not usually endowed

ith the ability of recognizing intentions. In fact, it is not com-

on that they consider the activity recognition task, in the sense

hat we briefly state it in Section 2 .

With respect to the features employed, they usually depend on

he task to solve. We can group them in three major classes ac-

ording to the temporal dimension. On one hand, we have tasks

6 A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13

Fig. 3. Human–robot interaction as a dyadic human–robot interaction: ARMAR-III interacting with a human teacher [26] ; Loki playing with one person [6] ; and the Nao

robot coaching a child in a rehabilitation session [58] .

Table 1

Representative perception modalities on social robots.

Social robot Task Features Algorithms for semantic understanding

PaPero Face detection and recognition Shape, 3D model Template matching

Speech recognition Filter banks Hidden Markov model

i-Cub Human detection Motion-based Machine learning [82]

Human/face tracking Color Hierarchical temporal memory [41]

Sound localization ITD, ILD, and notches Active mapping [33]

Maggie Emotion recognition Voice and face expression [3]

Pose recognition Skeleton Template matching [28]

Speech processing Grammar-based [4]

ARMAR-III Human tracking Haar-like, color... Particle filters [54]

Human tracking Time-delay Particle filters [54]

Face recognition DCT-based Nearest neighbor

Gesture recognition Intensity, color Neural network + hidden Markov model Head pose estimation Intensity, shape Neural network

Sound recognition ICA-transformed features Hidden Markov model [75]

Speech recognition MFCC RTN [75]

Loki Face detection and tracking Solor, depth Active appearance model

Human motion capture Skeleton Template matching [9]

Speech recognition CNN-BLSTM

Emotion recognition Candide model DBN [17]

NaoTherapist Skeleton Human motion capture Machine learning for body-part

Fig. 4. Taxonomy of the methods and approaches covered in this survey.

i

o

n

i

f

a

t

o

o

f

h

t

i

h

n

j

b

(

u

[

n

e

e

S

t

[

p

t

such as emotion detection from facial features or the recognition

of a specific verbal command. Although, we can incorporate the

time for improving the classification results, they put the empha-

sis on the current instant of time: an image for facial expression,

or a word for verbal command recognition. Within each observa-

tion, these approaches employ static data such as the brightness

or color values for images. These raw data are usually provided as

input data to modules that obtain feature vectors such as the Local

Binary Patterns (LBP) or the Haar-like features. Both features have

been successfully employed for face detection [83] or for gender

and age estimation [52] . Other popular descriptors for character-

zing static images are the scale-invariant feature transform (SIFT)

r the speed up robust features (SURF). In audio perception, sig-

ificant features are the inter-aural level difference (ILD) and the

nter-aural time difference (ITD) [33] . But the most commonly used

eature extraction method in automatic speech recognition is prob-

bly the Mel-Frequency Cepstral Coefficients (MFCC) [75] . Contrary

o static approaches, sequential algorithms consider the scene as a

rdered collection of individual observations. However, within each

bservation, they deal with static features. The matching of these

eatures within the sequence of images allow for example to track

uman body or face parts [31] . In these approaches, the feature ex-

raction can be supported by inner models of the human [5,6] . For

nstance, the Candide model has been successfully employed for

uman face tracking [78] or emotion recognition through the defi-

ition of the action units features [17] ( Fig. 5 ). The tracking of the

oints (head, left shoulder, center shoulder, right shoulder, left el-

ow, right elbow, left wrist, etc.) composing the three-dimensional

3D) representation of the human body as a skeleton is also widely

sed for action recognition in robots equipped with RGB-D sensors

28,58] . Both schemes show the advantages of tying together inter-

al representation and perception. Finally, space-time approaches

qual space and time dimensions, and work in a 3D space. There

xist 3D versions of typical image-based descriptors, such as the

IFT3D [69] or the SURF3D [93] . Unfortunately, they inherit from

heir predecessors the limitations in performance generalization

47] . Many effort s have been made to set features based on other

rinciples: representing actions by a temporally integrated spa-

ial response (TISR descriptor) that extracts bag-of-words features

A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13 7

Fig. 5. Recognizing emotions from facial features using the Candide model [17] .

[

(

t

T

t

S

3

u

d

a

s

i

m

s

s

l

l

o

r

f

B

p

p

t

o

b

t

t

(

t

n

t

e

a

p

a

i

p

3

m

t

w

e

c

p

m

C

m

t

c

s

a

i

p

d

t

t

t

f

t

[

n

t

(

e

s

r

t

s

t

m

C

a

T

m

t

a

a

r

a

i

c

m

b

s

p

s

t

t

t

t

t

d

d

w

a

t

f

e

m

m

n

t

n

t

o

t

99] ; trajectories described using histograms of oriented gradients

HOG), histograms of optical flow (HOF) and motion boundary his-

ogram (MBH) around interest points (iDT descriptors) [86] , etc.

he plethora of descriptors allow the researchers to fuse and ob-

ain successful schemes for recognition, as we briefly describe at

ection 3.1.2 .

.1.2. Feature vectors classification

Feature vectors can be classified for solving tasks (see Table 1 )

sing a large variety of approaches. Using skin color and image

isparity, Nickel and Stiefelhagen [54] used a k -means clustering

pproach for face detection. Stiefelhagen et al. [75] proposed to

olve the face recognition computing the distances between the

nput images and a collection of training images. A Min–Max nor-

alization approach and a sum rule that normalizes and fuses

cores are applied. Then, face is classified according to the highest

core and a predefined threshold value. However, the most popu-

ar strategy for detecting faces was the combination of the Haar-

ike features with an AdaBoost classifier, originally proposed by Vi-

la and Jones [83] . The approach was extended for dealing with

otated faces, and for performing face recognition using the Eigen-

aces approach [73] . Other boosting approach, the so-called Gentle-

oost, was used for recognizing children’s emotions [63] . When in-

ut data is represented as a sequence of ordered observations, the

roblem is how to compare the incoming stream with the stored

emplate. Previous approaches used dynamic time warping (DTW)

r a simple matching of coefficients obtaining from the activities

y principal component analysis (PCA). Lin et al. [48] described

he activity as a hierarchical prototype tree, which is matched to

he trees on the dataset for recognition. Hidden Markov Models

HMM) were applied for speech recognition [49] . HMMs or ex-

ensions have been also widely applied in human activity recog-

ition, and novel versions are still proposed [42] . Surveys such as

he one by Cheng et al. [12] (for activity recognition) or Mishra

t al. [51] (for face emotion recognition) provided information

bout databases and approaches. New schemes are continuously

roposed, being now possible to adopt one of these state-of-art

lgorithms in our robotics architecture and obtain good results

n a short time. The use of closed solutions for solving human-

erception tasks is widely employed [4,6] .

.1.3. Convolutional Neural Networks (CNNs)

Instead of setting handcrafted features and training traditional

achine learning methods, other option is to learn these descrip-

ors directly from the raw data. Deep Convolutional Neural Net-

orks (CNNs) are currently the state-of-the-art solution for sev-

ral computer vision problems such as object detection [55] and

lassification [27,57] . In a CNN, cells act as local filters over the in-

ut space exploiting the strong spatially local correlation, being the

ain reason behind their success in computer vision applications.

ombined with multi-layered recurrent networks (long short-term

emory, LTSM) used for learning temporal series, CNNs are also

he state-of-art solution for speech recognition [95] . In general, it

an be considered that CNNs and their extensions are currently the

trategy for dealing with the challenge of perceiving the human.

With respect to the pipeline strategy composed by most of the

pproaches described in Section 2 , CNNs can be trained for link-

ng raw information with class labels. This end-to-end training is

erformed in a supervised way [35] , being the traditional major

rawback that a good training requires a vast number of labeled

raining patterns [38] . Fortunately, we have now readily available

hese image-based models trained using millions of labeled pat-

erns [39] . It has also been demonstrated that a model trained

rom a large dataset can be transferred to other visual recogni-

ion tasks with limited training data [21,55] . Recently, Zhang et al.

97] has proposed a part-based hierarchical bidirectional recurrent

eural network (PHRNN) to analyze the facial expression informa-

ion of temporal sequences. Combined with a multi-signal CNN

MSCNN), the resulting deep evolutional spatial-temporal network

ffectively boosts the performance of facial expression recognition.

This last work captures the dynamic variation of facial physical

tructure from a sequence of images. Similarly, for being useful in

ecognizing human activities, the CNN needs to be extended from

he bi-dimensional domain of the image to the three-dimensional,

patio-temporal domain of the video sequence. The solutions for

aking the temporal cue into account can be grouped within three

ajor clusters: (i) three-dimensional (3D) CNNs; (ii) motion-based

NNs; and (iii) fusion approaches. The first cluster includes those

pproaches that perform 3D convolutions on the video sequence.

he second one groups the methods that adopt the scene infor-

ation related to motion as an input for the CNN. The third clus-

er proposes to fuse the information in temporal domains. These

pproaches are complementary and it is typical that CNN-based

pproaches merge techniques from different clusters for activity

ecognition. The better results are typically provided by those

pproaches that adopt the two-stream model [71] . Basically, the

dea is to characterize the sequence of images using two different

onvolutional networks (ConvNet) streams: a temporal stream of

otion-based features and a second spatial stream of appearance-

ased features. Fig. 6 provides a graphical illustration of the two-

tream proposal by Wang et al. [90] . As Fig. 6 shows, a fusion

rocess combine the obtained results and deliver the final deci-

ion. Wang et al. [90] proposed a temporal segment network (TSN)

o recognize action. The approach consists of three steps. First,

he input video is divided up into K segments and a short por-

ion (fragment) is randomly selected from each segment. Second,

he class scores of different fragments are fused by the segmen-

al consensus function to yield video-level prediction. Third, pre-

ictions from spatial and temporal streams are then fused to pro-

uce the final prediction. The second step of the previous scheme

as modified on the sequential segment network (SSN) [11] . The

im is to concatenate the outputs of different segment portions as

he video-level representation. This representation is fed into the

ully-connected layer. Feichtenhofer et al. [24,25] proposed to gen-

ralize the residual networks (ResNets) for the spatio-temporal do-

ain by introducing residual connections within the two-stream

odel. Specifically, Feichtenhofer et al. [24] injected residual con-

ections between the appearance and temporal streams. Moreover,

hey transformed pre-trained image ConvNets into spatio-temporal

etworks by equipping them with learnable convolutional filters

hat are initialized as temporal residual connections and operate

n adjacent feature maps in time. Feichtenhofer et al. [25] fused

wo streams by motion gating and injected identity mapping ker-

8 A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13

Fig. 6. Temporal segment network [90] .

Table 2

Activity recognition results on UCF101 and HMDB51 databases.

Approach CNN scheme Features UCF101 – mAP (%) HMDB51 – mAP (%)

Wang et al. [86] – iDT 85,9 57,2

Wang and Schmid [87] – iDT 87,9 61,1

Wang et al. [90] BN-Inception CNN 94,2 69,4

Chen and Zhang [11] BN-Inception CNN 94,8 73,8

Feichtenhofer et al. [24] ST-ResNet CNN 93,4 66,4

CNN + iDT 94,6 70,3 Feichtenhofer et al. [25] ResNet-50 CNN 94,2 68,9

CNN + iDT 94,9 72,2 Wang et al. [91] BN-Inception CNN 94,6 68,9

Duta et al. [22] VGG-16, VGG-19 CNN 93,6 69,5

CNN + HMG 94,0 70,3 CNN + HMG + iDT 94,3 73,1

(

c

a

t

t

g

a

s

f

n

i

s

t

o

o

c

o

g

[

c

i

c

e

c

s

f

c

A

a

e

[

f

F

[

nels as temporal filters to learn long-term temporal information.

Wang et al. [91] provided a pyramid two-stream model for merg-

ing the spatial and temporal information. The goal is to make both

streams reinforce each other. Duta et al. [22] added to the spatial

and temporal streams, a third spatio-temporal stream built with

the C3D architecture [77] . Spatio-Temporal Vector of Locally Max

Pooled Features (ST-VLMPF) are proposed to build action represen-

tation over the entire video. Table 2 shows the classification accu-

racy of these approaches on the UCF101 and HMDB51 databases.

The UCF101 database consists of 13,320 videos with 101 action

classes [72] . It is characterized by the large diversity in terms of

variations in background, camera motion, illumination and view-

point, as well as object scale, appearance or pose. The HMDB51

dataset consists of 6766 videos [44] . It shows a minor repertoire

of classes (51 action classes), but it is typically considered more

challenging than the UCF101 due to the even wider variations in

which actions are performed [24] . Both datasets provide an eval-

uation protocol. The evaluation metric is the mean of the Average

Precision (mAP) [23] .

3.2. Modeling a group of people

Understanding the activities and social interactions in a group

of people is a challenge topic that is starting to gain an increasing

attention by researchers. Several works pursuit to determine social

networks from appearance- and motion-based parameters charac-

terizing the people in the scene. For instance, Yu et al. [96] es-

timated the social network encoding the interactions among peo-

ple by combining face recognition and motion similarities between

tracks of people on the ground plane. The association problem of

mapping faces and tracks was solved using a novel graph-cut based

algorithm. In the proposal by Ding and Yilmaz [20] , this social net-

work was extracted from the video sequence analyzing the rela-

tionships among visual concepts. A probabilistic graphical model

PGM) with temporal smoothing was employed for analyzing so-

ial relations among actors and for detecting communities. The

pproach assumes that the relations remain constant throughout

he video sequence. RoleNet is a model for describing social rela-

ionships within a group of people [92] . It is built as a weighted

raph, where nodes are people, arcs represent relationships, and

third set of weights encodes the strength of the arcs (relation-

hips). Using co-occurrence matrices and recognizing people by

ace recognition, the social interaction is driven by the actors and

ot by audiovisual features. The method determines roles (lead-

ng roles and supporting roles) and divides up the sequence into

cenes according to the context of roles [92] . As major disadvan-

age, all these approaches do not extrapolate generic social events

r situations (birthday, wedding...) from one video sequence to the

ther. The grouping of the people is local to each sequence and so-

ial roles within an event (e.g. priest, groom, bride...) are not rec-

gnized. Some authors have addressing the problem of detecting

roups of interacting people using the concept of the F-formations

40,50] . F-formations are defined as a geometric arrangement en-

oding the position and orientation information of people stand-

ng in the formation ( Fig. 7 ). The estimation of these F-formations

an be inferred from body poses and/or head orientations. Vascon

t al. [81] associated to each person with a frustum, which was

omputed from the position and orientation information. They de-

igned a game-theoretic framework where the concept of the F-

ormation was embedded, but also the biological constraints of so-

ial attention. Orientation was the main cue for Ricci et al. [60] .

joint learning approach was suggested for estimating the pose

nd F-formation for groups of people. Zhang and Hung [98] also

mployed the frustum of attention. But, contrary to Vascon et al.

81] , they used this frustum to obtain features from people. These

eatures labeled people in associates, singletons and members of

-formations. Using the Group Interaction Zone (GIZ), Cho et al.

15] also addressed the problem of detecting meaningful groups by

A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13 9

Fig. 7. Two-people formations [40] and a three-people formation [61] .

m

i

a

d

o

f

h

d

t

m

p

fi

a

t

t

p

e

a

f

u

t

n

n

M

c

z

i

i

c

e

w

w

g

L

t

t

t

e

i

p

i

Fig. 8. Sample frames from a ‘wedding’ event from two films with manual role

annotations.

Table 3

Group recognition results on NUS-HGA and BEHAVE databases.

Approach NUS-HGA BEHAVE

Accuracy (%) Accuracy (%)

Cheng et al. [13] 96.20 92.93

Cho et al. [15] 96.03 93.74

Al-Raziqi and Denzler [2] 81.94 79.35

Zhuang et al. [100] 99.25 94.63

d

i

t

(

i

a

r

c

p

t

n

c

r

s

v

3

p

a

a

e

r

e

e

C

d

F

r

H

w

I

s

d

a

v

o

T

f

p

h

c

h

t

odeling proxemics. They described the group activity in a GIZ us-

ng attraction and repulsion properties, which considered an inter-

ction in terms of “getting close”, “away”, and “keeping the same

istance together”.

Other works try to capture the social interactions for helping

n the recognition of joint activities. Facial features were modeled

or recognizing activities such as hand-shaking [56] . The relation

istory image (RHI) descriptor was proposed by Gori et al. [30] for

iscriminating activities and interactions that happen at the same

ime. The RHI is built as the temporal variation of relational infor-

ation between every pair of local subparts belonging to one or a

air of people. Choi and Savarese [16] proposed a model that uni-

es the tracking of multiple people, the recognition of individual

ctions, and the identification of the interactions and collective ac-

ivities. It is assumed that there exists a strong correlation between

he individual activity of each person and the activities of the other

eople. Cheng et al. [13] proposed a layered model. They firstly

xtracted various motion and appearance features from the video

nd trajectory data. And then, features were randomly sampled

rom the training features to generate codebooks of visual words

sing K -means clustering. All features are quantized by assigning

heir nearest visual words with Euclidean distance. The resulting

ormalized histograms of visual word occurrences formed the fi-

al representations, one feature type per group action instance.

ulti-class Support Vector Machine (SVM) was used to build the

lassifier and make the recognition decisions. Al-Raziqi and Den-

ler [2] proposed to divide up the video sequence into clips us-

ng an unsupervised clustering approach. Within the clips, signif-

cant groups of objects were detected using a bottom-up hierar-

hical clustering and then tracked over time. Furthermore, mutual

ffect between objects based on motion and appearances features

ere computed. Finally, the Hierarchical Dirichlet Process (HDP)

as employed to cluster the clips.

The recognition of social roles and its importance for predicting

roup activities has been explored by Ramanathan et al. [59] and

an et al. [46] . The aim is to identify events and roles, being able

o label people ( Fig. 8 ). The first proposal addressed the identifica-

ion of social roles in a weakly supervised framework, meanwhile

he second one works in a fully supervised scenario. Ramanathan

t al. [59] tackled the problem from the perspective of recogniz-

ng social roles, which emerges from the interactions among peo-

le and among people and objects. They proposed to model the

nter-role interactions using Conditional Random Field (CRF) un-

er a weakly supervised setting. Unary component representations

ncluded HOG3D, spatio-temporal features, object interaction fea-

ures (restricted to two objects per event) and social role features

clothing and gender of the person). These features were refined

n a subsequent layer consisting of pairwise spatio-temporal inter-

ction features. The parameters of the CRF-based model and the

ole labels were learned adapting a joint variational inference pro-

edure. Focused on group activities, a hierarchical classifier was

roposed by Lan et al. [46] . Using an undirected graphical model,

he hierarchy encoded individual actions, role-based unary compo-

ents, pairwise roles, and group activities. Thus, at a low-level, the

lassifier recognizes single activities. At a mid-level, it infers social

oles. The parameters of the model are learned using a structured

upport vector machine (SVM). It works under completely super-

ised setting.

.2.1. Convolutional Neural Networks (CNNs)

Similarly to the approaches described in Section 3.1.3 , there are

roposals that deal with the problem of recognizing the activity of

group of people by using a layered model where both motion

nd appearance information are employed. For instance, Zhuang

t al. [100] proposed the Differential Recurrent Convolutional Neu-

al Networks (DRCNN). As Fig. 9 shows, the DRCNN combines lay-

rs of convolutional networks, max-pooling, fully-connected, differ-

ntial long short-term memory (DLSTM) networks and soft-max.

ontrary to Cheng et al. [13] and Cho et al. [15] , this method

oes not need the previous detection of the people on the images.

or assessing the performance of the approaches for group activity

ecognition two popular public video datasets: BEHAVE and NUS-

GA are used. The NUS-HGA dataset consists of 476 video clips,

hich cover six group activity classes (Fight, Gather, Ignore, Run-

nGroup, StandTalk and WalkInGroup). The BEHAVE dataset con-

ists of 7 long video sequences. As these video sequences include

ifferent classes of group activities, video clips containing group

ctivity instances have been extracted from the sequences. These

ideo clips cover ten group activity classes, but it is typical to use

nly six group activity classes (Approach, Fighting, InGroup, Run-

ogether, Split, and WalkTogether), because the rest only contain a

ew short sequences. Table 3 shows the group recognition results

rovided by several approaches on these datasets.

Other approaches represent activities and interactions within a

ierarchical representation. Taken into consideration scene classifi-

ation and group activity recognition, Deng et al. [19] proposed a

ierarchical model that predicts scores for individual actions, ob-

ained from bounding-boxes around each person, and the group

10 A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13

Fig. 9. Differential Recurrent Convolutional Neural Networks [100] .

Fig. 10. Overview of the software architecture for human perception proposed by

Lallée et al. [45] .

m

E

e

a

f

p

i

u

l

s

e

j

f

t

p

t

t

G

v

e

f

m

t

[

p

v

fi

d

c

o

5

p

v

t

d

r

i

s

i

n

[

e

c

activity, from the whole scene. Obtained labels were refined by

applying a belief propagation-like neural network. The dependen-

cies between individual actions and the group activity are taken

into account in the network. The model learns the message pass-

ing parameters and performs inference and learning in an unified

framework using back-propagation. While this approach use neural

network-based graphical representations, Ibrahim et al. [36] lever-

aged LSTM-based temporal modeling to learn discriminative infor-

mation from time varying sports activity data.

4. Internalizing the information

The integration of isolated feature descriptors provided by in-

dividual perceptual units for providing a whole view of the scene

can be achieved by internalizing all this information into an unique

representation. This scheme has been widely employed on robotics,

specially when it is expected that they deploy cognitive function-

alities. If cognition is the ability that allow us to internally deal

with the information about ourselves and the external world, this

ability is subject to the existence of an internal active representa-

tion handling all this information. For instance, Fig. 10 shows an

overview of the architecture proposed by Lallée et al. [45] for the

i-Cub robot. In the figure, we can note the presence of a mod-

ule for storing the spatial knowledge of the scene, which receives

inputs from the 3D perception module. The presence of the first

module, in the ‘Platform independent’ part of the software archi-

tecture, allows the system to decouple sensing and perception. This

odule is a geometric memory in Lallée et al. [45] , the so-called

goSphere . In the proposal by Romero-Garcés et al. [62] , the knowl-

dge is stored in a graphical representation that merges symbolic

nd metric information.

The use of an internal representation can be a good solution

or encoding the complexity of a scene populated by several peo-

le. As it is shown in previous Sections, rich semantic relations are

mportant for understanding these events. If these relations can be

seful for understanding the activities of an individual, building re-

ationships among the people sharing a common task will be ba-

ic for recognizing group activity. In Ramanathan et al. [59] , they

ncode relationships among people but also among people and ob-

ects. Some of the state-of-art approaches presented above success-

ully label the perceived sequence of images, but they are unable

o provide fine details about the individual role or activity of each

erson on the scene. Hierarchical approaches recognize the activi-

ies of each individual person and of the group of people. But it is

ypical that they do not encode all the richness of the interactions.

raphical models emerge as a solution to encode components of

isual appearance and their relations and interactions [6] . Chen

t al. [10] combined graphical models and deep neural networks,

eeding the outcomes of the final layer of a deep network to a CRF

odel. Schwing and Urtasun [68] designed an iterative process for

raining of a CRF model and, expanding this approach, Deng et al.

18] used an iterative approach for employing the actions of other

eople in the scene in the disambiguate of the action of each indi-

idual. They accomplished this by a recurrent neural network, re-

ned by repeatedly passing messages with estimates of each in-

ividual persons action. The inner representation of the scene is

onfigurable, using trainable gating functions for turning on and

ff arcs between individual people in the scene.

. Discussion

The previous sections review the state-of-art approaches on

erceiving people and their social interactions. Although the ad-

ances on accuracy are really surprising, some doubts appear when

hese algorithms must be translated to robotics. One of the major

ifficulties is related to the response time of these algorithms. The

obots illustrated in Fig. 3 need to interact with a person at human

nteraction rates. The hardware and software complexities underlie

ome of the architectures on this review are really relevant, and its

ntegration within a robot should increase its price. This is a sig-

ificant issue: how much would a social robot cost? As Blackman

7] pointed out for care robots, there is a serious lack of robust

vidence of cost-effectiveness. Although we solve the technological

hallenge of endowing a robot with the abilities for understanding

A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13 11

Fig. 11. Samples from a video sequence capturing a ‘Handshaking with the observer’: (top) from a 3rd-person viewpoint; and (down) from a first-person viewpoint.

o

b

O

o

t

a

o

g

v

5

b

r

u

o

a

i

o

a

b

t

i

u

a

5

v

a

s

s

m

e

3

o

v

u

l

i

d

6

a

u

t

f

o

p

c

v

(

a

c

t

o

h

t

s

c

n

d

l

t

t

n

F

q

o

F

f

s

A

C

E

R

ur activities and intentions, it will be difficult to bridge the gap

etween the research or academic domain and the market shelf.

ther significant problem is that most of the approaches focused

n recognition from a 3rd-person perspective (i.e., viewpoint). In

hese cases, the camera is typically far away from people. And the

lgorithms recognize what people are doing to each other with-

ut getting involved in the activities (e.g., two people walking to-

ether). This paradigm is insufficient when the observer itself is in-

olved in interactions [65] .

.1. Networked robotics: the strength of being part of an ecology

For addressing both problems, recent proposals suggest to em-

ed intelligent networking robotic devices in our everyday envi-

onments (homes, offices, public buildings...). Similar to the ubiq-

itous computing, the robot is now one element within an ecology

f connected devices. In fact, extending the definition of robot to

ny e mbedded device with computing, communication, and sens-

ng or actuation capabilities [8] , we can refer to this as an ‘ecol-

gy of robots’. Within these approaches, the perceptual and social

bilities of each robot are augmented by adding the ones provided

y the rest of robots. Each robot is in charge of solving a specific

ask, and the human activity understanding can be solved by us-

ng wearable sensors [70] , or external cameras that provide the

sual 3rd-person perspective. Moreover, the robot can shares the

cquired knowledge by uploading it to a distributed database [85] .

.2. Approaches for first-person activity recognition

First-person cameras or microphones are the correct input de-

ices for providing the researchers with the information that will

llow to endow the robot with the situation awareness that we

tate at the end of Section 1 . In this egocentric scenario, the ob-

erver wearing the camera is involved in the ongoing activities. It

ust be noted that videos will visually display very different prop-

rties when compared to the video captured from a conventional,

rd-person viewpoint. As an example, Fig. 11 shows some samples

f the task ‘Handshaking with the observer’ captured from the two

iewpoints.

The research area of first-person activity recognition or scene

nderstanding is gaining an increasing amount of attention these

ast years. There are works on recognition of activities of daily liv-

ng [53] , early recognition [64] , etc. But it is expected that new

atasets and approaches will appear in the next years.

. Conclusions

This review provides a summary of approaches that have been

pplied to characterize and recognize the behaviors of an individ-

al or a group of people. Specifically, the understanding of the in-

eraction with a group of people is receiving significant attention

rom the research community in recent years. Similarly, a large set

f concepts and different approaches have emerged recently. This

aper summarizes some of these advances for modeling the so-

ial setting where the robot is involved and for extracting the rele-

ant information during the interaction. Recurrent neural networks

CNN and LTSM) represent promising techniques for the detection

nd classification tasks in the interaction of a social robot. As dis-

ussed above, these techniques required a vast number of labeled

raining patterns, but this is not a problem due to the availability

f large labeled datasets and trained networks. These approaches

ave shown impressive results in the recognition of human ac-

ivity in the field of computer vision. While achieving these re-

ults is a significant achievement, the researchers still have many

hallenges to deal with, such as repeating the achieved recog-

ition rates in egocentric videos, dealing with noise due to the

ynamics associated to the robot’s motion, etc. The whole prob-

em should be approached from the robotics point-of-view, and

he algorithms should work with low memory and less computa-

ional time. Recently, it was discussed in [88] the development of

ew methods on Application Specific Integrated Circuit (ASIC) or

ield Programmable Gate Array (FPGA). A transversal effort will re-

uire a joint expertise in embedded vision and traditional teams

f robotics, software engineering and computer vision researchers.

urthermore, the work on activity recognition should be extended

or dealing with the early recognition, where the pre-activity ob-

ervations and the context awareness are basic concepts.

cknowledgments

The research work of A. Bandera, R. Vazquez-Martin and L.V.

alderita within this scope has been partially funded by the EU

CHORD++ project (FP7-ICT-601116) and the TIN2015-65686-C5-1-

(Gobierno de España and FEDER funds).

12 A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13

References

[1] J. Aggarwal , M. Ryoo , Human activity analysis: a survey, ACM Comput. Surv.

43 (2011) 1–43 .

[2] A. Al-Raziqi, J. Denzler, Unsupervised Group Activity Detection by Hierarchi- cal Dirichlet Processes, Springer International Publishing, Cham, pp. 399–407.

doi: 10.1007/978- 3- 319- 59876- 5 _ 44 . [3] F. Alonso-Martín, M. Malfaz, J. Sequeira, J. Gorostiza, M. Salichs, A multimodal

emotion detection system during human-robot interaction, Sensors (Basel) 13 (11) (2013) 15549–15581, doi: 10.3390/s131115549 .

[4] F. Alonso-Martín, M.A. Salichs, Integration of a voice recognition system in a

social robot, Cybern. Syst. 42 (4) (2011) 215–245, doi: 10.1080/01969722.2011. 583593 .

[5] A . Aly, A . Tapus, Multimodal adapted robot behavior synthesis within a nar- rative human-robot interaction, in: Proceedings of the 2015 IEEE/RSJ Interna-

tional Conference on Intelligent Robots and Systems (IROS), 2015, pp. 2986– 2993, doi: 10.1109/IROS.2015.7353789 .

[6] A. Bandera, P. Bustos, Toward the development of cognitive robots, in: L. Grandinetti, T. Lippert, N. Petkov (Eds.), Proceedings of the International

Workshop on Brain-Inspired Computing, BrainComp 2013, Springer Interna-

tional Publishing, Cham, 2014, pp. 88–99, doi: 10.1007/978- 3- 319- 12084- 3 _ 8 . Cetraro, Italy.

[7] T. Blackman , Care robots for the supermarket shelf: a product gap in assistive technologies, Ageing Soc. 33 (5) (2013) 763–781 .

[8] M. Bordignon, M.J. Rashid, M. Broxvall, A. Saffiotti, Seamless integration of robots and tiny embedded devices in a PIES-Ecology, in: Proceedings of the

2007 IEEE/RSJ International Conference on Intelligent Robots and Systems,

Sheraton Hotel and Marina, San Diego, California, USA, 2007, pp. 3101–3106 . October 29–November 2, 2007. 10.1109/IROS.2007.4399282 .

[9] L.V. Calderita, J.P. Bandera, P. Bustos, A. Skiadopoulos, Model-based reinforce- ment of kinect depth data for human motion capture applications, Sensors 13

(7) (2013) 8835–8855, doi: 10.3390/s130708835 . [10] L. Chen , G. Papandreou , I. Kokkinos , K. Murphy , A.L. Yuille , Semantic image

segmentation with deep convolutional nets and fully connected CRFs, CoRR

(2014) . [11] Q. Chen , Y. Zhang , Sequential segment networks for action recognition, IEEE

Signal Process. Lett. 24 (5) (2017) 712–716 . [12] G. Cheng , Y. Wan , A.N. Saudagar , K. Namuduri , B.P. Buckles , Advances in hu-

man action recognition: a survey, CoRR (2015) . abs/1501.05964. [13] Z. Cheng , L. Qin , Q. Huang , S. Yan , Q. Tian , Recognizing human group action

by layered model with multiple cues, Neurocomputing 136 (2014) 124–135 .

[14] P. Chevalier, J. Martin, B. Isableu, C. Bazile, A. Tapus, Impact of sensory prefer- ences of children with ASD on imitation with a robot, in: Proceedings of the

2017 IEEE International Conference on Human–Robot Interaction (HRI), 2017, doi: 10.1145/2909824.3020234 .

[15] N.-G. Cho , Y.-J. Kim , U. Park , J.-S. Park , S.-W. Lee , Group activity recognition with group interaction zone based on relative distance between human ob-

jects, Int. J. Pattern Recognit. Artif. Intell. 29 (5) (2015) 1555007 .

[16] W. Choi , S. Savarese , A unified framework for multi-target tracking and col- lective activity recognition, in: Proceedings of the 2012 European Conference

on Computer Vision (ECCV), 2012, pp. 215–230 . [17] F. Cid, J. Moreno, P. Bustos, P. Núñez, Muecas: a multi-sensor robotic head for

affective human robot interaction and imitation, Sensors 14 (5) (2014) 7711– 7737, doi: 10.3390/s140507711 .

[18] Z. Deng, A. Vahdat, H. Hu, G. Mori, Structure inference machines: recurrent

neural networks for analyzing relations in group activity recognition, in: Pro- ceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recog-

nition, CVPR 2016, Las Vegas, NV, USA, 2016, pp. 4772–4781 . June 27–30, 2016. 10.1109/CVPR.2016.516 .

[19] Z. Deng , M. Zhai , L. Chen , Y. Liu , S. Muralidharan , M. Roshtkhari , G. Mori , Deep structured models for group activity recognition, in: Proceedings of the 2015

British Machine Vision Conference (BMVC), 2015 . [20] L. Ding, A. Yilmaz, Inferring social relations from visual concepts, in: Proceed-

ings of the 2011 International Conference on Computer Vision, 2011, pp. 699–

706, doi: 10.1109/ICCV.2011.6126306 . [21] J. Donahue , Y. Jia , O. Vinyals , J. Hoffman , N. Zhang , E. Tzeng , T. Darrell , Decaf:

a deep convolutional activation feature for generic visual recognition, in: Pro- ceedings of the 2015 International Conference on Machine Learning (ICML),

32, 2014, pp. 1–9 . [22] I. Duta , B. Ionescu , K. Aizawa , N. Sebe , Spatio-temporal vector of locally max

pooled features for action recognition in videos, in: Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), 2017 . [23] M. Everingham, L. Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal visual

object classes (VOC) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338, doi: 10.10 07/s11263- 0 09- 0275- 4 .

[24] C. Feichtenhofer , A. Pinz , R. Wildes , Spatiotemporal residual networks for video action recognition, in: Proceedings of the Conference on Neural Infor-

mation Processing Systems (NIPS), 2016 .

[25] C. Feichtenhofer , A. Pinz , R. Wildes , Spatiotemporal multiplier networks for video action recognition, in: Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), 2017 . [26] A. Gams , T. Petric , M. Do , B. Nemec , J. Morimoto , T. Asfour , A. Ude , Adapta-

tion and coaching of periodic motion primitives through physical and visual interaction, Robot. Auton. Syst. 75 (2016) 340–351 .

[27] R. Girshick , J. Donahue , T. Darrell , J. Malik , Rich feature hierarchies for ac-

curate object detection and semantic segmentation, in: Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 580–587 .

[28] V. Gonzalez-Pacheco, M. Malfaz, F. Fernandez, M.A. Salichs, Teaching human poses interactively to a social robot, Sensors 13 (9) (2013) 12406–12430,

doi: 10.3390/s130912406 . [29] M. Goodrich , A. Schultz , Human-robot interaction: a survey, Found. Trends

Hum.-Comput. Interact. 1 (2007) 203–275 . [30] I. Gori , J. Aggarwal , L. Matthies , M. Ryoo , Multitype activity recognition in

robot-centric scenarios, IEEE Robot. Autom. Lett. 1 (1) (2016) 593–600 .

[31] A.M. Gupta, B.S. Garg, C.S. Kumar, D.L. Behera, An on-line visual human track- ing algorithm using surf-based dynamic object model, in: Proceedings of the

2013 IEEE International Conference on Image Processing, 2013, pp. 3875– 3879, doi: 10.1109/ICIP.2013.6738798 .

[32] B. Hommel , J. Müsseler , G. Aschersleben , W. Prinz , The theory of event coding (TEC): a framework for perception and action planning, Behav. Brain Sci. 24

(5) (2001) 849–937 .

[33] J. Hornstein, M. Lopes, J. Santos-Victor, F. Lacerda, Sound localization for hu- manoid robots - building audio-motor maps based on the HRTF, in: Proceed-

ings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2006, pp. 1170–1176, doi: 10.1109/IROS.2006.281849 .

[34] N. Hu , G. Englebienne , Z. Lou , B. Krose , Learning latent structure for activity recognition, in: Proceedings of the IEEE Conference Robotics and Automaton

(ICRA), 2014, pp. 1048–1053 .

[35] F. Husain , B. Dellen , C. Torras , Action recognition based on efficient deep fea- ture learning in the spatio-temporal domain, IEEE Robot. Autom. Lett. 1 (2)

(2016) 984–991 . [36] M.S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, G. Mori, A hierarchical

deep temporal model for group activity recognition, in: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

2016, pp. 1971–1980, doi: 10.1109/CVPR.2016.217 .

[37] A. Ijspeert , J. Nakanishi , P. Pastor , H. Hoffmann , S. Schaal , Dynamical move- ment primitives: learning attractor models for motor behaviors, Neural Com-

put. 25 (2) (2013) 328–373 . [38] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Synthetic data and ar-

tificial neural networks for natural scene text recognition, arXiv: 1406.2227 (2014).

[39] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar-

rama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, arXiv: 1408.5093 (2014).

[40] A. Kendon , Conducting Interaction: Patterns of Behavior in Focused Encoun- ters, Studies in Interactional Socio, Cambridge University Press, 1990 .

[41] M. Kirtay, E. Falotico, A. Ambrosano, U. Albanese, L. Vannucci, C. Laschi, Vi- sual Target Sequence Prediction via Hierarchical Temporal Memory Imple-

mented on the iCub Robot, Springer International Publishing, Cham, pp. 119–

130. doi: 10.1007/978- 3- 319- 42417- 0 _ 12 . [42] M.H. Kolekar, D.P. Dash, Hidden Markov model based human activity recog-

nition using shape and optical flow based features, in: Proceedings of the 2016 IEEE Region 10 Conference (TENCON), 2016, pp. 393–397, doi: 10.1109/

TENCON.2016.7848028 . [43] H. Koppula , R. Gupta , A. Saxena , Learning human activities and object affor-

dances from RGB-D videos, Int. J. Robot. Res. 32 (8) (2013) 951–970 . [44] H. Kuhne , H. Jhuang , E. Garrote , T. Poggio , T. Serre , HMDB: A large video

database for human motion recognition, in: Proceedings of the IEEE Inter-

national Conference on Computer Vision (ICCV), 2011 . [45] S. Lallée , S. Lemaignan , A. Lenz , C. Melhuish , L. Natale , S. Skachek , T. van der

Zant , F. Warneken , P.F. Dominey , Towards a platform-independent coopera- tive human–robot interaction system: I. Perception, in: Proceedings of the In-

ternational Conference on Intelligent Robots and Systems (IROS), IEEE, 2010, pp. 4 4 4 4–4 451 .

[46] T. Lan , L. Sigal , G. Mori , Social roles in hierarchical models for human activity

recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 1354–1361 .

[47] Q. Le , W. Zou , S. Yeung , A. Ng , Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis, in: Pro-

ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 3361–3368 .

[48] Z. Lin , Z. Jiang , L. Davis , Recognizing actions by shape-motion prototype trees,

in: Proceedings of the IEEE International Conference on Computer Vision, 2009, pp. 4 4 4–451 .

[49] C.Y. Liu, T.H. Hung, K.C. Cheng, T.H.S. Li, HMM and BPNN based speech recognition system for home service robot, in: Proceedings of the 2013 In-

ternational Conference on Advanced Robotics and Intelligent Systems, 2013, pp. 38–43, doi: 10.1109/ARIS.2013.6573531 .

[50] P. Marshall, Y. Rogers, N. Pantidi, Using F-formations to analyse spatial pat-

terns of interaction in physical environments, in: Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work, CSCW ’11, ACM,

New York, NY, USA, 2011, pp. 445–454, doi: 10.1145/1958824.1958893 . [51] B. Mishra, S.L. Fernandes, K. Abhishek, A. Alva, C. Shetty, C.V. Ajila, D. Shetty,

H. Rao, P. Shetty, Facial expression recognition using feature based techniques and model based techniques: A survey, in: Proceedings of the Second Interna-

tional Conference on Electronics and Communication Systems (ICECS), 2015,

pp. 589–594, doi: 10.1109/ECS.2015.7124976 . [52] D. Nguyen, S. Cho, K. Shin, J. Bang, K. Park, Comparative study of human age

estimation with or without preclassification of gender and facial expression, Sci. World J. 2014 (2014) 905269, doi: 10.1155/2014/905269 . 15 pages

[53] T.-H.-C. Nguyen, J.-C. Nebel, F. Florez-Revuelta, Recognition of activities of

A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13 13

[

daily living with egocentric vision: a review, Sensors 16 (1) (2016), doi: 10. 3390/s16010072 .

[54] K. Nickel , R. Stiefelhagen , Visual recognition of pointing gestures for hu- man–robot interaction, Image Vis. Comput. 25 (12) (2007) 1875–1884 .

[55] M. Oquab , L. Bottou , I. Laptev , J. Sivic , Learning and transferring mid-level image representations using convolutional neural networks, in: Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1717–1724 .

[56] A. Patron-Perez , M. Marszaek , A. Zisserman , I.D. Reid , High five: Recognising

human interactions in tv shows, in: Proceedings of the British Machine Vision Conference, 2010 .

[57] P. Pinheiro , R. Collobert , Recurrent convolutional neural networks for scene labeling, in: Proceedings of the Thirty-First International Conference on Ma-

chine Learning (ICML), 2014, pp. 82–90 . [58] J.C. Pulido, J.C. González, C. Suárez-Mejías, A. Bandera, P. Bustos, F. Fernán-

dez, Evaluating the child–robot interaction of the NAO therapist platform in

pediatric rehabilitation, Int. J. Soc. Robot. 9 (3) (2017) 343–358, doi: 10.1007/ s12369- 017- 0402- 2 .

[59] V. Ramanathan, B. Yao, L. Fei-Fei, Social Role Recognition for Human Event Understanding, Springer International Publishing, Cham, pp. 75–93. doi: 10.

1007/978- 3- 319- 05491- 9 _ 4 . [60] E. Ricci , J. Varadarajan , R. Subramanian , S. Rota Bulo , N. Ahuja , O. Lanz , Un-

covering interactions and interactors: joint estimation of head, body orienta-

tion and f-formations from surveillance videos, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4660–4668 .

[61] J. Rios-Martinez, A. Spalanzani, C. Laugier, From proxemics theory to socially- aware navigation: a survey, Int. J. Soc. Robot. 7 (2015) 137–153, doi: 10.1007/

s12369- 014- 0251- 1 . [62] A. Romero-Garcés , L. Calderita , J. Martínez-Gómez , J. Bandera , R. Marfil ,

L. Manso , A. Bandera , P. Bustos , Testing a fully autonomous robotic sales-

man in real scenarios., in: Proceedings of the International Conference on Autonomous Robot Systems and Competitions (ICARSC 2017), IEEE, 2015,

pp. 124–130 . [63] P. Ruvolo , I. Fasel , J. Movellan , Auditory mood detection for social and educa-

tional robots, in: Proceedings of the IEEE International Conference on Robotics and Automation, 2008, pp. 3551–3556 .

[64] M.S. Ryoo, T.J. Fuchs, L. Xia, J. Aggarwal, L. Matthies, Robot-centric activ-

ity prediction from first-person videos: What will they do to me? in: Pro- ceedings of the Tenth Annual ACM/IEEE International Conference on Human-

Robot Interaction, HRI ’15, ACM, New York, NY, USA, 2015, pp. 295–302, doi: 10.1145/2696454.2696462 .

[65] M.S. Ryoo, L. Matthies, First-person activity recognition: what are they doing to me? in: Proceedings of the 2013 IEEE Conference on Computer Vision and

Pattern Recognition, 2013, pp. 2730–2737, doi: 10.1109/CVPR.2013.352 .

[66] A. Sauppé, B. Mutlu , The social impact of a robot co-worker in industrial set- tings, in: Proceedings of the ACM Conference on Human Factors in Computing

Systems, 2015, pp. 3613–3622 . [67] C. Schuldt , I. Laptev , B. Caputo , Recognizing human actions: a local SVM ap-

proach, in: Proceedings of the Seventeenth International Conference on Pat- tern Recognition (ICPR), 3, 2004, pp. 32–36 .

[68] A.G. Schwing , R. Urtasun , Fully connected deep structured networks, CoRR (2015) . abs/1503.02351

[69] P. Scovanner , S. Ali , M. Shah , A 3-dimensional sift descriptor and its applica-

tion to action recognition, in: Proceedings of the Fifteenth International Con- ference on Multimedia, 2007, pp. 357–360 .

[70] W. Sheng , J. Du , Q. Cheng , G. Li , C. Zhu , M. Liu , G. Xu , Robotic semantic map- ping through human activity recognition: a wearable sensing and computing

approach, Robot. Auton. Syst. 68 (2015) 47–58 . [71] K. Simonyan , A. Zisserman , Two-stream convolutional networks for action

recognition in videos, in: Proceedings of the Advances in Neural Information

Processing Systems (NIPS), 2014, pp. 568–576 . [72] K. Soomro , A. Roshan Zamir , M. Shah , UCF101: a dataset of 101 human actions

classes from videos in the wild, CoRR (2012) . abs/1212.0402. [73] T. Spexard , M. Hanheide , Gerhard sagerer, human-oriented interaction with

an anthropomorphic robot, IEEE Trans. Robot. 23 (5) (2007) 852–862 . [74] C. Stefan , Dynamic eye movement datasets and learnt saliency models for vi-

sual action recognition, in: Proceedings of the Twelfth European Conference

on Computer Vision (ECCV), 2012, pp. 842–856 . [75] R. Stiefelhagen , H. Ekenel , C. Fügen , P. Gieselmann , H. Holzapfel , F. Kraft ,

K. Nickel , A. Waibel , Enabling multimodal human–robot interaction for the Karlsruhe humanoid robot, IEEE Trans. Robot. 23 (5) (2007) 840–851 .

[76] Y. Tamura , T. Akashi , S. Yano , H. Osumi , Human visual attention model based on analysis of magic for smooth human–robot interaction, Int. J. Soc. Robot.

8 (2016) 685–694 .

[77] D. Tran , L. Bourdev , R. Fergus , L. Torresani , M. Paluri , Learning spatiotemporal features with 3D convolutional networks, in: Proceedings of the IEEE Interna-

tional Conference on Computer Vision (ICCV), 2015, pp. 4 489–4 497 . [78] N.-T. Tran, F.-E. Ababsa, M. Charbit, J. Feldmar, D. Petrovska-Delacrétaz, G.

Chollet, 3D Face Pose and Animation Tracking via Eigen-Decomposition Based Bayesian Approach, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 562–

571. doi: 10.1007/978- 3- 642- 41914- 0 _ 55 .

[79] P.K. Turaga , R. Chellappa , V.S. Subrahmanian , O. Udrea , Machine recognition of human activities: a survey, Proc. IEEE Trans. Circuits Syst. Video Technol.

18 (2008) 1473–1488 . [80] N. Vahrenkamp, T. Asfour, R. Dillmann, Simultaneous grasp and motion plan-

ning: humanoid robot ARMAR-III, IEEE Robot. Autom. Mag. 19 (2) (2012) 43– 57, doi: 10.1109/MRA.2012.2192171 .

[81] S. Vascon , E. Mequanint , M. Cristani , H. Hung , M. Pelillo , V. Murino , A game-theoretic probabilistic approach for detecting conversational groups, in:

Proceedings of the Asian Conference on Computer Vision, Springer, 2014,

pp. 658–675 . [82] A. Vignolo, F. Rea, N. Noceti, A. Sciutti, F. Odone, G. Sandini, Biological

movement detector enhances the attentive skills of humanoid robot ICUB, in: Proceedings of the IEEE-RAS Sixteenth International Conference on Hu-

manoid Robots (Humanoids), 2016, pp. 338–344, doi: 10.1109/HUMANOIDS. 2016.7803298 .

[83] P. Viola , M. Jones , Rapid object detection using a boosted cascade of simple

features, in: Proceedings of the IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition, 1, 2001, pp. I511–I518 .

[84] D. Voilmy , C. Suarez , A. Romero-Garces , C. Reuther , J. Pulido , R. Marfil , L. Manso , K. Lan Hing Ting , A. Iglesias , J. Gonzalez , J. Garcia , A. Garcia-Olaya ,

R. Fuentetaja , F. Fernandez , A. Dueñas , L. Calderita , P. Bustos , T. Barile , J. Ban- dera Rubio , A. Bandera , CLARC: a cognitive robot for helping geriatric doctors

in real scenarios, in: A. Ollero, A. Sanfeliu, L. Montano, N. Lau, C. Cardeira

(Eds.), Proceedings of the ROBOT 2017: Third Iberian Robotics Conference, 2017 .

[85] M. Waibel, M. Beetz, J. Civera, R. D’Andrea, J. Elfring, D. Glvez-Lpez, K. Husser- mann, R. Janssen, J.M.M. Montiel, A. Perzylo, B. Schiele, M. Tenorth, O. Zwei-

gle, R.D. Molengraft, Roboearth, IEEE Robot. Autom. Mag. 18 (2) (2011) 69–82, doi: 10.1109/MRA.2011.941632 .

[86] H. Wang , A. Kläser , C. Schmid , C.-L. Liu , Action recognition by dense trajec-

tories, in: Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, 2011, pp. 3169–3176 .

[87] H. Wang , C. Schmid , Action recognition with improved trajectories, in: Pro- ceedings of the IEEE International Conference on Computer Vision (ICCV),

2013, pp. 3551–3558 . [88] H. Wang, M. Shao, Y. Liu, W. Zhao, Enhanced efficiency 3D convolution based

on optimal FPGA accelerator, IEEE Access 5 (2017) 6909–6916, doi: 10.1109/

ACCESS.2017.2699229 . [89] H. Wang , H. Zhou , A. Finn , Discriminative dictionary learning via shared la-

tent structure for object recognition and activity recognition, in: Proceed- ings IEEE International Conference Robotics and Automation (ICRA), 2014,

pp. 6299–6304 . [90] L. Wang , Y. Xiong , Z. Wang , Y. Qiao , D. Lin , X. Tang , L. Gool , Temporal segment

networks: towards good practices for deep action recognition, in: Proceedings

of the European Conference on Computer Vision (ECCV), 2016, pp. 20–36 . [91] Y. Wang , M. Long , J. Wang , P. Yu , Spatiotemporal pyramid network for video

action recognition, in: Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2017 .

[92] C.Y. Weng , W.T. Chu , J.L. Wu , RoleNet: movie analysis from the perspective of social networks, IEEE Trans. Multimed. 11 (2) (2009) 256–271 .

[93] G. Willems , T. Tuytelaars , L. Gool , An efficient dense and scale-invariant spa- tio-temporal interest point detector, in: Proceedings of the European Confer-

ence on Computer Vision (ECCV), 2008, pp. 650–663 .

[94] M. Williams , P. Gardenfors , B. Johnston , G. Wightwick , Anticipation as a strat- egy: a design paradigm for robotics, in: Y. Bi, M.A. Williams (Eds.), Proceed-

ings of the Knowledge Science, Engineering and Management (KSEM2010), Lecture Notes in Computer Science, 6291, Springer, Heidelberg, 2010 .

[95] W. Xiong , L. Wu , F. Alleva , J. Droppo , X. Huang , A. Stolcke , The Microsoft 2017 conversational speech recognition system, CoRR (2017) . abs/1708.06073.

[96] T. Yu, S.N. Lim, K. Patwardhan, N. Krahnstoever, Monitoring, recognizing and

discovering social networks, in: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1462–1469, doi: 10.1109/

CVPR.2009.5206526 . [97] K. Zhang, Y. Huang, Y. Du, L. Wang, Facial expression recognition based on

deep evolutional spatial-temporal networks, IEEE Trans. Image Process. 26 (9) (2017) 4193–4203, doi: 10.1109/TIP.2017.2689999 .

[98] L. Zhang, H. Hung, Beyond F-formations: determining social involvement in

free standing conversing groups from static images, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016,

doi: 10.1109/CVPR.2016.123 . [99] G. Zhu, M. Yang, K. Yu, W. Xu, Y. Gong, Detecting video events based on action

recognition in complex scenes using spatio-temporal descriptor, in: Proceed- ings of the Seventeenth ACM International Conference on Multimedia, MM

’09, ACM, 2009, pp. 165–174, doi: 10.1145/1631272.1631297 .

100] N. Zhuang, T. Yusufu, J. Ye, K.A. Hua, Group activity recognition with differen- tial recurrent convolutional neural networks, in: Proceedings of the Twelfth

IEEE International Conference on Automatic Face Gesture Recognition (FG 2017), 2017, pp. 526–531, doi: 10.1109/FG.2017.70 .

  • Perceiving the person and their interactions with the others for social robotics - A review
    • 1 Introduction
    • 2 Understanding a scene populated by humans
    • 3 Perceiving and modeling people and their interactions
      • 3.1 Modeling the human
        • 3.1.1 Feature extraction
        • 3.1.2 Feature vectors classification
        • 3.1.3 Convolutional Neural Networks (CNNs)
      • 3.2 Modeling a group of people
        • 3.2.1 Convolutional Neural Networks (CNNs)
    • 4 Internalizing the information
    • 5 Discussion
      • 5.1 Networked robotics: the strength of being part of an ecology
      • 5.2 Approaches for first-person activity recognition
    • 6 Conclusions
    • Acknowledgments
    • References