20 Jan 2020 |
Research article |
Information and Communications Technologies
Automatic Annotation of Non-Verbal Language among the Elderly
Bought on Istock.com. Copyrights.
Humans need to express themselves by communicating and sharing emotions through non-verbal communication. Consequently, analyzing their behaviour can contribute to a better understanding of their needs. This article discusses a new system model of nonverbal language annotation. It presents an interesting approach to head gesture annotation for elderly people, based on the use of machine learning techniques in the field of linguistics. By using the proposed approach, we were able to annotate frames as it is done by experts in the ground truth. Elderly people’s head movements are identified through automatic annotation of videos (natural conversations of elderly people). Three machine learning techniques—decision tree (DT), K-nearest neighbours (KNN) and support-vector machine (SVM)—were used and compared to annotate 3657 frames extracted from natural conversations collected in one corpus called CorpAGEst. The proposed approach shows promising results in terms of covering all head movements. The techniques used provided different accuracy rates. SVM and KNN both showed the highest accuracy (93%), compared to DT (68%). Following this research, the next step will focus on annotation of hand movements to allow a more complete characterization of non-verbal communication in elderly people.
These days, the study of nonverbal communication is one of the most important issues and has grown rapidly in the present decade. Indeed, thousands of empirical studies have focused on the role of nonverbal communication in human lives. Humans need to express themselves by communicating and sharing their emotions, and feel free to use verbal and nonverbal communication. Research (1) suggests that in communications, only 5% of the effect is produced by the spoken word, 45% by tone, inflection and other vocal elements, and 50% by body language, movement, and eye contact. So sometimes nonverbal communication is more useful to understand the exact intention of the communicator.
For this work, we used the CorpAGEst corpus (2), composed of face-to-face conversations between adults and an elderly subject (75 years old and more). It is a longitudinal corpus containing audio and video annotations, as well as time-aligned transcriptions. The goal of this corpus is to study verbal and gestural markers and build a pragmatic profile of elderly people, by following their verbal and gestural pragmatic markers in real-world situations. The corpus consists of transversal and longitudinal sub corpora:
- The transversal corpus includes 18 spontaneous conversations in Belgian-French. Participating were 8 women and 1 man, with a mean age of 85. Each participant had two conversations. The corpus serves to explore non-verbal markers of stance and their combination in language interaction, since they are representative indicators for attitudinal behaviour and speakers’ emotional state.
- The proposed approach to automate head movement annotations is based on two main ideas. The first one is the creation of a class that encompasses the complex classes in order to simplify the model, and the second is the use of machine learning techniques to simulate the expert’s annotations. We distinguished three steps that composed the machine learning process application.
- Head feature extraction (landmarks identification).
- Pre-processing data extracted in the first step.
- Three machine learning techniques to characterize head movement.
Feature extraction and identification
Different techniques are used for feature extractions, including edge detection (3), Boosted Haar cascade (4), and Gabor filter (5). The proposed system used OpenFace (6) to extract the head features. OpenFace is based on a neural network model called “Convolutional Expert Constrained Local Model” (7). OpenFace generated 714 features. From these features we removed those which were related to gaze and 3D landmarks. After this pre-processing step, we obtained 211 features for each frame.
One of the most important steps to obtain adequate results is data pre-processing. Different techniques can be applied. In this work, we used data cleaning to remove noise and correct inconsistencies in the data generated by OpenFace. We also used data transformations, such as normalization, to improve the precision of algorithms involving distance measurements. Finally, we reduced the data size by aggregation, eliminating redundant features using data reduction.
Dimensionality reduction is the process of finding a low-dimensional representation of the data that retains as much information as possible by keeping the most relevant variables. Two different techniques were applied in this work.
The Pearson correlation is an optimization criterion that measures the linear correlation between independent variables. The output of this step is used to reduce the number of features using the principal component analysis.
Principal component analysis
Due to the huge number of features extracted we needed, on the one hand, to reduce the number of variables considerably; on the other hand, we needed to retain information in the original dataset. One of the techniques that can ensure this function is the principal component analysis (PCA).
Feature scaling (standardisation)
One of the pre-processing steps done in this work was feature scaling. In fact, we applied this technique due to the different frame sizes of our dataset, to avoid low performance caused by size variation.
Minimizing the head movement class
We distinguished 38 classes used to annotate the head movements cited in the CorpAGEst corpus, proposed by experts, as mentioned in table 1. Among these classes, 8 classes represent “simple” head movement directions (Turn Right, Turn Left, Tilt Right, Tilt Left, Down, Up, Back, Forward). The remaining classes represent “complex” head movement directions, in other words, more than one direction (e.g. Down + Turn Right, Tilt Left + Turn Right). The proposed approach is true to expert annotations; we maintained the same simple annotations and we added a new class that encompasses all the composed classes. Annotations done by experts can be found in the table below.
Table 1 Annotations done by experts
Table 2 Proposed annotation classes
Machine learning techniques to characterize head movement
We validated our work by using the ground truth done by experts (linguists); we used 3657 frames extracted from a natural conversation. We divided the dataset into 2 parts:
The training set is a set of examples used for learning. The model is trained on the dataset using a supervised learning method to fit the parameters of the classifier. The model sees and learns from this data.
The testing set is a set of examples used to evaluate the performance of the trained classifier. A test phase was done to estimate the error rate after we had chosen the final model. It is a dataset used to provide an unbiased evaluation of a final model fitted on the training dataset.
The purpose of automatic annotation was reached by considering the problem as a classification task, which was solved by using supervised learning represented by three machine learning techniques, namely decision tree, support-vector machine, and the K-nearest neighbour algorithm.
The decision tree technique (DT) is designed in a flow diagram-like tree structure. DT is one of the multistage decision-making approaches, extensively used to represent classification models thanks to its simple and comprehensible structure similar to the human thought. There are several reasons behind the wide use of DT, among which the ability of this induced model to generate an adequate generalization. Also, it shows a good capacity to treat redundant attributes and to deal with noisy data. Add to that the low computational cost to generate the model (8). However, our case results were not promising, we reached a 67% result rate, as shown below.
The support-vector machine technique (SVM) is based on supervised non-parametric statistical learning algorithms defined by a separating hyperplane designed to increase the generalization capacity of the model, on the one hand, and on the other hand, to avoid overfitting (9). SVM is used for two tasks: classification and regression analysis. In our case, it was used for classification. In other words, given labelled training datasets, the algorithm was able to categorize new examples in the defined classes. The results we obtained with this method were promising, reaching a 93% rate.
The K-nearest neighbour algorithm (KNN) is an efficient learning algorithm. It belongs to the non-parametric classification method that has widely been used in real applications of machine learning, thanks to its high performance and simple implementation (10). A case is classified by a majority vote from its neighbours, with the case being assigned to the class most common among its K-nearest neighbours measured by a distance function. The KNN algorithm gave good results, exceeding 93%.
Discussion and Conclusion
We presented a methodology for a computer-based annotation. This approach aims to automate the manual process of annotation. Our work presents two major contributions: the first one on the use of machine learning techniques in the field of linguistics. This is an attempt to overcome several limitations related to precision and timing. The second important contribution is related to the idea of imitating expert annotations based on standards.
We presented an interdisciplinary approach for automatic annotation. The proposed process focused on the automatic annotation of head movements, reducing the cost of manual annotations and establishing a back and forth dialog between computer science communities and researchers. This research illustrated how automated techniques could be used to annotate videos according to a standard. The results propose two research focuses for further exploration. First, the automatic annotation process of nonverbal behaviours in videos is quite feasible; second, machine learning algorithms have the capacity to identify features that are not totally in sync with what humans are used to. This characteristic clearly opens up a new dialog between artificial intelligence and researchers in linguistics, and researchers in fields related to communication and aging in terms of interpreting results and features in videos. The principal impact of this work is to contribute in establishing stronger gesture recognition techniques adapted to the aging population, giving the community of researchers tools to explore specificities in an automated fashion.
This research was presented at the 87th Acfas Conference on May 29, 2019.
Helmi Garraoui is a PhD candidate at ÉTS under the supervision of Sylvie Ratté and Luc Duong. His research interests include non-verbal communication and machine learning video analysis.
Program : Information Technology Engineering
Research laboratories : LiNCS - Cognitive and Semantic Interpretation Engineering Laboratory
Luc Duong is a professor in the Software Engineering and IT Department at ÉTS, and researcher at the CHU Research Center. His research focuses on medical imaging, computer vision, algorithms and artificial intelligence.
Research laboratories : LIVE – Interventional Imaging Laboratory
Sylvie Ratté is a professor in the Software and IT Engineering Department at ÉTS. Her research interests include linguistic engineering, artificial intelligence, ontology, data and text mining, formal and visual languages.
Research laboratories : LiNCS - Cognitive and Semantic Interpretation Engineering Laboratory