DI - GIAA - Capítulos de Monografías

Permanent URI for this collection

http://dspace.uc3m.es/handle/10016/7457

Browse

Now showing 1 - 20 of 114

Silhouette-based human action recognition with a multi-class support vector machine
(Institution Of Engineering And Technology (IET), 2018-05) González, Luis; Velastin Carroza, Sergio Alejandro; Acuña Leiva, Gonzalo; European Commission; Ministerio de Economía y Competitividad (España); Ministerio de Educación, Cultura y Deporte (España)
Computer vision systems have become increasingly popular, being used to solve a wide range of problems. In this paper, a computer vision algorithm with a support vector machine (SVM) classifier is presented. The work focuses on the recognition of human actions through computer vision, using a multi-camera dataset of human actions called MuHAVi. The algorithm uses a method to extract features, based on silhouettes. The challenge is that in MuHAVi these silhouettes are noisy and in many cases include shadows. As there are many actions that need to be recognised, we take a multiclass classification ap-proach that combines binary SVM classifiers. The results are compared with previous results on the same dataset and show a significant improvement, especially for recognising actions on a different view, obtaining overall accuracy of 85.5% and of 93.5% for leave-one-camera-out and leave-one-actor-out tests respectively.
Detection of People Boarding/Alighting a Metropolitan Train using Computer Vision
(Institution Of Engineering And Technology (IET), 2018-05) Belloc, M.; Velastin Carroza, Sergio Alejandro; Fernández, R.; Jara, M.; European Commission; Ministerio de Economía y Competitividad (España)
Pedestrian detection and tracking have seen a major progress in the last two decades. Nevertheless there are always appli-cation areas which either require further improvement or that have not been sufficiently explored or where production level performance (accuracy and computing efficiency) has not been demonstrated. One such area is that of pedestrian monitoring and counting in metropolitan railways platforms. In this paper we first present a new partly annotated dataset of a full-size laboratory observation of people boarding and alighting from a public transport vehicle. We then present baseline results for automatic detection of such passengers, based on computer vi-sion, that could open the way to compute variables of interest to traffic engineers and vehicle designers such as counts and flows and how they are related to vehicle and platform layout.
Motorcycle detection and classification in urban Scenarios using a model based on Faster R-CNN
(The Institution of Engineering and Technology, 2018-05-22) Espinosa, Jorge E.; Velastin Carroza, Sergio Alejandro; Branch, John W.; European Commission
This paper introduces a Deep Learning Convolutional Neutral Network model based on Faster-RCNN for motorcycle detection and classification on urban environments. The model is evaluated in occluded scenarios where more than 60% of the vehicles present a degree of occlusion. For training and evaluation, we introduce a new dataset of 7500 annotated images, captured under real traffic scenes, using a drone mounted camera. Several tests were carried out to design the network, achieving promising results of 75% in average precision (AP), even with the high number of occluded motorbikes, the low angle of capture and the moving camera. The model is also evaluated on low occlusions datasets, reaching results of up to 92% in AP.
Evaluation Framework for Crowd Behaviour Simulation and Analysis based on Real Videos and Scene Reconstruction
(IEEE, 2017-01-23) Jablonski, Konrad; Argyriou, Vasileios; Greenhill, Darrel; Velastin Carroza, Sergio Alejandro
Crowd simulation has been regarded as an important research topic in computer graphics, computer vision, and related areas. Various approaches have been proposed to simulate real life scenarios. In this paper, a novel framework that evaluates the accuracy and the realism of crowd simulation algorithms is presented. The framework is based on the concept of recreating real video scenes in 3D environments and applying crowd and pedestrian simulation algorithms to the agents using a plug-in architecture. The real videos are compared with recorded videos of the simulated scene and novel Human Visual System (HVS) based similarity features and metrics are introduced in order to compare and evaluate simulation methods. The experiments show that the proposed framework provides efficient methods to evaluate crowd and pedestrian simulation algorithms with high accuracy and low cost.
Characterisation of the spatial sensitivity of classifiers in pedestrian detection
(IEEE, 2015-09) Quinteros, Daniel; Velastin Carroza, Sergio Alejandro; Acuña Leiva, Gonzalo; European Commission; Ministerio de Economía y Competitividad (España); Ministerio de Educación, Cultura y Deporte (España)
In this paper, a study of the spatial sensitivity in the pedestrian detection context is carried out by a comparison of two descriptor-classifier combinations, using the well-known sliding window approach and looking for a well-tuned response of the detector. By well-tuned, we mean that multiple detections are minimised so as to facilitate the usual non-maximal suppression stage. So, to guide the evaluation we introduce the concept of spatial sensitivity so that a pedestrian detection algorithm with good spatial sensitivity can reduce the number of classifications in the pedestrian neighbourhood, ideally to one. To characterise spacial sensitivity we propose and use a new metric to measure it. Finally we carry out a statistical analysis (ANOVA) to validate the results obtained from the metric usage.
Multi-view Human Action Recognition using Histograms of Oriented Gradients (HOG) Description of Motion History Images (MHIs)
(IEEE, 2015-12) Murtaza, Fiza; Yousaf, Muhammad Haroon; Velastin Carroza, Sergio Alejandro; European Commission; Ministerio de Economía y Competitividad (España)
In this paper, a silhouette-based view-independent human action recognition scheme is proposed for multi-camera dataset. To overcome the high-dimensionality issue, incurred due to multi-camera data, the low-dimensional representation based on Motion History Image (MHI) was extracted. A single MHI is computed for each view/action video. For efficient description of MHIs Histograms of Oriented Gradients (HOG) are employed. Finally the classification of HOG based description of MHIs is based on Nearest Neighbor (NN) classifier. The proposed method does not employ feature fusion for multi-view data and therefore this method does not require a fixed number of cameras setup during training and testing stages. The proposed method is suitable for multi-view as well as single view dataset as no feature fusion is used. Experimentation results on multi-view MuHAVi-14 and MuHAVi-8 datasets give high accuracy rates of 92.65% and 99.26% respectively using Leave-One-Sequence-Out (LOSO) cross validation technique as compared to similar state-of-the-art approaches. The proposed method is computationally efficient and hence suitable for real-time action recognition systems.
People Counting in Videos by Fusing Temporal Cues from Spatial Context-Aware Convolutional Neural Networks
(Springer, 2016-11-03) Sourtzinos, Panos; Velastin Carroza, Sergio Alejandro; Jara, Miguel; Zegers, Pablo; Makris, Dimitrios
We present an efficient method for people counting in video sequences from fixed cameras by utilising the responses of spatially context-aware convolutional neural networks (CNN) in the temporal domain. For stationary cameras, the background information remains fairly static, while foreground characteristics, such as size and orientation may depend on their image location, thus the use of whole frames for training a CNN improves the differentiation between background and foreground pixels. Foreground density representing the presence of people in the environment can then be associated with people counts. Moreover the fusion, of the responses of count estimations, in the temporal domain, can further enhance the accuracy of the final count. Our methodology was tested using the publicly available Mall dataset and achieved a mean deviation error of 0.091.
DA-VLAD: Discriminative action vector of locally aggregated descriptors for action recognition
(IEEE, 2018-09-06) Murtaza, Fiza; Yousaf, Muhammad Haroon; Velastin Carroza, Sergio Alejandro; European Commission; Ministerio de Economía y Competitividad (España)
In this paper, we propose a novel encoding method for the representation of human action videos, that we call Discriminative Action Vector of Locally Aggregated Descriptors (DA-VLAD). DA-VLAD is motivated by the fact that there are many unnecessary and overlapping frames that cause non-discriminative codewords during the training process. DA-VLAD deals with this issue by extracting class-specific clusters and learning the discriminative power of these codewords in the form of informative weights. We use these discriminative action weights with standard VLAD encoding as a contribution of each codeword. DA-VLAD reduces the inter-class similarity efficiently by diminishing the effect of common codewords among multiple action classes during the encoding process. We present the effectiveness of DA-VLAD on two challenging action recognition datasets: UCF101 and HMDB51, improving the state-of-the-art with accuracies of 95.1% and 80.1% respectively.
3D-Hog Embedding Frameworks for Single and Multi-Viewpoints Action Recognition Based on Human Silhouettes
(IEEE, 2018-09-13) Angelini, Federico; Fu, Zeyu; Velastin Carroza, Sergio Alejandro; Chambers, Jonathon A.; Naqvi, Syed Mohsen
Given the high demand for automated systems for human action recognition, great efforts have been undertaken in recent decades to progress the field. In this paper, we present frameworks for single and multi-viewpoints action recognition based on Space-Time Volume (STV) of human silhouettes and 3D-Histogram of Oriented Gradient (3D-HOG) embedding. We exploit fast-computational approaches involving Principal Component Analysis (PCA) over the local feature spaces for compactly describing actions as combinations of local gestures and L 2 -Regularized Logistic Regression (L 2 -RLR) for learning the action model from local features. Outperforming results on Weizmann and i3DPost datasets confirm efficacy of the proposed approaches as compared to the baseline method and other works, in terms of accuracy and robustness to appearance changes.
An Optimized and Fast Scheme for Real-time Human Detection using Raspberry Pi
(IEEE, 2016-11-30) Noman, Mubashir; Yousaf, Muhammad Haroon; Velastin Carroza, Sergio Alejandro
Real-time human detection is a challenging task due to appearance variance, occlusion and rapidly changing content; therefore it requires efficient hardware and optimized software. This paper presents a real-time human detection scheme on a Raspberry Pi. An efficient algorithm for human detection is proposed by processing regions of interest (ROI) based upon foreground estimation. Different number of scales have been considered for computing Histogram of Oriented Gradients (HOG) features for the selected ROI. Support vector machine (SVM) is employed for classification of HOG feature vectors into detected and non-detected human regions. Detected human regions are further filtered by analyzing the area of overlapping regions. Considering the limited capabilities of Raspberry Pi, the proposed scheme is evaluated using six different testing schemes on Town Centre and CAVIAR datasets. Out of these six testing schemes, Single Window with two Scales (SW2S) processes 3 frames per second with acceptable less accuracy than the original HOG. The proposed algorithm is about 8 times faster than the original multi-scale HOG and recommended to be used for real-time human detection on a Raspberry Pi.
Feature Similarity and Frequency-Based Weighted Visual Words Codebook Learning Scheme for Human Action Recognition
(Springer, 2017-11-20) Nazir, Saima; Yousaf, Muhammad Haroon; Velastin Carroza, Sergio Alejandro; European Commission; Ministerio de Economía y Competitividad (España)
Human action recognition has become a popular field for computer vision researchers in the recent decade. This paper presents a human action recognition scheme based on a textual information concept inspired by document retrieval systems. Videos are represented using a commonly used local feature representation. In addition, we formulate a new weighted class specific dictionary learning scheme to reflect the importance of visual words for a particular action class. Weighted class specific dictionary learning enriches the scheme to learn a sparse representation for a particular action class. To evaluate our scheme on realistic and complex scenarios, we have tested it on UCF Sports and UCF11 benchmark datasets. This paper reports experimental results that outperform recent state-of-the-art methods for the UCF Sports and the UCF11 dataset i.e. 98.93% and 93.88% in terms of average accuracy respectively. To the best of our knowledge, this contribution is first to apply a weighted class specific dictionary learning method on realistic human action recognition datasets.
Inter and Intra Class Correlation Analysis (IIcCA) for Human Action Recognition in Realistic Scenarios
(The Institution Of Engineering And Technology, 2017-07-11) Nazir, Saima; Yousaf, Muhammad Haroon; Velastin Carroza, Sergio Alejandro; European Commission; Ministerio de Economía y Competitividad (España); Ministerio de Educación, Cultura y Deporte (España)
Human action recognition in realistic scenarios is an important yet challenging task. In this paper we propose a new method, Inter and Intra class correlation analysis (IICCA), to handle inter and intra class variations observed in realistic scenarios. Our contribution includes learning a class specific visual representation that efficiently represents a particular action class and has a high discriminative power with respect to other action classes. We use statistical measures to extract visual words that are highly intra correlated and less inter correlated. We evaluated and compared our approach with state-of-the-art work using a realistic benchmark human action recognition dataset.
Shadow Detection for Vehicle Detection in Urban Environments
(Springer, 2017-07-02) Hanif, Muhammad; Hussain, Fawad; Yousaf, Muhammad Haroon; Velastin Carroza, Sergio Alejandro; Chen, Zezhi
Finding an accurate and computationally efficient vehicle detection and classification algorithm for urban environment is challenging due to large video datasets and complexity of the task. Many algorithms have been proposed but there is no efficient algorithm due to various real-time issues. This paper proposes an algorithm which addresses shadow detection (which causes vehicles misdetection and misclassification) and incorporates solution of other challenges such as camera vibration, blurred image, illumination and weather changing effects. For accurate vehicles detection and classification, a combination of self-adaptive GMM and multi-dimensional Gaussian density transform has been used for modeling the distribution of color image data. RGB and HSV color space based shadow detection is proposed. Measurement-based feature and intensity based pyramid histogram of orientation gradient are used for classification into four main vehicle categories. The proposed method achieved 96.39% accuracy, while tested on Chile (MTT) dataset recorded at different times and weather conditions and hence suitable for urban traffic environment
Skeletal Movement to Color Map: A Novel Representation for 3D Action Recognition with Inception Residual Networks
(IEEE, 2018-09-06) Hieu Pham, Huy; Khoudour, Louahdi; Crouzil, Alain; Zegers, Pablo; Velastin Carroza, Sergio Alejandro
We propose a novel skeleton-based representation for 3D action recognition in videos using Deep Convolutional Neural Networks (D-CNNs). Two key issues have been addressed: First, how to construct a robust representation that easily captures the spatial-temporal evolutions of motions from skeleton sequences. Second, how to design D-CNNs capable of learning discriminative features from the new representation in a effective manner. To address these tasks, a skeleton-based representation, namely, SPMF (Skeleton Pose-Motion Feature) is proposed. The SPMFs are built from two of the most important properties of a human action: postures and their motions. Therefore, they are able to effectively represent complex actions. For learning and recognition tasks, we design and optimize new D-CNNs based on the idea of Inception Residual networks to predict actions from SPMFs. Our method is evaluated on two challenging datasets including MSR Action3D and NTU-RGB+D. Experimental results indicated that the proposed method surpasses state-of-the-art methods whilst requiring less computation.
Motorcycle Classification in Urban Scenarios using Convolutional Neural Networks for Feature Extraction
(Institution Of Engineering And Technology (IET), 2017-07-12) Espinosa, Jorge E.; Velastin Carroza, Sergio Alejandro; Branch, John W.; Ministerio de Economía y Competitividad (España); European Commission
This paper presents a motorcycle classification system for urban scenarios using Convolutional Neural Network (CNN). Significant results on image classification has been achieved using CNNs at the expense of a high computational cost for training with thousands or even millions of examples. Nevertheless, features can be extracted from CNNs already trained. In this work AlexNet, included in the framework CaffeNet, is used to extract features from frames taken on a real urban scenario. The extracted features from the CNN are used to train a support vector machine (SVM) classifier to discriminate motorcycles from other road users. The obtained results show a mean accuracy of 99.40% and 99.29% on a classification task of three and five classes respectively. Further experiments are performed on a validation set of images showing a satisfactory classification.
Learning and Recognizing Human Action from Skeleton Movement with Deep Residual Neural Networks
(The Institution Of Engineering And Technology, 2017-07-11) Pham, Huy-Hieu; Khoudour, Louahdi; Crouzil, Alain; Zegers, Pablo; Velastin Carroza, Sergio Alejandro
Automatic human action recognition is indispensable for almost artificial intelligent systems such as video surveillance, human-computer interfaces, video retrieval, etc. Despite a lot of progresses, recognizing actions in a unknown video is still a challenging task in computer vision. Recently, deep learning algorithms has proved its great potential in many vision-related recognition tasks. In this paper, we propose the use of Deep Residual Neural Networks (ResNets) to learn and recognize human action from skeleton data provided by Kinect sensor. Firstly, the body joint coordinates are transformed into 3D-arrays and saved in RGB images space. Five different deep learning models based on ResNet have been designed to extract image features and classify them into classes. Experiments are conducted on two public video datasets for human action recognition containing various challenges. The results show that our method achieves the state-of-the-art performance comparing with existing approaches
People Detection and Pose Classification Inside a Moving Train Using Computer Vision
(Springer, 2017-11-29) Velastin Carroza, Sergio Alejandro; Gómez-Lira, Diego A.; European Commission; Ministerio de Economía y Competitividad (España)
The use of surveillance video cameras in public transport is increasingly regarded as a solution to control vandalism and emergency situations. The widespread use of cameras brings in the problem of managing high volumes of data, resulting in pressure on people and resources. We illustrate a possible step to automate the monitoring task in the context of a moving train (where popular background removal algorithms will struggle with rapidly changing illumination). We looked at the detection of people in three possible postures: Sat down (on a train seat), Standing and Sitting (half way between sat down and standing). We then use the popular Histogram of Oriented Gradients (HOG) descriptor to train Support Vector Machines to detect people in any of the predefined postures. As a case study, we use the public BOSS dataset. We show different ways of training and combining the classifiers obtaining a sensitivity performance improvement of about 12% when using a combination of three SVM classifiers instead of a global (all classes) classifier, at the expense of an increase of 6% in false positive rate. We believe this is the first set of public results on people detection using the BOSS dataset so that future researchers can use our results as a baseline to improve upon.
Vehicle Detection Using Alex Net and Faster R-CNN Deep Learning Models: A Comparative Study
(Springer, 2017-11-29) Espinosa, Jorge E.; Velastin Carroza, Sergio Alejandro; Branch, John W.; European Commission; Ministerio de Economía y Competitividad (España)
This paper presents a comparative study of two deep learning models used here for vehicle detection. Alex Net and Faster R-CNN are compared with the analysis of an urban video sequence. Several tests were carried to evaluate the quality of detections, failure rates and times employed to complete the detection task. The results allow to obtain important conclusions regarding the architectures and strategies used for implementing such network for the task of video detection, encouraging future research in this topic.
Human Action Recognition using Multi-Kernel Learning for Temporal Residual Network
(SciTePress, 2019-02) Nazir, Saima; Qian, Yu; Yousaf, Muhammad Haroon; Velastin Carroza, Sergio Alejandro; Izquierdo, Ebroul; Vazquez, Eduard; European Commission; Ministerio de Economía y Competitividad (España); Ministerio de Educación, Cultura y Deporte (España)
Deep learning has led to a series of breakthrough in the human action recognition field. Given the powerful representational ability of residual networks (ResNet), performance in many computer vision tasks including human action recognition has improved. Motivated by the success of ResNet, we use the residual network and its variations to obtain feature representation. Bearing in mind the importance of appearance and motion information for action representation, our network utilizes both for feature extraction. Appearance and motion features are further fused for action classification using a multi-kernel support vector machine (SVM).We also investigate the fusion of dense trajectories with the proposed network to boost up the network performance. We evaluate our proposed methods on a benchmark dataset (HMDB-51) and results shows the multi-kernel learning shows the better performance than the fusion of classification score from deep network SoftMax layer. Our proposed method also shows good performance as compared to the recent state-of-the-art methods.
Bag of Deep Features for Instructor Activity Recognition in Lecture Room
(Springer, 2019-01) Nida, Nudrat; Yousaf, Muhammad Haroon; Irtaza, Aun; Velastin Carroza, Sergio Alejandro; European Commission; Ministerio de Economía y Competitividad (España); Ministerio de Educación, Cultura y Deporte (España)
This research aims to explore contextual visual information in the lecture room, to assist an instructor to articulate the effectiveness of the delivered lecture. The objective is to enable a self-evaluation mechanism for the instructor to improve lecture productivity by understanding their activities. Teacher’s effectiveness has a remarkable impact on uplifting students performance to make them succeed academically and professionally. Therefore, the process of lecture evaluation can significantly contribute to improve academic quality and governance. In this paper, we propose a vision-based framework to recognize the activities of the instructor for self-evaluation of the delivered lectures. The proposed approach uses motion templates of instructor activities and describes them through a Bag-of-Deep features (BoDF) representation. Deep spatio-temporal features extracted from motion templates are utilized to compile a visual vocabulary. The visual vocabulary for instructor activity recognition is quantized to optimize the learning model. A Support Vector Machine classifier is used to generate the model and predict the instructor activities. We evaluated the proposed scheme on a self-captured lecture room dataset, IAVID-1. Eight instructor activities: pointing towards the student, pointing towards board or screen, idle, interacting, sitting, walking, using a mobile phone and using a laptop, are recognized with an 85.41% accuracy. As a result, the proposed framework enables instructor activity recognition without human intervention.

Browse

Recent Submissions