Intelligent Android malware family classification using Genetic Algorithms and SVM

Thumbnail Image
Publication date
Defense date
Journal Title
Journal ISSN
Volume Title
Google Scholar
Research Projects
Organizational Units
Journal Issue
As of April 2019, Android was the most popular mobile operating system amongst smartphone users[1]. Its high popularity, combined with the extended use of smartphones for everyday tasks as well as storing or accessing sensitive and personal data, has made Android applications the target of numerous malware attacks over the last few years and in the present. The malware attacks have been perfected to target specific vulnerabilities in the operating system or the user; thus specializing in types of malware and families within each type. The malware is usually distributed in infected applications (or APKs), which contain malicious behaviours that can be found looking into their code (known as static analysis) or analysing the behaviour of the application while running (known as dynamic analysis). This document describes the implementation of an intelligent system that aims to classify a series of malicious APK samples obtained from the free repository ContagioDump. These samples are classified inside the type and family they belong to. To create the classifier system, a Support Vector Machine (SVM) is implemented using Python’s library Scikit Learn. A series of attributes are extracted from the samples of malicious APK by analysing the code of the APKs via static analysis, using Python’s library Androguard, which contains a parser that allows to interact with all the relevant parts of the APK file. The attributes obtained are very high in number, and for that reason a Genetic Algorithm is used to optimize the attributes that the SVM uses in the learning process. The algorithm codifies a subset of attributes from all the attributes extracted in the static analysis, and is evaluated using the accuracy score obtained when training the SVM with said subset. As a result, a subset of attributes and a trained model for the classification are obtained. This model is then tested with a new set of malware samples, belonging to all the families classified in the learning. The present document contains the explanation of the process of designing, creating and testing the system. It is developed as bachelor’s thesis for computer science and engineering degree in Universidad Carlos III de Madrid.
Genetic algorithms, Neural networks, Support Vector Machine (SVM), Android (Operating system), Malware, Artificial Intelligence
Bibliographic citation