Abstract. Android is a most popular mobile-based operating system with bil- lions of active users, which has encouraged hackers and cyber-criminals to push the malware into this operating system. Accordingly, extensive research has been conducted on malware analysis and detection for Android in recent years; and Android has developed and implemented numerous security controls to deal with the problems, including unique ID (UID) for each application, system permissions, and its distribution platform Google Play. In this paper, we eval- uate four tree-based machine learning algorithms for detecting Android malware in conjunction with a substring-based feature selection method for the classi- fiers. In the experiments 11,120 apps of the DREBIN dataset were used where 5,560 contain malware samples and the rest are benign. It is found that the Random Forest classifier outperforms the best previously reported result (around 94% accuracy, obtained by SVM) with 97.24% accuracy, and thus provides a strong basis for building effective tools for Android malware detection.

Keywords: Machine learning � Classifier � DREBIN � Substring Malware

1 Introduction

Android is an open source mobile-based operating system built on the Linux Kernel and its architecture is divided into five components with two models of permissions: (i) A sandbox environment at the kernel level which prevent access to the file-system and other resources and (ii) API used to expose to the user during installation of an application[1, 2].

The assembly of every Android application consists of application code (.dex files), resources, and AndroidManifest.xml file, which provides information of an applica- tion’s features and the security configurations (e.g. the permissions API, activities, services, content providers and the broadcast receivers) [3, 4]. After DE-compilation of an Android APK file, we study the AndroidMenifest.xml file to check whether any permission used and then numerous API functions are written to call in java file to check whether any code hiding image script available or not.

In this paper, we propose a substring-based method for feature selection for use in Android malware detection, and evaluate tree-based machine learning algorithms

http://crossmark.crossref.org/dialog/?doi=10.1007/978-3-319-98446-9_35&domain=pdf

including Decision Tree, Random Forest, Extremely Randomized Tree, and Gradient Tree Boosting to detect malware on Android by performing static analysis on DREBIN dataset [5]. The substring-based method helps in removing possibly irrelevant infor- mation and speeds up the detection. The dataset is composed of malware and benign data and each file contains various features of requested hardware components, requested permissions, app components, network addresses etc. The results of Decision Tree, Random Forest, Extremely Randomized Tree, and Gradient Tree Boosting are assessed based on various features including api_call, feature, url, service_receiver, permission, call, intent, real_permission, activity, provider. Analyzing the outcome of the experiments using these algorithms we found that the Random Forest (RF) and Extremely Randomized Tree (ERT) algorithms have the best results, 97.24% and 96.97% accuracy, respectively, in malware detection.

The remainder of the paper is organized as follows: Sect. 2 gives an overview of mobile malware; Sect. 3 describes related work; Sect. 4 presents our proposed malware detection methodology, including dataset description, feature extraction, technology used, training and testing; Sect. 5 presents results and analysis including performance metrics; and Sect. 6 gives conclusions and future work.

2 Overview of Mobile Malware

According to their behaviors, mobile malware can be grouped into two types according to the manner they infect a device: (i) self-propagating ones, and (ii) social engineering based. The first type automatically installs malware on mobile devices using different approaches such as worms, while the second type tricks users into taking certain actions to install the malware. In recent reports, DangerouObject, Trojans, Backdoor, Spyware, and Adware have been exposed as the most popular mobile malware according to the reports provided by Kaspersky, McAfee, Nokia and Proofpoint [6–10].

In order to extract the useful features to facilitate malware detection it is also useful to categorize the different types of attacks, as follows:

• Hardware-based attacks: Attackers use specific commands or operations to crash hardware or change its characteristics.

• Software-based attacks: Attackers insert malware into benign official apps then upload to third-party Android app markets for download by unsuspecting users.

• Firmware-based attacks: Attackers modify or change the devices’ firmware through obtaining control privilege or creating backdoors so that they can install malware.

3 Related Works

Urcuqui and Cadavid [11] proposed a static analysis framework using Naïve Bayes, Bagging, K-Neighbors, Support Vector Machines (SVM), Stochastic Gradient Descent and Decision Tree to detect malware on Android where K-Neighbors algorithm per- formed the best in the classification task with 94%.

378 Md. S. Rana et al.

Talha, et al. [12] proposed a permission-based detection system ‘APK Auditor’ to classify the Android apps as benign or malicious and obtained 88% accuracy with a 0.925 specificity using 8762 applications containing 1853 benign applications and 6909 malicious applications.

Sahs and Khan [13] proposed a supervised machine learning technique to detect malware on Android using SVM, and where ‘Androguard’ tool and the Scikit-learn framework are first used to extract information from the APKs [14].

Yeima, et al. [15] proposed a model that provides indicators of potential malicious activities based on Bayesian classification using static analysis, and the best results were obtained TPR (True Positive Rate) 90.6%, FNR (False Negative Rate) 0.094%, accuracy of 93.5% and AUC (Area Under Curve) of 97.22%.

DroidDolphin [16] is a dynamic analysis framework based on machine learning to detect malware on Android. It performs analysis by extracting information from API calls and 13 activities by running the application on virtual environments, and achieved a precision of 86.1% and an F-score of 0.875 by using SVM and the LIBSVM library [17].

Feizollah, et al. [18] proposed a model using dynamic analysis by applying five supervised machine learning algorithms, and the best results were obtained by KNN: TPR of 99.94% against an FPR (False Positive Rate) of 0.06%.

4 Methodology

We propose a malware detection method based on substring-based feature selection (Fig. 1) and apply it to the DREBIN dataset for performance comparison with previ- ously published results. Details are described in the following:

4.1 Dataset Description

For experiments, we use the ‘DREBIN’ dataset that contains of 11,120 of 123,453 real Android applications, where 5,560 applications contain malware samples from 179 different malware families and 5,560 are benign samples. The samples were collected in the period of August 2010 to October 2012; the top 20 families of malware are listed below (Table 1).

Table 1. Top malware families of our dataset

Malware family # Entries Malware family # Entries

FakeInstaller 925 Adrd 91 DroidKungFu 667 DroidDream 81 Plankton 625 ExploitLinuxL otoor 70 Opfake 613 Glodream 69 Ginmaster 339 MobileTx 69 BaseBridge 330 FakeRun 61 Iconosys 152 SendPay 59 Kwin 147 Gappusin 58 FakeDoc 132 Imlog 43 Geinimi 92 SMSreg 41

Evaluation of Tree Based Machine Learning Classifiers 379

4.2 Data Preprocessing and Feature Selection

a. Data collection and balancing: The dataset is composed of benign apps and apps infected with malware; where each file or app contains features including requested hardware components, requested permissions, app components, network addres- ses, etc. A balanced dataset is constructed for experiments, which is done by randomly selecting the same number of malware samples and benign samples.

b. Substring array creation based on feature sets: As we use the DREBIN dataset, which focuses on the manifest XML file and the disassembled (.DEX) code of the Android apps. The dataset has 8 features from two sources including: i. Feature sets from the manifest: • Feature 1 (Hardware components): It is a set of requested hardware com-

ponents. For example, touchscreen, camera, GPS and so on. • Feature 2 (Requested permissions): Android permissions play vital role in

security mechanism allowed by users during the installation of application. Malicious Application has trends to request dangerous permission by which it can get access to sensitive information.

• Feature 3 (App components): App components has four types such as activities, services, content providers and broadcast receivers.

• Feature 4 (Filtered intents): Intents are performed in android inter process and intra process communication. For example, BOOT_COMPLETED is used by most malware to trigger malicious activities directly after rebooting the android mobile phone.

ii. Feature sets from disassembled code: • Feature 5 (Restricted API calls): Restricted API calls is performed based on

Android permissions allowed during installation. But the using of Android restricted API function call without requesting the permission in manifest indicate malicious activities may be using root exploits.

• Feature 6 (Used permissions): Sometimes it may need to take a permission for non-malicious API function calls. Using this set of features, we can ensure that the requested permissions and the API function calls are directed to malicious activities or not.

• Feature 7 (Suspicious API calls): Sometimes some API functions are used by malware by which can be get access of sensitive information about device and used for obfuscation. For example getDeviceId(), Runtime.exec (), Cipher.getInstance(), etc.

• Feature 8 (Network addresses): Network addresses or URLs are commonly used by malware to pass data or to execute external commands.

iii. Variation of word from feature sets: API calls, URL are unique sentences, and so we can count each as single word. But for permissions, activities, intents, services, they have multiple sequences of words, for example, android.hardware.telephony. We create 3 substrings using last 1, 2 and 3 words respectively as a meaningful information to identify the most impor- tant word in any permissions, activities, intents, services, etc.

c. Substring array loading: The substring array is loaded to split it 70:30 ratio to train and test the learning machine.

380 Md. S. Rana et al.

4.3 Technology Used

The experiments are performed using the Python in Anaconda package, and the machine used is MacBookPr011.5, Processor: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50 GHz (8 CPUs), *2.5 GHz 64-bit PC with RAM: 16 GB.

4.4 Training and Testing

In order to obtain good results, we split our dataset into 70% for training and 30% for testing. To perform the classification task, we apply the tree-based machine learning algorithms of Decision Tree [19], Random Forest [20], Extremely Randomized Tree [21], and Gradient Boosted Tree [22] classifiers. Each tree performs attribute tests at interior nodes and decision at the leaf nodes. After training the model we evaluate its accuracy by testing it on new instances.

Fig. 1. Substring-based malware detection method

Evaluation of Tree Based Machine Learning Classifiers 381

5 Results and Analysis

5.1 Measurement Metrics

Common metrics are used to evaluate the performance of the various combinations of learning machines and features on the dataset, including the following: Accuracy (AC) is the proportion of the total number of corrected predictions. Precision (P) is the proportion of the correctly predicted positive cases. Recall or True Positive Rate (TPR) is the proportion of the correctly identified positive cases. False Positive Rate (FPR) is the proportion of negatives cases that were incorrectly classified as positive. f1- Score or F-Measure is a weighted average of the True Positive (TP) rate or recall and Precision (P). ROC Curve is a graph to summarize the performance of the classifier over all probable thresholds generated by plotting the True Positive (TP) Rate in Y-axis against the False Positive (FP) Rate in X-axis.

5.2 Performance Results

The performance of algorithms by taking word as substring shown in Table 2.

Finally, we observe that the Random Forest classifier as almost always producing the best results compared to the other classifiers. Also, we find that the last words of any permissions, activities, intents, services are generally the most important words while the combination of other words may impact negatively on classification, for example, substring created by using the last 2 and 3 words often seem to introduce irrelevant information to classifiers. We compare the overall accuracy of each tree- based machine learning algorithm using the same parameters and summarize the results in Figs. 2 and 3, where it is seen that the best performance is obtained by Random

Table 2. Performance results of tree-based machine learning algorithms

Substring Algorithms Precision Recall f1-score Accuracy 0 1 0 1 0 1

3 words from each feature Decision tree 0.95 0.89 0.89 0.95 0.92 0.92 91.76 Random forest 0.95 0.92 0.93 0.95 0.94 0.95 93.87 Gradient boosting 0.86 0.89 0.90 0.84 0.88 0.87 87.26 Ext. randomized 0.95 0.91 0.91 0.95 0.93 0.93 92.62

2 words from each feature Decision tree 0.93 0.93 0.92 0.93 0.93 0.93 92.87 Random forest 0.93 0.95 0.95 0.94 0.94 0.94 94.17 Gradient boosting 0.87 0.91 0.91 0.87 0.89 0.89 89.96 Ext. randomized 0.94 0.94 0.94 0.94 0.94 0.94 93.97

1 word from each feature Decision tree 0.96 0.96 0.96 0.96 0.96 0.96 96.13 Random forest 0.97 0.98 0.98 0.97 0.97 0.97 97.24 Gradient boosting 0.93 0.94 0.94 0.93 0.94 0.94 93.68 Ext. randomized 0.97 0.97 0.97 0.97 0.97 0.97 96.97

382 Md. S. Rana et al.

Forest using the features of Hardware Components, Requested Permissions, App Components, Filtered Intents, Restricted API Calls, Used Permissions, Suspicious API Calls, and Network Address. This produced the Random Forest’s overall accuracy of 97.24%, TPR of 96.88%; FPR of 2.39%, and AUC of 97.24%. We can consider the “best” outcome as the experiment that maximizes the difference between the TPR and the FPR since we have achieved higher TPR and lower FPR.

(a) using three words as substring

(b) using two words as substring

Fig. 2. ROC curves

(a) using three words as substring

(b) using two words as substring

Fig. 3. Accuracy

Evaluation of Tree Based Machine Learning Classifiers 383

6 Conclusion and Future Work

Detecting mobile malware has become an important problem due to the rapid world- wide proliferation of mobile devices; accordingly, several datasets of Android malware have been created for research and analysis. With respect to the DREBIN dataset, previous research has shown that SVM gives the best results in detection. In this paper, we studied the problem on the same dataset using tree-based learning machines in conjunction with a substring-based method for feature selection that aims to remove irrelevant information. Most of the previous research on Android malware used only two common features: system permission and API call. In our investigation, we used 8 features and also observed the variations of parameters and their effect on the accuracy of detection. We conducted experiments using the DREBIN dataset with four different learning machines and concluded that the Random Forest classifier delivers the best performance of 97.24% accuracy (with 96.88% TPR, 2.39% FPR and 97.23% f1-score with 97.58% precision) and thereby provides a strong basis for building effective malware scanners.

For future work, we propose to study using a consortium blockchain and the deep neural network model based on DanKu protocol [23] to detect Android malware in real time manner. The DanKu protocol utilizes blockchain technology through decentral- ized contracts allowing anyone to post a dataset, an evaluation function, and a reward for the best trained machine learning model. In order to model the data, contributors submit their trained networks to the blockchain by training with deep neural networks. And finally, the blockchain rewards the best submitted model by executing these neural network models.

References

1. Christian, U., Andres, N.: Framework for malware analysis in android. Sist. Telemát. 14(37), 45–56 (2016)

2. Drake, J.J., Lanier, Z., Mulliner, C., Fora, P.O., Ridley, S.A., Wicherski, G.: Android Hacker’s Handbook. Wiley, Indianapolis (2014)

3. Peiravian, N., Zhu, X.: Machine learning for android malware detection using permission and API calls. In: 2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 300–305 (2013)

4. Yan, P., Yan, Z.: A survey on dynamic mobile malware detection, Softw. Qual. J. (2017). https://doi.org/10.1007/s11219-017-9368-4. Accessed 31 May 2018

5. Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K.: DREBIN: effective and explainable detection of android malware in your pocket. In: Symposium on Network and Distributed System Security (NDSS). https://doi.org/10.14722/ndss.2014.23247 (2014)

6. Motive Security Labs Malware Reports—Nokia Networks. https://networks.nokia.com/ solutions/malware-reports. Accessed 31 May 2018

7. Mobile Malware Evolution 2015—Securelist. https://securelist.com/analysis/kasperskysec urity-bulletin/73839/mobile-malware-evolution-2015/. Accessed 31 May 2018

8. IT Threat Evolution in Q2 2016. Statistics—Securelist. https://securelist.com/analysis/quar terly-malware-reports/75640/it-threat-evolution-in-q22016-statistics/. Accessed 24 May 2018

384 Md. S. Rana et al.

http://dx.doi.org/10.1007/s11219-017-9368-4

http://dx.doi.org/10.14722/ndss.2014.23247

https://networks.nokia.com/solutions/malware-reports

https://securelist.com/analysis/kasperskysecurity-bulletin/73839/mobile-malware-evolution-2015/

https://securelist.com/analysis/quarterly-malware-reports/75640/it-threat-evolution-in-q22016-statistics/

9. Threat Insight/Threat Reports—Proofpoint. https://www.proofpoint.com/us/threatinsight/ threat-reports. Accessed 24 May 2018

10. Qiao, M., Sung, A.H., Liu, Q.: Merging permission and API features for android malware detection. In: 5th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI), Kumamoto, pp. 566–571 (2016). https://doi.org/10.1109/iiai-aai.2016.237

11. Urcuqui, C., Navarro, A.: Machine learning classifiers for android malware analysis. In: IEEE Colombian Conference on Communications and Computing (COLCOM), pp. 1–6 (2016)

12. Talha, K.A., Alper, D.I., Aydin, C.: APK Auditor: permission-based android malware detection system. Digital Invest. 13, 1–14 (2015)

13. Sahs, J., Khan, L.: A machine learning approach to android malware detection. In: Intelligence and Security Informatics Conference (EISIC), pp. 141–147. IEEE (2012)

14. Scikit-learn: machine learning in Python Scikit-learn 0.16.1 documentation. http://scikit- learn.org/stable/. Accessed 31 May 2018

15. Yerima, S.Y., Sezer, S., McWilliams, G., Muttik, I.: A new android malware detection approach using Bayesian classification. In: IEEE 27th International Conference on Advanced Information Networking and Applications (AINA), pp. 121–128 (2013)

16. Wu, W.C., Hung, S.H.: DroidDolphin: a dynamic android malware detection framework using big data and machine learning. In: Conference on Research in Adaptive and Convergent Systems, pp. 247–252. ACM (2014)

17. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. Trans. Intell. Syst. Technol. 2(3), 1–39 (2011)

18. Feizollah, A., Anuar, N.B., Salleh, R., Amalina, F., Ma’arof, R.U.R., Shamshirband, S.: A study of machine learning classifiers for anomaly-based mobile botnet detection. Malays. J. Comput. Sci. 26(4), 251–265 (2014)

19. Towards Data Science. https://towardsdatascience.com/decision-trees-in-machine-learning- 641b9c4e8052. Accessed 31 May 2018

20. A Gentle Introduction to Random Forests, Ensembles, and Performance Metrics in a Commercial System. http://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles- and-performance-metrics. Accessed 31 May 2018

21. Ensemble Methods. https://www3.nd.edu/*rjohns15/cse40647.sp14/www/content/lectures/ 31%20-%20Decision%20Tree%20Ensembles.pdf. Accessed 31 May 2018

22. Introduction to Boosting Trees for Regression and Classification. http://www.statsoft.com/ Textbook/Boosting-Trees-Regression-Classification. Accessed 31 May 2018

23. Kurtulmus, A.B., Daniel, K.: Trustless Machine Learning Contracts; Evaluating and Exchanging Machine Learning Models on the Ethereum Blockchain, Algorithmia Research. https://algorithmia.com/static/documents/d3a4c04/Machine_Learning_Models_on_the_ Ethereum_Blockchain.pdf (2018)

Evaluation of Tree Based Machine Learning Classifiers 385

View publication statsView publication stats

https://www.proofpoint.com/us/threatinsight/threat-reports

http://dx.doi.org/10.1109/iiai-aai.2016.237

http://scikit-learn.org/stable/

https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052

http://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics

https://www3.nd.edu/%7erjohns15/cse40647.sp14/www/content/lectures/31%20-%20Decision%20Tree%20Ensembles.pdf

http://www.statsoft.com/Textbook/Boosting-Trees-Regression-Classification

https://algorithmia.com/static/documents/d3a4c04/Machine_Learning_Models_on_the_Ethereum_Blockchain.pdf

https://www.researchgate.net/publication/326890619