Deep Learning for the Detection and Classification of Obfuscated Codes

JavaScript is a common attack vector to probe for known vulnerabilities in web browsers in order to select a fitting exploit and redirect the internet traffic to an infected web server and inject the payload in the victim’s machine. The JavaScript files used in such attacks are often obfuscated to make them hard to detect using signature-based approaches which are found in all web browsers.
On the other hand, since the only legitimate reason to obfuscate a script is to protect intellectual property, there are not many scripts that are both benign and obfuscated. Code obfuscation was introduced as a viable technique to prevent reverse engineering of software applications, it was intended to protect an application's key algorithms and data structures from theft by hackers or even competitors since it is used to hinder both manual and automated analysis from detecting malicious scripts. Code minification and code obfuscation may hide the behavior of a script... However, malware authors use the same techniques to create malware or insert malicious logic into a legitimate application, so developing a detector that can reliably detect malicious obfuscated JavaScript files would, therefore, be a valuable tool in fighting JavaScript-based attacks.
This project proposes an analysis system to detect obfuscated malicious JavaScript files found on websites, mainly the scripts used in exploit kits to probe for a web browser’s vulnerabilities. The main goals of this project are to distinguish obfuscated from not-obfuscated JavaScript files and predict whether the obfuscated scripts are benign or malicious.
In order to make very accurate predictions, we will be developing multiple Deep Learning models, some of which will need a manual feature selection; so a descriptive set of features needs to be selected and a resourceful dataset that will make the Deep Learning model’s job a lot easier in detecting obfuscated malware, and other models will handle the feature extraction on their own.. Therefore, multiple features combined from previous studies and research done on this topic will be chosen (number of words in the script, number of peculiar characters e.g. /, %...and their occurrence compared to the file’s total number of characters) in addition to the features that have been extracted from further personal research (special JavaScript keywords e.g. eval, and some obfuscator specific patterns e.g. (p, a,c,k,e,d)).
For the dataset, Alexa’s top 5000 websites were crawled, and the JavaScript files were collected for previous research to train a Machine Learning model. Those scripts are considered benign and not obfuscated, so they will still need to be obfuscated. And for the malicious scripts, a repo was created by Web developers to insert malicious JavaScript files for analysis and are also not obfuscated.
The core of our study is a highly accurate (90%) neural network-based classifier that needs to be trained to identify whether obfuscation has been applied and whether that script is malicious.

General information
  • Date: 31.05.2020
  • Type: Master project
  • Responsible: Karl Daher

People

Students
  • Maroun Antoun
Supervisors
Omar Abou Khaled
Professor
See more
Elena Mugellini
Head of HumanTech
See more
Karl Daher
PhD Student
See more
Partners : Université Saint-Joseph de Beyrouth