The threat from ransomware continues to grow both in the number of affected victims as well as the cost incurred by the people and organisations impacted in a successful attack. In the majority of cases, once a victim has been attacked there remain only two courses of action open to them; either pay the ransom or lose their data. One common behaviour shared between all crypto ransomware strains is that at some point during their execution they will attempt to encrypt the users' files. This paper demonstrates a technique that can identify when these encrypted files are being generated and is independent of the strain of the ransomware. An enhanced mixed file ransomware data set of more than 130,000 files was developed based on the govdocs corpus. This data set was enriched to contain examples of files that reflect the more modern Microsoft file formats, as well as examples of high entropy file formats such as compressed files and archives. The data set also contained eight different sets of files that were generated as the result of different real-world high profile ransomware attacks such as WannaCry, Ryuk, Phobos, Sodinokibi and NetWalker. Previous research has highlighted the difficulty in differentiating between compressed and encrypted files using Shannon entropy as both file types exhibit similar values. One of the experiments described in this paper shows a unique characteristic for the Shannon entropy of encrypted file header fragments. This characteristic was used to differentiate between encrypted files and other high entropy files such as archives. This discovery was leveraged in the development of a file classification model that used the differential area between the entropy curve of a file under analysis and one generated from random data. When comparing the entropy plot values of a file under analysis against one generated by a file containing purely random numbers, the greater the correlation of the plots is, the higher the confidence that the file under analysis contains encrypted data. The experiments demonstrate a high degree of confidence in the accuracy of the model achieving a success rate of more than 99.96% when examining only the first 192 bytes of a file, using a mixed data set of more than 80,000 files. This technique successfully addresses the problem of using file entropy to differentiate compressed and archived files from files encrypted by ransomware in a timely manner.
Davies, S. R., Macfarlane, R., & Buchanan, W. J. (2021). Differential Area Analysis for Ransomware Attack Detection within Mixed File Datasets. Computers and Security, 108, Article 102377. https://doi.org/10.1016/j.cose.2021.102377