Revisiting the Microsoft Malware Classification Challenge (BIG 2015) in 2023

Alejandro Mosquera
2 min readFeb 16, 2023

In 2015, Microsoft provided the data science community with an unprecedented malware dataset and encouraging open-source progress on effective techniques for grouping variants of malware files into their respective families. Formatted as a Kaggle Competition, it featured a very large (for that time) dataset comprising of almost 40GB of compressed files containing disarmed malware samples and their corresponding disassembled ASM code.

At that time, my submitted solution had only a dozen of heuristic features and used a simple Random Forest as model, enough to secure a top 30 score (since it was my first Kaggle competition I ended up overfitting the public set and dropped 29 more positions, but that is a different story :) ). Revisiting publicly available feature sets from top solutions (e.g. top 10 with almost perfect scores) I was curious to see what type of features they would rely on the most.

In order to do that I quickly trained a LightGBM model and plotted the feature importance:

293 section_names_header 1707
1367 Offset.1 839
430 VirtualAlloc 594
284 Entropy 504
37 DllEntryPoint 483
81 misc1_assume 461
126 ent_q_diff_diffs_12 427
189 ent_q_diff_diffs_1_median 426
1371 dc_por 414
1393 string_len_counts_2 414
1272 regs_esp 400
22 byte 387
263 ent_p_19 387
0 Virtual 386
287 section_names_.edata 377
237 ent_q_diff_block_3_19 350
148 ent_q_diff_block_0_8 342
991 TB_00 342
1377 db3_rdata 339
19 DATA 339
107 ent_q_diffs_19 318
1398 string_len_counts_7 314
45 void 301
1387 db3_NdNt 297
1258 regs_bh 285
23 word 283
112 ent_q_diffs_max 282
290 section_names_.rsrc 280
1304 asm_commands_jnb 277
1296 asm_commands_in 274
165 ent_q_diff_diffs_0_min 244
1374 dd_text 239
1366 FileSize 235
135 ent_q_diff_diffs_mean 228
1196 TB_cd 222
1246 TB_ff 219
300 Unknown_Sections_lines_por 215
295 Unknown_Sections 213
426 GetProcAddress 211
1686 contdll 208
1331 asm_commands_std 208
1257 regs_ax 203
163 ent_q_diff_diffs_0_median 197
1381 dd5 188
2 loc 181
79 misc_visualc 181
1322 asm_commands_ror 179
468 FindFirstFileA 177
5 var 174
59 entry 173

Going over the top features it is clear that they would not be particularly resistant to adversarial modifications, e.g. adding fake imports, renaming section names or adding extra padding bytes would likely upset the model predictions and would be very easy to perform. Entropy-based features e.g. (ent_q_diff_diffs_12) should be in theory more robust, depending on which portions of the executable are computed though.

Overall, it seems that this dataset hasn’t aged very well, however it is still surprising to see recently published papers using it to evaluate new detection approaches.

References https://www.kaggle.com/code/x75a40890/ms-malware-big-2015-0-0067-score

Originally published at https://www.alejandromosquera.net.

--

--

Alejandro Mosquera

Kaggle Grandmaster. Researcher in AI, Cyber Security, Machine Learning, NLP. Opinions are my own. www.alejandromosquera.net