TY - GEN
T1 - A data compacting technique to reduce the NetFlow size in botnet detection with BotCluster
AU - Wang, Chun Yu
AU - Chen, Yu Cheng
AU - Fuh, Shih Hao
AU - Cho, Feng Min
AU - Lo, Ta Chun
AU - Chang, Jyh Biau
AU - Cheng, Qi Jun
AU - Shieh, Ce Kuen
PY - 2019/12/2
Y1 - 2019/12/2
N2 - Big data analytics helps us to find potentially valuable knowledge, but as the size of the dataset increases, the computing cost also grows exponentially. In our previous work, BotCluster, we had designed a pre-processing filtering pipeline, including whitelist filter and flow loss-response rate (FLR) filter, for data reduction, which intended to wipe out irrelative noises and reduce the computing overhead. However, we still face a data redundancy phenomenon in which some of the same feature vectors repeatedly emerged. In this paper, we propose a data compacting approach aimed to reduce the input volume and keep enough representative feature vectors to fit DBSCAN's (Density-based spatial clustering of applications with noise) criteria. It purges the redundant vectors according to a purging threshold and keeps the primary representatives. Experimental results have shown that the average data reduction ratio is about 81.34%, while the precision has only slightly decreased by 1.6% on average, and the results still have 99.88% of IPs overlapped with the previous system.
AB - Big data analytics helps us to find potentially valuable knowledge, but as the size of the dataset increases, the computing cost also grows exponentially. In our previous work, BotCluster, we had designed a pre-processing filtering pipeline, including whitelist filter and flow loss-response rate (FLR) filter, for data reduction, which intended to wipe out irrelative noises and reduce the computing overhead. However, we still face a data redundancy phenomenon in which some of the same feature vectors repeatedly emerged. In this paper, we propose a data compacting approach aimed to reduce the input volume and keep enough representative feature vectors to fit DBSCAN's (Density-based spatial clustering of applications with noise) criteria. It purges the redundant vectors according to a purging threshold and keeps the primary representatives. Experimental results have shown that the average data reduction ratio is about 81.34%, while the precision has only slightly decreased by 1.6% on average, and the results still have 99.88% of IPs overlapped with the previous system.
UR - http://www.scopus.com/inward/record.url?scp=85077338015&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85077338015&partnerID=8YFLogxK
U2 - 10.1145/3365109.3368778
DO - 10.1145/3365109.3368778
M3 - Conference contribution
T3 - BDCAT 2019 - Proceedings of the 6th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies
SP - 81
EP - 84
BT - BDCAT 2019 - Proceedings of the 6th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies
PB - Association for Computing Machinery, Inc
T2 - 6th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, BDCAT 2019
Y2 - 2 December 2019 through 5 December 2019
ER -