GA-iForest: An Efficient Isolated Forest Framework Based on Genetic Algorithm for Numerical Data Outlier Detection
Article
Figures
Metrics
Preview PDF
Reference
Related
Cited by
Materials
Abstract:
With the development of data age, data quality has become one of the problems that people pay much attention to. As a field of data mining, outlier detection is related to the quality of data. The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in recent years. In the process of constructing the isolation tree by the isolated forest algorithm, as the isolation tree is continuously generated, the difference of isolation trees will gradually decrease or even no difference, which will result in the waste of memory and reduced efficiency of outlier detection. And in the constructed isolation trees, some isolation trees cannot detect outlier. In this paper, an improved iForest-based method GA-iForest is proposed. This method optimizes the isolated forest by selecting some better isolation trees according to the detection accuracy and the difference of isolation trees, thereby reducing some duplicate, similar and poor detection isolation trees and improving the accuracy and stability of outlier detection. In the experiment, Ubuntu system and Spark platform are used to build the experiment environment. The outlier datasets provided by ODDS are used as test. According to indicators such as the accuracy, recall rate, ROC curves, AUC and execution time, the performance of the proposed method is evaluated. Experimental results show that the proposed method can not only improve the accuracy and stability of outlier detection, but also reduce the number of isolation trees by 20%—40% compared with the original iForest method.
LI Kexin, LI Jing, LIU Shuji, LI Zhao, BO Jue, LIU Biqi. GA-iForest: An Efficient Isolated Forest Framework Based on Genetic Algorithm for Numerical Data Outlier Detection[J]. Transactions of Nanjing University of Aeronautics & Astronautics,2019,36(6):1026-1038