Construction of a non-communicable disease risk prediction model using data mining methods

  • 黃 唯軒

Student thesis: Doctoral Thesis


This thesis aims to construct risk prediction models for non-communicable diseases using the data collected from a physical and mental health self-assessment questionnaire and clinical data The questionnaire data includes personal physiological conditions and living style while the clinical data includes subject demographics biochemical laboratory test results X-ray etc Two types of prediction models have been developed in this thesis The first type is to use the questionnaires filled out by the subjects to assess the risk of non-communicable diseases The second type is to predict the risk of non-communicable diseases using the clinical data In this study a total of 2 361 subjects' laboratory data and 2 270 questionnaire data were collected retrospectively from the Health Management Center of National Cheng Kung University Hospital After the removal of missing data the Boruta algorithm was applied to select the important features from the aforementioned data With the selected features five prediction models decision trees random forests support vector machines (SVM) backpropagation neural networks (BPNN) and light gradient boosting machines(LightGBM) were trained to predict the risk of the diseases The results showed that the best model was the LightGBM which reached the average accuracy sensitivity specificity and area under the curve (AUC) at 73 3% 73 52 % 72 86% 0 7319 respectively with random validation We also found that sleep conditions and coffee drinking have an impact on the risk of non-alcoholic fatty liver disease In addition the LightGBM also outperformed the other models in predicting hypertension hyperglycemia and hyperlipidemia The AUC of the prediction is 0 7384 0 7137 and 0 6181 respectively The results also indicated that the LightGBM has poor performance in predicting hyperlipidemia After discussing with the medical experts the reason why the model had poor performance in predicting hyperlipidemia is determined by the content of the questionnaire and the symptoms of hyperlipidemia cannot be fully explored by the current questionnaire questions To improve the prediction performance it is necessary to increase the highly related questions for the disease such as hyperlipidemia questionnaire The results of the second type prediction models showed that the LightGBM had the best results in predicting non-alcoholic fatty liver disease(NAFLD) and its average accuracy sensitivity specificity and area under the curve (AUC) were 80 9% 81 25% 80 3% and 0 8077 respectively with random validation In summary the above results show that the best results can be obtained by using the LightGBM model that achieves the average accuracy at 80 9% for the prediction of non-alcoholic fatty liver disease and identifies valuable influencing factors for the disease The above results have successfully validated the effectiveness of the proposed model It is hoped that the proposed model becomes a convenient and fast tool to make recommendations for the selection of health exams and to analyze the daily life habits for people to realize how to prevent diseases
Date of Award2019
Original languageEnglish
SupervisorJeen-Shing Wang (Supervisor)

Cite this