Building Risk Prediction Models for Type 2 Diabetes Using Machine Learning Techniques

Zidian Xie, PhD; Olga Nikolayeva, MS; Jiebo Luo, PhD; Dongmei Li, PhD


Prev Chronic Dis. 2019;16(9):e130 

In This Article

Abstract and Introduction


Introduction: As one of the most prevalent chronic diseases in the United States, diabetes, especially type 2 diabetes, affects the health of millions of people and puts an enormous financial burden on the US economy. We aimed to develop predictive models to identify risk factors for type 2 diabetes, which could help facilitate early diagnosis and intervention and also reduce medical costs.

Methods: We analyzed cross-sectional data on 138,146 participants, including 20,467 with type 2 diabetes, from the 2014 Behavioral Risk Factor Surveillance System. We built several machine learning models for predicting type 2 diabetes, including support vector machine, decision tree, logistic regression, random forest, neural network, and Gaussian Naive Bayes classifiers. We used univariable and multivariable weighted logistic regression models to investigate the associations of potential risk factors with type 2 diabetes.

Results: All predictive models for type 2 diabetes achieved a high area under the curve (AUC), ranging from 0.7182 to 0.7949. Although the neural network model had the highest accuracy (82.4%), specificity (90.2%), and AUC (0.7949), the decision tree model had the highest sensitivity (51.6%) for type 2 diabetes. We found that people who slept 9 or more hours per day (adjusted odds ratio [aOR] = 1.13, 95% confidence interval [CI], 1.03–1.25) or had checkup frequency of less than 1 year (aOR = 2.31, 95% CI, 1.86–2.85) had higher risk for type 2 diabetes.

Conclusion: Of the 8 predictive models, the neural network model gave the best model performance with the highest AUC value; however, the decision tree model is preferred for initial screening for type 2 diabetes because it had the highest sensitivity and, therefore, detection rate. We confirmed previously reported risk factors and also identified sleeping time and frequency of checkup as 2 new potential risk factors related to type 2 diabetes.


Diabetes is a chronic disease that increases risk for stroke, kidney failure, renal complications, peripheral vascular disease, heart disease, and death.[1] The International Diabetes Federation estimates that by 2045, at the current growth rate, 693 million people will have diabetes worldwide.[2] According to the Centers for Disease Control and Prevention (CDC), in 2012, 29.1 million people in the United States were diagnosed with diabetes, making it the seventh leading cause of death in the country.[3] Diabetes puts a high financial burden on the US economy. Studies show the total estimated cost of diagnosed diabetes increased to $327 billion in 2017, including $237 billion in direct medical costs and $90 billion in reduced productivity.[4]

There are 3 main types of diabetes: type 1, type 2, and gestational. Of those 3, type 2 diabetes is the most prevalent and accounts for 90% to 95% of all cases. Type 2 diabetes is a predictable and preventable disease because it usually develops later in life (age >30) as a result of lifestyle (eg, low physical activity, obesity status) and other (eg, age, sex, race, family history) risk factors.[5,6] Many models have been built to predict the occurrence of type 2 diabetes.[7–10] However, because of its causal complexity, the prediction performance (especially sensitivity) of models for type 2 diabetes based on survey data needs improvement.[11] In addition, although many risk factors, including obesity and age, are well established for type 2 diabetes, others remain to be identified.

To identify the risk factors for a variety of human diseases, in 1984 CDC initiated the state-wide Behavioral Risk Factor Surveillance System (BRFSS), an ongoing, state-based, random-digit–dialed telephone survey of noninstitutionalized US adults aged 18 years or older. The goal of our study was to build predictive models for type 2 diabetes using 2014 BRFSS data by applying machine learning techniques, including support vector machine (SVM), decision tree, logistic regression, random forest, Gaussian Naive Bayes classifiers, and neural network. In addition, we expected to identify other risk factors for type 2 diabetes using statistical methods.