A Machine Learning Approach to Identification of Unhealthy Drinking

Levi N. Bonnell, MPH; Benjamin Littenberg, MD; Safwan R. Wshah, PhD; Gail L. Rose, PhD


J Am Board Fam Med. 2020;33(3):397-406. 

In This Article

Abstract and Introduction


Introduction: Unhealthy drinking is prevalent in the United States, and yet it is underidentified and undertreated. Identifying unhealthy drinkers can be time-consuming and uncomfortable for primary care providers. An automated rule for identification would focus attention on patients most likely to need care and, therefore, increase efficiency and effectiveness. The objective of this study was to build a clinical prediction tool for unhealthy drinking based on routinely available demographic and laboratory data.

Methods: We obtained 38 demographic and laboratory variables from the National Health and Nutrition Examination Survey (1999 to 2016) on 43,545 nationally representative adults who had information on alcohol use available as a reference standard. Logistic regression, support vector machines, k-nearest neighbor, neural networks, decision trees, and random forests were used to build clinical prediction models. The model with the largest area under the receiver operator curve was selected to build the prediction tool.

Results: A random forest model with 15 variables produced the largest area under the receiver operator curve (0.78) in the test set. The most influential predictors were age, current smoker, hemoglobin, sex, and high-density lipoprotein. The optimum operating point had a sensitivity of 0.50, specificity of 0.86, positive predictive value of 0.55, and negative predictive value of 0.83. Application of the tool resulted in a much smaller target sample (75% reduced).

Conclusion: Using commonly available data, a decision tool can identify a subset of patients who seem to warrant clinical attention for unhealthy drinking, potentially increasing the efficiency and reach of screening.


An estimated 27% of adults in the United States drink alcohol at a level considered unhealthy,[1] which is defined as consuming ≥1 drink per day for women or ≥2 for men or binge drinking (consuming ≥4 drinks on the same occasion for women or ≥5 for men) at least once in the past year.[2] Consuming more than the recommended amount of alcohol is a major risk factor for health and social issues, injuries, accidents, and early death.[3–5] Unhealthy drinking has been associated with cancer, pancreatitis, liver disease, psychopathology, sleep problems, hypertension, and other serious diseases,[6–10] costing the United States $249 billion in 2010.[11] Moreover, 88,000 deaths are attributable to consuming unhealthy levels of alcohol each year,[12] making it the third leading preventable cause of death in the United States behind tobacco use and poor diet/lack of exercise.

The United States Preventive Services Task Force recommends screening for unhealthy drinking among adults ages 18 and older,[13] and valid screening tools such as the Alcohol Use Disorders Identification Test (AUDIT),[14] AUDIT-Consumption,[15] and the Single Alcohol Screening Question[16] exist for this purpose.

Primary Care Providers (PCPs) have an important role in identifying people with unhealthy drinking; yet, screening rates in primary care are low. In a representative survey of the US population, only 25% reported having been screened for alcohol use in the last year.[17] Barriers to screening include lack of time and administrative support, need for modifications to office workflow, lack of training for PCPs, the stigma associated with alcohol misuse, and the fact that universal screening will not be applicable to the majority of patients.[18–20] Efforts to impose universal screening through the use of electronic clinical reminders and/or performance measures have improved screening rates in some health care systems but are inconsistently used and can be hampered by low clinical staff buy-in.[21,22]

An alternative approach is a clinical prediction rule, which can automatically identify patients most likely to have unhealthy drinking, thereby reducing the burden on PCPs and staff. Previous research has shown that clinical prediction rules using prospectively collected data can successfully identify unhealthy drinking. Hartzet al[23] used logistic regression and 40 laboratory values to distinguish 426 heavy drinkers from 188 light drinkers. Lichtensteinet al[24] used linear regression plus clinical and laboratory values to predict heavy drinking. Harasymiwet al[25,26] used discriminant function analysis to predict patient-reported alcohol use from a set of blood chemistry profiles. Korzec and colleagues[27] built a predictive test for unhealthy drinking based on laboratory values and a clinical questionnaire using Bayesian networks. However, the generalizability of these studies is limited by small sample sizes and highly selected populations. Furthermore, questionnaires or prospective data collection offer little advantage over universal screening. Finally, neither logistic regression nor discriminant function analysis accommodate missing values, which are common in clinical data.

Clinical prediction rules using large, existing datasets and machine learning methods are gaining momentum in the medical literature and have been used to predict poststroke mortality,[28] in-hospital mortality,[29] peripheral artery disease and future mortality risk,[30] infection in the emergency department,[31] and mortality among colon cancer patients,[32] to mention a few.

The purpose of this study was to build a clinical prediction rule for unhealthy drinking based on routinely collected demographic, clinical, and laboratory data and to compare its performance to a universal screening strategy. We hypothesized that a clinical prediction rule could discriminate patients with greater likelihood of unhealthy drinking from those with a low probability of unhealthy drinking who would not require further evaluation. The population of patients needing further evaluation would, therefore, be smaller and have a higher prevalence of unhealthy drinking and have a greater yield from additional evaluations. In this way, a prediction rule could save time and clinical resources, relieving providers from a function that is challenging to implement reliably.[16,18,19,33]