Development of a Model for Predicting the 4-year Risk of Symptomatic Knee Osteoarthritis in China

A Longitudinal Cohort Study

Limin Wang; Han Lu; Hongbo Chen; Shida Jin; Mengqi Wang; Shaomei Shang


Arthritis Res Ther. 2021;23(65) 

In This Article


Study Design and Data Source

The present retrospective cohort study relied on 4-year data from the China Health and Retirement Longitudinal Study (CHARLS)—a nationwide study among Chinese adults aged 45 years or older for whom the detailed cohort profile has been published.[35] The national baseline survey for the study was conducted between June 2011 and March 2012 (CHARLS2011), and 17,708 respondents across 150 counties/districts and 450 villages/resident committees were recruited using a multistage sampling strategy. The respondents are followed up every 2 years via face-to-face computer-assisted personal interviews. Detailed information related to demographic background, socioeconomic status, biomedical findings, health status, and functioning was collected at baseline and at each follow-up using a structured questionnaire.[35] Blood samples were also obtained at each time point. The present study included participants recruited in CHARLS2011 and re-examined in CHARLS2015.


In this study, the unit of analysis was the person. Individuals who did not suffer from symptomatic KOA in CHARLS2011 and had complete diagnosis of symptomatic in CHARLS2015 were included. Participants who had no complete diagnosis of symptomatic KOA in CHARLS2011 or in CHARLS2015 were excluded. We also excluded those who had over 50% of predictive variables unavailable.


The primary outcome was the incident of symptomatic KOA during the 4-year follow-up period, and the subject was the unit of analysis. In accordance with the definition utilized in a previous study,[36] symptomatic KOA was defined as both physician-diagnosed arthritis and the presence of concurrent pain in either knee joint. The incident of symptomatic KOA was defined as the participant being free of symptomatic KOA in CHARLS2011 and diagnosed with symptomatic KOA in CHARLS2015. The presence of pain in the knee joint was assessed based on responses to the following question: "Are you often troubled by pain in any part of your body?" If the participant answered in the affirmative, the following question was asked: "In what part of your body do you feel pain?"

Predictor Variables

In CHARLS2011, data related to demographic background, socioeconomic status, biomedical findings, and levels of blood biomarkers was extracted. We included the following risk factors highlighted in previous studies and imputed missing values when necessary. The demographic variables included gender, age (year), BMI (categorized as underweight [< 18.5 kg/m2], normal [18.5–24.9 kg/m2], overweight [25.0–29.9 kg/m2], obese [≥30.0 kg/m2]), waist circumference (cm), and residential area (urban vs. rural). Waist circumference was categorized into four groups: < 85/80 cm, < 90/85 cm, < 95/90 cm, and ≥ 95/90 cm in men/women. The first group was referred to as the normal group and other three groups were referred to as central obesity based on the diagnosed criteria of central obesity recommended by the Department of Disease Control at the Ministry of Health.[37] The behavior variable included smoking status and engagement in vigorous/moderate/light physical activity. The physical activity score was calculated by multiplying the code for the duration by the code for frequency during 1 week.[38] According to physical activity score, the physical activity was divided into three levels (none, 0; low, 1–4; moderate-to-high, ≥ 5). Health-related variables included history of hip fracture, number of other diagnosed comorbidities, metabolic syndrome (MS) in accordance with Chinese Diabetes Society (CDS) criteria,[39] depressive symptoms based on Center for Epidemiologic Studies Depression Scale (CESD-10) score,[40] self-rated health status and self-reported difficulties with activities of daily living (ADLs),[41] or instrumental activities of daily living (IADLs).[42] The list of potential predictors is presented in Supplementary Box 1, along with detailed information related to how each predictor was assessed and the used tools.

Statistical Analysis

Model Structure. In CHARLS2011, physical activity measures were randomly available for 3684 participants, while blood samples were available for 11,847 participants. Hence, physical activity scores of vigorous/moderate/light physical activity and MS were the main predictors with missing values. The percentage of missing values across the predictors varied between 0.04 and 57% in this study. We assumed data were missing at random and imputed 50 datasets based on the multiple imputation by chained equations (MICE) procedure.[43] The MICE technique improved the data accuracy, as any reasons for missing data could be explained by the observed variables included in the imputation model. We included all the predictor variables in the MICE process, along with the diagnosis of symptomatic KOA in CHARLS2011 and in CHARLS2015, as this information provides a stronger correlation structure among covariates used as predictors in the imputation model. Continuous variables (including systolic blood pressure, diastolic blood pressure, triacylglycerol, HDL cholesterol, and fasting blood glucose) were imputed using linear regression, and binary and multiple categorical variables (including duration and frequency of physical activity, history of hip fracture, smoking behavior, self-rated health status, CESD-10 items, and ADL/IADLs items) were imputed using logit regression.

Descriptive statistics (means and standard deviations for continuous data, and counts and percentages for categorical data) were used to report key variables. Univariable and multivariable logistic regression analyses were used to establish a model for predicting the risk of KOA. All candidate variables were first evaluated via an unconditional univariable logistic regression analysis, and we then selected variables according to clinical value combined with statistical significance to conduct multivariable logistic regression analysis. In the multivariable logistic regression analysis, stepwise selection was combined with the Akaike information criterion (AIC) to determine the final model structure. The coefficients, odds ratios (ORs), and 95% CIs were estimated via 1000-replication bootstrapping to obtain stable and unbiased parameters.[44] We combined the estimates using Rubin's rules.[45]

Internal Validation. The multivariable models were internally validated using a bootstrap procedure (sampling with replacement for 1000 iterations) to assess bias-corrected estimates of predictive ability.

Model Performance. We assessed the predictive performance of the final model using calibration and discrimination measures. Discrimination refers to the ability to distinguish patients experiencing an event from those not experiencing the event and was quantified based on the area under the receiver operating characteristic curve (AUC) in this study. Calibration refers to how closely the predicted risk corresponds with the observed risk and was assessed visually using calibration plots.

Clinical Scoring Tool. We developed a points-based risk-scoring tool based on the final model for easy clinical use—a widely utilized method of clinical scoring.[23] This clinical risk prediction tool can be used to identify individuals who are at high risk of developing KOA during the following 4 years. Continuous factors were categorized based on the results of meta-analyses and clinical practice guidelines. Scores for categorical variables were determined by multiplying the β coefficients (log odds) in the multivariable logistic regression model by ten and rounding off decimal place. The total score was calculated by summing the scores of all variables. Sensitivity, specificity, and the AUC were calculated at different cut-off values, and the maximal Youden index was used to identify the optimal cut-off point.[46] The Youden index was calculated as follows: sensitivity + specificity − 1.

The present study was conducted in accordance with the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines for model development and reporting. All analyses were performed using STATA version 15.1 (STATA Corporation, College Station, TX) and R version 3.6.3 (R Foundation for Statistical Computing, Vienna, Austria). All statistical tests were two-sided and P values of < 0.05 were considered statistically significant.

Ethics Statement

Given that the present study is a secondary analysis of publicly available CHARLS data, the Medical Ethics Board Committee of Peking University granted the study an exemption from review.