Predictors of Mortality Among Long-term Care Residents With SARS-CoV-2 Infection

Douglas S. Lee MD, PhD; Shihao Ma BASc; Anna Chu MHSc; Chloe X. Wang BSc; Xuesong Wang MSc; Peter C. Austin PhD; Finlay A. McAlister MD, MSc; Sunil V. Kalmady PhD; Moira K. Kapral MD, MSc; Padma Kaul PhD; Dennis T. Ko MD, MSc; Paula A. Rochon MD, MPH; Michael J. Schull MD, MSc; Barry B. Rubin MD, PhD; Bo Wang PhD


J Am Geriatr Soc. 2021;69(12):3377-3388. 

In This Article


Data sources

We utilized test data from SARS-CoV-2 viral RNA PCR testing conducted in the population of Ontario. The Registered Persons Database (RPDB) was used to identify all individuals living in Ontario, alive and eligible for the Ontario Health Insurance Plan (OHIP), the province's universal health insurance. Residents of a LTC home were identified using the Ontario Drug Benefit database and Continuing Care Reporting System (CCRS). Data from CCRS interRAI assessments were also used to obtain characteristics of residents of LTC facilities. InterRAI is a mandated clinical assessment tool that is completed on every LTC resident in Ontario, at admission and at quarterly intervals, completed by staff within the LTC home using information from the resident chart, discussions with the care team, resident, and other caregivers, and observations over the days prior to completion.[7] Immigrant status was determined from the Immigration, Refugees and Citizenship Canada Permanent Residents database. The Canadian Institute for Health Information Discharge Abstract Database (CIHI-DAD) and National Ambulatory Care Reporting System (NACRS) database were used to obtain information on all hospitalizations and emergency department visits, respectively and related medical diagnoses in the year prior for each individual in the study cohort, using the International Classification of Diseases 10th Canadian Edition (ICD-10-CA) coding system. Select comorbidities in the prior 5 years were also extracted from the CIHI-DAD, NACRS, and OHIP databases. Where available, chronic diseases were identified using validated, disease-specific provincial databases at ICES (formerly Institute for Clinical Evaluative Sciences). We also assessed select laboratory tests conducted at any time between 2015 and 2019 using the Ontario Laboratories Information System (OLIS) database. Finally, regional population characteristics were obtained at the dissemination area level using data from the 2016 Canadian Census. In Canada, dissemination areas are small, relatively stable geographic areas with populations of 400–700 people bounded by the road network. Similar to zip code tabulation areas (ZCTA) in the United States, dissemination areas are specific geographic areas created by Statistics Canada for census purposes and reporting; however, unlike ZCTAs, which are determined by the US Postal Service for efficient mail delivery as based on zip codes, dissemination areas are independent of mail delivery areas. Community characteristics included information on age and sex distributions, English and French language ability, education, ethnic population, housing, and employment status, in regions surrounding LTC homes.

Study Cohort

We studied all residents of LTC homes in Ontario, Canada who tested positive for SARS-CoV-2 from January 1, 2020 to August 31, 2020. We excluded non-Ontario residents, those who were not eligible for OHIP, and those whose test was performed after death. If an individual had repeat tests, the first positive test was considered the index test. This study was performed under Section 45 of Ontario's Person Health Information and Privacy Act (PHIPA), did not require approval by a Research Ethics Board, and did not require individual consent to be obtained. Therefore, all LTC residents in the province who tested positive for SARS-CoV-2 were included without participant bias.


The primary outcome was time to all-cause mortality occurring up to 30 days after a positive SARS-CoV-2 test. Deaths were identified using the RPDB.


Potential covariates included patient-level demographic characteristics (e.g., age, sex, education, English language ability, ethnicity, and immigrant status), acute hospitalizations for infectious or respiratory disease (e.g., pneumonia or influenza, SARS-CoV-1 infection [ca. 2003], H1N1 infection, respiratory tuberculosis, invasive pneumococcal disease or other acute respiratory infections) since 2000, Hospital Frailty Risk Score (HFRS),[8] Charlson comorbidity score,[9] chronic diseases, and comorbidities (e.g., prior organ transplant, liver disease, hypertension, diabetes, heart failure, cancer, COPD, asthma, need for home oxygen, chronic kidney disease, atrial fibrillation, peripheral vascular disease, rheumatoid arthritis, inflammatory bowel disease, HIV infection, and dementia). Comorbidities with an onset date were modeled as the duration of illness in years, and those without the condition were assigned a value of zero (see Table S1 for comorbidity codes). We also included outcome scales and clinical assessment protocols from the interRAI Resident Assessment Instrument (RAI-MDS 2.0), to assess clinical and functional status and identify areas for potential intervention,[10] and laboratory test results in the province-wide OLIS database (see Table S2 for a list of all laboratory tests considered). Missing lab data were imputed by using the age/sex-stratified mean from available measurements. Since prior reports identified the importance of neighborhood characteristics on the likelihood of SARS-CoV-2 infection, we included regional socio-demographic and population characteristics (e.g., neighborhood income quintile of LTC home and community size) of the 20,160 dissemination areas in the province. Potential predictors were included in our models only if they were present prior to the date of the index positive SARS-CoV-2 test.


Our analyses were performed on the private-cloud based Ontario Health Data Platform at ICES, enabling secure remote machine learning analyses on linked population-level health and other data. Continuous variables were presented using median (25th and 75th percentiles) and categorical variables using proportions. With death as the outcome of interest, we conducted survival analysis to predict 30-day mortality using a random survival forest (RSF) model, which is an alternative to the Cox proportional hazards model for survival prediction and allows for evaluation of time-to-event outcomes.[11] We first split the study data into training (80%) and testing (20%) sets. To improve the models' discriminative ability, we then randomly sub-sampled the non-deaths in the training set to match the number of deaths in a ratio of 1:1 using simple random sampling, creating an enriched training set. We repeated this step to create 50 balanced training datasets, training 50 separate models. The final model, which is an ensemble of the 50 models, can output a 30-day survival probability by averaging the predicted probabilities from the 50 models with uniform weighting, and was further evaluated in the test set. We used the predicted survival probabilities on the original training set to divide the training samples into four risk quartiles. The threshold values for each quartile were noted and used when we divided the testing samples into different risk quartiles given their predicted survival probabilities.

To identify variables important for predicting time to death within 30 days, we used the VIMP function in the randomForestSRC package.[12] To calculate the importance score for a variable x, VIMP randomly permutes the value of x so that when dropping a sample down the survival tree and a spilt for x is encountered, assigns the sample to a daughter node randomly. The importance score for x is the prediction error of the original model subtracted from the prediction error of the new model obtained with randomized x values. Thus, the VIMP for x measures the increase/drop in prediction error on the testing set if x were not available. Larger importance scores indicate variables with more predictive power, and zero or negative scores indicate nonpredictive variables. A priori, we chose to select the 50 most important covariates from the survival models, and further evaluated the directionality of association with mortality using Spearman's correlation analysis in the test set. The directionality of all covariates (particularly those with low Spearman correlation) were confirmed by visual plot. We determined the time-dependent area under the curve (AUC) at day 30[13] on the test set, of models using covariate groupings, including base demographics (e.g., age, sex, and sociodemographic variables), interRAI (functional), LTC information, comorbidities including HFRS, surrounding community characteristics, and laboratory test data from OLIS. All variables were included our final full model, and individuals in the test sample were stratified into risk quartiles based on their predicted 30-day risk. Odds ratios were determined to compare risk across quartiles with the lowest risk quartile as the reference category. Cumulative incidence of death was compared between strata using the log-rank test. Machine learning analyses were performed using R randomForestSRC (R Foundation for Statistical Computing, Vienna, Austria). Statistical analyses were performed using SAS version 9.4 (Cary, North Carolina).