Selection of the Most Accurate Thermometer Devices for Clinical Practice: Part 1

Meta-Analysis of the Accuracy of Non-Core Thermometer Devices Compared to Core Body Temperature

Nancy A. Ryan-Wenger; Maureen A. Sims; Rebecca A. Patton; Jayme Williamson


Pediatr Nurs. 2018;44(3):116-133. 

In This Article


Search Method

Our Research and Evidence-Based Practice (EBP) team conducted a general search for peer-reviewed evidence related to temperature measurement from the following databases: Medline, Cumulative Index of Nursing and Allied Health Literature (CINAHL), Cochrane Database, Clinical Key, and the National Guideline Clearing-house. Keywords included body temperature and thermometer accuracy, with subcategories of core, axillary, rectal, temporal, tympanic, and oral temperatures. Search limits were set for English language, humans, time-frame from 1990 to 2017, and journal articles. Relevant articles were also derived from the reference lists of research studies. Initial selection criteria for articles to be included in the meta-analysis were studies that compared core body temperatures versus body temperatures from non-core thermometer devices in children ages older than one month and adults. There is no evidence that the physiology of temperature fluctuations differs between children and adults; therefore, it was appropriate to include both groups.

Search Outcome

Articles that met the initial selection criteria were subjected to a secondary review. These criteria included the following: concurrent or sequential core and non-core temperature methods; reports of mean difference, standard deviation, and 95% confidence intervals (CI) between core and non-core temperatures, or sufficient data to calculate these statistics. Very small samples are likely to have wide variability in mean differences that may not be representative of the population. Therefore, we set our criterion for sample size as 10 or more subjects. Data from glass thermometers containing mercury or gallium were excluded for safety reasons.

From a total of 244 articles identified from the literature search, 47 duplicate articles were eliminated, leaving a total of 197 articles. Of these, 159 were research-based and 38 were non-research articles (see Figure 1). Of the 159 research studies, 39 studies compared core with non-core thermometer devices, and 34 of these met the initial and secondary criteria for inclusion in the meta-analysis, while five did not include sufficient statistical data. The remaining research studies included comparisons of non-core vs. non-core thermometer devices (n=83), studies of core vs. core devices (n=9), and studies of temperature accuracy involving experiments, parents, eye temperatures, and computer modeling (n=26). Research articles that focused only on intra- or inter-rater repeatability of non-core temperature devices (n=2) were not included in the meta-analysis, but were included in the discussion of intra- and inter-rater repeatability. Of the 38 non-research studies, 28 were clinical articles and 10 were systematic literature reviews and/or meta-analyses.

Figure 1.

Flow Chart of Literature Search

Data Extraction

In this study, temperature accuracy was defined as the extent to which measurements from a non-core thermometer device are the same as measurements from a gold standard device (Ryan-Wenger, 2017). In the context of temperature measurement, bias was the absolute or pooled mean difference between core and non-core measurements, while effect size (ES) (i.e., accuracy) was defined as the standard deviation (SD) of mean differences. When only standard error (SE) was reported, we calculated SD=SE* (√n). When only confidence limits (CL) or confidence intervals (CI) were reported, we used the formula SD=CI/3.12. To ensure equality in the method used to measure 5% and 95% confidence limits (CL) among the studies, we used the Confidence Interval Calculator for Means (Allto Consulting, 2017) software to calculate them from study means and SDs. Repeatability was defined as the extent to which duplicate temperature measurements taken by one rater (intra-rater) or more than one rater (inter-rater) on the same person are the same as measured by appropriate statistics, such as kappa coefficients or intra-class correlations (Bland & Altman, 1986; Cho, 1981; Collis, 1985). Many authors report inappropriate statistics, including Pearson correlations, paired t tests, sensitivity, and specificity. Correlations measure the strength and direction of association between two variables, not the extent to which the measurements are the same (Bland & Altman, 1986, 1992). Paired t tests test the null hypothesis that the difference between two measurements is equal to zero, which is not clinically useful information. Sensitivity and specificity indicate the proportion of agreement between dichotomous variables (for example, fever vs. no fever) and not the extent to which the actual temperature measurements were the same (Bland & Altman, 1986, 1992).

Analysis Plan

All authors evaluated measurement accuracy in the total sample, while some authors also reported results for patients with hypothermia, normothermia, and/or hyperthermia. We believe that a clinically useful thermometer must demonstrate accuracy among all patients regardless of setting or range of body temperatures; therefore, our meta-analysis was limited to comparisons in the total sample of each study. To account for variation in sample size across studies, we calculated inverse variance weights and weighted mean effect size for each study (Lipsey & Wilson, 2001). The accuracy of temperatures from non-core thermometer devices was graphically displayed as Forest plots, which depict pooled weighted mean differences between core and non-core temperatures and 95% CL with 0 (perfect agreement) as the reference line. MetaData Viewer software v1.05 (Boyles, Harris, Rooney, & Thayer, 2011, 2016) was used to generate Forest plots.

Precision of mean ES measurements within studies of each type of thermometer device was a function of the mean ES, inverse variance weight, and SE of mean ES (Lipsey & Wilson, 2001). A small SE indicates a precise mean ES. A significant z-test means that the ES is likely to be representative of the population. We evaluated the extent of heterogeneity between studies on effect size with the Cochran Q statistic, which tests the null hypothesis of homogeneity (Lipsey & Wilson, 2001). Q is expressed as a percentage, and has a Chi-square distribution. We also measured the impact of heterogeneity with the I2 statistic, which accounts for within-study and between-study variation in estimates of pooled mean effect size. I2 is the percentage of variability due to random sampling error as opposed to systematic heterogeneity (Higgins & Thompson, 2002). IBM SPSS v. 24 was used to analyze the data, and alpha was set as p≤0.05.