Selection of the Most Accurate Thermometer Devices for Clinical Practice: Part 1

Meta-Analysis of the Accuracy of Non-Core Thermometer Devices Compared to Core Body Temperature

Nancy A. Ryan-Wenger; Maureen A. Sims; Rebecca A. Patton; Jayme Williamson


Pediatr Nurs. 2018;44(3):116-133. 

In This Article


Description of Studies Included in the Meta-analysis

Thirty-four research studies conducted between 1990 and 2017 met the primary and secondary criteria for inclusion in our meta-analysis. Details on samples, core and non-core temperature sites and devices, and mode used with electronic devices are described in Table 1. Eight (23.5%) samples were children one month to 18 years old, and 27 (79.4%) samples were adults. One study included both adults and children. Core body temperature was measured from a variety of sites in these studies: pulmonary artery (n=21, 61.8%), bladder (n=7, 20%), esophageal (n=4, 11.8%), and nasopharyngeal (n=3, 8.6%). One study included both pulmonary artery and bladder sites. Core body temperatures were compared to temperatures from six types of non-core sites and devices, including oral electronic, rectal electronic, axillary chemical, axillary electronic, temporal artery, and tympanic sites. Oral chemical devices were evaluated in only one study, and thus, were not included in the meta-analysis.

Quality Assessment of Studies Included in the Meta-analysis

Two investigators individually applied the GRADE criteria for diagnostic tests and strategies (Schünemann, Brożek, Guyatt, & Oxman, 2013) to evaluate the quality of studies included in the meta-analysis. The purpose of testing the non-core thermometer devices was for replacement (to serve as a substitute for core body temperature methods). Cohort studies in patients with diagnostic uncertainty (normothermic, hypothermic, or hyperthermic) in which non-core thermometer devices were directly compared to an appropriate reference standard (core body temperature) are initially considered high quality. Thus, we rated the studies as high quality, then evaluated the studies for factors that could lower the quality, including risk of bias, indirectness, inconsistency, imprecise evidence, and publication bias.

Risk of Bias

The risk of bias among the 34 studies included in our meta-analysis was evaluated as low according to the four domains of the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) criteria (Whiting et al., 2011). Brief summaries of this evaluation are listed below.

Description. All studies were rated low risk because they adequately described methods of patient selection and patient characteristics. In all studies, index tests (non-core thermometer devices) and reference standards (core body temperature), protocols for measuring temperatures, and the interval between measurements with non-core devices and core body temperature (simultaneous or consecutive) were adequately described.

Signaling questions. Ten of 11 criteria were rated in the positive direction. For example, non-core and core temperatures were interpreted without knowledge of the other, the interval between tests was appropriate, and all patients received the reference standard. One unclear criterion in many studies was whether consecutive or random samples were enrolled.

Risk of bias. Three of four criteria in this domain were rated low risk because the selection of patients, conduct, and interpretation of temperatures from non-core and core devices were not likely to introduce bias. Each study had a protocol for use of the devices. We only used reported mean temperature data in our meta-analysis, not data related to authors' interpretation of temperatures. One criterion was rated low quality because patient flow was poorly described and could have introduced bias.

Concerns about applicability. All three criteria were rated low risk; patients included in the studies, and the non-core and core devices matched the purpose of the meta-analysis.

Overall, we determined the risk of bias was low.

Indirectness of the Evidence

The population of patients enrolled in the studies were surgical or critically ill patients for whom devices to measure core body temperature were routinely inserted for monitoring purposes and could serve as the reference standard for non-core body temperatures. The target population for this meta-analysis was all patients regardless of setting. The two populations are not significantly different because body temperatures are primarily influenced by physiologic processes and less by the setting in which they are measured. Thus, the evidence can be directly applied to both healthy and sick individuals. We evaluated this criterion as low risk.

Overall Confidence in Estimates of Effect Size

Our confidence in the estimates of ES between mean non-core and core body temperatures reported in the 34 studies was high based on four criteria.

Consistency. All studies used the same research design in which core and non-core temperatures were compared.

Precision. Narrow and wide CIs were explained by most authors. Our calculations showed highly precise ES within studies of each thermometer device as indicated by very low SE (Range=0.02 to 0.06) and significant z-tests (see Table 2).

Publication bias. We rated bias as low because studies with large and small samples, and tests of each non-core device were represented, and both positive and negative findings were reported and explained by the authors.

Heterogeneity assessment. We calculated two statistics to evaluate the heterogeneity of effect sizes within and between studies. The extent of heterogeneity was Q=0% for each type of thermometer device (see Table 2), meaning that "variability across effect sizes does not exceed what would be expected based on sampling error" (Lipsey & Wilson, 2001, Slide #24). The impact of heterogeneity, measured as the proportion of variance in pooled mean differences for each type of device, was 0% due to true differences and I2=100% due to random sampling error (see Table 2) (Higgins & Thompson, 2002). Overall, we determined the studies were high quality, and confidence in the ES was high.

Intra-rater Repeatability

Training of data collectors in the use of thermometer devices increases the likelihood that temperatures are taken in the same manner by all data collectors (Ryan-Wenger, 2017). Only about one-third of authors mentioned that data collectors were trained to use thermometer devices (n=13, 38.2%). Authors of four studies evaluated intra-rater reliability in thermometer measurements but reported inappropriate statistics, including SD (Childs, Harrison, & Hodkinson, 2014), mean differences (Rotello et al., 1996), Pearson correlations (Myny, Dewaele, Defloor, Blot, & Colardyn, (2005), and percentage of mean differences within a clinically acceptable range of 0.2°C (Rubia-Rubia, Arias, Sierra, & Aguirre-Jaime, 2011). Only Kimberger, Cohen, Illievich, and Lenhardt (2007) reported an appropriate statistic. The intraclass correlation between two subsequent temporal artery measurements was 0.73 (CI: 0.67 to 0.78).

Inter-rater Repeatability

Among the 34 studies, 5 (14.7%) had one observer; 17 (50%) had multiple observers, and the number of observers was not reported in 12 (35.3%) studies. When studies have more than one data collector, it is essential to evaluate the extent to which observers' measurements on the same subject agree with each other (Ryan-Wenger, 2017). Inter-rater repeatability of tympanic temperatures was reported in only one study by Rubia-Rubia and colleagues (2011), but again, reported only the percentage of measurements within a clinically acceptable range. Overall, there was insufficient evidence to evaluate the repeatability of temperature measurements from various types of thermometer devices.

Temporal Artery Thermometers

Temporal artery temperatures were compared with core body temperatures in 16 total samples. Mean differences (bias) were quite variable, ranging from −0.44 to +1.3°C (see Figure 2). Core body temperature was overestimated in 81.3% (n=13) of the studies. Effect sizes were between ±0.02 and ±1.8°C. Lower CLs ranged from −2.99 to +4.1°C, while upper CLs were between −0.41 to +3.74°C. The forest plot illustrates the wide variability of temperatures between and within the studies (see Figure 2).

Figure 2.

Forest Plot of Bias (Mean Differences), 95% Confidence Limits (CL), and Confidence Intervals (CI) Between Temporal Artery Temperatures and Core Body Temperatures among 16 Samples

Tympanic Thermometers

Tympanic thermometer temperatures from 39 total samples were compared to core body temperatures. The forest plot illustrates bias from −1.06 to +0.98°C (see Figure 3). Core temperatures were underestimated in 41% (n=16) of samples and overestimated in 51% (n=20) of studies. Three (8%) mean differences were 0°C. Effect sizes were from ±.30 to ±1.02°C. Lower CL ranged from −2.98 to −0.24°C, and upper CL ranged from −0.38 to 3.8°C (see Figure 3).

Figure 3.

Forest Plot of Bias (Mean Differences), 95% Confidence Limits (CL), and Confidence Intervals (CI) Between Tympanic Temperatures and Core Body Temperatures in 39 Samples

Axillary Chemical Thermometers

Temperatures from axillary chemical thermometers were compared to core temperatures in five total samples (see Figure 4). Core body temperature was overestimated in four of the five comparisons. Bias ranged from −0.01 to +0.50°C, and effect sizes were ±0.35 to ±0.53. Lower confidence intervals ranged from −0.75 to −0.49°C, and upper CL were from +0.73 to +1.5°C (see Figure 4).

Figure 4.

Forest Plot of Bias (Mean Differences), 95% Confidence Limits (CL), and Confidence Intervals (CI) Between Core Body Temperature and Chemical Axillary Temperature (N=5) and Electronic Axillary Temperature (N=15 Samples)

Axillary Electronic Thermometers

Temperatures from axillary electronic thermometers were compared to core temperatures in 15 total samples (see Figure 4). Core body temperature was underestimated in 8 (53%) samples and overestimated in 7 (47%) samples. Bias ranged from −1.25 to +0.60°C, and effect sizes were ±0.26 to ±1.00°C. Lower CL ranged from −2.94 to −.27°C, and upper CL were from −0.33 to +2.17°C (see Figure 4).

Oral Electronic Thermometers

Temperatures from oral electronic thermometers were compared to core body temperatures in 7 total samples. Bias ranged from −0.25 to +0.12°C, 3 (43%) of which underestimated and 4 (69%) overestimated core temperature (see Figure 5). Effect sizes varied from ±0.15 to ±0.45. Lower CL ranged from −1.07 to −0.30°C, while upper CL ranged from +0.29 to +1.02°°C (see Figure 5).

Figure 5.

Forest Plot of Bias (Mean Differences), 95% Confidence Limits (CL), and Confidence Intervals (CI) Between Core Body Temperature and Electronic Oral Temperature (N=7 Samples) and Electronic Rectal Temperature (N=15 Samples)

Rectal Electronic Thermometers

Fourteen comparisons of core body temperature and rectal electronic thermometer temperatures resulted in mean differences that underestimated (n=6, 40%) or over-estimated (n=7, 47%) core temperature, with one (14%) mean difference of 0°C (see Figure 5). Bias ranged from −0.69 to +0.54°C, and effect sizes varied from ±0.10 to ±1.00°C. Lower CLs were between −2.36 to +0.19°C, and upper CL ranged from −0.11 to +1.56°C (see Figure 5).

Overall Accuracy of non-core Thermometer Devices

A clearer view of the accuracy of these six non-core body temperature devices is illustrated by a Forest plot of the pooled mean differences from studies of the six non-core sites and devices (see Figure 6). Oral and rectal electronic thermometers had the least bias (-0.05°C and −0.04°C respectively) (closest to the reference line of 0 difference), and the narrowest confidence intervals: oral electronic CI=0.58°C, rectal electronic CI=1.18°C. Axillary electronic thermometers underestimated core body temperature by a pooled mean difference=-0.19°C with a large CI=2.36°C. Temporal, axillary chemical, and tympanic devices overestimated core body temperature and had wide confidence intervals (temporal CI=1.88°C, axillary chemical CI=2.25°C, and tympanic CI=2.62°C.

Figure 6.

Forest Plot of Bias (Pooled Mean Differences), Effect Size, 95% Confidence Limits (CL), and Confidence Intervals (CI) Between Core Body Temperatures and Temperatures from 6 Types of Non-Core Thermometer Devices