How Well Does the STOPBANG Score Predict OSA?

Aaron B. Holley, MD


January 05, 2016

I'm going to start this article by talking about basic statistics. I know that I'm gambling a little here, and I hope I don't lose too many readers up front. I figured that I would lay all the cards on the table, however, because I can't make my point without a little math.

Researchers talk a lot about sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) for the tests they use. The sensitivity and specificity for a given test are fixed performance values for that test, assuming that the patient populations tested have similar characteristics (demographics, comorbid diseases, etc). Not so for the PPV and NPV, which vary significantly by the prevalence of the disease the test is designed to identify. I'll provide some concrete examples later.

The funny thing about the prevalence of sleep apnea, however, is that it varies markedly by the definition used to identify a hypopnea on polysomnogram.[1,2] If your definition requires a ≥4% drop in oxygen saturation (SpO2) instead of an arousal, you will end up reducing your hypopnea index by a factor of eight, and your apnea-hypopnea index (AHI) will fall by a factor of three.[1] For the Wisconsin Sleep Cohort Study (from which most obtain their OSA prevalence estimates for the general population), hypopneas were only scored when there was ≥4% drop in SpO2.[3] The 2012 American Academy of Sleep Medicine (AASM) Scoring Guidelines recommend that hypopneas be scored when a 30% drop in airflow is accompanied by an arousal (in truth, they recommend using an arousal or a desaturation of ≥3% to score a hypopnea, but this effectively means that all hypopneas with arousals will be scored).[4] Translation? If the 2012 AASM criteria were applied to the original Wisconsin sleep cohort or to the updated analysis performed in 2013,[5] conservative estimates would put the prevalence of OSA (AHI ≥5) in the general population at more than 40%. Never mind the fact that the Wisconsin sleep cohort used a thermistor to detect changes in airflow and not a pressure transducer, which would also drop the sensitivity for detecting hypopneas and further increase the prevalence of OSA using 2012 criteria.[6] That's right, we are closing in on saying that 50% of the male population between the ages of 30 and 60 years has an AHI ≥5.

Now let's do a little math using a published tool. The STOPBANG (S = snoring, T = tired, O = observed apneas, P = elevated blood pressure, B = body mass index >35 k/m2, A = age >50 years, N = neck circumference >40 cm, G = male gender) questionnaire was recently recommended as the tool of choice for predicting OSA in surgical patients.[7] The investigators provide a comprehensive review of the STOPBANG score in different populations. As they do, they vary the AHI and score thresholds (both measures are linear) for both the polysomnography and the STOPBANG. Although they appropriately focus on the importance of population characteristics (surgical vs sleep clinic vs primary care patients), there is no mention of hypopnea scoring criteria.

The original STOPBANG score was derived from a surgical population using the ≥4% desaturation criterion for scoring hypopneas.[8,9] It's fair to estimate that the AHI will increase by a factor of three if hypopneas are scored by arousal criteria instead of ≥4% desaturation. Remember, the 2012 AASM guidelines recommend using arousal criteria for scoring hypopneas. How would the STOPBANG perform if 2012 AASM criteria were applied to the population from which it was derived? In one of the studies cited above, switching from arousal to ≥4% desaturation criteria increased the prevalence of AHI ≥5 from 59% to 92%.[1] In the other, roughly half of the patients diagnosed with an AHI ≥5 using arousal criteria were missed with the ≥4% desaturation definition for hypopnea.[2] Can we split the difference, then, and say that the prevalence of an AHI ≥5 would increase by 40% if the STOPBANG investigators had used 2012 AASM scoring criteria in their population?

The prevalence of obstructive sleep apnea in the original STOPBANG study was 69.7%.[8] Had they used the 2012 AASM criteria, the prevalence would have risen to >90%, probably close to 100%. Let's use 95% to do the math. Admittedly this is a silly exercise because it is obvious that as prevalence approaches 100%, the NPV will approach zero. Using 95% as the prevalence estimate, the STOPBANG has a NPV of 15.4%. You could increase the sensitivity (and improve the NPV) by requiring only one or two of the STOPBANG criteria to consider the test positive, but that is unlikely to be practically useful. How many patients do you see who don't meet at least two STOPBANG criteria? Will you order polysomnography on every 50-year-old man who is tired or who snores? It sort of defeats the purpose of having the tool if every patient is positive.

Dispensing with the math for a second, one has to consider whether any score designed to predict AHI by one definition will perform well when that definition is changed. Yes, the raw value of the AHI will increase predictably when hypopnea criteria are different. It's not just a numbers issue, though. We have to ask ourselves whether scores designed to predict desaturations will work to predict arousals. Patients who desaturate tend to be older and heavier, and they often have comorbid pulmonary disease.[10] For patients who arouse easily, those with a low arousal threshold, this is not the case. In fact, patients with a low arousal threshold tend to be younger and thinner.[11] It doesn't necessarily follow that body mass index, snoring, age, or witnessed apneas would identify a patient with OSA and a low arousal threshold who would be prone to having lots of hypopneas with arousals but very few desaturations.

Where am I going with all of this? For the reasons cited above, we don't know whether such pre-test probability (PTP ) tools as the STOPBANG perform well when the 2012 AASM recommended hypopnea criteria are used to score polysomnograms. The AASM absolutely did the right thing by committing to one definition for hypopneas so that outcomes can be compared across studies (although they added an alternative definition that includes the 4% desaturation criterion in a subsequent edition[12]). Still, researchers and physicians need to understand the profound effects of this change.

What are the consequences? Who knows? There is considerable debate about whether an AHI between 5 and 30 events per hour needs to be treated, never mind the fact that we are just now starting to agree on how to define an AHI. In the short term we are left with considerable uncertainty and a recommended definition that dramatically increases prevalence. We have PTP scores that haven't been validated using this definition. For physicians who want to find patients with an AHI ≥5 using arousal criteria to score hypopneas, it will be difficult to find a tool that has an NPV high enough to rule it out.


Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.
Post as: