How to Critically Appraise an Article

Jane M Young; Michael J Solomon


Nat Clin Pract Gastroenterol Hepatol. 2009;6(2):82-91. 

In This Article

Selection and Critical Appraisal of Research Literature

Ten key questions (Box 1) can be used to assess the validity and relevance of a research article. These questions can assist clinicians to identify the most relevant, high-quality studies that are available to guide their clinical practice.

Is the Study's Research Question Relevant?

Even if a study is of the highest methodological rigor, it is of little value unless it addresses an important topic and adds to what is already known about that subject.[17] The assessment of whether the research question is relevant is inevitably based on subjective opinion, as what might be crucial to some will be irrelevant to others. Nonetheless, the first question to ask of any research article is whether its topic is relevant to one's own field of work.

Does the Study Add Anything New?

Scientific-research endeavor is often likened to 'standing on the shoulders of giants', because new ideas and knowledge are developed on the basis of previous work.[18] Seminal research papers that make a substantive new contribution to knowledge are a relative rarity, but research that makes an incremental advance can also be of value. For example, a study might increase confidence in the validity of previous research by replicating its findings, or might enhance the ability to generalize a study by extending the original research findings to a new population of patients or clinical context.[17]

What Type of Research Question Does the Study Pose?

The most fundamental task of critical appraisal is to identify the specific research question that an article addresses, as this process will determine the optimal study design and have a major bearing on the importance and relevance of the findings. A well-developed research question usually identifies three components: the group or population of patients, the studied parameter (e.g. a therapy or clinical intervention) and the outcomes of interest.[10] In general, clinical research questions fall into two distinct categories, below.

Questions About the Effectiveness of Treatment. These types of questions relate to whether one treatment is better than another in terms of clinical effectiveness (benefit and harm) or cost-effectiveness.

Questions About the Frequency of Events. Such questions refer to the incidence or prevalence of disease or other clinical phenomena, risk factors, diagnosis, prognosis or prediction of specific clinical outcomes and investigations on the quality of health care.

Was the Study Design Appropriate for the Research Question?

Studies that answer questions about effectiveness have a well-established hierarchy of study designs based on the degree to which the design protects against bias. Meta-analyses of well-conducted RCTs and individual RCTs provide the most robust evidence followed by nonrandomized controlled trials, cohort studies, case-control studies, and other observational study designs.[19,20] However, in some circumstances, RCTs are either not feasible or considered ethically inappropriate. These issues are more common in nonpharmaceutical trials, such as those of surgical procedures. One review of gastrointestinal surgical research found that only 40% of research questions could have been answered by an RCT, even when funding was not an impediment. Patients' preferences, the rarity of some conditions, and the absence of equipoise among surgeons proved to be the major obstacles to performing RCTs of gastrointestinal surgery in this setting.[21] When an RCT is not feasible, the specific reasons that preclude its use will determine the type of alternate study design that can be used.[21] Observational studies, rather than RCTs, are the most appropriate study design for research questions on the frequency of events.

Did the Study Methods Address the Key Potential Sources of Bias?

In epidemiological terms, the presence of bias does not imply a preconception on the part of the researcher, but rather means that the results of a study have deviated from the truth.[3] Bias can be attributed to chance (e.g. a random error) or to the study methods (systematic bias). Random error does not influence the results in any particular direction, but it will affect the precision of the study;[22] by contrast, systematic bias has a direction and results in the overestimation or underestimation of the 'truth'. Systematic biases arise from the way in which the study is conducted, be it how study participants were selected, how data was collected, or through the researchers' analysis or interpretation.[23]

Different study designs are prone to varying sources of systematic bias. Once the study design of a given article has been identified, we recommend that clinicians use one of the available design-specific critical-appraisal checklists to decide whether the study in question is of high quality. The Critical Appraisal Skills Programme (CASP) includes such tools and the program coordinators have developed separate checklists for the appraisal of systematic reviews, RCTs, cohort studies, case-control studies, diagnostic test studies, economic evaluations and qualitative research that each comprise 10 questions.[9] They have been developed from the Users' guides to the medical literature series of articles that were originally published in the Journal of the American Medical Association. These articles are now available in book form[5] and are readily accessible on the internet.[9]

Systematic Reviews and Meta-analyses

A meticulous, standardized protocol is used in a systematic review to identify, critically appraise and synthesize all the relevant studies on a particular topic. Some systematic reviews may then proceed to a meta-analysis, in which the results from individual studies are combined statistically to produce a single pooled result.[3] Although planning to undertake a systematic review or a meta-analysis prospectively is possible,[24] the majority of these types of article are retrospective and a risk of bias exists, which arises from the selection of studies and the quality of these primary sources.[25] Publication bias, which results from the selective publication of studies with positive findings, is of particular concern, as it distorts overall perceptions of the findings on a particular topic.[26,27]

The QUORUM (Quality of Reporting of Meta-Analyses) statement provides a comprehensive framework for assessments of the quality of reporting in meta-analyses and systematic reviews.[25,28] In addition, the AMSTAR[29] assessment tool, which comprises 11 questions, has been developed for the appraisal of systematic reviews, and this tool or the CASP checklist[9] could be more useful than the QUORUM statement for clinicians who wish to undertake a rapid appraisal of these types of articles. Key methodological points to consider in the appraisal of systematic reviews and meta-analyses are listed in Box 2.

Systematic reviews and meta-analyses are not restricted to RCTs alone. The MOOSE (Meta-Analysis Of Observational Studies in Epidemiology) guidelines have been developed as a corollary of the QUORUM statement for meta-analyses of non-RCTs.[30]

Randomized Controlled Trials

In an RCT, the random allocation of participants should ensure that treatment groups are equivalent in terms of both known and unknown confounding factors; any differences in outcomes between groups can, therefore, be ascribed to the effect of treatment.[31] Study design alone, however, will not guard against bias if crucial aspects of the study protocol are suboptimal. The potential for selective enrollment of patients into the study can be one an important source of bias if the group to which individuals will be allocated is known or can be guessed.[32] Centralized methods of randomization, for example a computer-generated allocation, are preferable to less concealed methods, such as use of color-coded forms or pseudo-random sequences based on medical record numbers or days of the week.[31] Failure to conceal the allocation sequence has been shown to result in a greater distortion of the results than lack of double-blinding -- another major source of bias in RCTs.[33]

The CONSORT (Consolidated Standards of Reporting Trials) statement flow chart (Figure 1) is functionally equivalent to the QUORUM statement for systematic reviews, and provides a comprehensive tool with which to assess the standard of reporting in randomized trials.[34] Key points to consider in the appraisal of an RCT are listed in Box 3.

Figure 1.

Consolidated standards of reporting trials (CONSORT) statement flowchart for the standard reporting and appraisal of randomized controlled trials. With permission from CONSORT

Cohort Studies

Cohort, or longitudinal, studies involve following up two or more groups of patients to observe who develops the outcome of interest. Prospective cohort studies have been likened to natural experiments, as outcomes are measured in large groups of individuals over extended periods of time in the real world.[35] Cohort studies can also be performed retrospectively; such studies usually involve identifying a group of patients and following up their progress by examining records that have been collected routinely or for another purpose, such as medical data, death registry records and hospital admission databases.

The major methodological concern with cohort studies is their high potential for selection bias and confounding factors. These problems are particularly relevant when cohort studies (or non-RCTs) are used to evaluate therapeutic interventions. In this situation, the treatment that someone receives is determined by the patient's or clinician's preferences, referral patterns, current treatment paradigms or local policy.[36] Important differences are likely to exist between patients who receive disparate treatments and these differences, rather than the treatment itself, might be responsible for the observed outcomes. Although some potential confounding factors can be measured and accounted for in the analysis,[37] such adjustments are more difficult in retrospective than prospective studies, as data on important potential confounders might not have been collected, or might be of poor quality.

The STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) statement is the corollary of the QUORUM and CONSORT statements for observational studies, including cohort, case-control and cross-sectional studies.[38] Key methodological features to consider in the appraisal of cohort studies are listed in Box 4.

Case-control Studies

Case-control studies are always retrospective by their very nature -- the case patients are selected because they have already developed the outcome of interest (e.g. a disease). Data are then collected about factors that might have influenced this outcome, and these exposures are compared with those of a group of people who differ from the case patients only in that they have not developed the outcome of interest. Case-control studies are ideal for the investigation of risk factors when the outcome of interest is rare, as it would take too long to recruit a prospective cohort.

Major methodological difficulties with case-control studies are the selection of appropriate control individuals and the possibility of 'recall bias' (a patient's subjective interpretation of what caused their condition can alter their recall of certain events or experiences). Controls should be drawn from exactly the same population as the cases, and the only difference between controls and cases should be that the controls have not developed the condition of interest. Although objective measures of possible causative factors are preferable, case-control studies often rely on participants' recall, and patients might be more likely to remember certain events or experiences than controls.[39] Key aspects to consider when assessing a case-control study are listed in Box 5.

Cross-sectional Analyses

Cross-sectional studies provide a 'snapshot' in which all parameters (exposures and outcomes) are assessed at the same time; examples of cross-sectional designs include one-off surveys and audits of practice. Key methodological points to consider in the appraisal of a cross-sectional study are listed in Box 6.

Case Series

Case series provide low-level evidence about therapeutic effectiveness; however, these articles are very common in medical literature. Key methodological issues to consider when assessing such articles are listed in Box 7.

Studies that Assess the Accuracy of Diagnostic Tests

These studies are usually cross-sectional in design, but possess a number of specific methodological issues that should be considered in addition to those noted above.[40] To investigate the accuracy of a diagnostic test, it is performed on a sample of patients and the results are compared with those of a reference or gold-standard diagnostic test.[41] The level of agreement between the investigated test and the gold-standard diagnostic test can then be reported either in terms of the sensitivity and specificity, or likelihood ratio.[4,41]

The STARD (Standards for the Reporting of Diagnostic Accuracy Studies) website provides a detailed flowchart (Figure 2) and 25-item checklist for standardized reporting and appraisal of studies that assess the accuracy of diagnostic tests.[42,43] The CASP also provides a similar, but more simple, tool for this type of study.[9] Important features to consider when appraising a study of diagnostic accuracy are listed in Box 8.

Figure 2.

Standards for the reporting of diagnostic accuracy studies (STARD) statement flowchart for the standard reporting and appraisal of studies examining the accuracy of diagnostic tests. With permission from STARD

Economic Evaluations

Economic-evaluation studies focus on cost-efficiency, or which treatment can provide the greatest benefit for the least cost.[44] Several types of economic-evaluation studies exist, including cost-benefit, cost-effectiveness and cost-utility analyses, all of which differ in how they measure health benefits.[45] An important feature of critical appraisal of any cost analysis is an assessment of how well the various costs and consequences of individual treatments have been identified and measured. The CASP has developed a checklist to aid with the appraisal of economic evaluation studies.[9]

Was the Study Performed in Line with the Original Protocol?

Deviations from the planned protocol can affect the validity or relevance of a study. One of the most common problems encountered in clinical research is the failure to recruit the planned number of participants. An estimate suggests that more than a third of RCTs recruit less than 75% of their planned sample.[46] This deviation from the study plan not only potentially reduces the extent to which the results of the study can be generalized to real-world situations, because those who actually were recruited might be different from those who weren't for some reason, but also reduces the power of the study to demonstrate significant findings. Other differences to the original protocol might include changes to the inclusion and exclusion criteria, variation in the provided treatments or interventions, changes to the employed techniques or technologies, and changes to the duration of follow-up.

Does the Study Test a Stated Hypothesis?

A hypothesis is a clear statement of what the investigators expect the study to find and is central to any research as it states the research question in a form that can be tested and refuted.[3] A null hypothesis states that the findings of a study are no different to those that would have been expected to occur by chance. Statistical hypothesis testing involves calculating the probability of achieving the observed results if the null hypothesis were true. If this probability is low (conventionally less than 1:20 or P < 0.05), the null hypothesis is rejected and the findings are said to be 'statistically significant' at that accepted level.

Study hypotheses must crucially be identified a priori (that is, before the study is conducted, and are developed from theory or previous experience). If the study investigates the statistical significance of associations that were not prespecified in the original hypothesis (post-hoc analysis), such analyses are prone to false-positive findings because, at a significance level of 5% (P = 0.05), 1 in 20 associations tested will be significant (positive) by chance alone. When a large number of such tests are conducted some false-positive results are highly likely to occur. Another important consideration it to check that all data relevant to the stated study objectives have been reported, and that selected outcomes have not been omitted.

Where treatments for a medical condition already exist, trials can be designed to test whether a new therapy has similar efficacy to an existing one. This type of trial is called an equivalence or noninferiority trial, as its purpose is to establish that the new treatment is no worse than the existing one.[47] Equivalence studies require that the degree of outcome difference at which the two treatments will not be considered equivalent be determined in advance.[48] For example, researchers might decide that if the primary outcome for a new treatment is no greater than 5% worse than that of the existing treatment, the two treatments will be considered to be equivalent. Equivalence studies determine whether a new treatment is at least as good as an existing treatment so that decisions about which treatment to administer to a given patient can be made on the basis of criteria, such as cost or ease of administration.[47,48]

The CONSORT statement for randomized trials has been extended to incorporate guidelines for reporting equivalence studies.[49] A key question when appraising this type of study is whether the trial results were analyzed appropriately for an equivalence study. If a study is designed to show that a new treatment is at least as good as an existing treatment, statistical methods, for conventional testing of a hypothesis that one treatment is superior to another should not be used. Appropriate analysis of the results in an equivalence study often involves calculating confidence intervals for the treatment effect, and determining whether these limits are within the predetermined margin of noninferiority.[48] Another key question is whether the sample size was calculated correctly for an equivalence study, as these types of study usually require a larger sample size than a corresponding superiority trial.[49]

Were the Statistical Analyses Performed Correctly?

Assessing the appropriateness of statistical analyses can be difficult for nonstatisticians. However, all quantitative research articles should include a segment within their 'Method' section that explains the tools used in the statistical analysis and the rationale for this approach, which should be written in terms that are appropriate for the journal's readership. In particular, the approach to dealing with missing data and the statistical techniques that have been applied should be specified; patients who are lost in follow-up and missing data should be clearly identified in the 'Results' section. Original data should be presented in such a way that readers can check the statistical accuracy of the paper.

An important consideration in the statistical analysis of RCTs is whether intention-to-treat (ITT) or per-protocol analyses were conducted. According to the ITT principle, participants' data are analyzed with reference to the group to which they were randomly allocated, regardless of whether they actually received the allocated treatment. ITT analyses are preferred, because they maintain the randomization and ensure that the two treatment groups are comparable at baseline.[50] However, if a lot of participants are nonadherant or a large proportion cross over to other treatments, an ITT analysis will be somewhat conservative and the results might be difficult to interpret. In this situation, a per-protocol analysis that includes only those patients who complied with the trial protocol can be used to supplement the ITT analysis. As per-protocol analyses are at increased risk of selection bias, they should not usually be used as the primary method of analysis unless a compelling reason exists to justify this approach.[50] The CONSORT flowchart (Figure 1) enables the flow of participants and the groups used in the analysis of the trial to be clearly identified.[34]

Do the Data Justify the Conclusions?

The next consideration is whether the conclusions that the authors present are reasonable on the basis of the accumulated data. Sometimes an overemphasis is placed on statistically significant findings that invoke differences that are too small to be of clinical value; alternatively, some researchers might dismiss large and potentially important differences between groups that are not statistically significant, often because sample sizes were small. Other issues to be wary of are whether the authors generalized their findings to broader groups of patients or contexts than was reasonable given their study sample, and whether statistically significant associations have been misinterpreted to imply a cause and effect.

Are There any Conflicts of Interest?

Conflicts of interest occur when personal factors have the potential to influence professional roles or responsibilities.[51] Members of a research team must make judgments that have the potential to affect the safety of the participants and the validity of the research findings. Researchers are in a position to decide which studies will be conducted in their unit, which patients will be invited to participate in a study and whether certain clinical occurrences should be reported as adverse events.[52] These decisions require researchers to act with integrity and not for personal or institutional gain.

Potential financial conflicts of interest include the receipt of salary and consultation fees from the company that has sponsored the research and ownership of stocks and shares or other pecuniary interests, such as patents related to the research.[52] Units that recruit research participants might be paid a per-capita fee for every patient enrolled, which can be greater than the expenses involved.[53] Many potential financial sources of conflicts of interest, such as industry funding for educational events, travel or gifts, are increasingly recognized both within the context of daily clinical practice and research.[54] However, other potential conflicts are inherent to the research setting. An example is that medical researchers' status and future research income is dependent on the success of their research.[55]

Identification of a potential conflict of interest is not synonymous with having an actual conflict of interest or poor research practice. Potential conflicts of interest are extremely common, and the most important questions are whether they have been recognized and how they have been dealt with.[56] A main mechanism for dealing with potential conflicts of interest is open disclosure.[56] In the process of critically appraising a research article, one important step is to check for a declaration about the source of funding for the study and, if a potential conflict of interest had been identified for a statement about how this conflict was managed. For example, the researchers might state specifically that the sponsoring agency had no input into the research protocol, data analysis or interpretation of the findings. Many journals now routinely require authors to declare any potential financial or other conflicts of interest when an article is submitted. The reader must then decide whether the declared factors are important and might have influenced the validity of the study's findings.


Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.