Systematic Review and Meta-analysis of Cognitive Interventions for Children With Central Nervous System Disorders and Neurodevelopmental Disorders

Kristen E. Robinson, PhD; Eloise Kaizar, PhD; Cathy Catroppa, PhD; Celia Godfrey, DPsych; Keith Owen Yeates, PhD


J Pediatr Psychol. 2014;39(8):846-865. 

In This Article


Selection of Studies

A comprehensive search of existing peer-reviewed studies was conducted using multiple strategies. First, keyword database inquiries were conducted. These included searches in PubMed, PsycInfo, Embase, and the Cochrane Central Register of Controlled Trials (CENTRAL). In each search, keywords were grouped as either diagnosis terms (e.g., brain injuries, learning disability) or intervention terms (e.g., treatment outcome, rehabilitation), and all possible combinations of keywords were used. An example search strategy including all search terms can be found in Supplementary Table I Database filters limited the search to articles published between 1980 and May 2013. Second, reference lists from articles found in the initial search were reviewed and additional potential studies were identified. Third, relevant review articles and meta-analyses were similarly reviewed and their reference lists were also examined. After eliminating duplicate articles, a total of 623 studies published between 1980 and 2013 were identified and reviewed for potential inclusion in the meta-analysis.

Requirements for Inclusion

To be eligible for inclusion in the systematic review, each study was required to meet all of the following inclusion criteria. First, studies must have included a sample of children or adolescents who were diagnosed with either a neurodevelopmental (e.g., ADHD, specific learning disability) or central nervous system (e.g., epilepsy, traumatic brain injury, brain tumor) disorder. We did not require that a study specifically state that participants met diagnostic criteria (e.g., Diagnostic and Statistical Manual of Mental Disorders, 4th Edition criteria) although most included studies did so. Second, studies must have included only participants aged <19 years; if both older and younger participants were included, data on child and adolescent participants must have been presented separately. Third, study designs were limited to RCT and other controlled clinical trials that included at least two points of measurement, one at baseline and another at a point in time afterward to measure the efficacy of the intervention. Fourth, studies must have involved an intervention with a primary aim of promoting attention, memory, or executive function, including metacognition. Fifth, the treatment arm and all control/comparison arms must have included at least 10 participants at the end of treatment or follow-up. Finally, studies must have been published in English. We excluded studies that focused on speech/language/communication skills or specific academic skills (e.g., reading) because they are not typically considered cognitive interventions. We also excluded studies that targeted only parents.

Primary outcomes of interest were tests of specific cognitive skills (both standardized and experimental). Secondary outcomes included rating scales meant to measure behaviors that reflect the everyday manifestation of specific cognitive skills (e.g., Behavior Rating Inventory of Executive Function for executive functions) and other functional domains thought to be promoted by the intervention (e.g., academic skills, behavioral adjustment).

Determination of Eligibility

The process of study selection and determination of eligibility is summarized in graphical form consistent with Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines in Figure 1. Abstracts from each of 623 original studies were reviewed for potential inclusion in the meta-analysis. Each abstract was randomly assigned for review using a random number generator. Abstracts were each reviewed by two independent reviewers and were categorized as (a) likely eligible, (b) maybe eligible, or (c) not eligible. Inter-reviewer agreement was perfect for 549 of the 623 articles (88.1%). All articles categorized by at least one of the reviewers as likely or maybe eligible underwent a full-text review by one reviewer, using a standard data collection form. Full-text reviews were completed for 117 articles, which were again randomly assigned for review using a random number generator. Of the 117 articles that underwent full-text review, 104 did not meet all of the inclusion criteria for the systematic review. Studies were excluded for a variety of reasons, the most common being that they did not involve a randomized or controlled clinical trial, lacked a control or comparison arm, had <10 participants at the follow-up assessment, or did not report on an intervention targeting attention, memory, or executive function (see Figure 1). Several studies were excluded because their participants had a disorder that could affect brain function, but only as a secondary effect rather than a primary consequence of the disease or injury (e.g., low birth weight, HIV); we excluded those studies because brain impairment is not characteristic of all children with those disorders.

Figure 1.

PRISMA flowchart.

The full-text coding process yielded 13 original studies that met inclusion criteria for the meta-analysis (see studies preceded by asterisk in References). The data presented in some papers were not adequate to support meta-analysis; we contacted the authors of those papers and asked that they provide whatever data were necessary for inclusion. Sufficient data for inclusion were available for 12 studies; one study (Chan & Fong, 2011) was excluded from the meta-analysis but was included in qualitative reviews. Table I provides descriptive information about each of the 13 included studies.

Assessment of Risk of Bias

The methodological quality of the 13 studies included in the systematic review was rated using the Cochrane Risk Bias Tool (Higgins & Green, 2011). This procedure involves rating each study for several potential sources of bias: selection bias (i.e., random sequence generation for assignment to treatment conditions and allocation concealment); performance bias (i.e., blinding of participants and study personnel); detection bias (i.e., blinding of outcome assessment); attrition bias (i.e., incomplete outcome data); and reporting bias (i.e., selective reporting of results). Each form of bias is rated low, high, or unclear based on the information provided in the published article. Other forms of bias may also be rated if noted by the reviewer.

Data Analysis

To facilitate data analysis, we first identified seven major categories of outcomes for which at least two studies presented data, as listed in Table II. The table also lists the number of studies that included outcome measures within each outcome category, and the total number of outcome measurements used across those studies. Most studies included working memory tasks (8 of 13 studies) and behavioral rating scales meant to assess attention (8 of 13 studies), but fewer than half of the studies addressed any of the other categories. Unfortunately, the small number of studies and measurements within each category was exacerbated by the diverse choices of outcome metrics. Within each category, most studies chose unique measures. The most consistent choice of metric was for behavioral rating scales measuring attention, where five studies used the Connors ADHD parent rating scale (Conners, 1997) and three studies used the Connors Cognitive Problems and Hyperactivity parent rating scales (Conners, 1997). Four studies used a digit span task to measure working memory. No other measure was used in more than two studies.

To combine effects across studies to estimate an overall effect size, we first defined a standardized effect size to be the difference between treatment groups divided by the population standard deviation:

where T indicates the active or more intense treatment group, C indicates the control or less intense treatment group, 1 indicates the posttreatment time point, 0 indicates the pretreatment time point, and σ indicates the standard deviation within treatment and period. A positive effect indicates that the active or more intense treatment was beneficial compared with the control or less intense treatment group. We follow the standard repeated measures analysis of variance assumption that the standard deviation is constant across the groups and time points. Further, because most authors only report pre- and posttreatment measures, we disregarded any follow-up measurements.

Most authors did not report the standardized difference and its standard error. However, most authors did publish or provide us with enough information to calculate reasonable estimates of the effect of interest, δ. Supplementary Table II summarizes how each reported statistic or collection of statistics was transformed to provide an estimate and the standard error. However, some studies provided only group-by-time-specific means and measures of variability. In that case, we could not exploit the pre–post nature of the study to correctly estimate the standard deviation of the change in outcome across participants. We conservatively assumed that the pre- and postmeasures for each individual are independent, thus producing a much larger standard error for the effect than would most likely have been calculated with complete data. Results based on this assumption are plotted using open (versus filled) squares in the forest plots presented in the Results section. Because Gibson et al. (2011) compared two active treatments, effect size estimates had no meaningful direction, and so these estimates were not included in the meta-analysis and "NA"s were noted for relevant measures in the forest plots presented in the Results section.

Although many of the included studies used a large number of outcome measures, none of the published data provided information about correlation among effect sizes that would allow us to correctly use more than one measure per study in any estimation of effect size within domain. Thus, we took two separate approaches to the estimation of an overall effect size. First, we took a traditional meta-analytic approach where we used the inverse variance estimator under the unrealistic assumption that the effect on each measure is independent across all the studies (DerSimonian & Laird, 1986). We labeled this the "Overall Mean Effect." We next relaxed this assumption to fit a weighted linear random effects model that accommodates within-study correlation. We called this the "Hierarchical Mean Effect."

The protocol for the review initially sought to determine the relative efficacy of interventions along six major dimensions: type of participants (i.e., are interventions more effective for neurological disorders or acquired brain injuries versus neurodevelopmental disorders?); age of participants (i.e., are interventions more effective for older than younger children?); parent involvement in intervention (i.e., are interventions more effective when they include a parent component than when they focus solely on children?); methodological quality (i.e., is efficacy a function of methodological quality of the selected studies?); transfer effects (i.e., is efficacy a function of whether outcomes are similar to those that have been trained?); and early versus late outcomes (i.e., is efficacy sustained over time after the completion of the intervention?). Unfortunately, the small number of studies and nature of the data available in those studies precluded most moderator analyses. We were able to assess differences in effect size related to type of participant (i.e., neurological disorder/acquired brain injury versus neurodevelopmental disorder) in two domains (i.e., attention tasks and working memory tasks), which were the only ones that included at least two studies of each type of participant. In each case, we used a weighted random-effects hierarchical model with a fixed effect for type of participant. We used the Kenward-Roger adjusted F-test to test for an effect of participant type (Kenward & Roger, 1997).

Heterogeneity of effects was assessed based on the I2 metric, which quantifies the proportion of an observation's total variability that is due to study-to-study variation (Higgins & Thompson, 2002). We used the I2 metric directly to assess heterogeneity for outcome categories with only one measurement per study (i.e., inhibitory control). For the other outcome categories, we calculated an approximate I2 value by treating the study-level random effect estimate as if it were known in a standard random-effects meta-analysis. Treating the study-specific estimate as if it were directly measured in this two-step approximation slightly underestimates the study-specific variation, and thus generally overestimates I2. We believe this overestimation to be small, since similar calculations based on hierarchical models result in similar estimates of heterogeneity.

Finally, we were concerned about publication bias because many studies openly noted that they reported only those results that were deemed "significant" (i.e., generally with p < .05). For these studies, we could not include any estimate for those effects that were not "significant"; even though we could estimate the effect size to be zero, we did not have a corresponding standard error necessary for including the effect in the meta-analysis. Because Johnstone et al. (2012) explicitly noted nonsignificant tests for relevant measures, but did not report corresponding statistics that could be converted into an effect estimate, we note an "NA" for these measures in the forest plots. In addition to incomplete results, entire studies may exist that were not reported in the literature or that we did not discover in our search. To address these concerns, we examined funnel plots and used the "trim and fill" method to approximate an overall effect size when a funnel plot showed evidence of publication bias. For this rough approximation, we used the estimated within-study overall effect size ("Study Mean Effect") as a single observation, and "filled in" potentially missed studies to balance the funnel plot. We take some of these "filled-in" studies to represent the average "nonsignificant" effects for the measures that had not been reported. Because this method is based entirely on the supposed symmetry of our incomplete study-specific effect sizes, we do not interpret these as true estimates, but rather as indications of the direction and magnitude of possible publication bias.

All calculations were carried out using R software version 3.0.1, including the meta package version 3.0.1 and the lme4 package version 1.0.5.

Assessment of Quality of Evidence

To complement the meta-analytic results, we also assessed the overall quality of the evidence for each outcome using the Grading of Recommendations Assessment, Development and Evaluation (GRADE) system (Guyatt et al., 2011). In the GRADE system, quality of evidence is rated very low, low, moderate, or high across all studies within a particular outcome domain. Randomized trials are initially considered to provide high-quality evidence. Five factors can lead to rating the quality of evidence lower when considered across studies (i.e., risk of bias; inconsistency in results across trials; indirectness of outcome measurement; imprecision in effect size estimates resulting from small samples or few studies; and publication bias). Three factors can lead to higher quality ratings (i.e., large magnitude of effect; dose–response gradient; and confounders minimize effect).