Prospective External Validation of a New Non-invasive Test for the Diagnosis of Non-alcoholic Steatohepatitis in Patients With Type 2 Diabetes

Thierry Poynard; Valérie Paradis; Jimmy Mullaert; Olivier Deckmyn; Nathalie Gault; Estelle Marcault; Pauline Manchon; Nassima Si Mohammed; Beatrice Parfait; Mark Ibberson; Jean-Francois Gautier; Christian Boitard; Sébastien Czernichow; Etienne Larger; Fabienne Drane; Jean Marie Castille; Valentina Peta; Angélique Brzustowski; Benoit Terris; Anais Vallet-Pichard; Dominique Roulot; Cédric Laouénan; Pierre Bedossa; Laurent Castera; Stanislas Pol; Dominique Valla


Aliment Pharmacol Ther. 2021;54(7):952-966. 

In This Article


Patient Characteristics

The study flow chart of patients included in the validation population with biopsy is presented in Figure 1. Table 2 presents the clinical, serological, histological characteristics and NashFibroTest data for all the 272 included patients.

Figure 1.

Study flow chart of core population with biopsy. Of 325 patients enrolled, 272 were eligible. Eventually among 275 patients with an interpretable biopsy, 272 had reliable FibroTest, NashTest-2 and SteatoTest-2. Only one patient with a non-reliable FibroTest has been excluded

Liver Biopsies. A total of 325 patients underwent liver biopsy (see Figure 1). The median (IQR) length was 17 mm (8 mm), with 55 (31–82) days between the blood test and biopsy (days) (Table 2). Biopsies were not available in 50 patients. The reading was not centralised in 22 and 18 biopsies were inadequate (Table S1). Only one significant side effect was observed in the 325 patients, one case with an accidental intestinal biopsy, without symptoms.

Main Outcomes. The Obuchowski measure (SE; significance) for the FibroTest was 0.862 (0.012; P < 0.001), for NashTest-2 was 0.827 (0.015; P < 0.001), and for SteatoTest was 0.794 (0.020; P < 0.01) with the corresponding medians and IQR by stages and grades in Figure 2A-C respectively.

Figure 2.

FibroTest performance in 272 type 2 diabetes patients for Fibrosis staging. (A) FibroTest was significantly different between Stage F0 (n = 54) vs F2, F3 and F4; Stage F1 (n = 65) vs F2, F3, and F4; Stage F2 (n = 50) vs F0, F1 and F4; Stage F3 (n = 74) vs F0 and F1; Stage F4 (n = 29) vs F0, F1 and F2. All 272 patients had reliable tests and centralised biopsies. The corresponding Obuchowski measure (SE; significance) was 0.862 (0.012; P < 0.001). (B) NashTest-2 performance in 272 T2M patients for NASH grading. NashTest-2 was significantly different between Grade A0 (n = 57) vs A2 and A3; Grade A1 (n = 51) vs A3; Grade A2 (n = 73) vs A0 and A3; Grade A3 (n = 91) vs F0, F1 and F2. The corresponding Obuchowski measure (SE; significance) was 0.827 (0.015; P < 0.001). (C) SteatoTest-2 performance in 272 T2M patients for Steatosis grading. By definition there was no S0, and only 2 S1. SteatoTest-2 was significantly different between grade S3 (n=207) vs S2 (n=58; P=0.03).

The comparisons of test performances in diabetes vs the original population are provided in Table 3, for the Obuchowski measures, standard binary AUROCs, and including the adjusted AUROC for the FibroTest, and Figure 3 for the spectrum of stages and grades.

Figure 3.

Spectrum of stages and grades in the original (upper row) and diabetes (lower row) subset. The spectrum of the stages of Fibrosis was not uniform in the original subset and almost uniform in diabetes. The prevalence of F3 and F4 was twice as high in diabetes as in the original subset. The difference between the mean advanced fibrosis stage and non-advanced stage was 2.32 in diabetes and 2.11 in the original population resulting in a slight underestimation of binary AUROC for both subsets vs a perfect uniform distribution. Binary AUROCs were 0.76 and 0.84 after standardisation vs 0.72 and 0.80 before, for the diabetes and original subsets respectively

The influence of the prevalence of advanced fibrosis on the PPV and the NPV (95% CI) is provided in Table 5 and in File S3 for references. The PPV for the standard predetermined cutoff used, was 80% (71–89) and the NPV was 56% (48–63) in type 2 diabetes with a high (56%) prevalence of advanced fibrosis. In a large cohort of 30 761 NAFLD patients with type 2 diabetes and a prevalence of 32% of advanced fibrosis the PPV was 61% (52–69) and the NPV was 78% (69–86). In a group representative of the French general population with the lowest prevalence of advanced fibrosis (2.8%), the PPV was 9% (8–10) and the NPV was 98% (97–99).

The influence of the prevalence of significant NASH on PPV and NPV is also presented in Table 5. In these cases the PPV at the standard predetermined NashTest-2 cutoff (0.50), was 64% (57–70) and the NPV was 63% (45–79) in type 2 diabetes, with a high (56%) prevalence of significant NASH. In a large cohort of 89 427 NAFLD patients with type 2 diabetes and a prevalence of significant NASH of 60% the PPV was 67% (60–73) and the NPV was 59% (49–74). In a group representative of the French general population with the lowest prevalence of NASH (1.1%), the PPV was 1.3% (0.8–18) and the NPV was 99% (98–100).

Secondary Outcomes. The diagnostic performance of the NashTest-2 was also significant for each of the elementary features of NASH activity, according to both the SAF and CRN scoring systems. All w-AUROCs were above 0.790 (P < 0.001). Results are presented in Figure S1A for CRN ballooning, Figure S1B for SAF ballooning 0.0794 (0.021), Figure S1C for CRN lobular inflammation, Figure S1D for SAF lobular inflammation 0.821 (0.017), Figure S1E for portal inflammation and Figure S1F for Mallory bodies.

Post hoc Comparisons Between NITs, VCTE, FIB-4 and CAP. FibroTest was performed in 273 of the eligible patients and 272 were reliable, for a reliability of 99.6% (98.0–100). A total of 260 of the 272 included patients had a reliable VCTE, for a reliability of 95.6% (92.5–97.4). For the 272 cases with paired NITs the FibroTest reliability outperformed VCTE by 4.1% (2.5–6.5; P < 0.001). In an intention to diagnose analysis, the Obuchowski measure (se) for fibrosis stage was 0.859 (0.012) for the FibroTest, 0.870 (0.009) for VCTE, a non-significant difference (P = 0.10). If the analysis included only reliable stiffness measurements Obuchowski measures were higher for VCTE 0.910 (0.009), than for FibroTest, 0.862 (0.012; P = 0.009). For FIB-4, analysis cannot be performed in intention to diagnose, and the standard Obuchowski measure was 0.828 (0.011) which was lower than the FibroTest (P = 0.02) and VCTE (P = 0.001).

The diagnostic performances by other endpoints are described in Table S2. The overall results were similar for cases with biopsy ≥15 mm, and for cases with a biopsy <15 mm, the Obuchowski measure was only a higher for VCTE vs FIB-4. Comparison of CAP with SteatoTest-2 cannot be performed in intention to diagnose in the absence of a recognised cutoff for the CAP reliability. The binary AUROC for S3 vs S2 was 0.60 (0.52–0.67) and 0.69 (0.60–0.77), a not significant difference (Z = 1.71; P = 0.09) between SteatoTest-2 and CAP respectively.

Effect of the Uncertainty of Biopsy on Tests Performances. The effect of biopsy uncertainty on the diagnostic performance of the FibroTest was significantly associated with the length of the specimen (Figure 4A). The maximum expected binary AUROC of an ideal NIT for fibrosis using 25 mm biopsies,[33] as a comparator in a study of 272 patients, would be 0.83. In the present study the median biopsy specimen was 17 mm and the maximum expected AUROC for an ideal NIT decreased to 0.70 due to the 30% misclassification rate of the biopsy.[33]

Figure 4.

The effect of biopsy uncertainty on patient classification, due to specimen length in relation to the diagnostic performance of FibroTest (Panel A) and NashTest-2 (Panel B). The ground truth is a large surgical liver specimen. (FP, false positive, FN, false negative). In this study with 272 patients and a median 17 mm long biopsy the expected area under the ROC curve (AUROC) of the FibroTest (or NashTest-2) as a comparator cannot be more than 0.70 whatever its real performance due to the 30% misclassification rate of the biopsy as comparator. PPV, positive predictive value, and NPV, negative predictive value. The terms positive per cent agreement (PPA) and negative per cent agreement (NPA) are used instead of sensitivity and specificity, respectively, when the comparator is known to contain uncertainty

The effect of biopsy uncertainty on the diagnostic performance of NashTest-2 was even higher than for fibrosis, and also significantly associated with the length of the specimen, with AUROC decreasing from 0.69 with 25 mm to 0.60 with 17 mm (Figure 4B).

Analysis of Severe Discordances. A major discordance was found between the biopsy and the FibroTest in 28 patients (10.3%; 7.0–14.5). After adjudication 10 (3.7%; 1.8–6.7) were considered to be a FibroTest error, 5 (1.8%; 0.6–4.2) a biopsy error and 14 (5.1%; 2.8–8.5) indeterminate (Table S3A). A major discordance was found in nine patients (3.3%; 1.5–6.2) between the biopsy and the NashTest-2. After adjudication 7 (3.7%; 1.8–6.7) were considered as an error of biopsy, and two (2.6%; 1.0–5.2) to be indeterminate (Table S3B).