Prospective External Validation of a New Non-invasive Test for the Diagnosis of Non-alcoholic Steatohepatitis in Patients With Type 2 Diabetes

Thierry Poynard; Valérie Paradis; Jimmy Mullaert; Olivier Deckmyn; Nathalie Gault; Estelle Marcault; Pauline Manchon; Nassima Si Mohammed; Beatrice Parfait; Mark Ibberson; Jean-Francois Gautier; Christian Boitard; Sébastien Czernichow; Etienne Larger; Fabienne Drane; Jean Marie Castille; Valentina Peta; Angélique Brzustowski; Benoit Terris; Anais Vallet-Pichard; Dominique Roulot; Cédric Laouénan; Pierre Bedossa; Laurent Castera; Stanislas Pol; Dominique Valla


Aliment Pharmacol Ther. 2021;54(7):952-966. 

In This Article

Research Design and Methods

Study Participants and Design

The primary outcome of this prospective cross-sectional multicentre study in patients with type 2 diabetes is to assess the diagnostic accuracy of the FibroTest, NashTest-2 and SteatoTest-2 using liver histology as the reference to evaluate liver fibrosis, activity and steatosis. NAFLD was suspected based on the presence of abnormal liver enzymes as well as an ultrasound scan showing a bright liver echo pattern, in patients with type 2 diabetes diagnosed in a diabetology outpatient clinic.

The STARD and FibroSTARD guidelines were followed (File S2) particularly for items 13.7 and 13.8.[23] Consecutive patients were prospectively recruited between October 2018 and 2020 in four outpatient diabetology clinics in the Assistance-Publique-Hôpitaux-de-Paris (File S1). The study (NCT03634098) was approved by the Research Ethics Committee #18.021–2018-A00311-54. All patients gave written informed consent. The study was performed in accordance with the declaration of Helsinki. All authors had access to the study data and reviewed and approved the final manuscript.

Main Analyses

The primary objective of the study was to evaluate the diagnostic accuracy of each component of the NashFibroTest, the FibroTest, the NashTest and the SteatoTest, in relation to the histological evaluation of fibrosis, Nash activity and steatosis. The primary endpoint was the validation of the FibroTest because the stage of fibrosis is the main prognostic criterion compared the grades of Nash or Steatosis.[1–4]

Inclusion and Exclusion Criteria in the Validation Population

Inclusion criteria were as follows: patients were ≥18 years of age, able to give written informed consent, with type 2 diabetes defined according to American Diabetes Association (ADA) or World Health Organization (WHO) criteria,[24] and were scheduled, independently from this study, to undergo a liver biopsy for investigation of suspected NAFLD within 4 weeks after ultrasonography and alanine aminotransferase (ALT) assessment. These patients had abnormal transaminases had to be negative for standard tests for liver diseases (File S3). Exclusion criteria were as follows: patients with HBV, HCV and autoimmune diseases, pregnant women, patients without national health insurance, with a history of chronic liver disease, patients with serum haemoglobin <7g/L or <10g/L in the presence of cardiovascular or pulmonary disease, patients who refused liver biopsy or tests, patients with significant alcohol consumption (≥30 g/day for males and ≥20 g/day for females) and by serum carbohydrate deficient transferrin per cent >2%, and patients with a terminal disease.

Patient Characteristics

The following characteristics were recorded in all patients: age, gender, body mass index (BMI), temperature, the presence of diabetes and arterial pressures. A 12-hour fasting blood test was performed locally on fresh samples for assessment of the following parameters: platelet count, aspartate transaminase (AST), ALT, gamma-glutamyl-transferase (GGT), alkaline phosphatase, albumin, bilirubin, fasting glucose, total cholesterol, high-density and low-density cholesterol lipoproteins, triglyceride, ferritin, urea, creatinine, alpha-2-macroglobulin (A2M), A1C-haemoglobin, insulinaemia, HOMA score, urea, creatinine, sodium, potassium, calcium and C-reactive protein.

Histopathologic Evaluation in the Validation Group

Liver biopsy (intercostal or transvenous) was performed in all patients according to the standard local procedure. Biopsy specimens were fixed in formalin, embedded in paraffin and stained with haematoxylin and eosin and Sirius Red. Slides were analysed in each centre by an experienced pathologist (VP, BT) and then centrally reviewed by a single experienced pathologist (PB) for the read-outs, blinded to all patient characteristics. The length and the number of fragments were assessed, and the quality scored according to a three-class classification (adequate, marginal and inadequate). The cause of any inadequate liver biopsy was specified: length, fragmentation or technical issues, that is, inadequate staining, or granuloma. NASH was diagnosed according to the presence of steatosis, hepatocyte ballooning (three grades 0–2) and lobular inflammation (three grades 0–2) with at least 1 point for each category. NAFLD activity (Nash score) was scored using both the SAF (main outcome in four classes),[8–10] and NASH-CRN scoring systems, which are different for several feature scores (File S3).[21]

Fibrosis was scored using the same SAF and CRN definition in five stages from 0 to 4.[21] Steatosis was scored in four grades (from 0, 'less than 5%', 1 '5%-30%', 2 '33%-66%', to 3 'more than 66% of hepatocyte with steatosis'). Portal inflammation and Mallory bodies were also recorded by grade into three classes. Liver biopsies were categorised by pathologists as a normal liver (no liver pathology), NAFL (steatosis but no NASH), NASH or other diagnosis when no NAFLD but other histological features suggesting another diagnosis were observed.

Nash FibroTest panel

The FibroTest is called the NASH-FibroSure® (LabCorp) in the USA. The FibroTest includes A2M, apolipoprotein-A1, haptoglobin, total bilirubin and GGT.[5,25] The comparative components of the FibroTest, the new NashTest-2,[6] the SteatoTest-2[7] and the original NashTest, and SteatoTest are described in Table 1. Compared to the original NashTest,[26] NashTest-2 was developed for a quantitative diagnosis of NASH (SAF score as reference) with no need for the body mass index (BMI). Compared to the original SteatoTest,[27] SteatoTest-2 was constructed without total bilirubin and BMI.[7] The tests were all adjusted for age and gender. All components were assessed on fresh samples. The pre-analytical and analytical procedures were those recommended by BioPredictive. Exclusion criteria were the non-reliable results identified using security control algorithms.[28] Using both the FibroTest and the NashTest-2, it was possible to predict the presence or the absence of clinically significant NAFLD as defined by the histological SAF score: fibrosis stage ≥F2 (FibroTest >0.48), the standard cutoff for stage F2-F3-F4,[5,25] and/or activity grade ≥grade A2, (NashTest-2 ≥0.50).[6]

Effect of the Uncertainty of Biopsy on Tests Performances

Biopsy is an imperfect gold standard.[18,29–33] We used the method recently suggested by McHugh et al,[33] and for the first time we assess the effect of uncertainty in the patient classification (Files S3 and S4). The performance of any test must be evaluated with reference to a comparator. The presence of classification uncertainty in the comparator (here biopsy) is therefore an important confounding factor when interpreting the diagnostic performance of the test (ie FibroTest). We report the comparator uncertainty together with the estimated performance of the test. As increasing amounts of noise are introduced into the biopsy, such as the biopsy length,[32,33] the apparent performance of the diagnostic test compared to biopsy the comparator, will decrease accordingly. Each amount of uncertainty (here the false positive/negative of biopsy according to the specimen length vs large surgical biopsy) was randomly introduced into 100 iterations and the aggregate results are shown. This simulation was implemented using the online simulation tool,[33] The performances of liver biopsy where those assessed using large surgical biopsy as the ground truth for staging.[34] The percentage of 25 mm liver biopsy that was correctly classified for fibrosis by the METAVIR score was 75%. Thus, a 25 mm biopsy was considered to have a sensitivity of 82.5% and a specificity of 82.5% for the percentage of correct classifications into the five stages of fibrosis. The same method was applied for the NashTest-2. There was no large surgical biopsy for ground truth in NASH, thus we used the repeated biopsies results as ground truth as recommended.[18,33–35]

Discordance Analysis

A major discordance was defined as a difference >2 stages for fibrosis, or >2 grades for activity according to the SAF score, which could influence clinical decision-making. For steatosis, as NAFLD and NASH required the presence of steatosis no major discordances could be observed. To attribute these major discordances to biopsy or to the Nash-FibroTest panel, reliable VCTE and FIB4 were used for the staging of fibrosis, ALT, AST and GGT levels were used to grade significant NASH. All cases with such major discordances were independently adjudicated by two clinicians DV and TP.[32]

Comparisons Between NashFibroTest, VCTE and FIB-4

A prospective, direct comparison between the FibroTest, VCTE and FIB4 in intention to diagnose and per-protocol analyses would have required 600 cases, based on the multiple comparisons between the Obuchowski measure and reliability.[36] These comparisons have been scheduled in other work packages of the Quid-Nash consortium ( In this study, we performed a post hoc analysis to compare the reliabilities and diagnostic performances of FibroTest, VCTE and FIB4 for fibrosis and SteatoTest-2 and controlled attenuation parameter (CAP) for steatosis. The VCTE FibroScan (FibroScan 502Touch model Echosens, Paris, France) examination was performed by nurses or physicians trained and certified by the manufacturer and blinded to the patient's histological evaluation and NashFibroTest. Only examinations with at least 10 valid liver stiffness measurements (LSMs) as well as those with LSMs median/IQR ratio ≥30%, both for LSMs, and CAP were considered to be valid.[30,37,38] FIB-4 was assessed with the original formula: age ([yr] × AST [IU/L])/((PLT [10[9]/L]) × (ALT [IU/L])1/2).[39]

Statistical Analysis

The chosen same sample size of n = 300 for the primary aim of the study was the same as that used for the internal validation of SteatoTest-2, and for validation of the original SteatoTest.[7] Evidence of differences in variables between the stages of fibrosis and the grades of NASH or of steatosis was evaluated with the Kruskal-Wallis test followed by Dunnett's tests with a post hoc comparison. P values <0.05 were considered to be statistically significant.

The overall diagnostic accuracy of tests (main outcome) and VCTE and FIB4 (post hoc analysis) was estimated by the Obuchowski measure together with the standard error, to take into account the spectrum effect.[19,20,23] The performances of the FibroTest, NashTest-2 and SteatoTest-2 were assessed using the Obuchowski measure, the main outcome recommended as a summary measure of accuracy which includes all pairwise stages and grades comparisons, which is not provided par the extensively used binary area under the ROC curve.[19,20] The Obuchowski measure can be interpreted as the probability that the non-invasive index will correctly rank 2 randomly chosen patient samples from different fibrosis stages according to the weighting scheme, with a penalty for misclassifying patients. The binary under the ROC curve only measure the probability to be lower or higher than the cutoff, that is, 0.48 for the FibroTest for stages F0F1 vs F2F3F4 (significant fibrosis) that is one comparison. The Obuchowski measure summarises the performance of the all pairwise comparisons, that is 10 comparisons for the five stages of fibrosis (F0 to F4).[19]

'To compare the performance of FibroTest between the original Construction and the Validation subsets, we assess the binary-AUROC "spectrum adjusted" (binaryAUROCsa), together with the associated the difference between the mean fibrosis stages of (F2 + F3 + F4) and the mean fibrosis stages of (F0 + F1) as previously described.[20] This permitted to estimate the spectrum effect without computing the individual data. The binaryAUROCsa is calculated by its linear regression curve with binary-AUROC. The maximum is 4 when all patients are F0 or F4. The minimum is 1 when all patients are F1 or F2. When there is an uniform prevalence of stages, 20% for each five stages, the binaryAUROCsa is 2.5.'[20]

Due to the absence of patients grade S0 and with only two S1 (Table 2), we could only validate the SteatoTest-2 vs the original population, and performed a binary AUROC for the diagnosis of S3 vs S2. Data were reported for standard predetermined thresholds of the stages of fibrosis for the Fibrotest (0.27, 0.48, 0.58 and 0.74 for F1, F2, F3 and F4 respectively), grades of activity for the NashTest-2 (0.25, 0.50 and 0.75 for A1, A2 and A3 respectively) and of steatosis for the SteaoTest-2 (0.40, 0.55 and 0.62 for S1, S2 and S3 respectively). We reported the sensitivity (Se), specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio and negative likelihood ratio together with 95% CI for each cutoff value. We also investigated the performance of the tests in settings with different prevalences using Bayes' equation to estimate post-test probabilities. In this case we used the F2 threshold for fibrosis, and A2 for NASH activity which correspond to clinically significant liver disease.[1,2] The post hoc analysis was performed in intention to diagnose, the reliability and the diagnostic performances being compared by the paired binary test. For FIB-4 there was no definition of reliability in the literature. FibroTest reliability definition followed the manufacturer recommendation.[28] TE reliability was assessed among the participants of the core group, as not prospectively scheduled in the eligible participants.

To assess possible variability due to the length of biopsy 2 subsets was also analysed, one with biopsy length of 15 mm or longer, and one with length lower than 15 mm. The original cutoffs for F2 were used 7.1 kPa for VCTE,[38] and 1.45 for FIB-4.[39] All analyses were performed using the software R, version and NCSS 2020, and in duplicate by two independent teams of statisticians, one independent from the inventor (JM, PM); and one including the inventor (TP). Continuous variables were expressed as medians (interquartile range [IQR]) and categorical variables as absolute figures with percentages. CIs were reported at the 95% level. Details are given in File S3. An explanation of the impact of spectrum effect and of the uncertainty of biopsy is given in File S4.