Sex Steroid Hormones and Risk of Breast Cancer

A Two-Sample Mendelian Randomization Study

Aayah Nounu; Siddhartha P. Kar; Caroline L. Relton; Rebecca C. Richmond


Breast Cancer Res. 2022;24(66) 

In This Article

Materials and Methods

Two-sample MR

To investigate the effect of sex hormone levels on BC risk, we applied a two-sample MR approach.[29] Firstly, genetic instruments to proxy for nine hormones and SHBG were obtained from GWAS summary statistics (sample 1). These were then integrated with BC risk genetic association effect estimates from published GWAS results (sample 2). We adopted a two-sample MR as opposed to a one-sample MR study because the risk of the winners' curse that happens in a one-sample MR study is unlikely to happen in a two-sample MR. Furthermore, in the presence of weak instruments, bias in a two-sample MR is towards the null, whereas in one-sample MR it is biased towards the confounded multivariable regression result.[29]

Genetic Predictors for sex Hormones

Single nucleotide polymorphisms (SNPs) were extracted from summary data for total testosterone (TT), bioavailable testosterone (BT), DHEAS, estradiol, cortisol, androstenedione, aldosterone, progesterone, 17-hydroxyprogesterone (17OHP) and SHBG. The threshold for SNP selection was P < 5 × 10–8. When no SNPs were found at this association, the threshold was relaxed to P < 5 × 10–7, as was the case for estradiol, aldosterone and androstenedione (Table 2).

SNPs predicting levels of TT, BT and SHBG were obtained from publicly available summary statistics provided by Ruth et al. using UK Biobank data,[27] which consists of phenotype and biological samples collected from around 500,000 individuals across Great Britain.[30] Testosterone and SHBG levels were measured (nmol/L) using a one-step competitive analysis and two-step sandwich immunoassay analysis, respectively, on a Beckman Coulter UniCel Dxl 800 in 230,454 and 189,473 participants, respectively.[27] BT (nmol/L; female N: 188,507) was calculated from TT and albumin, which was also measured by BCG analysis on a Beckman Coulter AU5800 (g/L). Genotyping and imputation (using the Haplotype Reference Consortium (HRC) and 1000 Genomes) and quality control (removal of SNPs with MAF ≤ 0.01) filtering resulted in 16,580,850–16,585,865 SNPs for the three measures (Table 2).[27] Measures were subjected to inverse normal transformation of rank before being taken forward in a sex-stratified GWAS study.[27]

Estradiol has also been measured in UK Biobank using a Beckman Coulter DXI 800 with a detection range between 73 and 17,621 pmol/L.[31] However, since the majority of females (162,691/273,455 (59.49%),[32] mean age 56.35 years)[33] enrolled in the UK Biobank were postmenopausal, levels of estradiol were below the limit of detection for 75% of these women.[27,34] For this reason, we used summary data for estradiol from the LIFE-Adult and LIFE-Heart cohorts. The LIFE-Adult cohort consists of a random selection of 10,000 participants from Leipzig, Germany. Conversely, 7000 participants were chosen for the LIFE-Heart study based on having suspected or confirmed coronary artery disease. In contrast to UK Biobank, estradiol measurements in the LIFE-Adult and LIFE-Heart cohort (total N: 2607) were measured using an electrochemiluminescence immunoassay (ECLIA) with a detection limit of 18.4 pmol/L[35] and liquid chromatography-tandem mass spectrometry (LC–MS/MS) with a lower detection limit of 37 pmol/L, respectively.[36] For this reason, we used these two cohorts to obtain genetic instruments to proxy for estradiol levels. The mean age of women in the LIFE-Adult and LIFE-Heart cohorts was 59.4 and 64.8 years, respectively.[37] Imputation in this GWAS was performed using 1000 Genomes Phase 3 as the reference panel. Estradiol measurements were log-transformed prior to analysis, and so SNP associations represent a log-transformed unit increase (pmol/L) in levels (Table 2).[37]

Summary data were also obtained for the hormones androstenedione, aldosterone and 17-OHP, which were measured in females in LIFE-Heart only (N = 711, N = 685 and N = 711, respectively) (Table 2) using LC–MS/MS. Progesterone was measured in females in both LIFE-Heart and LIFE-Adult using LC–MS/MS (N = 1259). These GWASs were imputed using the 1000 Genomes reference panel with further information in Table 2. Steroid hormone measurements were log-transformed prior to analysis and so SNP associations represent a log-transformed unit increase (nmol/L or pmol/L) in hormone levels (Table 2).[37]

We obtained summary statistics for DHEAS associations from Prins et al. which included DHEAS measures for 9722 participants (4308 males and 5414 females) obtained from the United Kingdom Household Longitudinal Study—a longitudinal survey across the UK (England, Wales, Scotland and Northern Ireland) consisting of 40,000 households.[38] DHEAS was measured (μmol/L) in serum samples using a competitive immunoassay on the Roche E module analyser, and measurements were log-transformed and adjusted for age and sex; thereby, SNPs represented a log-transformed unit (μmol/L) increase in DHEAS levels.[38] Imputation was performed using the UK10K project and 1000 Genomes phase 3 panels in this GWAS (Table 2).[38]

We obtained summary statistics for cortisol from the CORtisol NETwork (CORNET) consortium that meta-analysed GWAS statistics from 17 population-based cohorts of a European background including 25,314 individuals (36.27% men and 63.73% women).[39] Cortisol was measured using immunoassays in blood samples for all studies except TwinsUK which used liquid chromatography-mass spectrometry. The studies performed linear regressions on z-scores of log-transformed morning plasma cortisol and were also adjusted for sex, age and genetic ancestry. Imputation was performed using the Haplotype Reference Consortium and 1000 Genomes Phase 3 panels in this GWAS, with SNPs representing a standard deviation (SD) increase in cortisol levels (Table 2).[39]

For estradiol, DHEAS, progesterone, 17-OHP, aldosterone and androstenedione, we converted log-transformed units to a standard deviation scale so that results would represent an SD increase in hormone levels. For hormones with reported median and interquartile (IQR) ranges, these were transformed to the log scale, and the SD was calculated using the method presented by Wan et al..[40] DHEAS study characteristics reported mean, maximum and minimum values, which were transformed to the log scale and the SD was calculated using the formula: (maximum–minimum)/4. When the hormone was measured in more than one study (estradiol, DHEAS and progesterone), a combined SD was calculated using the formula from the Cochrane Handbook (Sect.[41]

To adjust for multiple testing, we applied a genome-wide significance threshold for SNP associations with metabolites (P value ≤ 5 × 10–8). When no SNPs were available at this cut-off, we relaxed the threshold to a P value ≤ 5 × 10–7 (Table 2). We also chose to include independent SNPs to avoid multi-collinearity and therefore carried out linkage disequilibrium (LD) clumping at an R 2 < 0.001 so that only the SNP most strongly associated with the hormone within a 10,000 kb window was taken forward in the analysis.

We calculated the variance explained as well as the F statistic to assess whether the identified SNPs may be weak instruments. When weak instruments are used in a two-sample MR analysis, the estimate obtained tends to be biased towards the null.[42] The F statistic helps to determine the strength of the bias, with lower F statistics indicating a greater bias towards the null.[43] Power calculations were conducted using the mRnd online calculator to identify the effect size (odds ratio, OR) in both directions that could be detected based on the variance explained by the instruments and the sample sizes available.[44] Due to the absence of effect allele frequencies for the DHEAS GWAS, the variance explained, F statistic and power calculations could not be calculated for this hormone.

Genetic Associations for Breast Cancer

Breast Cancer Risk. Genetic association summary statistics for BC risk were obtained from the Breast Cancer Association Consortium (BCAC) (consisting of 68 studies combined together) as well as the Discovery, Biology and Risk of Inherited Variants in Breast Cancer Consortium (DRIVE).[45] This study includes 122,977 BC cases and 105,974 controls and when stratified based on oestrogen receptor (ER) expression, there were 69,501 ER + BC cases and 21,468 ER–BC cases (Table 1). Genotyping was carried out using both the iCOGS array or the OncoArray with imputation (using the version 3 release of the 1000 Genomes Project data set as a reference panel) to obtain data on 10,680,257 SNPs and the results were combined using a fixed-effect meta-analysis.[45]

Breast Cancer Risk Subtypes. To further identify subtype-specific effects of hormones on BC risk, we also tested their association with six subtypes of BC. Data from 118,474 cases and 96,201 controls were previously analysed from 82 studies from BCAC to obtain summary statistics associations for five subtypes of BC. These subtypes are defined by expression of ER, progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2) and cancer grade: luminal A-like (ER+, and/or PR+, HER2- and grades 1/2; 7325 cases and 20,815 controls), luminal B-like (ER+ and/or PR+, HER2+; 1682 cases and 20,815 controls), luminal B/HER2-negative-like (ER+ and/or PR+, HER2-, grade 3; 1779 cases and 20,815 controls), HER2-enriched-like (ER-, PR-, HER2+; 718 cases and 20,815 controls) and TNBC (ER-, PR-, HER2-; 2006 and 20,815 controls).[46] Furthermore, summary data from the Consortium of Investigators of Modifiers of BRCA1/2 (CIMBA) comprising 9414 cases with BRCA1 mutation and 9494 controls with BRCA1 mutation were also used in this analysis. Since the majority of BRCA1-mutated cancers were also triple-negative, we used summary statistics that meta-analysed associations of BRCA1-mutated cancers with TNBC (18,016 cases and 100,971 controls) (Table 1).[46]

Statistical Analysis

Analyses were carried out in R version 3.3.1 using the "Two-Sample MR" package,[47] which allows data formatting, harmonization and application of MR methods in a semi-automated manner. This package automatically assigns the allele with a positive effect on the exposure as the effect allele, so that the effect allele predicts an increase in hormone levels. The SNPs used to proxy for the exposure are also extracted from the BC outcome data sets. Exposure and outcome summary statistics are then subject to allele harmonization to ensure that the effect allele in the exposure data set (hormone-increasing) is the same effect allele in the outcome data set, with effect allele frequencies used to assist in harmonizing palindromic SNPs.

In the presence of one SNP to proxy for hormone levels, Wald ratios (SNP-outcome estimate/SNP-exposure estimate) were used to calculate the change in log OR (risk analysis) per SD increase in hormone levels. When more than one SNP was present, an inverse variance weighted (IVW) method was applied, which is an average of the Wald ratios where the weight of the SNP contribution to the overall estimate is the inverse of the SNP effect on the outcome.[48,49] The default of the two-sample MR package is a random effects IVW model; however, a fixed effects model is used when there is underdispersion in causal estimates between SNPs.[47] The fixed effects model assumes that there is no horizontal pleiotropy and that each SNP provides the same estimate, whereas the random effects model allows each SNP to have different means.[50] We performed the analysis for the risk of overall BC, ER + BC, ER–BC, luminal A-like BC, luminal B-like BC, luminal B/HER2-negative-like BC, HER2-enriched-like BC, TNBC and BRCA1 mutated TNBC (Table 2).

The IVW method is prone to bias if one of the genetic instruments is invalid due to its association with another trait through an independent pathway (horizontal pleiotropy).[51] For this reason, we also applied alternative MR methods that produce unbiased estimators even in the presence of some invalid genetic instruments. When more than two SNPs were present, we calculated a weighted median, weighted mode and an MR-Egger estimate.[47,52,53,54] The weighted median approach allows a consistent estimate even if 50% of the information contributing to the overall estimate comes from invalid genetic instruments.[53] The weighted mode estimate may also be used even when the majority of the SNPs are invalid instruments so long as the SNPs that form a cluster of homogenous results are valid.[48,52] Finally, we adopted an MR-Egger analysis to evaluate evidence for the presence of horizontal pleiotropy. This method is not constrained to pass through an effect size of 0; therefore, the y-intercept gives an indication of the presence of directional pleiotropy.[51,54] We used an MR-Egger intercept with a P value below 0.05 to indicate the presence of directional pleiotropy that may be influencing the MR results. For hormones proxied by weak instruments, we conducted an MR robust adjusted profile score (MR RAPS), a method that provides robust inference when weak instruments are present.[44]

Linkage disequilibrium score regression (LDSC) was used to assess the genetic correlation between TT and BT with estradiol using the settings advised in the software package LDSC (v1.0.1).[55] We tested these hormones due to the direct conversion of testosterone to estradiol and therefore to identify whether SNP associations are shared between the two traits. Firstly, quality control was performed on the summary statistics to exclude variants with missing data, non-biallelic, strand-ambiguous alleles which could not be matched in the European ancestry 1000 Genomes reference panel, variants with imputation scores below 0.90 and rare variants with minor allele frequencies below 0.01. A score was calculated to reflect whether GWAS test statistics for variants correlates with nearby variants that are in high LD. A z statistic was generated for each variant in trait 1, and this was multiplied with the z statistic from trait 2. The product was regressed against the LD scores, and the resultant coefficient/slope was the genetic correlation statistic.