### Methods

Our analysis is based on a model developed recently to interpret data on clustering of DNA fingerprint patterns in the Netherlands.^{[6]} Equations describing the model's formulation are provided in the Appendix.

The model's structure, parameters, and assumptions have been published.^{[6]} Persons are assumed to be born uninfected. Infected persons are divided into those in whom primary disease has not yet developed (defined by convention as disease within 5 years of initial infection^{[7]}), and those in the "latent" class, who are at risk for endogenous reactivation or for reinfection, which can be followed by exogenous disease. Exogenous disease is here defined as the first disease episode within 5 years of the most recent reinfection; endogenous disease includes disease occurring >5 years after the most recent (re)infection event, and second or subsequent disease episodes occurring <5 years after the most recent (re)infection event. (These definitions differ slightly from those of Sutherland et al.^{[8]} to include the assumption that once persons have recovered from disease during the first 5 years after initial infection or reinfection, their risk of developing disease becomes the same as that of developing disease through reactivation, until they are newly reinfected.)

The infection and reinfection risks are assumed to be identical, but reinfection is less likely to lead to disease than is initial infection, due to some immunity induced by the prior infection.^{[9]} We explored the implications of four assumptions for the magnitude (and trend) in the annual risk for infection, namely, that the risk for infection 1) declined over time, as estimated for the Netherlands (from approximately 2% in 1940 to approximately 5/10,000 by 1979^{[4,10]}); 2) remained unchanged over time at a very low level (0.1%); 3) remained unchanged at 1%; or 4) remained unchanged at 3%. Infection risks of 1% have been found in several populations (e.g., Malawi^{[11]}). Infection risks of 3% are uncommon today but have been reported in parts of South Africa.^{[12]} For simplicity, we assumed that persons cannot be reinfected during the period between initial infection (or reinfection) and onset of the first primary episode (or exogenous disease).

The risks of developing disease depend on age and sex (Figure 1A;^{[6]}); they are based on previous analyses, in which we fitted predictions of disease incidence to observed notifications in the U.K..^{[9]} The risks of developing either a first primary episode or disease following exogenous reinfection also depend on the time since infection and reinfection, respectively (Figure 1B). The probability that a disease episode is infectious (sputum smear/culture-positive) is age dependent (Figure 1C).^{[9]} The demography of the population described in the model is assumed to be that for the Netherlands. Analyses are restricted to respiratory (pulmonary) forms of tuberculosis, since these are far more likely than extrapulmonary forms to lead to transmission. Although additional factors such as immigration and HIV can influence the extent of clustering in complicated ways,^{[14]} these factors are not considered here, where the focus is upon the effect of the magnitude and trend in the annual risk for infection on clustering.

Summary of the main assumptions in the model relating to the risks of developing disease.

Recent studies suggest that the half-life of DNA fingerprint patterns based on IS *6110* restriction fragment length polymorphism (RFLP, which has been used for the DNA fingerprinting conducted to date in most studies) is 2-5 years.^{[5,15]} If the molecular clock speed for IS *6110* RFLP patterns of strains involved in latent infection (currently unknown) were to be similar, this relatively short half-life implies that most of the fingerprint patterns of the strains causing disease today differ from those that caused disease many years ago. Similarly, this short half-life implies that the *M. tuberculosis* fingerprint types and cluster distributions in tuberculosis cases today depend only loosely upon those that existed 50 years ago. Based on this assumption, to derive clustering estimates for a given population for recent years, we designed the model to simulate the introduction and subsequent transmission of strains with new DNA fingerprint patterns from a sufficiently distant time in the past (taken to be 1950), so that a) all cases with onset in recent years involved a strain whose DNA fingerprint pattern had first appeared since then and b) no assumptions would be required about the distributions of strains that existed before 1950. The general steps in the calculations are outlined briefly below.

The numbers of persons of each age in each of the epidemiologic categories for 1950 were calculated by using the model, based on described equations.^{[9]} From 1950, each of these age-sex classes was stratified to distinguish between those who had, versus those who had not, been (re)infected since 1950. Those who had been (re)infected since 1950 were subdivided further according to the time of infection or reinfection. The transmission dynamics were tracked simultaneously for all persons with the equations described in the Appendix and elsewhere,^{[6]} by using time steps of 6 months and 1 year for calendar year and age, respectively.

In each interval, disease was assumed to develop in a proportion of infected persons, and a proportion of these disease episodes was attributed to a strain for which the DNA fingerprint pattern differed from that of the strain with which the persons were originally infected. This latter proportion depended on the time since infection (see below), and each of the new DNA fingerprint patterns was assigned a unique identity number. Each infectious patient with onset at a given time was assumed to contact a different number of persons (see Appendix and Figure 2). The frequency distributions of the number of persons contacted by each patient were used to derive the total number of persons who were newly (re)infected at this time. The corresponding equations were then applied to this number to determine the total number of persons in whom disease developed at a later time, *T,* among those who had been infected at time *t*. The DNA fingerprint patterns of the strains in these diseased persons were then determined by using the frequency distribution of the number of persons contacted by each case-patient at time *t*. These calculations are described further in the Appendix.

Summary of the assumptions defining contact between persons in the model

Our model was used to calculate the age-specific proportion of disease attributable to primary and exogenous disease from 1993 to 1997 for the Netherlands and for settings in which the annual risk for infection is assumed to have remained unchanged over time at 0.1%, 1%, and 3%. Primary and exogenous disease involve disease occurring during the first 5 years after the most recent (re)infection event, although the majority of persons in whom primary or exogenous (reinfection) disease develops acquire the disease within 2-3 years (Figure 1B). The clustering by sex and age for cases with onset in different periods between 1993 and 1997 for the Netherlands, and for settings in which the annual risk for infection is assumed to have remained unchanged over time at 0.1%, 1%, and 3%,was also calculated by using the age and sex distribution of the cases with onset in that period (see equations in^{[6]}). For simplicity, we present age-specific levels of clustering for male patients only. Model predictions for male patients generally compared better against the observed data in the Netherlands than did those for female patients.^{[6]}

The predictive values of clustering for the identification of recent transmission were calculated as follows. The positive predictive value of clustering for identifying recent transmission in different age groups in different periods was calculated as the proportion of case-patients who were in a cluster in a given period who had been infected or reinfected <5 years before disease onset. The negative predictive value of clustering for identifying recent transmission in different age groups was calculated as the proportion of case-patients who were not in a cluster in a given period who had been infected or last reinfected >5 years before disease onset.

Emerging Infectious Diseases. 2003;9(2) © 2003 Centers for Disease Control and Prevention (CDC)

Cite this: Annual Mycobacterium tuberculosis Infection Risk and Interpretation of Clustering Statistics - *Medscape* - Feb 01, 2003.

## Comments