Diabetes Spectrum Volume 12 Number 4, 1999, Page
EPINTAQ: A Software Designed to Teach Statistical Methods Applied in the Epidemiology of Diabetes José Luis Santos, PhD, MSc, Francisco Pérez-Bravo, PhD, Elena Carrasco, BSc, Marcelo Calvillán, BSc, and Cecilia Albala, MD, MPH ABSTRACT EPINTAQ is a software program designed for teaching the principles and methods of epidemiology applied to studies of diabetes. It can be used in both undergraduate and postgraduate education. The program has three sections. The first section is devoted to teaching some basic concepts commonly used in the epidemiology of diabetes, such as prevalence, risk, odds, incidence rate, logistic function, and the capture-recapture method. The second section shows the basis of the epidemiological analysis of two-by-two tables (chi-square procedure, P-values, risk ratio, odds ratio, and their confidence intervals). The program shows how epidemiological parameters are estimated from data. The third section refers to the analysis of genetic association studies, mainly in the assessment of the relationship between type 1 diabetes and the polymorphism in the human leucocyte antigen (HLA) genes. The program performs a statistical test for Hardy-Weinberg genetic equilibrium for HLA data, computes allele frequencies, and calculates the incidence of type 1 diabetes in subjects with different degrees of genetic susceptibility. Finally, it computes the odds ratio and its confidence interval from case-parental studies, which is a special case-control design that uses parents instead of nonrelated controls. Statistical methods are extensively used in experimental and nonexperimental epidemiological research based on surveys, case-control studies, cohort studies, and clinical trials.^{1} As a consequence, scientific journals concerning the epidemiology of diabetes are usually presented with a large amount of numerical information that can be obscure and difficult to understand for many biomedical professionals. For this reason, an investment of time in learning numerical methods and statistical inference is usually required in every epidemiology course. Other considerations are also important in teaching epidemiology, and special care is required to clearly distinguish between systematic error (connected to validity) and random error (connected to precision).^{2} The epidemiology group of the University of Pittsburgh has pointed out that "diabetes is a disease made for epidemiology."^{3} The diversity of factors that potentially influence the onset of both type 1 and type 2 diabetes make this disorder very useful for explaining epidemiological concepts and methods. Recently, the development of the discipline known as molecular epidemiology^{4} has given us an excellent opportunity to join epidemiological methods with recent advances in molecular biology in areas such as mental disorders, diabetes, and other chronic diseases.^{5} This article describes the features of EPINTAQ, a software program that is designed as a computational tool to help diabetologists learn epidemiological methods applied to diabetes. However, it is also possible to use the software as a sophisticated calculator to statistically analyze real epidemiological data.
EPINTAQ has three sections (Figure 1). The first section teaches some basic concepts used in the epidemiology of diabetes, such as prevalence, risk, odds, incidence rate, logistic function, and more recent developments for epidemiological surveillance such as the capture-recapture method. The second section shows the basis of the epidemiological analysis of two-by-two tables (chi-square procedure, P-values, calculation of risk ratio, odds ratio, and their confidence intervals). The program shows how epidemiological parameters are estimated from data. The third section refers to the analysis of genetic association studies in the assessment of the relationship between type 1 diabetes and the polymorphism in the human leucocyte antigen (HLA) genes. The program performs a statistical test for Hardy-Weinberg equilibrium for HLA data, computes allele frequencies, and calculates the incidence of type 1 diabetes in subjects with different degrees of genetic susceptibility. Finally, it computes the odds ratio and its confidence interval from case-parental studies, which is a special case-control design that uses parents instead of nonrelated controls. This software program was written with Microsoft VISUAL BASIC 4.0.^{6} The installation files are approximately 3 megabytes, and the program has been tested and found to run correctly under both Windows 3.1x and Windows 95. To obtain a free copy of the compiled version of this program, contact the authors via e-mail (jsantos@uec.inta.uchile.cl) SECTION 1: BASIC EPIDEMIOLOGICAL MEASURES AND CONCEPTS Prevalence, Incidence Rate, and Cumulative
Incidence The pattern of distribution of type 1 diabetes has focused the attention of epidemiologists because both the incidence of the disease and the frequency of susceptibility genetic factors differ sharply among countries and ethnic groups.^{7} A great effort has been made to generate estimates of incidence and prevalence of type 1 diabetes around the world. The program also shows the mathematical relationship between cumulative incidence and incidence rate, using examples from type 1 diabetes. It is possible to enter the age-specific incidence rates of type 1 diabetes in the groups of 04 years, 59 years, and 1014 years, which have been tabulated for several European countries.^{8} The output gives the 15-year cumulative incidence of type 1 diabetes for a newborn using the formula described by Morgenstern and associates.^{9} For example, EPINTAQ shows that the 15-year cumulative incidence of type 1 diabetes in Sardinia, Italy, is approximately 4.5 times greater than that in the Lazio region.^{8} In this way, students can compare cumulative risk in different countries. When analyzing data from type 1 diabetes, most of the statistical procedures carried out in EPINTAQ are in concordance with the standards defined in the World Health Organization DiaMond Molecular Epidemiology Project for type 1 diabetes.^{10} On the other hand, the dramatic worldwide differences in the prevalence of type 2 diabetes are also very interesting to epidemiologists. These differences range from virtually 0% in Papua, New Guinea, to >50% in the Pimas of Arizona.^{11} Type 2 diabetes is also a disease with a multifactorial etiology that includes genetic and environmental factors.^{12} Interestingly, type 2 diabetes is often clustered with risk factors for cardiovascular disease, such as obesity, physical inactivity, hypertension, and dyslipidemia.^{13} This is an important fact, since cardiovascular disease remains the most costly disease in terms of mortality and morbidity in industrialized nations. EPINTAQ writes the mathematical formula involved in the calculation of rates at the same time students change the figures of incidence or prevalence with the aid of a scroll. This system is recurrently used in sections 1 and 2 of EPINTAQ. It saves students from making calculations by hand, thus allowing them to concentrate on how the statistical expressions are built. Logistic Function Capture-Recapture Method SECTION 2: ANALYSIS OF TWO-BY-TWO TABLES Risk Ratio and Odds Ratio in Cohort Studies
RR and OR are common measures in epidemiological research, providing a quantitative assessment of the magnitude of the association between an exposure variable and the outcome, which usually is the disease status. RR cannot be computed in retrospective studies since the number of diseased and disease-free subjects are fixed by the investigator in this design. It has been demonstrated that the exposure OR in cases versus controls equals the disease OR for exposed versus unexposed in prospective research (Figure 2). This OR approximates the ratio of proportions of subjects who develop the disease in a specific period of time among exposed and unexposed, when the disease is rare. Moreover, in an incidence-density sampling, the OR approximates the ratio of instantaneous disease incidence rates, even without the assumption of rarity.^{18}
EPINTAQ uses a simple two-by-two contingency table to show the calculation of OR and RR in prospective studies. Both exposure and disease status are encoded as dichotomous variables. The program shows four odds (by row and by column) and the cumulative incidence rate for exposed and unexposed groups. It also presents a simple chi-square statistic with one degree of freedom together with its P-value. The estimations of OR and RR are shown with their 95% confidence intervals, which were computed with the methods of Woolf and Kaltz, respectively.^{19} From the numerous published examples of diabetes epidemiology in the literature, students can explore interactively the process of estimation and inference of RR and OR in different study designs by moving four scrolls that correspond to the four cells in the contingency table of an epidemiological research study. Likewise, professors can use EPINTAQ by showing this screen to the whole class. When a cell is changed in the table, all statistics are recalculated instantaneously. At the same time, the complete new formula for the different statistics is also written with the numbers chosen in the two-by-two table. Students can experiment by themselves (or guided by the teacher) with a) the exposure OR in cases versus controls equaling the disease OR for exposed versus unexposed; b) OR approximating the RR in prospective studies when the disease is rare; c) how the confidence intervals are calculated; d) the effect of the sample size on the P-value of the chi-square statistic and confidence intervals; and e) how the chi-square statistic is built. Case-Control Studies of Host Susceptibility in
Type 1 Diabetes
Most of the association studies between genetic susceptibility and type 1 diabetes have been carried out through the use of case-control studies. In this type of observational research, OR is usually estimated from the data. Depending on the design of the research, the OR would estimate different parameters.^{25} In general terms, the OR computed in a case-control study from genetic epidemiology studies could be interpreted as the relative risk of subjects with the genetic susceptibility compared to that of people without genetic susceptibility. EPINTAQ can be used to compute the OR that shows the comparative risk of type 1 diabetes for children with four susceptible heterodimers (genetically susceptible) compared to children with zero susceptible heterodimers (not genetically susceptible) using the table from Perez-Bravo and associates (Figure 3).^{26} SECTION 3: GENETIC EPIDEMIOLOGY IN THE STUDY OF DIABETES Assessment of the Hardy-Weinberg Law It has been suggested that departure from Hardy-Weinberg law should be tested in the population before designing a population-based case-control study.^{27} This issue is important in retrospective studies, such as HLA-disease association studies, since uncontrolled population stratification (mixture of populations with different genetic backgrounds) could lead to deviations from the Hardy-Weinberg law. Disturbingly, this genetic stratification could produce spurious marker-disease associations.^{28} EPINTAQ uses a statistical test for HLA data that assesses the likelihood of the hypothesis of Hardy-Weinberg equilibrium using a sample of unrelated subjects.^{29} This part of the software was designed for analytical purposes, although it can also be used to introduce courses in genetic epidemiology, since Hardy-Weinberg equilibrium is a key concept in population genetics. Several forces may affect the genetic equilibrium in a given population, such as nonrandom mating, migration, mutation, genetic drift, or natural selection. Extreme care must be taken when interpreting the result of this statistical test, since several of the assumptions of genetic equilibrium may be violated simultaneously in such a way that their effects cancel each other, and the population would exhibit allele frequencies concordant with Hardy-Weinberg proportions.^{30} On the other hand, some authors invoke the use of exact tests instead of goodness-of-fit tests based on chi-square statistic.^{31} Allele Frequency Other measures, such as the Bernstein estimator, consider the set of undetectable alleles as the "recessive" (as in the ABO blood group). This estimator is very common when HLA typing is done by serology, so that it is not possible to distinguish between someone who is homozygous for the allele of interest and someone who is heterozygous with one unknown allele. If p_{i} is the probability that a random person carries the allele i, the allele frequency of such allele is _{i} = 1 - (1 - p_{i} )^{1/2} . In order to calculate a confidence interval for the Bernstein estimator, the reparametrization described by Farewell and Dahlberg^{33} is used. EPINTAQ provides the estimation of gene frequency and confidence intervals by both procedures, using data entry from a sample of unrelated individuals. Incidence of Type 1 Diabetes by Level of
Genetic Susceptibility The algorithm that uses the program operates in the following way: the incidence rate (I_{t}) can be expressed as a weighted average of the incidence rate of type 1 diabetes among the subjects that belong to the different levels of genetic susceptibility ( susceptible heterodimers). Then, we have the equation (a): I_{t} = f_{0} I_{0} + f_{1} I_{1} + f_{2} I_{2} + f_{4} I_{4} , where f_{1} to f_{4} are the respective frequency of people with 0, 1, 2, or 4 susceptible heterodimers in the population, estimated from the sample of controls. At the same time, the knowledge of the incidence rate of type 1 diabetes for the not-genetically-susceptible group (I_{0}), and the OR for each level of susceptibility (OR_{x}), would allow us to compute an approximation of the incidence rate of type 1 diabetes for each level of genetic susceptibility (I_{x}) using I_{x} = OR_{ x} I_{ 0} / (1 + (OR_{ x} - 1) (I_{0} ) (equation b). OR_{x} (odds ratio) can be calculated using the programs in Section 2. By substituting equation (b) for equation (a), we would obtain a fourth-degree equation where I_{0} is the unknown quantity. EPINTAQ uses the Newton-Raphson iterative method in order to get a value for I_{0}.^{35 }Finally, I_{0} is substituted again in equation (a) in order to get an estimation of the incidence rate in the different groups of subjects who carry 0, 1, 2, or 4 susceptible heterodimers. Analysis of Case-Parental Control Studies In most adult chronic diseases, it may sometimes be impossible to find the parents of the affected proband. However, these practical problems are not that serious when collecting families of incident cases of type 1 diabetes, since parents are usually young and often come to the hospital when the child is diagnosed. The difficulty of collecting families has limited the application of the case-parental design in the study of type 2 diabetes. Recently, researchers have described alternative methods that compare the marker genotypes in affected and unaffected offspring instead of using marker data from affected offspring and parents.^{38,39} All these methods try to reproduce the advantages of the case-parental design in situations where parents are not genotyped, and their use will probably increase in future epidemiological literature. On the contrary, some authors^{40} have argued that traditional epidemiological designs are superior to family-based association studies, since cohort or population-based incident case-control studies can define the magnitude of risk associated with a genotype and gene-environment interaction, a crucial step to disease prevention and health promotion. User-Based Evaluation of EPINTAQ The results of the classes are, in general, very satisfactory from the perspective of both students and professors. In spite of the simplicity of EPINTAQ, two teachers are usually required during a class, since there are students who need help with scrolls, buttons, menus, and other specific tools of the visual interface of the program. We have assessed the performance of the software with a small questionnaire administered to a group of 24 postgraduate medical students who received a standard class of epidemiology, and later a class with EPINTAQ. All students concurred that the sequence of classical lectures with discussion of biomedical papers, followed by classes with EPINTAQ analyzing research data, is a very useful combination for learning principles and methods in the epidemiology of diabetes. Acknowledgments References ^{2}Charlton BG: The scope and nature of epidemiology. J Clin Epidemiol 49:623-26, 1996. ^{3}Orchard TJ, Dorman JS, LaPorte RE, Ferrell RE, Drash AL: Host and environmental interactions in diabetes mellitus. J Chron Dis 39:979-99, 1986. ^{4}McMichael AJ: Molecular epidemiology: new pathway or new travelling companion? Am J Epidemiol 140:1-11, 1994. ^{5}Shilberg O, Dorman JS, Ferrell RE, Trucco M, Shahar A, Kuller L: The next stage: molecular epidemiology. J Clin Epidemiol 50:633-38, 1997. ^{6}Microsoft Visual Basic Programming System for Windows: Programmer's Guide. Version 4.0. Microsoft Corporation, 1995. ^{7}Dorman JS, LaPorte RE, Stone RA, Trucco M: Worldwide differences in the incidence of type I diabetes are associated with amino acid variation at position 57 of the DQb chain. Proc Natl Acad Sci USA 87:7370-74, 1990. ^{8}Green A, Gale EAM, Patterson CC: Incidence of childhood-onset insulin-dependent diabetes mellitus: the Eurodiab Ace Study. Lancet 339:905-909, 1992. ^{9}Morgenstern H, Kleinbaum DG, Kupper LL: Measures of disease incidence used in epidemiologic research. Int J Epidemiol 9:97-104, 1980. ^{10}WHO DIAMOND Project Group on Epidemics: Childhood diabetes, epidemics, and epidemiology: an approach for controlling diabetes. Am J Epidemiol 135:803-16, 1992. ^{11}Gruber W, King H: The WHO national diabetes programme initiative. Diab Res Clin Pract 34 (Suppl):S1-6, 1996. ^{12}Gerich JE: Pathogenesis and treatment of type 2 (non-insulin-dependent ) diabetes mellitus (NIDDM). Horm Metab Res 28:404-12, 1996. ^{13}Neel JV, Julius S, Weder A, Yamada M, Kardia SLR, Haviland MB: Syndrome X: is it for real? Genet Epidemiol 15:19-32, 1998. ^{14}Kannel WB, McGee D, Gordon T: A general cardiovascular risk profile: the Framingham study. Am J Cardiol 38:46-51, 1976. ^{15}LaPorte RE, McCarty DJ, Tull ES, Tajima N: Counting birds, bees, and NCD's. Lancet 339:18-19, 1992. ^{16}Hook EB, Regal RR: Capture-recapture methods in epidemiology: methods and limitations. Epidemiol Rev 17:243-64, 1995. ^{17}Serrano-Rios M, Moy CS, Serrano RM, Asensio AM, Labat MET, Romero GZ, Herrera J: Incidence of type 1 (insulin-dependent) diabetes mellitus in subjects 0-14 years of age in the Comunidad of Madrid, Spain. Diabetologia 33:422-24, 1990. ^{18}Breslow EN: Statistics in epidemiology: the case-control study. J Am Stat Assoc 91:14-28, 1996. ^{19}Kahn HA, Sempos CT: Statistical Methods in Epidemiology. New York, Oxford University Press, 1989, p. 45-71. ^{20}Akerblom HK, Knip M, Hyoty H, Reijonen H, Virtanen S, Savilahti E, Ilonen J: Interaction of genetic and environmental factors in the pathogenesis of insulin-dependent diabetes mellitus. Clin Chim Acta 257:143-56, 1997. ^{21}Thorsby E, Ronningen KS: Role of HLA genes in predisposition to develop insulin-dependent diabetes mellitus. Ann Med 24:523-31, 1992. ^{22}Trucco M: To be or not to be ASP 57, that is the question. Diabetes Care 15:705-15, 1992. ^{23}Khalil I, D'Auriol L, Gobet M, Morin L, Lepage V, Deschamps I, Park MS, Degos L, Galibert F, Hors J: A combination of HLA-DQ Asp57 negative and HLA-DQ Arg52 confers susceptibility to insulin-dependent diabetes mellitus. J Clin Invest 85:1315-19, 1990. ^{24}Khalil I, Deschamps I, Lepage V, Al-Daccak R, Degos L, Hors J: Dose effect of cis- and trans-encoded HLA-DQ heterodimers in IDDM susceptibility. Diabetes. 41:378-84, 1991. ^{25}Green A: The epidemiologic approach to studies of association between HLA and disease. Tissue Antigens 19:245-58, 1982. ^{26}Pérez-Bravo F, Carrasco E, Gutiérrez-López MD, Martínez MT, López G, García de los Ríos M: Genetic predisposition and environmental factors leading to the development of insulin-dependent diabetes mellitus in Chilean children. J Mol Med 74:105-109, 1996. ^{27}Tiret L, Cambien F: Departure from Hardy-Weinberg equilibrium should be systematically tested in studies of association between genetic markers and disease. Circulation 92:3364, 1995. ^{28}Lander ES, Schork NJ: Genetic dissection of complex traits. Science 265:2037-48, 1994. ^{29}Nam J: Simple test for the Hardy-Weinberg law for HLA data with no observed double blanks. Biometrics 51:354-57, 1995. ^{30}Hernández JL, Weir BS: A disequilibrium coefficient approach to Hardy-Weinberg testing. Biometrics 45:53-70, 1989. ^{31}Guo SW, Thompson EA: Performing the exact test of Hardy-Weinberg proportion for multiple alleles. Biometrics 48:361-72, 1992. ^{32}Olson JM: Robust estimation of gene frequency and association parameters. Biometrics 50:665-74, 1994. ^{33}Farewell VT, Dahlberg S: Some statistical methodology for the analysis of HLA data. Biometrics 40:547-60, 1984. ^{34}Dorman JS, McCarthy B, McCanlies E, Kramer MK, Vergona RJ, Stone R, Steenkistie AR, Kocova M, Trucco M, and the WHO DiaMond Molecular Epidemiology Sub-Project Group: Molecular IDDM epidemiology: international studies. Diab Res Clin Pract 34 (Suppl):S107-16, 1996. ^{35}Santos Martin JL, Pérez-Bravo F, Carrasco E, Icaza G, Calvillán M, Albala C: Different statistical models used in the calculation of the prevalence of insulin-dependent diabetes mellitus according to the polymorphism of the HLA-DQ region. Immunol Cell Biol 75:351-55, 1997. ^{36}Schaid DJ, Sommer SS: Comparison of statistics for candidate-gene association studies using cases and parents. Am J Hum Genet 55:402-409, 1994. ^{37}Flanders WD, Khoury MJ. Analysis of case-parental control studies: method for the study of associations between disease and genetic markers. Am J Epidemiol 144:696-703, 1996. ^{38}Spielman RS, Ewens WJ: A sibship test for linkage in the presence of association: the sib transmission/disequilibrium test. Am J Hum Genet 62:450-58, 1998. ^{39}Boehnke M, Langefeld CD: Genetic association mapping based on discordant sib pairs: the discordant-alleles test. Am J Hum Genet 62:950-61, 1998. ^{40}Khoury M, Yang Q: The future of genetic studies of complex human diseases: an epidemiologic perspective. Epidemiology 9:350-54, 1998. ^{41}Santos JL, Pérez-Bravo F, Albala C, Carrasco E: DINTAQ: A software program for epidemiological analysis of the association between HLA polymorphisms and insulin-dependent diabetes mellitus in case-control studies. J Mol Med 75:B2, 1997. José Luis Santos, PhD, MSc, is an assistant professor of molecular epidemiology, Francisco Pérez-Bravo, PhD, is an assistant professor of molecular biology, and Cecilia Albala, MD, MPH, is an associate professor of epidemiology at the Institute of Nutrition and Food Technology, Department of Nutritional Epidemiology and Molecular Biology Group of the University of Chile in Santiago. Elena Carrasco, BSc, is an associate professor of nutrition, and Marcelo Calvillán, BSc, is a research assistant in diabetes in the Department of Medicine, Diabetes Section, of the San Juan de Dios Hospital Faculty of Medicine at the University of Chile in Santiago. Address correspondence and requests for reprints to José Luis Santos, PhD, MSc, Department of Epidemiology, INTA-University of Chile, Macul 5540. Comuna de Macul, Santiago Chile. Copyright © 1999 American Diabetes Association Last updated: 12/99 |