TY - JOUR

T1 - Regression-based association analysis with clustered haplotypes through use of genotypes

AU - Tzeng, Jung Ying

AU - Wang, Chih Hao

AU - Kao, Jau Tsuen

AU - Hsiao, Chuhsing Kate

N1 - Funding Information:
The authors thank the reviewers for their constructive and detailed comments, which improved the manuscript. J.Y.T. was supported by National Institutes of Health grant GM45344 and National Science Foundation grant DMS-0504726. Appendix A Let S α (Y, G,Z ,α,ζ) denote the score function of the observed data (Y, G,Z ) for α. As set forth by Louis ( 1982 ), S α (Y, G,Z ,α,ζ) is the expectation of the complete-data score function given the observed data—that is, S α (Y, G,Z ,α,ζ)=E[S α (Y, X F , Z ,α,ζ)| G ]. Hence, S α ( Y , G , Z , α , ζ ) = ∑ i = 1 n E [ ∂ ∂ α log L ( α , ζ ; y i , x F , i , z i ) | g i ] = ∑ i = 1 n E [ ∂ ∂ α [ log f ( y i | x F , i , z i ; α , ζ ) + log P ( x F , i ; Π ) ] | g i ] = ∑ i = 1 n E [ y i - b ′ ( η ) a ( ϕ ) B ( Π ) - 0 ′ X i | g i ] = ∑ i = 1 n y i - E ( y i ) a ( ϕ ) B ( Π ) - 0 ′ E ( X i | g i ) . Appendix B Let Γ=(μ,γ). The expected Fisher information function of the observed data (Y, G,Z ), I , is I I I I = ( I α α I α Γ I α φ I α Π I α Γ ′ I Γ Γ I Γ φ I Γ Π I α φ ′ I Γ φ ′ I φ φ I φ Π I α Π ′ I Γ Π ′ I φ Π ′ I Π Π ) , where I α φ = 0 L * × 1 , I α Π = 0 L * × ( L + 1 ) , I Γ φ = 0 ( 1 + P ) × 1 , I Γ Π = 0 ( 1 + P ) × ( L + 1 ) . and I φ Π = 0 1 × ( L + 1 ) . The hybrid estimate of I is obtained by replacing the nonzero entries of I with the observed Fisher information (denoted by i ): I I I I = ( i α α i α Γ 0 0 i α Γ ′ i Γ Γ 0 0 0 0 i φ φ 0 0 0 0 i Π Π ) . Hence, equation ( 6 ) can be simplified as V α = D α α - i α Γ i Γ Γ - 1 D α Γ ′ - D α Γ i Γ Γ - 1 i α Γ ′ + i α Γ i Γ Γ - 1 D Γ Γ i Γ Γ - 1 i α Γ ′ . Recall that D = ∑ i = 1 n s i ( y i , g i , z i , Θ ) s i ′ ( y i , g i , z i , Θ ) and that Louis ( 1982 ) proposed s i ( y i , g i , z i , Θ ) = E { s i ( y i , x F , i , z i , Θ ) g i } and i = ∑ i = 1 n { E { - ∂ s i ( y i , x F , i , z i , Θ ) ∂ Θ | g i } - E { s i ( y i , x F , i , z i , Θ ) s i ′ ( y i , x F , i , z i , Θ ) | g i } + E { s i ( y i , x F , i , z i , Θ ) g i } E { s i ′ ( y i , x F , i , z i , Θ ) g i } } . We have D α α = ∑ i = 1 n ( y i - b ′ ( η i ) a ( φ ) ) 2 B ( Π ) - 0 ′ E ( x F , i | g ) E ( x F , i ′ | g ) B ( Π ) - 0 , D α Γ = ∑ i = 1 n ( y i - b ′ ( η i ) a ( φ ) ) 2 B ( Π ) - 0 ′ E ( x F , i | g ) [ 1 z i ′ ] , D Γ Γ = ∑ i = 1 n ( y i - b ′ ( η i ) a ( φ ) ) 2 { 1 z i } [ 1 z i ′ ] , i α Γ = ∑ i = 1 n b ′ ′ ( η ) a ( φ ) B ( Π ) - 0 ′ E ( x F , i | g ) [ 1 z i ′ ] , and i Γ Γ = ∑ i = 1 n b ′ ′ ( η ) a ( ϕ ) [ 1 z i ] [ 1 z i ′ ] .

PY - 2006/2

Y1 - 2006/2

N2 - Haplotype-based association analysis has been recognized as a tool with high resolution and potentially great power for identifying modest etiological effects of genes. However, in practice, its efficacy has not been as successfully reproduced as expected in theory. One primary cause is that such analysis tends to require a large number of parameters to capture the abundant haplotype varieties, and many of those are expended on rare haplotypes for which studies would have insufficient power to detect association even if it existed. To concentrate statistical power on more-relevant inferences, in this study, we developed a regression-based approach using clustered haplotypes to assess haplotype-phenotype association. Specifically, we generalized the probabilistic clustering methods of Tzeng to the generalized linear model (GLM) framework established by Schaid et al. The proposed method uses unphased genotypes and incorporates both phase uncertainty and clustering uncertainty. Its GLM framework allows adjustment of covariates and can model qualitative and quantitative traits. It can also evaluate the overall haplotype association or the individual haplotype effects. We applied the proposed approach to study the association between hypertriglyceridemia and the apolipoprotein A5 gene. Through simulation studies, we assessed the performance of the proposed approach and demonstrate its validity and power in testing for haplotype-trait association.

AB - Haplotype-based association analysis has been recognized as a tool with high resolution and potentially great power for identifying modest etiological effects of genes. However, in practice, its efficacy has not been as successfully reproduced as expected in theory. One primary cause is that such analysis tends to require a large number of parameters to capture the abundant haplotype varieties, and many of those are expended on rare haplotypes for which studies would have insufficient power to detect association even if it existed. To concentrate statistical power on more-relevant inferences, in this study, we developed a regression-based approach using clustered haplotypes to assess haplotype-phenotype association. Specifically, we generalized the probabilistic clustering methods of Tzeng to the generalized linear model (GLM) framework established by Schaid et al. The proposed method uses unphased genotypes and incorporates both phase uncertainty and clustering uncertainty. Its GLM framework allows adjustment of covariates and can model qualitative and quantitative traits. It can also evaluate the overall haplotype association or the individual haplotype effects. We applied the proposed approach to study the association between hypertriglyceridemia and the apolipoprotein A5 gene. Through simulation studies, we assessed the performance of the proposed approach and demonstrate its validity and power in testing for haplotype-trait association.

UR - http://www.scopus.com/inward/record.url?scp=31544481920&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=31544481920&partnerID=8YFLogxK

U2 - 10.1086/500025

DO - 10.1086/500025

M3 - Article

C2 - 16365833

AN - SCOPUS:31544481920

VL - 78

SP - 231

EP - 242

JO - American Journal of Human Genetics

JF - American Journal of Human Genetics

SN - 0002-9297

IS - 2

ER -