GENETIC ANALYSIS OF BINARY and CATEGORICAL TRAITS PART ONE

GENETIC ANALYSIS OF

BINARY and CATEGORICAL TRAITS

PART ONE

TABLE 1. Twin Pair Concordances for Major Depression

(Virginia Twin Study data, adapted from Neale and Cardon, 1992)

MZ FEMALE PAIRS DZ FEMALE PAIRS

Twin B Twin B

Unaffected Affected

Unaffected Affected

Twin A

- Unaffected 329 83

Twin A

- Unaffected 201 94

- Affected 95 83 - Affected 82 63

Prevalence =

e.g. for MZ pairs =

e.g. for DZ pairs =

Prevalance = proportion of affected (alcoholic) twins in the general population.

2 x concordant affected pairs + discordant pairs2 x Total Pairs

166+95+831180

126+82+94880

= 29.2%

= 34.3%

Probandwise concordance rate =

e.g. for MZ pairs =

e.g. for DZ pairs =

Probandwise concordance rate = probability that cotwin of a depressed twinwill also have a history of depression.

Recurrence Risk-ratio

2 x concordant affected pairs2 x concordant affected pairs + discordant pairs

166166 + 95 + 83

126126 + 82 + 94

Probandwise concordance ratePrevalence

= 48.3%

= 41.7%

=

Why do we have (2 x number of concordant affected pairs)

in the numerator and denominator of the expression for the

probandwise concordance rate? Consider a simple example

where there are 4 affected individuals, who came from 3 twin

pairs, ie,

1 — 0 1 — 0 1 — 1

There are 4 potential probands, so if we randomly select an

affected individual, the probability that the cotwin of that

individual is also affected will be 50% 2

4

TABLE 1a. Twin Pair Concordances for Alcohol Dependence (DSM-IIIR)

(Virginia Twin Study data, from Kendler et al., 1992)

MZFemale Pairs

DZFemale Pairs

N pairs 590 440

Population prevalence 8.1% 10.2%

Probandwise concordance 31.6% 24.4%

Number of concordant alcoholic pairs= N pairs x prevalence x probandwise concordance

MZ: 15 pairs DZ: 11 pairs

Number of discordant pairs= 2 x N pairs x prevalence x (1 - probandwise concordance)


Number of concordant unaffected pairs


Some investigators also report a “PAIRWISE” CONCORDANCE RATE - the proportion of pairs with at least one twin affected who are concordant.

The “PAIRWISE” concordance rate is redundant --

PAIRWISE CONCORDANCE RATE =

where CR is the probandwise concordance rate

CR2-CR

Alcoholism Risk Alcoholism Risk

UNAFFECTED AFFECTED

a) Normal Liability Threshold Model

b) Multiple-threshold Model

UNAFFECTEDMILD

CASESSEVERECASES

t2

0 0

t1

t1

Threshold value(t)

Prevalence(area under the standard normal curve)

0.0 50%

0.25 40%

0.53 30%

0.84 20%

1.04 15%

1.28 10%

1.64 5%

1.95 2.5%

2.33 1%

3.08 0.1%

-0.25 60%

CUMULATIVE NORMAL FREQUENCY DISTRIBUTION

Table 3. Population distribution of pairs of relatives with both alcoholic, neither alcoholic, or only one relative alcoholic, as a function of (i) lifetime prevalence of alcoholism, and (ii) liability correlation for alcoholism in relatives

PREVALENCE

Relative A Relative B

Liability

correlation

Both

affected

Discordant

A affected B affected

Both

unaffected

Risk to relative

of an alcoholica

Relatives’ Recurrence Risk Ratio

(%) (%) (%) (%) (%) (%)

30% 30% 0.6 17.3 12.7 12.7 57.3 57.6 1.9

0.3 12.8 17.2 17.2 52.8 42.7 1.4

0.15 10.9 19.1 19.1 50.9 36.2 1.2

20% 20% 0.6 9.9 10.1 10.1 69.9 49.6 2.5

0.3 6.6 13.4 13.4 66.6 33.1 1.7

0.15 5.2 14.8 14.8 65.2 26.2 1.3

10% 10% 0.6 3.9 6.1 6.1 83.9 39.0 3.9

0.3 2.2 7.8 7.8 82.2 21.6 2.2

0.15 1.5 8.5 8.5 81.5 15.2 1.5

ai.e. Probandwise concordance rate

EXAMPLE DATA-FILE FOR MX RAW ORDINAL DATA:

MZF DEPRESSION DATA (depmzf.dat)

0 0 329

0 1 83

1 0 95

1 1 83

EXAMPLE DATA-FILE (II):

DERIVED FROM PUBLISHED SOURCES

MZF ALCOHOL DEPENDENCE DATA (alcmzf.dat)

0 0 310

0 1 32.5

1 0 32.5

1 1 15

! tetrachoric.mx! estimating tetrachoric correlations #define nvar 1#define maxthresf 1 ! number of thresholdsAnalysis of depression data: estimating tetrachorics & confidence intervals data NI=3 NG=4LAbels twina twinb countmzOrdinal fi=depmzf.rec! Count is a definition variable that we use to tell MX the frequency count! for each element of the 2x2 table!Definition_variables countmz /Begin matrices;W LO nvar nvar fr ! w*w' is the tetrachoric correlationY LO nvar nvar fr ! y*y' is 1-tetrachoric correlationM FU maxthresf nvar fiS DI nvar nvar ! Matrix that will store weight variableend matrices;SP M3 MATRIX M 1.5487! This tells MX to store the definition variable count in SSP S-1 mat w 0.7 mat y 0.7

Begin algebra;R=W*W';E=Y*Y';V=R+E;end algebra;FREQ S; ! tells MX that S contains the weight (frequency) variableTH M|M; ! tells MX that row and column thresholds contained in M|MCO V|R_ R'|V; ! formula for correlation matrix!bo 0.001 1.0 y(1,1) bo 0.0001 0.999 w(1,1)bo -5.0 5.0 m(1,1)interval r(1,1) ! compute 95% confidence interval for correlation OPT func=1.E-12OPT RSEND

Analysis of depression data: DZmdata NI=3LAbels twina twinb countdzOR fi=depdzf.recDefinition_variables countdz /Begin matrices; W LO nvar nvar fr ! w*w' is the tetrachoric correlation for DZ groupY LO nvar nvar fr ! y*y' is 1-tetrachoric correlation for DZ groupN FU maxthresf nvar frS DI nvar nvar ! Matrix that will store weight variableend matrices;SP N6 MATRIX N 1.4487SP S-1 mat w 0.6 mat y 0.8Begin algebra;R=W*W';E=Y*Y';V=R+E;end algebra;FREQ S;TH N|N;CO V|R_R'|V; bo 0.001 1.0 y(1,1) bo 0.0001 0.999 w(1,1)bo -5.0 5.0 n(1,1)interval r(1,1) ! compute 95% confidence interval for correlation OPT RSEND

Constraint function - constrain variances to unity for MZ groupCO NI=1Begin matrices = group 1;U unit 1 nvarend matrices;CO \d2v(V) = u;endConstraint function - constrain variances to unity for DZ groupCO NI=1Begin matrices = group 2;U unit 1 nvarend matrices;CO \d2v(V) = u;end

Summary of VL file data for group 1 COUNTMZ TWINA TWINB Code -1.0000E+00 1.0000E+00 2.0000E+00 Number 4.0000E+00 4.0000E+00 4.0000E+00 Mean 1.4750E+02 5.0000E-01 5.0000E-01 Variance 1.1005E+04 2.5000E-01 2.5000E-01 Minimum 8.3000E+01 0.0000E+00 0.0000E+00 Maximum 3.2900E+02 1.0000E+00 1.0000E+00 Summary of VL file data for group 2 COUNTDZ TWINA TWINB Code -1.0000 1.0000 2.0000 Number 4.0000 4.0000 4.0000 Mean 110.0000 0.5000 0.5000 Variance 2882.5000 0.2500 0.2500 Minimum 63.0000 0.0000 0.0000 Maximum 201.0000 1.0000 1.0000

PARAMETER SPECIFICATIONS GROUP NUMBER: 1 Analysis of depression data: estimating tetrachorics & confidence intervals MATRIX E This is a computed FULL matrix of order 1 by 1 It has no free parameters specified MATRIX M This is a FULL matrix of order 1 by 1 1 1 3 MATRIX R This is a computed FULL matrix of order 1 by 1 It has no free parameters specified MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 -1 MATRIX V This is a computed FULL matrix of order 1 by 1 It has no free parameters specified MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 1 MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 2

GROUP NUMBER: 2 Analysis of ordinal alcohol tolerance and dependence data: DZm MATRIX E This is a computed FULL matrix of order 1 by 1 It has no free parameters specified MATRIX N This is a FULL matrix of order 1 by 1 1 1 6 MATRIX R This is a computed FULL matrix of order 1 by 1 It has no free parameters specified MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 -1 MATRIX V This is a computed FULL matrix of order 1 by 1 It has no free parameters specified MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 4 MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 5

MX PARAMETER ESTIMATES GROUP NUMBER: 1 Analysis of depression data: estimating tetrachorics & confidence intervals MATRIX E This is a computed FULL matrix of order 1 by 1 [=Y*Y'] 1 1 0.5660 MATRIX M This is a FULL matrix of order 1 by 1 1 1 0.5489 MATRIX R This is a computed FULL matrix of order 1 by 1 [=W*W'] 1 1 0.4340 MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 83.0000 MATRIX V This is a computed FULL matrix of order 1 by 1 [=R+E] 1 1 1.0000

MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.6588 MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.7523 Matrix of EXPECTED thresholds TWINA TWINB Threshold 1 0.5489 0.5489 Threshold 2 1.0000 1.5487 (OBSERVED MATRIX is nonexistent for raw data) EXPECTED COVARIANCE MATRIX TWINA TWINB TWINA 1.0000 TWINB 0.4340 1.0000 Function value of this group: 1383.2565 Where the fit function is -2 * Log-likelihood of raw ordinal

GROUP NUMBER: 2 Analysis of ordinal alcohol tolerance and dependence data: DZm MATRIX E This is a computed FULL matrix of order 1 by 1 [=Y*Y'] 1 1 0.8157 MATRIX N This is a FULL matrix of order 1 by 1 1 1 0.4038 MATRIX R This is a computed FULL matrix of order 1 by 1 [=W*W'] 1 1 0.1843 MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 63.0000 MATRIX V This is a computed FULL matrix of order 1 by 1 [=R+E] 1 1 1.0000


Your model has 6 estimated parameters and 18 Observed statistics Observed statistics include 2 constraints. -2 times log-likelihood of data >>> 2509.632 Degrees of freedom >>>>>>>>>>>>>>>> 12 1 Confidence intervals requested in group 1 Matrix Element Int. Estimate Lower Upper Lfail Ufail R 1 1 1 95.0 0.4340 0.3086 0.5477 0 0 0 0 1 Confidence intervals requested in group 2 Matrix Element Int. Estimate Lower Upper Lfail Ufail R 2 1 1 95.0 0.1843 0.0306 0.3316 0 0 0 0 This problem used 0.2% of my workspace Task Time elapsed (DD:HH:MM:SS) Reading script & data 0: 0: 0: 0.11 Execution 0: 0: 0: 2.85 TOTAL 0: 0: 0: 2.96 Total number of warnings issued: 0 ______________________________________________________________________________ ______________________________________________________________________________

** Mx startup successful ** **MX-Sunos version 1.49** ! tetrachoric.mx ! estimating tetrachoric correlations The following MX script lines were read for group 1 #DEFINE NVAR 1 #DEFINE MAXTHRESF 1 ! NUMBER OF THRESHOLDS ANALYSIS OF ALCOHOLISM DATA: ESTIMATING TETRACHORICS & CONFIDENCE INTERVALS DATA NI=3 NO=2 NG=4 LABELS TWINA TWINB COUNTMZ ORDINAL FI=ALCMZF.REC Ordinal data read initiated NOTE: Rectangular file contained 4 records with data ! Count is a definition variable that we use to tell MX the frequency count ! for each element of the 2x2 table ! DEFINITION_VARIABLES COUNTMZ / NOTE: Definition yields 4 data vectors for analysis NOTE: Vectors contain a total of 8 observations

Summary of VL file data for group 1 COUNTMZ TWINA TWINB Code -1.0000E+00 1.0000E+00 2.0000E+00 Number 4.0000E+00 4.0000E+00 4.0000E+00 Mean 1.4750E+02 5.0000E-01 5.0000E-01 Variance 4.3853E+04 2.5000E-01 2.5000E-01 Minimum 1.5000E+01 0.0000E+00 0.0000E+00 Maximum 5.1000E+02 1.0000E+00 1.0000E+00 Summary of VL file data for group 2 COUNTDZ TWINA TWINB Code -1.0000E+00 1.0000E+00 2.0000E+00 Number 4.0000E+00 4.0000E+00 4.0000E+00 Mean 1.1000E+02 5.0000E-01 5.0000E-01 Variance 2.1088E+04 2.5000E-01 2.5000E-01 Minimum 1.1000E+01 0.0000E+00 0.0000E+00 Maximum 3.6100E+02 1.0000E+00 1.0000E+00

MX PARAMETER ESTIMATES GROUP NUMBER: 1 Analysis of alcoholism data: estimating tetrachorics & confidence intervals MATRIX E This is a computed FULL matrix of order 1 by 1 [=Y*Y'] 1 1 0.4688 MATRIX M This is a FULL matrix of order 1 by 1 1 1 1.4017 MATRIX R This is a computed FULL matrix of order 1 by 1 [=W*W'] 1 1 0.5312 MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 15.0000

MATRIX V This is a computed FULL matrix of order 1 by 1 [=R+E] 1 1 1.0000 MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.7288 MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.6847 Matrix of EXPECTED thresholds TWINA TWINB Threshold 1 1.4017 1.4017 Threshold 2 1.0000 1.5487 (OBSERVED MATRIX is nonexistent for raw data) EXPECTED COVARIANCE MATRIX TWINA TWINB TWINA 1.0000 TWINB 0.5312 1.0000 Function value of this group: 635.6429 Where the fit function is -2 * Log-likelihood of raw ordinal

GROUP NUMBER: 2 Analysis of ordinal alcohol tolerance and dependence data: DZm MATRIX E This is a computed FULL matrix of order 1 by 1 [=Y*Y'] 1 1 0.6482 MATRIX N This is a FULL matrix of order 1 by 1 1 1 1.2687 MATRIX R This is a computed FULL matrix of order 1 by 1 [=W*W'] 1 1 0.3518 MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 11.0000 MATRIX V This is a computed FULL matrix of order 1 by 1 [=R+E] 1 1 1.0000


MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.8051 Your model has 6 estimated parameters and 18 Observed statistics Observed statistics include 2 constraints. -2 times log-likelihood of data >>> 1207.896 Degrees of freedom >>>>>>>>>>>>>>>> 12 1 Confidence intervals requested in group 1 Matrix Element Int. Estimate Lower Upper Lfail Ufail R 1 1 1 95.0 0.5312 0.3367 0.6903 0 0 0 0 1 Confidence intervals requested in group 2 Matrix Element Int. Estimate Lower Upper Lfail Ufail R 2 1 1 95.0 0.3518 0.1190 0.5558 0 0 0 0 This problem used 0.2% of my workspace Task Time elapsed (DD:HH:MM:SS) Reading script & data 0: 0: 0: 0.11 Execution 0: 0: 0: 3.41 TOTAL 0: 0: 0: 3.52 Total number of warnings issued: 0

ESTIMATED TETRACHORIC CORRELATIONS(estimating separate thresholds for each zygosity group)

DEPRESSION

ALCOHOL DEPENDENCE

ρ 95% CI ρ 95% CI

MZF 0.43 0.31-0.55 0.53 0.34-0.69

DZF 0.18 0.03-0.33 0.35 0.12-0.56

-2 log-likelihood 2509.632 1207.896

TEST FOR ZYGOSITY DIFFERENCE IN PREVALENCE(takes into account non-independence!)

DEPRESSIONALCOHOL

DEPENDENCE

-2 ln L -2 ln L

(i) Separate thresholds model 2509.632 1207.896

(ii) Equal thresholds 2514.897 1210.304

Heterogeneity (i - ii) χ2 = 5.265, p=0.02 χ2 = 2.408, p=0.121 1

This approach extends

naturally to fitting univariate

genetic models.

! univariate.mx! fitting a univariate genetic model to 2x2 data#define nvar 1#define maxthresf 1 ! number of thresholdsAnalysis of depression data: fitting ACE model data NI=3 NG=3LAbels twina twinb countmzOrdinal fi=depmzf.rec! Count is a definition variable that we use to tell MX the frequency count! for each element of the 2x2 table!Definition_variables countmz /Begin matrices;W LO nvar nvar fr ! additive genetic path (A=w*w')X LO nvar nvar fr ! shared environmental path (C=x*x') Y LO nvar nvar fr ! non-shared environmental path (E=y*y') Z LO nvar nvar fi ! non-additive genetic path (D=z*z')M FU maxthresf nvar fi ! matrix of thresholdsS DI nvar nvar ! Matrix that will store weight variableend matrices;SP M4 MATRIX M 1.5487! This tells MX to store the definition variable count in SSP S-1 mat w 0.5mat x 0.5 mat y 0.7

Begin algebra;A=W*W';C=X*X';E=Y*Y';D=Z*Z';V=A+C+D+E;end algebra;FREQ S; ! tells MX that S contains the weight (frequency) variableTH M|M; ! tells MX that row and column thresholds contained in M|MCO V|A+D+C_ A'+D'+C'|V; ! formula for correlation matrix!bo 0.001 1.0 y(1,1) bo 0.0001 0.999 w(1,1) x(1,1)bo -5.0 5.0 m(1,1)interval a(1,1) c(1,1) e(1,1) ! compute 95% confidence interval for correlation OPT func=1.E-12OPT RSEND

Analysis of depression data: DZmdata NI=3 NO=4LAbels twina twinb countdzOR fi=depdzf.recDefinition_variables countdz /Begin matrices = group 1; S DI nvar nvar ! Matrix that will store weight variableg DI 1 1 ! constant (=0.5) for coefficient of additive genetic componenth DI 1 1 ! constant (=0.25) for coefficient of dominance genetic component n FU maxthresf nvar fi ! matrix of thresholdsend matrices;SP N5 MATRIX N 1.4487MAT g 0.5MAT h 0.25SP S-1 FREQ S;TH N|N;CO V|g@A+h@D+C_ g@A'+h@D'+C'|V; ! formula for correlation matrix!bo -5.0 5.0 n(1,1)OPT RSEND

Constraint function - constrain variance to unity

CO NI=1

Begin matrices = group 1;

U unit 1 nvar

end matrices;

CO \d2v(V) = u;

end

** Mx startup successful ** **MX-Sunos version 1.49** ! univariate.mx ! fitting a univariate genetic model to 2x2 data The following MX script lines were read for group 1 #DEFINE NVAR 1 #DEFINE MAXTHRESF 1 ! NUMBER OF THRESHOLDS ANALYSIS OF DEPRESSION DATA: FITTING ACE MODEL DATA NI=3 NO=2 NG=3 LABELS TWINA TWINB COUNTMZ ORDINAL FI=DEPMZF.REC Ordinal data read initiated NOTE: Rectangular file contained 4 records with data ! Count is a definition variable that we use to tell MX the frequency count ! for each element of the 2x2 table ! DEFINITION_VARIABLES COUNTMZ / NOTE: Definition yields 4 data vectors for analysis NOTE: Vectors contain a total of 8 observations BEGIN MATRICES; W LO NVAR NVAR FR ! ADDITIVE GENETIC PATH (A=W*W') X LO NVAR NVAR FR ! SHARED ENVIRONMENTAL PATH (C=X*X') Y LO NVAR NVAR FR ! NON-SHARED ENVIRONMENTAL PATH (E=Y*Y') Z LO NVAR NVAR FI ! NON-ADDITIVE GENETIC PATH (D=Z*Z') M FU MAXTHRESF NVAR FI ! MATRIX OF THRESHOLDS S DI NVAR NVAR ! MATRIX THAT WILL STORE WEIGHT VARIABLE END MATRICES;

MX PARAMETER ESTIMATES GROUP NUMBER: 1 Analysis of depression data: fitting ACE model MATRIX A This is a computed FULL matrix of order 1 by 1 [=W*W'] 1 1 0.4250 MATRIX C This is a computed FULL matrix of order 1 by 1 [=X*X'] 1 1 1.0000E-08 MATRIX D This is a computed FULL matrix of order 1 by 1 [=Z*Z'] 1 1 0.0000 MATRIX E This is a computed FULL matrix of order 1 by 1 [=Y*Y'] 1 1 0.5750 MATRIX M This is a FULL matrix of order 1 by 1 1 1 0.5493

MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 83.0000

MATRIX V This is a computed FULL matrix of order 1 by 1 [=A+C+D+E] 1 1 1.0000 MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.6519 MATRIX X This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 1.0000E-04 MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.7583 MATRIX Z This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.0000

Matrix of EXPECTED thresholds TWINA TWINB Threshold 1 0.5493 0.5493 Threshold 2 1.0000 1.5487 (OBSERVED MATRIX is nonexistent for raw data) EXPECTED COVARIANCE MATRIX TWINA TWINB TWINA 1.0000 TWINB 0.4250 1.0000 Function value of this group: 1383.2782 Where the fit function is -2 * Log-likelihood of raw ordinal

Your model has 5 estimated parameters and 17 Observed statistics Observed statistics include 1 constraints. -2 times log-likelihood of data >>> 2509.788 Degrees of freedom >>>>>>>>>>>>>>>> 12 3 Confidence intervals requested in group 1 Matrix Element Int. Estimate Lower Upper Lfail Ufail A 1 1 1 95.0 0.4250 0.1045 0.5325 0 0 0 0 C 1 1 1 95.0 0.0000 0.0000 0.2609 0 0 0 1 E 1 1 1 95.0 0.5750 0.4675 0.6940 0 0 0 0 This problem used 0.1% of my workspace Task Time elapsed (DD:HH:MM:SS) Reading script & data 0: 0: 0: 0.10 Execution 0: 0: 0:15.76 TOTAL 0: 0: 0:15.86 Total number of warnings issued: 1 ______________________________________________________________________________ ______________________________________________________________________________

** Mx startup successful ** **MX-Sunos version 1.49** ! univar2.mx ! fitting a univariate genetic model to 2x2 data The following MX script lines were read for group 1 #DEFINE NVAR 1 #DEFINE MAXTHRESF 1 ! NUMBER OF THRESHOLDS ANALYSIS OF ALCOHOL DEPENDENCE DATA: FITTING ACE MODEL DATA NI=3 NO=2 NG=3 LABELS TWINA TWINB COUNTMZ ORDINAL FI=ALCMZF.REC Ordinal data read initiated NOTE: Rectangular file contained 4 records with data ! Count is a definition variable that we use to tell MX the frequency count ! for each element of the 2x2 table ! DEFINITION_VARIABLES COUNTMZ / NOTE: Definition yields 4 data vectors for analysis NOTE: Vectors contain a total of 8 observations

MX PARAMETER ESTIMATES GROUP NUMBER: 1 Analysis of alcohol dependence data: fitting ACE model MATRIX A This is a computed FULL matrix of order 1 by 1 [=W*W'] 1 1 0.3588 MATRIX C This is a computed FULL matrix of order 1 by 1 [=X*X'] 1 1 0.1724 MATRIX D This is a computed FULL matrix of order 1 by 1 [=Z*Z'] 1 1 0.0000 MATRIX E This is a computed FULL matrix of order 1 by 1 [=Y*Y'] 1 1 0.4688

MATRIX M This is a FULL matrix of order 1 by 1 1 1 1.4017 MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 15.0000 MATRIX V This is a computed FULL matrix of order 1 by 1 [=A+C+D+E] 1 1 1.0000 MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.5990 MATRIX X This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.4152 MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.6847 MATRIX Z This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.0000


Your model has 5 estimated parameters and 17 Observed statistics Observed statistics include 1 constraints. -2 times log-likelihood of data >>> 1207.896 Degrees of freedom >>>>>>>>>>>>>>>> 12 3 Confidence intervals requested in group 1 Matrix Element Int. Estimate Lower Upper Lfail Ufail A 1 1 1 95.0 0.3588 0.0000 0.6902 0 0 0 0 C 1 1 1 95.0 0.1724 0.0000 0.5542 0 0 0 0 E 1 1 1 95.0 0.4688 0.3097 0.6628 0 1 0 0 This problem used 0.1% of my workspace Task Time elapsed (DD:HH:MM:SS) Reading script & data 0: 0: 0: 0.10 Execution 0: 0: 0: 7.55 TOTAL 0: 0: 0: 7.65 Total number of warnings issued: 1 ______________________________________________________________________________ ______________________________________________________________________________

VIRGINIA TWIN STUDY: Female Like-Sex PairsSummary Model-Fitting Results

Additive Genetic

Variance 95% CI

Shared Environmental

Variance 95% CI

Non-Shared Environmental

Variance 95% CI

Major

depression 42.5 10.5-53.3 0.0 0.0-26.1 57.5 46.8-69.4

Alcohol

dependence 35.9 0.0-69.0 17.2 0.0-55.4 46.9 31.0-66.3

Model-fitting results:Depression in the Virginia Twin Study

Parameter Estimates (%)Likelihood-ratioversus full model

Model A C E D d.f. χ2 p

A D E 30 -- 57 13 -- -- --

A C E 43 0 57 -- -- -- --

A E 43 -- 57 -- 1 0.15 0.70

C E -- 33 67 -- 1 6.44 0.01

E -- -- 100 -- -- -- --

We can easily handle data where only one twin has responded.

HOWEVER, we are assuming that missing data are MCAR -

Missing Completely at Random.

We can include twins with missing cotwins (indicated by .) in the

same data-file as complete pairs.

Alternatively, if we want to test for differences in prevalence for

complete pairs versus singles (suggestive of an ascertainment bias),

we can include singleton twins as separate groups, allowing a test

of equality of thresholds.

EXAMPLE: Alcohol Dependence Data from1992 Survey of the Australian Twin Panel

(1981 cohort)

MZ Male DZ Male

0 0 274 138

1 0 37.5 38.5

0 1 37.5 38.5

1 1 49 19

0 . 34 31.5

1 . 8 8

. 0 34 31.5

. 1 8 8

Table 2. Numbers of twin pairs concordant and discordant for smoking status in the Australian twin panel 1981 survey.

MZ Female (N=1232 pairs) DZ Female (N=747 pairs)

I II III I II III

I Non-smoker 629 310

II Successful quitter 110 64 98 33

III Current smoker 124 115 190 146 61 99

MZ Male (N=567 pairs) DZ Male (N=350 pairs)

I II III I II III

I Non-smoker 221 121

II Successful quitter 77 70 44 27

III Current smoker 31 61 77 61 53 44

MULTIPLE THRESHOLD MODEL

For n categories, we need to estimate (n-1) thresholds.

The safest way to estimate multiple thresholds is to estimate:t0

t1 = t0 + t1 (t1 > 0)t2 = t1 + t2 (t2 > 0)

and so on. This is especially important when we estimate confidence intervals.

Note that if L = and M =

then LM = etc.

Hence, we merely need to constrain M2 etc. > 0.

1 01 1

tt

0

1

tt t

0

0 1

! univariate3x3.mx! fitting a univariate genetic model to 3x3 data#define nvar 1#define maxthresf 2 ! number of thresholdsAnalysis of smoking data: fitting ACE model data NI=3 NG=3LAbels twina twinb countmzOrdinal fi=smkmzf.rec! Count is a definition variable that we use to tell MX the frequency count! for each element of the 3x3 table!Definition_variables countmz /Begin matrices;W LO nvar nvar fr ! additive genetic path (A=w*w')X LO nvar nvar fr ! shared environmental path (C=x*x') Y LO nvar nvar fr ! non-shared environmental path (E=y*y') Z LO nvar nvar fi ! non-additive genetic path (D=z*z')M FU maxthresf nvar fi ! matrix of thresholdsL LO maxthresf maxthresf ! used to ensure t1 < t2S DI nvar nvar ! Matrix that will store weight variableend matrices;SP M45 MATRIX M 1.5487 0.5

MATRIX L 11 1 ! This tells MX to store the definition variable count in SSP S-1 mat w 0.5mat x 0.5 mat y 0.7Begin algebra;A=W*W';C=X*X';E=Y*Y';D=Z*Z';V=A+C+D+E;T=L*M;end algebra;FREQ S; ! tells MX that S contains the weight (frequency) variableTH T|T; ! tells MX that row and column thresholds contained in T|TCO V|A+D+C_ A'+D'+C'|V; ! formula for correlation matrix!bo 0.001 1.0 y(1,1) m(2,1) bo 0.0001 0.999 w(1,1) x(1,1)bo -5.0 5.0 m(1,1)interval a(1,1) c(1,1) e(1,1) ! compute 95% confidence interval for correlation OPT func=1.E-12OPT RSEND

Analysis of ordinal smoking data: DZmdata NI=3 XLAbels twina twinb countdzOR fi=smkdzf.recDefinition_variables countdz /Begin matrices = group 1; S DI nvar nvar ! Matrix that will store weight variableg DI 1 1 ! constant (=0.5) for coefficient of additive genetic componenth DI 1 1 ! constant (=0.25) for coefficient of dominance genetic component n FU maxthresf nvar fi ! matrix of thresholdsend matrices;SP N67 MATRIX N 1.4487 0.5MAT g 0.5MAT h 0.25SP S-1 Begin algebra;T=L*N;end algebra;FREQ S;TH T|T;CO V|g@A+h@D+C_ g@A'+h@D'+C'|V; ! formula for correlation matrix!bo -5.0 5.0 n(1,1)bo 0.001 1.0 n(2,1)OPT RSEND

Constraint function - constrain variances to unityCO NI=1Begin matrices = group 1;U unit 1 nvarend matrices;CO \d2v(V) = u;end

** Mx startup successful ** **MX-Sunos version 1.49** ! univariate3x3.mx ! fitting a univariate genetic model to 3x3 data The following MX script lines were read for group 1 #DEFINE NVAR 1 #DEFINE MAXTHRESF 2 ! NUMBER OF THRESHOLDS ANALYSIS OF SMOKING DATA: FITTING ACE MODEL DATA NI=3 NO=9 NG=3 LABELS TWINA TWINB COUNTMZ ORDINAL FI=SMKMZF.REC Ordinal data read initiated NOTE: Rectangular file contained 9 records with data and 1 records where all data were missing ! Count is a definition variable that we use to tell MX the frequency count ! for each element of the 3x3 table ! DEFINITION_VARIABLES COUNTMZ / NOTE: Definition yields 9 data vectors for analysis NOTE: Vectors contain a total of 18 observations BEGIN MATRICES; W LO NVAR NVAR FR ! ADDITIVE GENETIC PATH (A=W*W') X LO NVAR NVAR FR ! SHARED ENVIRONMENTAL PATH (C=X*X') Y LO NVAR NVAR FR ! NON-SHARED ENVIRONMENTAL PATH (E=Y*Y') Z LO NVAR NVAR FI ! NON-ADDITIVE GENETIC PATH (D=Z*Z') M FU MAXTHRESF NVAR FI ! MATRIX OF THRESHOLDS L LO MAXTHRESF MAXTHRESF ! USED TO ENSURE T1 < T2 S DI NVAR NVAR ! MATRIX THAT WILL STORE WEIGHT VARIABLE END MATRICES;

Summary of VL file data for group 1 COUNTMZ TWINA TWINB Code -1.0000E+00 1.0000E+00 2.0000E+00 Number 9.0000E+00 9.0000E+00 9.0000E+00 Mean 1.3689E+02 1.0000E+00 1.0000E+00 Variance 3.1949E+04 6.6667E-01 6.6667E-01 Minimum 5.5000E+01 0.0000E+00 0.0000E+00 Maximum 6.2900E+02 2.0000E+00 2.0000E+00 Summary of VL file data for group 2 COUNTDZ TWINA TWINB Code -1.0000 1.0000 2.0000 Number 9.0000 9.0000 9.0000 Mean 83.0000 1.0000 1.0000 Variance 6923.2778 0.6667 0.6667 Minimum 30.5000 0.0000 0.0000 Maximum 310.0000 2.0000 2.0000

MX PARAMETER ESTIMATES GROUP NUMBER: 1 Analysis of smoking data: fitting ACE model MATRIX A This is a computed FULL matrix of order 1 by 1 [=W*W'] 1 1 0.5836 MATRIX C This is a computed FULL matrix of order 1 by 1 [=X*X'] 1 1 0.1823 MATRIX D This is a computed FULL matrix of order 1 by 1 [=Z*Z'] 1 1 0.0000 MATRIX E This is a computed FULL matrix of order 1 by 1 [=Y*Y'] 1 1 0.2341 MATRIX L This is a LOWER TRIANGULAR matrix of order 2 by 2 1 2 1 1.0000 2 1.0000 1.0000

MATRIX M This is a FULL matrix of order 2 by 1 1 1 0.2809 2 0.3979 MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 190.0000 MATRIX T This is a computed FULL matrix of order 2 by 1 [=L*M] 1 1 0.2809 2 0.6788 MATRIX V This is a computed FULL matrix of order 1 by 1 [=A+C+D+E] 1 1 1.0000 MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.7639 MATRIX X This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.4269

MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.4839 MATRIX Z This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.0000 Matrix of EXPECTED thresholds TWINA TWINB Threshold 1 0.2809 0.2809 Threshold 2 0.6788 0.6788 (OBSERVED MATRIX is nonexistent for raw data) EXPECTED COVARIANCE MATRIX TWINA TWINB TWINA 1.0000 TWINB 0.7659 1.0000 Function value of this group: 4094.3378 Where the fit function is -2 * Log-likelihood of raw ordinal

MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.7639 MATRIX X This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.4269 MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.4839 MATRIX Z This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.0000


Your model has 7 estimated parameters and 37 Observed statistics Observed statistics include 1 constraints. -2 times log-likelihood of data >>> 6860.060 Degrees of freedom >>>>>>>>>>>>>>>> 30 3 Confidence intervals requested in group 1 Matrix Element Int. Estimate Lower Upper Lfail Ufail A 1 1 1 95.0 0.5836 0.3990 0.7781 0 0 0 1 C 1 1 1 95.0 0.1823 0.0000 0.3510 0 0 0 0 E 1 1 1 95.0 0.2341 0.1958 0.2777 0 1 0 0 This problem used 0.1% of my workspace Task Time elapsed (DD:HH:MM:SS) Reading script & data 0: 0: 0: 0.24 Execution 0: 0: 0:29.75 TOTAL 0: 0: 0:30.00 Total number of warnings issued: 2 ______________________________________________________________________________ ______________________________________________________________________________

SMOKING IN WOMEN

% 95% CI

Additive genetic variance 58.4 39.9-77.8

Shared environmental variance 18.2 0.0-35.1

Non-shared environmental variance 23.4 19.6-27.8

-2 log-likelihood 6860.06

BIVARIATE GENETIC APPLICATIONSIt is a simple step to modify the univariate script to allow for bivariate (or

even trivariate) genetic analyses.

If the traits being analyzed have varying numbers of thresholds, maxthres

will be the maximum number of thresholds, and we will have, say,

MAT M

0.5 -0.5

0 1.0

In the next example, we analyze Australian twin data on lifetime history

of major depression and current smoking status. Here, the original raw

data are given in depsmkmf.rec and depsmkdf.rec. Notice that the data

have been sorted -- this will improve the efficiency of the MX run.

! ordinal_bivariate.mx#define nvar 2#define nvar2 4#define maxthres 2Analysis of ordinal depression (0/1) and smoking ! initiation/persistence (0/1/2) data NI=nvar2 NG=3Ordinal fi=depsmkmf.recBegin matrices;M FU maxthres nvar frL LO maxthres maxthresW LO nvar nvar frX LO nvar nvar frY LO nvar nvar frend matrices;MAT L1.01.0 1.0MATRIX M 0.5294 0.7191 0.0 0.5 SP M1 2 0 3 st 0.7 y(1,1) y(2,2) w(1,1) w(2,2)st 0.2 x(1,1) x(2,2)st 0.2 w(2,1) x(2,1) y(2,1)

Begin algebra;A=W*W';O=\stnd(A);C=X*X';r=\stnd(C);E=Y*Y';q=\stnd(E);P=A+C+E;end algebra;TH L*M|L*M;CO P | A + C _ A' + C' | P ;bo 0.001 1.0 y(1,1) y(2,2)bo 0.0001 0.999 x(1,1) x(2,2) w(1,1) w(2,2) bo -0.999 0.999 x(2,1) y(2,1) w(2,1)bo 0.001 3.0 m(2,2)bo -5.0 5.0 m(1,1)! interval a(1,1) a(2,2) c(1,1) c(2,2) e(1,1) e(2,2) o(1,2) r(1,2) q(1,2) OPT func=1.E-12OPT RSEND

Analysis of ordinal depression and smoking data: DZFdata NI=nvar2Ordinal fi=depsmkdf.recBegin matrices = group 1;N FU maxthres nvar frg fu 1 1end matrices;MATRIX N0.5781 0.6884 0 0.72 SP N101 102 0 103 mat g0.5TH L*N | L*N ;CO P | g@A + C _ g@A' + C' | P ;bo 0.001 3.0 n(2,2)bo -5.0 5.0 n(1,1)OPT RSEND

Data constraintCO NI=1Begin matrices = group 1;U unit 1 nvarend matrices;CO \d2v(P) = u;end

Summary of VL file data for group 1 Code 1.0000 2.0000 3.0000 4.0000 Number 1013.0000 1286.0000 982.0000 1254.0000 Mean 0.1925 0.6454 0.2169 0.6555 Variance 0.1554 0.7234 0.1699 0.7426 Minimum 0.0000 0.0000 0.0000 0.0000 Maximum 1.0000 2.0000 1.0000 2.0000 Summary of VL file data for group 2 Code 1.0000 2.0000 3.0000 4.0000 Number 598.0000 826.0000 586.0000 786.0000 Mean 0.1940 0.6525 0.2526 0.7468 Variance 0.1564 0.7086 0.1888 0.7820 Minimum 0.0000 0.0000 0.0000 0.0000 Maximum 1.0000 2.0000 1.0000 2.0000

*** WARNING! *** I am not sure I have found a solution that satisfies Kuhn-Tucker conditions for a minimum. NAG's IFAIL parameter is 6 Looks like I got stuck here. Check the following: 1. The model is correctly specified 2. Starting values are good 3. You are not already at the solution The error can arise if the Hessian is ill-conditioned You can try resetting it to an identity matrix and fit from the solution by putting TH=-n on the OU line where n is the number of refits that you want to do If all else fails try putting NAG=30 on the OU line and examine the file NAGDUMP.OUT and the NAG manual

MX PARAMETER ESTIMATES GROUP NUMBER: 1 Analysis of ordinal depression (0/1) and smoking initiation/persistence (0/1/2) MATRIX A This is a computed FULL matrix of order 2 by 2 [=W*W'] 1 2 1 0.3551 0.0782 2 0.0782 0.6147 MATRIX C This is a computed FULL matrix of order 2 by 2 [=X*X'] 1 2 1 3.5944E-08 5.0292E-05 2 5.0292E-05 1.5118E-01 MATRIX E This is a computed FULL matrix of order 2 by 2 [=Y*Y'] 1 2 1 0.6451 0.0270 2 0.0270 0.2343 MATRIX L This is a LOWER TRIANGULAR matrix of order 2 by 2 1 2 1 1.0000 2 1.0000 1.0000

MATRIX M This is a FULL matrix of order 2 by 2 1 2 1 0.8249 0.2698 2 0.0000 0.4014 MATRIX O This is a computed FULL matrix of order 2 by 2 [=\STND(A)] 1 2 1 1.0000 0.1675 2 0.1675 1.0000 MATRIX P This is a computed FULL matrix of order 2 by 2 [=A+C+E] 1 2 1 1.0003 0.1053 2 0.1053 1.0001 MATRIX Q This is a computed FULL matrix of order 2 by 2 [=\STND(E)] 1 2 1 1.0000 0.0695 2 0.0695 1.0000

MATRIX R This is a computed FULL matrix of order 2 by 2 [=\STND(C)] 1 2 1 1.0000 0.6822 2 0.6822 1.0000 MATRIX W This is a LOWER TRIANGULAR matrix of order 2 by 2 1 2 1 0.5959 2 0.1313 0.7729 MATRIX X This is a LOWER TRIANGULAR matrix of order 2 by 2 1 2 1 1.8959E-04 2 2.6527E-01 2.8428E-01 MATRIX Y This is a LOWER TRIANGULAR matrix of order 2 by 2 1 2 1 0.8032 2 0.0337 0.4829

Matrix of EXPECTED thresholds 1 2 3 4 Threshold 1 0.8249 0.2698 0.8249 0.2698 Threshold 2 0.8249 0.6712 0.8249 0.6712 (OBSERVED MATRIX is nonexistent for raw data) EXPECTED COVARIANCE MATRIX 1 2 3 4 1 1.0003 2 0.1053 1.0001 3 0.3551 0.0783 1.0003 4 0.0783 0.7659 0.1053 1.0001 Function value of this group: 6219.6048 Where the fit function is -2 * Log-likelihood of raw ordinal

*** WARNING! *** Minimization may not be successful. See above CODE RED - Hessian/precision problem Your model has 15 estimated parameters and 7333 Observed statistics Observed statistics include 2 constraints. -2 times log-likelihood of data >>> 10511.502 Degrees of freedom >>>>>>>>>>>>>>>> 7318 This problem used 1.2% of my workspace Task Time elapsed (DD:HH:MM:SS) Reading script & data 0: 0: 0: 6.63 Execution 0: 0: 5:18.42 TOTAL 0: 0: 5:25.04 Total number of warnings issued: 2 ______________________________________________________________________________ ______________________________________________________________________________

GENETIC ANALYSIS OF

BINARY and CATEGORICAL TRAITS

PART TWO

HIGH-RISK SAMPLING SCHEMES

The Ordinal data-option in MX allows us to analyze twin

or family data collected under a two-stage sampling scheme,

where in the first stage we study a random sample of

families, but in the second stage the probability that a family

will be assigned for interview is a function of phenotypic

values observed at the first stage. For example, we may

decide that we will do follow-up assessments with all pairs

where at least one twin is affected at stage one, but only 10%

of pairs where neither twin was affected at stage one.

To illustrate this, we have created a simulated data-set,

with the following parameters, using multsim2_2mz.mx and

multsim2_2dz.mx.

WAVE 1 WAVE 2

VA 50% 50% rG = 1.00

VC 9% 9% rC = 1.00

VE 41% 41% rE = 0.71

Prevalence 25% 25%

First, we analyze this data assuming that all twin pairs

(1000 MZ, 1000 DZ pairs) are assessed at both waves

(ordinal_bivariate_simulated.mx).

SIMULATED TWO-WAVE DATATWIN A TWIN B

Wave 1 Wave 2 Wave 1 Wave 2 MZ_FULL DZ_FULL

0 0 0 0 560 519.5

0 0 0 1 33.6 37.8

0 0 1 0 33.6 37.8

0 0 1 1 60.35 92.4

0 1 0 0 33.6 37.8

0 1 0 1 5.8 4.7

0 1 1 0 5.8 4.7

0 1 1 1 17.25 15.3

1 0 0 0 33.6 37.8

1 0 0 1 5.8 4.7

1 0 1 0 5.8 4.7

1 0 1 1 17.25 15.3

1 1 0 0 60.35 92.4

1 1 0 1 17.25 15.3

1 1 1 0 17.25 15.3

1 1 1 1 92.7 64.6

TOTAL PAIRS 1000.00 1000.10

Estimatedparameters: WAVE 1 WAVE 2

% 95% CI % 95% CI r

VA 49.7 (34.5-61.7) 49.5 (26.6-55.0) rG = 1.00

VC 9.3 (0.0-24.5) 9.5 (0.0-25.3) rC = 1.00

VE 41.1 (37.2-47.6) 41.1 ( -- ) rE = 0.71

Prevalence 25.0 25.0

(t1=0.6748) (t2=0.6747)

HIGH-RISK SAMPLING SCHEMES(II)

Next, we analyze the data-set that would arise under our

two-stage sampling scheme, using ordinal_bivariate_hirisk.mx.

This is exactly the same program as in the previous case,

except that we have changed file names! In 90% of cases

where neither twin was affected at stage one, the stage two

phenotypic values are set to missing. What parameter

estimates do we recover in this case?

Two-Wave data simulating high-risk samplingMZHIRISK DZHIRISK

0 0 0 0 56 51.95

0 . 0 . 504 467.55

0 0 0 1 3.36 3.78

0 . 0 . 30.24 34.02

0 0 1 0 33.6 37.8

0 0 1 1 60.35 92.4

0 1 0 0 3.36 3.78

0 . 0 . 30.24 34.02

0 1 0 1 0.58 0.47

0 . 0 . 5.22 4.23

0 1 1 0 5.8 4.7

0 1 1 1 17.25 15.3

1 0 0 0 33.6 37.8

1 0 0 1 5.8 4.7

1 0 1 0 5.8 4.7

1 0 1 1 17.25 15.3

1 1 0 0 60.35 92.4

1 1 0 1 17.25 15.3

1 1 1 0 17.25 15.3

1 1 1 1 92.7 64.6

TOTAL PAIRS - Wave 1: 1000 1000.1

- Wave 2: 430.30 460.275

Estimatedparameters: WAVE 1 WAVE 2

% 95% CI % 95% CI r

VA 50.2 (39.7-55.6) 50.3 (32.7-67.6) rG = 1.00

VC 8.8 8.6 rC = 1.00

VE 41.0 41.1 rE = 0.71

Prevalence 25.0 25.0

(t1=0.6745) (t2=0.6737)

HIGH-RISK SAMPLING SCHEMES (III)

Notice that we included all pairs who were assessed at stage one. What

happens if we focus on the stage two phenotype and include only those pairs

who have data at stage two? Data are in mzlistwise.rec and dzlistwise.rec;

the program is univariate_listwise.mx.Estimatedparameters: WAVE 2

% 95% CIVA 23.0 4.8-35.6

VC 0.0 0.0-12.4

VE 77.0 64.4-90.1

Prevalence 50.0 49.7-50.3

(t1=0.004) (-0.086-0.094)

When we ignore the wave one data, our estimates of population

prevalence (not unexpectedly) and genetic and environmental parameters,

are seriously biased!

HIGH-RISK SAMPLING SCHEMES (IV)

Suppose that instead we acknowledge that our population is drawn

from a population where the prevalence of the observed trait is 25%, and

fix our estimate of the threshold value, t=0.67449. As in the previous

example, we limit ourselves to twin pairs where wave two assessments

occurred.WAVE 2

% 95% CIVA 45.8 22.9-55.2

VC 0.0 0.0-17.8

VE 54.2 44.8-64.3

The bias to parameter estimates is substantially reduced! (There is

still a bias, however: in particular, our estimate of the shared

environmental variance is now zero.)

HIGH-RISK SAMPLING SCHEMES (V)

How do we explain these results?

“Missing data theory” is an active area of research in statistics

which is concerned with how we should adjust for missing

observations -- which may be missing because of subject non-

response, or because of sampling design (e.g. our two-stage sampling

design).

Missing data theory distinguishes between data that are

(i) MCAR -- missing completely at random, i.e. non-response is

completely unrelated to the variable we are studying (plausible

for variables such as finger ridge count).

(ii) MAR -- missing at random, i.e. non-response is random, but the

probability of non-response may vary as a function of observed

trait values (or underlying latent variables).

Suppose we have a 5-level variable with the following probabilities of missing data at subsequent follow-up:

Trait Value Probability of missing data

1 60%

2 20%

3 35%

4 13%

5 50%

These data are certainly not MCAR, but they do meet the definition of MAR.

HIGH-RISK SAMPLING SCHEMES (VI)

Under certain conditions, missingness is said to be

ignorable, i.e. we can recover estimates of the underlying

population parameters without needing to adjust for

differential rates of non-response.

For our two-stage high-risk sampling scheme, where we

assumed random sampling at the first stage, but that only 10%

of concordant unaffected pairs are assessed at the second stage,

the stage-two data are MAR. Provided that we use the Ordinal

option in MX (or the Raw data option, for continuous

variables), and analyze all pairs observed at stage one, we can

recover correct estimates of population prevalence and of

genetic and environmental parameters.

If probability of non-response is (i) determined by one or more

correlated phenotypes that are not included in the analyses; or (ii)

partly a function of the stage-two phenotype (such as would be the

case if individuals who were unaffected at wave one but had become

affected by the time wave two were more likely not to agree to be

assessed at wave two than individuals who remained unaffected

throughout), missing data will be non-ignorable.

In the case where we analyzed only the wave two data, but fixed

the prevalence at 25% (i.e. assuming that missingness is determined

by the stage two phenotype), missingness was still strictly non-

ignorable, since it was determined by wave one and not wave two

phenotypic values. However, since we simulated a very high test-

retest correlation between wave one and wave two data, analyzing

the data as though they were MAR greatly reduced biases to

estimates of genetic and environmental parameters.

HIGH-RISK SAMPLING SCHEMES (VII)

Consider the following cases:

Example 1: We are conducting a twin study of smoking. Probability

of non-response is much higher for smokers (30%) than for non-

smokers (15%). We include data from singleton twins as well as

complete pairs. Are missing data ignorable?

NO, since the probability of non-response is determined by

the phenotype we wish to study.

Example 2: We are conducting a twin study of smoking. We know

that the prevalence of smoking is 25%, both in MZ and in DZ twin

pairs. We can identify every twin pair where at least one twin is a

smoker, and determine the smoking status of each twin. However,

we don’t have data on concordant non-smoking pairs. Are

missing data ignorable?

Missing data theory provides a framework for

thinking about several important classes of problems in

behavior genetics:

(i) clinically ascertained samples;

(ii) cooperation or retention bias;

(iii) hierarchical or stage-dependent models of genetic and

environmental influences on risk of psychopathology.

ASCERTAINED SAMPLES

Suppose we identify twins from a treatment series, and then

assess the twins and their cotwins. Typically, we will ascertain no

more than one twin from each pair (‘single ascertainment’). If we

have an estimate of the prevalence of the disorder we are studying in

the population from which they are drawn (e.g. treated alcohol

dependence), provided that that prevalence estimate applies equally

to MZ and to DZ pairs, then we can derive estimates of genetic and

environmental parameters.

If we have “double proband” pairs, where each twin

independently has come to clinical attention, we will count the pair

twice, once with twin A and once with twin B as the proband.

For example, Gurling et al. (1984) reported that they failed to

find significant heritability of alcohol dependence based on data

from a small twin series, in which the number of unaffected and

affected cotwins of alcohol dependent probands was as follows:

Suppose the prevalence of treated alcohol dependence is 10%.

We can fit an ACE model, fixing the threshold for MZ and DZ pairs

to 1.28 -- see univariate_ascertained.mx.

Unaffected Affected

MZ pairs: 10 5 (mzmasc.rec)

DZ pairs: 14 6 (dzmasc.rec)

ASCERTAINED SAMPLES (II)

For the Gurling data, assuming 10% prevalence, we estimate:

VA = 11.0% (95%CI 0-30.3%)

VC = 40.4% (95%CI 0-71.0%)

VE = 48.6% (95%CI 18.0-78.7%)

Of course, the assumption that one in ten British males receives

inpatient treatment for alcoholism may be a little strong. Suppose

the prevalence of treated alcoholism is only 2.5% (corresponding to a

threshold value of 1.96). Then we would obtain:

VA = 7.5% (95%CI 0.0-76.3%)

VC = 62.8% (95%CI 4.1-82.9%)

VE = 29.8% (95%CI 10.4-51.3%)

In other words, with these data we cannot reject the hypothesis

of 75% heritability of alcohol dependence.

Date post:	03-Jan-2016
Category:	Documents
Upload:	gloria-owens
View:	33 times
Download:	2 times

GENETIC ANALYSIS OF BINARY and CATEGORICAL TRAITS PART ONE

Documents