Bayz Manual

Preparing data sets for using bayz in R

Three data sets can be specified using data=, geno= and covar=.

Missing values

Missing values are given as NA (as the R standard for missing). Missing response or explanatory variables will lead to removal of the record. The genotype data can contain missing values and these missing values will be replaced by the mean genotype. However, NA in the data given with covar= is not replaced by the respective column means and any missing data in these covariates will also lead to loss of records. To work with large sets of covariates it is therefore advisable to replace any missing data.

Using geno versus covar data

The data given with covar= can be any continuous covariates, and in principle this can also handle genotypes. There are, however, some advantages to use the special "geno" table, when possible: genotypes are stored 8x more efficient which can become crucial for larger analyses, missing values are automatically replaced by the mean genotype, and edits are added for Minor Allele Frequency (MAF) and missing rate. This is all not done when handling covariates. The geno table can handle biallelic genotypes (coded as 0,1,2 for the R interface), for any other genotypes / haplotypes the covar-table should be used.

Individual / sample ID

Most Bayz-R models need an individual or sample ID in the first column of the data sets (data=, geno=, covar=). Some exceptions are simple models that do not use a geno= or covar= data, see in the overview of the models. The IDs/sample can be identified with character/numerical-strings of any length, and they do not have to be in the same order and do not need to match all between the data= data, and the geno= or covar= data. When samples do not all match, all samples from the data= data are used, and any non-matching samples from the geno= or covar= data are dropped. Removing a particular sample from the analysis therefore only needs to remove it from the data= data.
Note: a sample that is in the data= data, but not in the geno= or covar= data, will in principle be kept in the analysis, but with missing data for the information from the geno= or covar= data. When using covariates this can ultimately still lead to loss of the sample because of missing explanatory variables for that sample.

Handling repeated observations

Repeated observations can be handled for some of the Bayz-R models which have an appended 'r' (reps / repeated). See the overview of models which ones have a special 'rep' version. The 'rep' versions are built to have a single record for each ID/sample in the data= data, and repeated samples in the geno= or covar= data. An extra variance component (with a name that has resid.ID in it) is added that accounts for additional covariance between the repeated samples. This accounts for a Permanent Environment effect in typical animal data, or non-additive genetic effects in some typical plant data.

When both data= and geno/covar= have repeated samples, use the regular non-'rep' versions where the sample IDs are the combined sample/rep ID.