Bayz Manual

Preparing data sets for using bayz in R

Three data sets can be specified using data=, geno= and covar=.

Missing values

Missing values are given as NA (as the R standard for missing). Missing response or explanatory variables will lead to removal of the record. The genotype data can contain missing values and these missing values will be replaced by the mean genotype. However, NA in the data given with covar= is not replaced by the respective column means and any missing data in these covariates will also lead to loss of records. To work with large sets of covariates it is therefore advisable to replace any missing data.

Using geno versus covar data

The data given with covar= can be any continuous covariates, and in principle this can also handle genotypes. There are, however, some advantages to use the special "geno" table, when possible: genotypes are stored 8x more efficient which can become crucial for larger analyses, missing values are automatically replaced by the mean genotype, and edits are added for Minor Allele Frequency (MAF) and missing rate. This is all not done when handling covariates. The geno table can handle biallelic genotypes (coded as 0,1,2 for the R interface), for any other genotypes / haplotypes the covar-table should be used.

Individual / sample ID

Most Bayz-R models need an individual or sample ID in the first column of the data sets (data=, geno=, covar=). Some exceptions are simple models that do not use a geno= or covar= data, see in the overview of the models. The IDs/sample can be identified with character/numerical-strings of any length, and they do not have to be in the same order and do not need to match all between the data= data, and the geno= or covar= data. When samples do not all match, all samples from the data= data are used, and any non-matching samples from the geno= or covar= data are dropped. Removing a particular sample from the analysis therefore only needs to remove it from the data= data.
Note: a sample that is in the data= data, but not in the geno= or covar= data, will in principle be kept in the analysis, but with missing data for the information from the geno= or covar= data. When using covariates this can ultimately still lead to loss of the sample because of missing explanatory variables for that sample.

Handling repeated observations

Bayz can efficiently handle the case where phenotypes for an ID are replicated, but genotypes or other covariates are stored and used only per unique ID. This feature is available for some of the Bayz-R models which have an appended 'r' (reps / repeated). See the overview of models which ones have a special 'rep' version. The lay-out of the data sets are: the data= data has replicated IDs in multiple lines, but replicates can occur in any order. The geno= or covar= data has unique IDs, in any order.

The rep-version fit an extra variance component (with a name that has resid.ID in it) that accounts for additional covariance between the repeated samples. This accounts for a Permanent Environment effect in typical animal data, or non-additive genetic effects in some typical plant data. Note that, when accidentally using rep-versions of models on data which has no repeats, this additional variance will not be estimable.

When both data= and geno/covar= have repeated samples, use the regular non-'rep' versions where the sample IDs are the combined sample/rep ID.