Bayz Manual

Data Reading Examples


On this page:

Basic Example

File names with path and spaces

Setting missing value

Using input file with header for field names

Merging data

Filtering (selecting) data

Pedigree data

Basic Example

A basic data reading statement, which will use the default NA for missing values:

     data
     file=ageweight.txt
     mouseID sex age weight
which will read 4 columns from the file "ageweight.txt" and assign the names "mouseID", "sex", "age" and "weight" to the 4 columns. For this basic example there will be no header in the file, and the data in the file must be separated by spaces or tabs. The input file could for instance look like:

Basic example input file
ZE5939M4491.7
ID7708F3368.2
DF9401F4074.9
OG0605M3475.5
TR1140M3390.9
OH6564F3364.2
IG0596F4767.6

File name with path and spaces

File names with a full and relative path work in linux and mac, and probably also in Windows. File names with a space cannot be used, bayz will stop reading at the first space and will not get the complete name. Some examples:

     file=/usr/home/xxxx/datamouse/ageweight.txt
     file=../datamouse/ageweight.txt
     # not working because of spaces in the name:
     file=age and weight.txt

Setting missing value other than NA

     data
     file=ageweight.txt missing=-999
     mouseID sex age weight

Field names (header) in the input file

The field names that are in the bayz script for the Basic Example can also be in the data input file. This allows for use of files with a "header" line. This is specified as:

     data
     file=ageweight.txt -header
Now the input file could like like:

Input file with field names (header line)
mouseIDsexageweight
ZE5939M4491.7
ID7708F3368.2
DF9401F4074.9
OG0605M3475.5
TR1140M3390.9
OH6564F3364.2
IG0596F4767.6

Note: in version 2.5 the use of a header line does not work (well) when reading so-called "blocks-fields" (see...).

Merging data

Indicate after the data statement the mergekey field, and bayz merges information from all files with the same mergekey. After merging, models can be specified that include response and explanatory variables from different files. The mergekey field indicated after the data statement must match one of the fields in the field list (not necessarily the first field), and the field must have all unique values.
     data mouseID
     file=weight.txt
     mouseID sex weight

     data mouseID
     file=dietinfo.txt
     mouseID diet

For example, the following two files will be merged as shown, and will allow to make a model with weight explained by diet (header line with field names is not in the input file):

weight.txt
mouseIDweight
SF123420.5
PX569823.6
BN997618.4
diet.txt
mouseIDdiet
BN9976highcarb
PX5698control
KA4533control
merged
mouseIDweightdiet
SF123420.5NA
PX569823.6control
BN997618.4highcarb

Notes about merging:

Filtering (selecting) data

The merging feature can be used to filter (select) data by making the first file in a merge-sequence a file only containing the list of IDs to keep, as in the following example.

     data mouseID
     file=keepids.txt
     mouseID

     data mouseID
     file=ageweight.txt
     mouseID sex age weight
  

See the example below for the effect of this kind of merging (header line with field names is not in the input files):

keepids.txt
mouseID
DF9401
OG0605
ZE5939
IG0596
ageweight.txt
mouseIDsexageweight
ZE5939M4491.7
ID7708F3368.2
DF9401F4074.9
OG0605M3475.5
TR1140M3390.9
OH6564F3364.2
IG0596F4767.6
merged (selected based on keepids list)
mouseIDsexageweight
DF9401F4074.9
OG0605M3475.5
ZE5939M4491.7
IG0596F4767.6

Using pedigree data

With one phenotype per ID (sample), add the ID-name used at the data statement for reading the phenotype file and the same ID-name at the data statement for the pedigree file. With the special flag 'ped' at the data statement for the pedigree file bayz does not merge these files as in the regular merging procedure, but keeps all records in both files. When phenotypes are repeated a hierarchical model is needed. In the pedigree data, the ID must be in the first columns (this is not required for all other files), and parents must be in columns 2 and 3 (any names can be given to the parent columns). More columns can follow which will be ignored.

     data mouseID
     file=ageweight.txt
     mouseID sex age weight

     data mouseID ped
     file=mouse.ped
     mouseID father mother

By default bayz does not compute/ignores inbreeding in construction of the A-1 relationship matrix. Add the -inbred flag at the file name to make bayz compute and include inbreeding:

     data mouseID ped
     file=mouse.ped -inbred
     mouseID father mother

Reading genotype data

These examples assume that there is one phenotype observed per ID (sample), and one genotype that is available in a separate file. The standard bayz merge is used to combine phenotype and genotype data. For the case where repeated phenotypes are available per ID a hierarchical model can be used, or the phenotype and genotype data should be merged before (copying genotype data for every repeated ID in the phenotype data). The genotype data is defined as a block-field (a field name with two square brackets) that allows to include all genotypes in the model with a single model term (add or dom).

Biallelic genotype coding

Biallelic genotype coding requires the genotypes to be coded with two alleles that must be a single character or digit, e.g. "A T", or "1 5".
   data cow
   file=milkdata.txt
   cow milk fat prot

   Data id
   file=cowgenotypes.txt
   cow geno[] !biallelic

   data geno map
   file=snpnames.txt
   geno

Biallelic coding is the format of plink 'ped' files, the following would use plink ped-map files where the 'ped' file has 6 additional fields, and the map file has chromosome, SNP name, genetic map position and physical (base pair) map position. Note that for bayz the column of SNP names in the map file must get the same name as the name used for the genotype block-field, in this example 'geno'.

   data cow
   file=milkdata.txt
   cow milk fat prot

   Data id
   file=cowgenotypes.ped
   famid cow sire dam sex pheno geno[] !biallelic

   data geno map
   file=cowgenotypes.map
   chrom geno cmdist bpdist

Genotype coding 0, 1, 2

Use the genot012 flag to indicate the input file contains genotypes coded as 0,1,2 for homozygote, heterozygote and homozygote. Missing genotype is default NA.

   cow geno[] !genot012

Modifying MAF and missing rate edits

Bayz has automatic edits for Minor Allele Frequency at 1% and missing rate at 20%. To modify the default settings add flags on the map file:



Additional information in map file

The map file can be extended with any additional information that may be use in the modeling of SNP effects. The chromosome is often in the map files, but one can insert (and use) any other grouping.

Using large numbers of (continuous) covariates