HOME Latest Release Issues and Q&A MMAP Cheat Sheet
This Cheat Sheet is designed to facilitate creation of shell scripts for running MMAP.
Example soft links for the MMAP program plus input and output files are shown below. Example commands and options using these soft links are given in subsequent sections. Thus, you can copy the softlinks into a shell script, adjust the link definitions, and then copy/paste the desired commands.
mmap=/data4/datasets/mmap/mmap # path to mmap program
snps=marker_set.csv # single column with list of markers (use SNPNAME's or RSNUM's)
subjects=subject_set.csv # single column with list of subject id's (EGO numbers)
genoSxMbin=genotypefilename.SxM.bin # mmap binary genotype file with Subject x Marker configuration
genoMxSbin=genotypefilename.MxS.bin # mmap binary genotype file with Marker x Subject configuration
genoMxScsv=genotypefilename.MxS.csv # mmap csv genotype file with Marker x Subject configuration
freqbin=genotypefilename.FREQ.mmap.bin # mmap binary allele frequency file
freqcsv=genotypefilename.FREQ.mmap.csv # mmap csv allele frequency file
output_root=myfile # filename root for plink-like file formats (.ped, .map, .info)
Creating a marker and/or subject reduced binary genotype file. The outfile is the new binary genotype file. This command may be applied to MxS binary files (resulting in an MxS output file) or to SxM formats (resulting in SxM output file).
$mmap --write_reduced_genotype_binary --binary_input_filename $input_genobin --binary_output_filename $output_genobin [marker and subject options]
--autosome # Extract the autosomal SNPs
--chromosome <numbers> # Extract SNPs on chromosomes in <numbers> e.g. --chromosome 1 9 22 X Y XY MT
--genomic_region <chr> <start bp> <stop bp> #Extract SNPs in the genomic region(s) specified: 1 120000 129999
--marker_set <file> # Extract markers in <file>
--subject_set <file> # Extract subjects in <file>
--include_duplicate_markers # Use this option with --write_reduced_genotype_binary or --marker_by_subject_mmap2csv
to insure you get all desired markers when there are duplicate markers with the
same SNPNAME are in the genotype file.
subject_set files: Use one column, no header, with a list of subject id’s.
marker_set files: Use one column, no header, with a list of EITHER SNPNAMEs or RSNUMs.
( The file type for one-column files can be .txt or .csv )
When marker_set files are only one column, then they don’t need a header. The one column can be SNPNAMEs or RSNUMs or even SNPNAMEs for some lines and RSNUMs for other lines. MMAP will read the line, look for a match with SNPNAME in the genotype file and, if not found, will look for a match with RSNUM in the genotype file. MMAP seems to do everything possible to find the SNP. Thus, if the genotype file is populated with good RSNUM data, then you can use rsNumbers in the marker_set file. Genotype files always have SNPNAMEs, so SNPNAMEs are the best bet.
If chromosome-specific genotype files are created, they can then be combined into a single MMAP binary.
$mmap --combine_binary_genotype_files <file1> <file2> … <fileN> --binary_output_filename <file>
In the above command, the input and output files are MMAP marker-by-subject binary genotype files.
RECOMMENDATION: To reduce the size of the combined binary genotype file, once the chromosome specific files are created, run the allele frequency option. The output will contain the minor allele frequency and imputation quality score, which can be used to extract a marker set based on minor allele frequency and/or imputation quality threshold. This marker set can then be used to create reduced binary genotype files before combining to the full file.
To transpose the file from marker-by-subject (MxS) to subject-by-marker (SxM) or subject-by-marker (SxM) to marker-by-subject (MxS). If you transpose twice then you will get the original file.
MxS format is useful for GWA, computation of LD matrices and allele frequency calculations.
SxM format is useful for computation of genetic covariance matrices and haplotype analysis.
$mmap --transpose_binary_genotype_file --binary_input_filename $genoMxSbin --binary_output_filename $genoSxMbin
The infile must be MxS format. The outfile has the results. Marker and Subjects options are valid with this command.
$mmap --marker_by_subject_mmap2csv --binary_input_filename $genoMxSbin --csv_output_filename $genoMxScsv
--include_duplicate_markers
Use this option with --write_reduced_genotype_binary
and --marker_by_subject_mmap2csv
to insure you get all desired markers when there are duplicate markers with SAME SNPNAME are in the genotype file.
For sparse input file, use following syntax. Marker_set and (soon subject_set) options are valid.
$mmap --mmap_sparse2csv --binary_input_filename $genoMxSbin --csv_output_filename $genoMxScsv
First create binary allele frequency file, then convert to csv.
The allele frequency for each SNP will be in the allele frequency files.
$mmap --write_binary_allele_frequency_file --binary_input_filename $genoMxSbin --binary_output_filename $freqbin
$mmap --allele_freq_binary2csv --binary_input_filename $freqbin --csv_output_filename $freqcsv
Alternative allele frequency extract from sparse binary:
$mmap --mmap_sparse2csv_allele_frequency --binary_input_filename <sparse> --csv_output_filename <csv>
Versions prior to 2017_03_06 used:
$mmap --mmap_vcf2csv_allele_frequency (even though input was a sparse file)
Creates .ped and .info file for input to Haploview program.
$mmap --subject_by_marker_mmap2haploview --binary_input_filename $genoSxMbin --marker_set $snps --haploview_output_filename $output_root
OPTION: --use_snpname
will use the SNPNAME for the variant id in the haploview dataset (.info file). Default is to use RSNUM for the variant id.
mmap=/data/datasets/mmap/mmap
ped=/data/datasets/mmap/amish.pedigree.csv
kinbin=/data/datasets/mmap/amish.relationship.bin
genobin=/data/datasets/markers/exmChipAdj3836.MxS.bin
pheno=/data/jperry/phenotypes/some_phenotypes.csv
covariates="sex age Exm654050 Exm654042"
suffix=exmChip
stdoutfile=z.mmap.stdout
# Clear the stdout file
cat /dev/null > $stdoutfile
# Run the regression
for trait in `cat TRAIT_LIST.txt`; # TRAIT_LIST.txt is a one-column file of trait names
# Note the BACKTICK characters (they are NOT single quotes!)
do
echo "$trait"
$mmap --ped $ped --model add --read_binary_covariance_file $kinbin \
--phenotype_filename $pheno --binary_genotype_filename $genobin \
--covariates $covariates --trait $trait --file_suffix $suffix \
--binary_covariate_filename $genobin \
--marker_set $snps \
--subject_set $subjects \
--min_minor_allele_frequency 0.02 \ <== 2% Use ONLY for standard genotype files (see "imputation files" below)
>> $stdoutfile
done
exit
# Other allele frequency and allele count options:
--min_mac 10 <== minimum minor allele count
--max_mac 50 <== maximum minor allele count
--max_minor_allele_frequency 0.05 <== 5% maximum allele frequency
$mmap --ped $ped --model add --read_binary_covariance_file $kinbin \
--phenotype_filename $pheno --binary_genotype_filename $genobin \
--covariates $covariates --trait $trait --file_suffix $suffix \
--marker_set $snps \
--subject_set $subjects \
--output_marker_attribute INFO \ <== Add INFO column to the ...mle.pval.csv file
--min_imputation_quality 0.3 <== minimum value for INFO
--chromosome 1 \ <== no leading zero on single-digit chromosome numbers
--min_dosage 0.02 \ <== for imputation genotype files (equiv. to --min_minor_allele_frequency)
--all_output \
>> $stdoutfile
--min_dosage
and --min_minor_allele_frequency
: The “ideal” is to apply the limit based on the frequency for the subject population actually included in a model run. However, this may use the frequency for all subjects in the genotype file.--max_h2 0.98 <== sets limit on h2 (if h2 goes to 1.0 we will get funny pValues)
--snp_block_size 5000 \ <== may not be needed (check with Jeff)
If some covariates are not in your phenotype file, use one or more additional covariate files.
MMAP will look first in the phenotype_filename, then
in the covariate_filename list (one or more files) and then
in the binary_covariate_filename list (one or more files)
If you want to use SNPs as covariates, add the SNPname in the --covariates
list and then
include the option --binary_covariate_filename <list of one or more genotype binary files>
A binary covariate file can be the same file used for --binary_genotype_filename
Examples:
covarFile1=/data/jperry/phenotypes/more_covariates.csv
covarFile2=/data/jperry/phenotypes/other_covariates.csv
--covariate_filename $covarFile1 $covarFile2
--binary_covariate_filename $genobin
The subject ids are expected to be in the first column of the phenotype_filename and covariate_filename files If this condition is met, then the column header for the subject ids does not need to be specified. If the subject ids are not in the first column, you may specify the column header as shown below When specified, the column header must be used in phenotype_filename AND IN ALL covariate_filename files.
--phenotype_id EGO
# column header for subject ids in phenotype_filename and covariate_filename files.
Gene x Environment interactions (actually, SNP x covariate interaction) are available with MMAP.
This option is valid ONLY if you are using a --binary-genotype-filename
such that MMAP is
looping over a list of SNPs (the SNPs in the genotype file subsetted by the optional marker_set)
This option creates an additional covariate which is: SNPcovariate (SNPBMI in example) where
SNP is the SNP from the list of SNPs being looped over.
The covariate can be any item in the phenotype file or in a covariate file or a SNP from a binary covariate file.
This item does NOT have to be in the covariate list identified with --covariates
(but typically would be)
--gxe_interaction TREATMENT (gives additional covariate: SNP*TREATMENT)
NOTE: There can be only one GxE term in the model. Typically, you would have TREATMENT in the list of covariates in addition to the –gxe_interaction term
You may also specify covariate x convariate interactions with the --interactions
option.
The items do NOT have to be in the covariate list identified with --covariates
(but typically would be)
The covariate can be any item in the phenotype file or in a covariate file or a SNP from a binary covariate file.
--interactions age*sex age*sex*BMI gives 2 additional covariates: age*sex age*sex*BMI
--interactions age*rs123456 gives additional covariate: age*rs123456 where rs123456 is a specific SNP
rs123456 could be in the pheno file or a covarFile or it can be in a
binary_covariate_filename which might be the same as the
binary_genotype_filename or could be a different binary file
If you wanted to do every possible combination of 4 covariates (singles, doubles, triples, quadruples)
specify them as shown below. If you didn’t need the singles, you could leave out the --covariates
option.
--covariates age sex BMI rs123
--interactions age*sex age*BMI age*rs123 sex*BMI sex*rs123 BMI*rs123 age*sex*BMI age*BMI*rs123 sex*BMI*rs123 age*sex*BMI*rs123
There is general syntax to exclude data from the analysis based on fields in the phenotype/covariate file. The fields need not be used as covariates in the model itself.
examples:
--exclude_list 1EQ_SEX
--exclude_list AGE_LT${lower_age_limit} ${upper_age_limit}LT_AGE
--exclude_list AGE_LT$50 70LT_AGE # exclude where AGE < 50 or 70 < AGE
--polygenic_adjusted_residuals <== Internally substitutes the "error residuals" for the trait and appends "_ADJ" to the trait name.
--x_male_coding_01 # treats chrX coding for males as 0 and 1 (default is 0 and 2)
--vcf2sparse --use_chr_pos_alt_ref --vcf_input_filename <vcf> --binary_output_filename <sparse>
Use --num_skip_fields 7
if we do NOT add extra columns for TYPE, GENE, AACHG
$mmap --write_binary_genotype_file --csv_input_filename $genoMxScsv --binary_output_filename $genoMxSbin --num_skip_fields 7
Use --num_skip_fields 10
if we DO want to add 3 extra columns for TYPE, GENE, AACHG
$mmap --write_binary_genotype_file --csv_input_filename $genoMxScsv --binary_output_filename $genoMxSbin --num_skip_fields 10 --additional_marker_attributes GENE C AACHG C TYPE C
If the genotype file has the above 3 additional attributes, you must use the option
shown below (when you run mmap analysis) to have them included in the mmap output files.
--output_marker_attribute TYPE GENE AACHG --snp_block_size 1
$mmap --transpose_binary_genotype_file --binary_input_filename $genoMxSbin --binary_output_filename $genoSxMbin
dense is the original MMAP binary format with file type …bin
bit format (…bit.bin) is 1/4th the size of a “dense” binary (…bin) when a binary_input_filename is required, both “bit” and “dense” formats can be used and MMAP determines the binary_type (user does not need to specify).
sparse format (…sparse.bin) is the most highly compressed format.
To convert from “dense” to “bit”:
$mmap --binary_genotype_file_dense2bit --binary_input_filename <dense> --binary_output_filename <bit>
To convert from “bit” to “dense”:
$mmap --binary_genotype_file_bit2dense --binary_input_filename <bit> --binary_output_filename <dense>
To convert from “sparse” to “dense”:
$mmap --binary_genotype_file_sparse2dense --binary_input_filename <sparse> --binary_output_filename <dense>
To convert from “sparse” to “bit”: (NOTE: command say “dense”, but we add –use_bit_coding )
$mmap --binary_genotype_file_sparse2dense --use_bit_coding --binary_input_filename <sparse> --binary_output_filename <bit>
Convert from Plink binary to MMAP binary (assuming MxS in the Plink file):
$mmap --plink_bfile2mmap --swap_A1_A2 --plink_bfile $plinkBinaryFormat --binary_output_prefix $mmapFormat.MxS
By default, MMAP will set Plink A1 to the NON_CODED_ALLELE and A2 to the EFFECT_ALLELE
However, Plink (by default) sets A1 to the allele with the lower allele frequency.
If you want the resulting MMAP file to have the Plink A1 allele in MMAP’s EFFECT_ALLELE,
use: --swap_A1_A2
which will use
MMAP imports Plink binary format files into an SxM or MxS genotype binary file, depending on the Plink format, which is automatically detected.
$mmap --plink_bfile2mmap -–plink_bfile <prefix> --binary_output_prefix <mmap prefix>
Converts files <prefix>.bim, <prefix>.bed, <prefix>.fam into binary genotype file <mmap prefix>.bin and MMAP pedigree <mmap prefix>.ped.csv extracted from the <prefix>.fam.
$mmap --subject_by_marker_mmap2plink --binary_input_filename $genoSxMbin --plink_output_prefix $output_root
OPTION: --use_snpname
will use the SNPNAME for the variant id in the plink dataset (.map file). Default is to use RSNUM for the variant id.
$mmap --subject_by_marker_mmap2plink --binary_input_filename <SxM binary genotype file> --plink_output_prefix <prefix>
Creates <prefix>.map and <prefix>.ped which can then be converted into Plink binary format with Plink commands.
Currently no support of export directly into Plink binary format.
$mmap --marker_by_subject_mmap2tped --binary_input_filename <MxS binary genotype file> --plink_output_prefix <prefix>
Creates <prefix>.fam, <prefix>.bim and <prefix>.tped
$mmap --marker_by_subject_mmap2plink_dosage --binary_input_filename <MxS binary genotype file> --plink_output_filename <prefix>
NOTE: output option may change in future to be --plink_output_prefix
Creates <prefix>.fam, <prefix>.map and <prefix>.ped
plink --dosage $prefix.dose format=1 --fam $prefix.fam --map $prefix.map --write-dosage --out newOut
Eigenvector files - (to be updated soon) The eigen.bin file only depends on the subjects in the file. It is independent of trait and covariates. Create them by trait as each trait typicallly has a different number of subjects. You can add any covariate to the model as long as there is no missing data for the subjects used to create the eigenvector file.
Genomic Relationship Matrix - (to be added soon)
$mmap –ped $ped –read_binary_covariance_file $kinbin –phenotype_filename $pheno –trait $trait –all_output –covariates $covariates –file_suffix $suff –subject_set $subject_set –transform_analysis_phenotype –binary_genotype_filename $genomxs
Takes full set of options. Exits once the file is created. NO adjustment for pedigree, which is standard. File contains all data in the model, similar to poly.model file
The pedigree, covariates and genotypes are not used in the calculation except to create a squared off data set given all the data and also to have a ready made phenotype file for analysis.
Squared off means the data is not missing at any inputs in the model, be it pedigree, covariates and/or genotypes. When you create the transformed values, you specify the model (with pedigree, covariates, genotypes) that you plan to use the transformed values in. For example, if you use the full phenotype file of 4700 subjects and no genotype file, then you will get a transformed file with 4700 subjects. With a genotype file of 1100 subjects, then the squared off data set would have 1100 subjects. The transformed values for a given subject will likely be different if the transform is done with 4700 subjects vs 1100 subjects. If you were to take transformed data based on 4700 subjects and simply pull out the data for 1100 subjects, the histogram for those 1100 subjects might not look very normalized.
If the analysis plan calls for inverse normal of the covariate adjusted residuals, you would need 3 steps. Run the linear regression to generate the residuals, then transform the residuals with no covariates, then run mixed model.
The Omics Analysis Search and Information System (OASIS) is an information system for analyzing, searching and visualizing associations between phenotypes, genotypes, and other types of omics data (such as transcriptomics, metabolomics, etc.). It is designed to enable discovery by connecting to the thought processes of biological researchers in a way that allows them to search results from an initial GWAS (or other type of association study), ask follow up questions and get the answers in real-time.
OASIS accomplishes this with a web-based search system and a variety of real-time analysis tools including conditional & multi-covariates analysis, LD calculations, alternative data transformations, and customized SKAT analysis. On-demand visualizations are provided in the form of boxplots, histograms, LocusZoom & Haploview plots. The OASIS search reports contain a broad spectrum of annotation from Annovar and WGSA plus a variety of links to external resources such as gnomAD, GTEx, HaploReg, Roadmap, UCSC and NCBI. Because OASIS has a web-based user interface, an understanding of programming or the UNIX operating system is not required.
OASIS is powered by MMAP. The OASIS user interface coordinates the use of MMAP’s unique options and algorithms to provide repeated, custom computations in a fraction of the time normally required. Please visit the OASIS website for more information!
MMAP: Mixed Model Analysis for Pedigrees and Populations - Copyright © 2017