MMAP by Jeff O'Connell

    HOME     Latest Release     Issues and Q&A     MMAP Cheat Sheet

    MMAP Cheat Sheet by Jim Perry

    This Cheat Sheet is designed to facilitate creation of shell scripts for running MMAP.

    Example soft links for the MMAP program plus input and output files are shown below. Example commands and options using these soft links are given in subsequent sections. Thus, you can copy the softlinks into a shell script, adjust the link definitions, and then copy/paste the desired commands.

    mmap=/data4/datasets/mmap/mmap			# path to mmap program
    snps=marker_set.csv				# single column with list of markers (use SNPNAME's or RSNUM's)
    subjects=subject_set.csv			# single column with list of subject id's (EGO numbers)
    
    genoSxMbin=genotypefilename.SxM.bin		# mmap binary genotype file with Subject x Marker configuration
    genoMxSbin=genotypefilename.MxS.bin		# mmap binary genotype file with Marker x Subject configuration
    genoMxScsv=genotypefilename.MxS.csv		# mmap csv genotype file with Marker x Subject configuration
    
    freqbin=genotypefilename.FREQ.mmap.bin  	# mmap binary allele frequency file
    freqcsv=genotypefilename.FREQ.mmap.csv  	# mmap csv allele frequency file
    
    output_root=myfile				# filename root for plink-like file formats (.ped, .map, .info)
    

    Subset a genotype file

    Creating a marker and/or subject reduced binary genotype file. The outfile is the new binary genotype file. This command may be applied to MxS binary files (resulting in an MxS output file) or to SxM formats (resulting in SxM output file).
    $mmap --write_reduced_genotype_binary --binary_input_filename $input_genobin --binary_output_filename $output_genobin [marker and subject options]

    subsetting options

    --autosome			# Extract the autosomal SNPs
    --chromosome <numbers>		# Extract SNPs on chromosomes in <numbers>  e.g.  --chromosome 1 9 22 X Y XY MT
    
    --genomic_region <chr> <start bp> <stop bp>	#Extract SNPs in the genomic region(s) specified:  1 120000 129999
    
    --marker_set <file>		# Extract markers in <file>
    --subject_set <file>		# Extract subjects in <file>
    
    --include_duplicate_markers     # Use this option with --write_reduced_genotype_binary or --marker_by_subject_mmap2csv
    			          to insure you get all desired markers when there are duplicate markers with the 
    			          same SNPNAME are in the genotype file.
    

    subject_set files: Use one column, no header, with a list of subject id’s.
    marker_set files: Use one column, no header, with a list of EITHER SNPNAMEs or RSNUMs.
        ( The file type for one-column files can be .txt or .csv )

    When marker_set files are only one column, then they don’t need a header. The one column can be SNPNAMEs or RSNUMs or even SNPNAMEs for some lines and RSNUMs for other lines. MMAP will read the line, look for a match with SNPNAME in the genotype file and, if not found, will look for a match with RSNUM in the genotype file. MMAP seems to do everything possible to find the SNP. Thus, if the genotype file is populated with good RSNUM data, then you can use rsNumbers in the marker_set file. Genotype files always have SNPNAMEs, so SNPNAMEs are the best bet.

    marker_set files can also be 2 columns (for non-sparse genotype files).


    Combine genotype files

    If chromosome-specific genotype files are created, they can then be combined into a single MMAP binary.
    $mmap --combine_binary_genotype_files <file1> <file2> … <fileN> --binary_output_filename <file>

    In the above command, the input and output files are MMAP marker-by-subject binary genotype files.
    RECOMMENDATION: To reduce the size of the combined binary genotype file, once the chromosome specific files are created, run the allele frequency option. The output will contain the minor allele frequency and imputation quality score, which can be used to extract a marker set based on minor allele frequency and/or imputation quality threshold. This marker set can then be used to create reduced binary genotype files before combining to the full file.


    Transpose a genotype file

    To transpose the file from marker-by-subject (MxS) to subject-by-marker (SxM) or subject-by-marker (SxM) to marker-by-subject (MxS). If you transpose twice then you will get the original file.
    MxS format is useful for GWA, computation of LD matrices and allele frequency calculations.
    SxM format is useful for computation of genetic covariance matrices and haplotype analysis.

    $mmap --transpose_binary_genotype_file --binary_input_filename $genoMxSbin --binary_output_filename $genoSxMbin


    Genotype extract as csv file

    The infile must be MxS format. The outfile has the results. Marker and Subjects options are valid with this command.
    $mmap --marker_by_subject_mmap2csv --binary_input_filename $genoMxSbin --csv_output_filename $genoMxScsv

    --include_duplicate_markers Use this option with --write_reduced_genotype_binary and --marker_by_subject_mmap2csv to insure you get all desired markers when there are duplicate markers with SAME SNPNAME are in the genotype file.

    For sparse input file, use following syntax. Marker_set and (soon subject_set) options are valid.
    $mmap --mmap_sparse2csv --binary_input_filename $genoMxSbin --csv_output_filename $genoMxScsv


    Allele frequency extract

    First create binary allele frequency file, then convert to csv.
    The allele frequency for each SNP will be in the allele frequency files.

    $mmap --write_binary_allele_frequency_file --binary_input_filename $genoMxSbin --binary_output_filename $freqbin
    $mmap --allele_freq_binary2csv --binary_input_filename $freqbin --csv_output_filename $freqcsv

    Alternative allele frequency extract from sparse binary:
    $mmap --mmap_sparse2csv_allele_frequency --binary_input_filename <sparse> --csv_output_filename <csv>

    Versions prior to 2017_03_06 used:
    $mmap --mmap_vcf2csv_allele_frequency (even though input was a sparse file)


    Haploview extract

    Creates .ped and .info file for input to Haploview program.

    $mmap --subject_by_marker_mmap2haploview --binary_input_filename $genoSxMbin --marker_set $snps --haploview_output_filename $output_root

    OPTION: --use_snpname will use the SNPNAME for the variant id in the haploview dataset (.info file). Default is to use RSNUM for the variant id.


    Analysis with standard (non-imputed) genotypes

    mmap=/data/datasets/mmap/mmap
    ped=/data/datasets/mmap/amish.pedigree.csv 
    kinbin=/data/datasets/mmap/amish.relationship.bin 
    
    genobin=/data/datasets/markers/exmChipAdj3836.MxS.bin
    pheno=/data/jperry/phenotypes/some_phenotypes.csv
    
    covariates="sex age Exm654050 Exm654042"
    suffix=exmChip
    stdoutfile=z.mmap.stdout
    
    # Clear the stdout file
    cat /dev/null > $stdoutfile
    
    # Run the regression
    for trait in `cat TRAIT_LIST.txt`; 	# TRAIT_LIST.txt is a one-column file of trait names
    					# Note the BACKTICK characters (they are NOT single quotes!)
    do 
    echo "$trait"
    $mmap --ped $ped --model add --read_binary_covariance_file $kinbin \
      --phenotype_filename $pheno  --binary_genotype_filename $genobin \
      --covariates $covariates  --trait $trait  --file_suffix  $suffix \
      --binary_covariate_filename $genobin \
      --marker_set $snps \
      --subject_set $subjects \
      --min_minor_allele_frequency 0.02 \  <== 2%  Use ONLY for standard genotype files (see "imputation files" below)
      >> $stdoutfile
    done
    exit
    
    # Other allele frequency and allele count options:
      --min_mac 10    <== minimum minor allele count
      --max_mac 50    <== maximum minor allele count
      --max_minor_allele_frequency 0.05   <== 5% maximum allele frequency
    
    
    

    Example for IMPUTED genotype files

    $mmap --ped $ped --model add --read_binary_covariance_file $kinbin \
      --phenotype_filename $pheno  --binary_genotype_filename $genobin \
      --covariates $covariates  --trait $trait  --file_suffix  $suffix \
      --marker_set $snps \
      --subject_set $subjects \
      --output_marker_attribute INFO \  <== Add INFO column to the ...mle.pval.csv file
      --min_imputation_quality 0.3		<== minimum value for INFO 
      --chromosome 1 \					<== no leading zero on single-digit chromosome numbers
      --min_dosage 0.02 \               <== for imputation genotype files (equiv. to --min_minor_allele_frequency)
      --all_output \
      >> $stdoutfile
    

    Additional Options

    If some covariates are not in your phenotype file, use one or more additional covariate files.
    MMAP will look first in the phenotype_filename, then
            in the covariate_filename list (one or more files) and then
            in the binary_covariate_filename list (one or more files)
    If you want to use SNPs as covariates, add the SNPname in the --covariates list and then
            include the option --binary_covariate_filename <list of one or more genotype binary files>
            A binary covariate file can be the same file used for --binary_genotype_filename
    Examples:

    covarFile1=/data/jperry/phenotypes/more_covariates.csv  
    covarFile2=/data/jperry/phenotypes/other_covariates.csv  
    
    --covariate_filename $covarFile1 $covarFile2
    --binary_covariate_filename $genobin
    

    The subject ids are expected to be in the first column of the phenotype_filename and covariate_filename files If this condition is met, then the column header for the subject ids does not need to be specified. If the subject ids are not in the first column, you may specify the column header as shown below When specified, the column header must be used in phenotype_filename AND IN ALL covariate_filename files.

    --phenotype_id EGO # column header for subject ids in phenotype_filename and covariate_filename files.


    Gene x Environment Interactions

    Gene x Environment interactions (actually, SNP x covariate interaction) are available with MMAP. This option is valid ONLY if you are using a --binary-genotype-filename such that MMAP is looping over a list of SNPs (the SNPs in the genotype file subsetted by the optional marker_set) This option creates an additional covariate which is: SNPcovariate (SNPBMI in example) where SNP is the SNP from the list of SNPs being looped over.

    The covariate can be any item in the phenotype file or in a covariate file or a SNP from a binary covariate file. This item does NOT have to be in the covariate list identified with --covariates (but typically would be)

    --gxe_interaction TREATMENT   (gives additional covariate: SNP*TREATMENT)
    

    NOTE: There can be only one GxE term in the model. Typically, you would have TREATMENT in the list of covariates in addition to the –gxe_interaction term


    Covariate x Covariate Interactions

    You may also specify covariate x convariate interactions with the --interactions option. The items do NOT have to be in the covariate list identified with --covariates (but typically would be) The covariate can be any item in the phenotype file or in a covariate file or a SNP from a binary covariate file.

    --interactions age*sex age*sex*BMI	gives 2 additional covariates: age*sex  age*sex*BMI
    --interactions age*rs123456     	gives additional covariate: age*rs123456 where rs123456 is a specific SNP
    					rs123456 could be in the pheno file or a covarFile or it can be in a
    					binary_covariate_filename which might be the same as the
    					binary_genotype_filename or could be a different binary file
    

    If you wanted to do every possible combination of 4 covariates (singles, doubles, triples, quadruples) specify them as shown below. If you didn’t need the singles, you could leave out the --covariates option.

    --covariates age sex BMI rs123  
    --interactions  age*sex age*BMI age*rs123 sex*BMI sex*rs123 BMI*rs123  age*sex*BMI age*BMI*rs123 sex*BMI*rs123  age*sex*BMI*rs123
    

    Exclusions based on phenotype/covariate values

    There is general syntax to exclude data from the analysis based on fields in the phenotype/covariate file. The fields need not be used as covariates in the model itself.

    examples:


    Special case options:

    --polygenic_adjusted_residuals <== Internally substitutes the "error residuals" for the trait and appends "_ADJ" to the trait name. --x_male_coding_01 # treats chrX coding for males as 0 and 1 (default is 0 and 2)


    Building binary genotype files

    Create a “sparse” formated binary from vcf AND force SNPNAME to be chr:pos:ref:alt

    --vcf2sparse --use_chr_pos_alt_ref --vcf_input_filename <vcf> --binary_output_filename <sparse>

    Create the MxS.bin genotype file from a .csv file

    Use --num_skip_fields 7 if we do NOT add extra columns for TYPE, GENE, AACHG
    $mmap --write_binary_genotype_file --csv_input_filename $genoMxScsv --binary_output_filename $genoMxSbin --num_skip_fields 7

    Use --num_skip_fields 10 if we DO want to add 3 extra columns for TYPE, GENE, AACHG
    $mmap --write_binary_genotype_file --csv_input_filename $genoMxScsv --binary_output_filename $genoMxSbin --num_skip_fields 10 --additional_marker_attributes GENE C AACHG C TYPE C

    If the genotype file has the above 3 additional attributes, you must use the option shown below (when you run mmap analysis) to have them included in the mmap output files.
    --output_marker_attribute TYPE GENE AACHG --snp_block_size 1

    Create the SxM.bin file from the MxS.bin file:

    $mmap --transpose_binary_genotype_file --binary_input_filename $genoMxSbin --binary_output_filename $genoSxMbin


    MMAP binary formats

    To convert from “dense” to “bit”:
    $mmap --binary_genotype_file_dense2bit --binary_input_filename <dense> --binary_output_filename <bit>

    To convert from “bit” to “dense”:
    $mmap --binary_genotype_file_bit2dense --binary_input_filename <bit> --binary_output_filename <dense>

    To convert from “sparse” to “dense”:
    $mmap --binary_genotype_file_sparse2dense --binary_input_filename <sparse> --binary_output_filename <dense>

    To convert from “sparse” to “bit”: (NOTE: command say “dense”, but we add –use_bit_coding )
    $mmap --binary_genotype_file_sparse2dense --use_bit_coding --binary_input_filename <sparse> --binary_output_filename <bit>


    Convert from Plink binary to MMAP binary (assuming MxS in the Plink file):
    $mmap --plink_bfile2mmap --swap_A1_A2 --plink_bfile $plinkBinaryFormat --binary_output_prefix $mmapFormat.MxS
    By default, MMAP will set Plink A1 to the NON_CODED_ALLELE and A2 to the EFFECT_ALLELE
    However, Plink (by default) sets A1 to the allele with the lower allele frequency.
    If you want the resulting MMAP file to have the Plink A1 allele in MMAP’s EFFECT_ALLELE,
    use: --swap_A1_A2 which will use

    MMAP imports Plink binary format files into an SxM or MxS genotype binary file, depending on the Plink format, which is automatically detected.
    $mmap --plink_bfile2mmap -–plink_bfile <prefix> --binary_output_prefix <mmap prefix>
    Converts files <prefix>.bim, <prefix>.bed, <prefix>.fam into binary genotype file <mmap prefix>.bin and MMAP pedigree <mmap prefix>.ped.csv extracted from the <prefix>.fam.


    $mmap --subject_by_marker_mmap2plink --binary_input_filename $genoSxMbin --plink_output_prefix $output_root
    OPTION: --use_snpname will use the SNPNAME for the variant id in the plink dataset (.map file). Default is to use RSNUM for the variant id.

    $mmap --subject_by_marker_mmap2plink --binary_input_filename <SxM binary genotype file> --plink_output_prefix <prefix>
    Creates <prefix>.map and <prefix>.ped which can then be converted into Plink binary format with Plink commands. Currently no support of export directly into Plink binary format.

    $mmap --marker_by_subject_mmap2tped --binary_input_filename <MxS binary genotype file> --plink_output_prefix <prefix>
    Creates <prefix>.fam, <prefix>.bim and <prefix>.tped

    $mmap --marker_by_subject_mmap2plink_dosage --binary_input_filename <MxS binary genotype file> --plink_output_filename <prefix>
    NOTE: output option may change in future to be --plink_output_prefix
    Creates <prefix>.fam, <prefix>.map and <prefix>.ped


    Genomic Relationship Matrices and Eigenvector files

    Eigenvector files - (to be updated soon) The eigen.bin file only depends on the subjects in the file. It is independent of trait and covariates. Create them by trait as each trait typicallly has a different number of subjects. You can add any covariate to the model as long as there is no missing data for the subjects used to create the eigenvector file.

    Genomic Relationship Matrix - (to be added soon)


    Transformations of Phenotypes

    $mmap –ped $ped –read_binary_covariance_file $kinbin –phenotype_filename $pheno –trait $trait –all_output –covariates $covariates –file_suffix $suff –subject_set $subject_set –transform_analysis_phenotype –binary_genotype_filename $genomxs

    Takes full set of options. Exits once the file is created. NO adjustment for pedigree, which is standard. File contains all data in the model, similar to poly.model file

    The pedigree, covariates and genotypes are not used in the calculation except to create a squared off data set given all the data and also to have a ready made phenotype file for analysis.

    Squared off means the data is not missing at any inputs in the model, be it pedigree, covariates and/or genotypes. When you create the transformed values, you specify the model (with pedigree, covariates, genotypes) that you plan to use the transformed values in. For example, if you use the full phenotype file of 4700 subjects and no genotype file, then you will get a transformed file with 4700 subjects. With a genotype file of 1100 subjects, then the squared off data set would have 1100 subjects. The transformed values for a given subject will likely be different if the transform is done with 4700 subjects vs 1100 subjects. If you were to take transformed data based on 4700 subjects and simply pull out the data for 1100 subjects, the histogram for those 1100 subjects might not look very normalized.

    If the analysis plan calls for inverse normal of the covariate adjusted residuals, you would need 3 steps. Run the linear regression to generate the residuals, then transform the residuals with no covariates, then run mixed model.


    Future Developments

    OASIS: Omics Analysis, Search and Information System

    The Omics Analysis Search and Information System (OASIS) is an information system for analyzing, searching and visualizing associations between phenotypes, genotypes, and other types of omics data (such as transcriptomics, metabolomics, etc.). It is designed to enable discovery by connecting to the thought processes of biological researchers in a way that allows them to search results from an initial GWAS (or other type of association study), ask follow up questions and get the answers in real-time.

    OASIS accomplishes this with a web-based search system and a variety of real-time analysis tools including conditional & multi-covariates analysis, LD calculations, alternative data transformations, and customized SKAT analysis. On-demand visualizations are provided in the form of boxplots, histograms, LocusZoom & Haploview plots. The OASIS search reports contain a broad spectrum of annotation from Annovar and WGSA plus a variety of links to external resources such as gnomAD, GTEx, HaploReg, Roadmap, UCSC and NCBI. Because OASIS has a web-based user interface, an understanding of programming or the UNIX operating system is not required.

    OASIS is powered by MMAP.  The OASIS user interface coordinates the use of MMAP’s unique options and algorithms to provide repeated, custom computations in a fraction of the time normally required. Please visit the OASIS website for more information!


    MMAP: Mixed Model Analysis for Pedigrees and Populations - Copyright © 2017