hiddenDomains Manual, Tutorial and Troubleshooting Guide

hiddenDomains is a suite of programs used to identify significant enrichment of ChIP-seq reads that span large domains, like HK27me3. The input data can be in BAM format, or in a tab-delimited "reads per bin" format described below. The output is a BED formatted file the lists the enriched domains and their posterior probabilities.

hiddenDomains requires samtools, so if you don't already have that installed, you'll need to do that.

This manual and tutorial will teach you how to use the hiddenDomains suite of programs. In it we will analyze a mouse ChIP-seq dataset for H3k27me3

Quickstart Tutorial

  1. Download and unpack hiddenDomains

    If you haven't already done this, download the latest version of hiddenDomains from the Sourceforge website: https://sourceforge.net/projects/hiddendomains/

    Now unpack it

    
    shell$ tar -xzvf hiddenDomains.VERSION.NUMBER.tar.gz
    
    Where VERSION and NUMBER represent the release that you downloaded.

  2. Install R and required R packages (if necessary)

    Download and install R if you don't already have it.

    If you don't already have them, you will need to install two hidden Markov model libraries:

    You can do this by starting R on the command line:
    
    shell$ R
    

    Or, if you prefer RStudio, you can use that.

    Regardless of how you started R, type the following commands to install the packages.

    
    > install.packages("depmixS4")
    > install.packages("HiddenMarkov")
    
  3. Analyze the Data

    Once you have unpacked hiddenDomains, the next step is to analyze it.

    hiddenDomains comes with two example data files:

    Both of these files only contain reads that mapped to chromosome 6 in order to limit the size of the files and to get you up and running as quickly as possible.

    To analyze the data with hiddenDomains.

    
    shell$ cd hiddenDomain.VERSION.NUMBER
    shell$ ./hiddenDomains -g ChromInfo_mm9.txt -b 200 -t k27me3_chr6.bam -c input_chr6.bam -o results
    

    hiddenDomains has three required parameters:

    In addition to these required parameters, hiddenDomains accepts many optional parameters (described in detail in the manual). The optional parameters that are used here are:

    NOTES:

    1. Additional parameters and usage information can be found by entering hiddenDomains -h

    2. Depending on the computer, this command should run in less than 5 minutes with ample notices printed to the terminal while it works.

    3. If you get an error that hiddenDomains could not run, you either need to make it executable, with the command

      shell$ chmod 755 hiddenDomains
      and then re-run the command above, or run it directly with Perl using the following command

      shell$ perl ./hiddenDomains -g ChromInfo_mm9.txt -b 200 -t k27me3_chr6.bam -c input_chr6.bam -o results

    To verify that hiddenDomains has worked correctly, we have provided bedgraph files for the original data and you can upload these to the UCSC Genome Browser along with the BED files that you just created. You can then peruse the results at your liberty, however, there are two locations that we would like to point out that illustrate the value of having a control dataset.

    These are the files you need to upload to the UCSC Mouse mm9 Genome Browser:

    1. input_chr6.bedgraph.gz - this will display the normalized reads from the "input". The reads should appear uniformly distributed. You may want to set the track "Vertical viewing range" from 0 to 200. You do this by clicking on the grey bar on the left side of the track, setting the Vertical viewing range "min" and "max" values to 0 and 200, and then selecting "use vertical viewing range setting" form the "Data view scaling" pulldown menu.
    2. k27me3_chr6.bedgraph.gz - this will display the normalized reads - this will display the normalized reads from the K27me3 ChIP-seq experiment. The reads should appear piled up near genes that are repressed. You may want to set the track "Vertical viewing range" from 0 to 200.
    3. k27_hiddenDomain_no_control.bed
    4. k27_hiddenDomain.bed

    Once you finish uploading the files, you can point the Genome Browser to the Hox gene cluster at chr6:52,000,664-52,512,712. Here you see that both methods called the rather broad domain of H3k27me3 reads that cover this cluster of genes.

    chr6:103,414,566-103,717,194 highlights the importance of using a control dataset to limit false positives. In this example there is a location with a lot of reads in both the K27me3 dataset and the input dataset. Without using the input dataset as a control, hiddenDomains calls a peak at this location. However, when run with the input dataset as a control, hiddenDomains does not call this peak.

Manual

hiddenDomains can be run two different ways. In the tutorial, we used hiddenDomains to do three things:

  1. Bin the reads

  2. Look for domains

  3. Generate BED files

However, for more control, you can do these things separately. The individual programs have additional options. For example, you can specify to only analyze certain chromosomes, or you can customize the track line in the "vis" BED file.

If you want to run the programs separately...

  1. binReads.pl bins reads
  2. hiddenDomains.R looks for domains
  3. domainsToBed.pl and domainsMergeToBed.pl create BED files.

binReads.pl

Usage: binReads.pl [options] bamFile.bam > binned_reads.txt

binReads.pl takes a BAM formatted file of aligned reads and outputs the number of reads that overlap each bin, where a "bin" is an equally sized, non-overlapping portion of a chromosome.

Options

-hPrint out the help information.
-b BIN_WIDTH The width of the bin. Default is 1000bp, which works well with domains and transcription factors. If the exact location of a transcription factor is of interest (rather than answering "does this transcription factor bind in this region"), you can measure the peak width in the UCSC genome browser and use that to define the bid width.
-BThe input file is in BED format.
-q MIN_MAPQ The minimum MAPQ score. Default is 30.
-MAssume all bins should be on mouse chromosomes. This is the default.
-H Assume all bin should be on human chromosomes.
-c "chr1 chr2 ..." Create a custom list of chromosomes to bin reads into.

domainsMergeToBed.pl

Usage: domainsMergeToBed.pl [options] domainFile.txt > domainFile.bed

domainsMergeToBed.pl takes the output from hiddenDomains.R and merges all consecutive domains with a posterior greater than a threshold (default is 0; all domains are merged) into a single domain and generates track of enriched domains for the UCSC Genome Browser.

Options

-hPrint out the help information.
-b BIN_WIDTH The width of the bin. Default is 1000bp.
-g ChromInfo.txt If you get an out of bounds error when uploading a bed file to the UCSC genome browser, you can use this option to trim the bounds to the size of the chromosomes. hiddenDomains comes with ChromInfo.txt files for MM9, MM10 and HG19, and it is easy to make a custom file for other genomes.

To make your own ChromInfo.txt file, just create a tab delimited file with the chromosome names in the first column and their sizes in the second column. Here is an example ChromInfo.txt file for a genome with 3 chromosomes:

    chr1    197195432
    chr2    181748087
    chr3    159599783    
-p MIN_POSTERIOR Toss out parts of domains that have posterior values that are less than MIN_POSTERIOR. You can set this to any value you want, but for reference, domainsToBed bins according the following scale:
    >= 0.9
    0.9 > posterior >= 0.8
    0.8 > posterior >= 0.7
    0.7 > posterior >= min posterior for significance
The default value is 0; everything is merged by default.
-t If set, a track line will be added to the start of the bed file
-n TRACK_NAME The name you want to give your track. The default value is the name of the domain file (without the .txt suffix if it has one).

domainsToBed.pl

Usage: domainsToBed.pl [options] domainFile.txt > domainFile.bed

domainsToBed.pl takes the output from hiddenDomains.R and generates track of enriched domains and their posterior probabilities for the UCSC Genome Browser.

Options

-hPrint out the help information.
-b BIN_WIDTH The width of the bin. Default is 1000bp.
-g ChromInfo.txt If you get an out of bounds error when uploading a bed file to the UCSC genome browser, you can use this option to trim the bounds to the size of the chromosomes. hiddenDomains comes with ChromInfo.txt files for MM9, MM10 and HG19, and it is easy to make a custom file for other genomes.

To make your own ChromInfo.txt file, just create a tab delimited file with the chromosome names in the first column and their sizes in the second column. Here is an example ChromInfo.txt file for a genome with 3 chromosomes:

    chr1    197195432
    chr2    181748087
    chr3    159599783    
-t If set, a track line will be added to the start of the bed file
-n TRACK_NAME The name you want to give your track. The default value is the name of the domain file (without the .txt suffix if it has one).

hiddenDomains

Usage: hiddenDomains [options] -g ChromInfo -t TreatmentReads -o OutputPrefix

Options

-hPrint out the help information.
-b BIN_WIDTH The width of the bin. Default is 1000bp, which works well with domains and transcription factors. If the exact location of a transcription factor is of interest (rather than answering "does this transcription factor bind in this region"), you can measure the peak width in the UCSC genome browser and use that to define the bid width.
-BThe input file is in BED format. NOTE: All read files have to have to same format (either BED or BAM). Use binReads.pl as a stand alone program if you have a more complicated set up.
-c ControlReads A BED or BAM file that contains aligned read reads. Use the -B option to specify BED format. BAM format is the default.
-g ChromInfo.txt If you get an out of bounds error when uploading a bed file to the UCSC genome browser, you can use this option to trim the bounds to the size of the chromosomes. hiddenDomains comes with ChromInfo.txt files for MM9, MM10 and HG19, and it is easy to make a custom file for other genomes.

To make your own ChromInfo.txt file, just create a tab delimited file with the chromosome names in the first column and their sizes in the second column. Here is an example ChromInfo.txt file for a genome with 3 chromosomes:

    chr1    197195432
    chr2    181748087
    chr3    159599783    
-o OutputPrefix hiddenDomains generates four or five files with names that start with OutputPrefix.
  1. OutputPrefix_treatment_bins.txt : A file with the read counts per bin.
  2. OutputPrefix_control_bins.txt : A file with the read counts per bin.
  3. OutputPrefix_domains.txt : A text file with all of the enriched domains and posterior probabilities.
  4. OutputPrefix_vis.bed : A BED file for visualization, which contains one line per significantly enriched bin - this allows for color coding based on the posterior probability.
  5. OutputPrefix_analysis.bed : The second BED file is for analysis, and this merges all consecutive bins with posterior probabilities greater than MIN_POSTERIOR (as specified with the -p option) or the default value, 0 - which merges all consecutive significant bins.
-p MIN_POSTERIOR Toss out parts of domains that have posterior values that are less than MIN_POSTERIOR. You can set this to any value you want, but for reference, domainsToBed bins according the following scale:
    >= 0.9
    0.9 > posterior >= 0.8
    0.8 > posterior >= 0.7
    0.7 > posterior >= min posterior for significance
The default value is 0; everything is merged by default.
-q MIN_MAPQ The minimum MAPQ score. Default is 30.
-t TreatmentReads A BED or BAM file that contains aligned read reads. Use the -B option to specify BED format. BAM format is the default.

hiddenDomains.R

Usage: hiddenDomains(treat.bin.file="TREATMENT_bins.txt", control.bin.file="CONTROL_bins.txt", out.file.name="OUTFILE.txt", chr.names=["mouse"|"human"|c("chr1","chr2", "etc"]))

hiddenDomains.R takes the output from binReads.pl and attempts to identify enriched domains using Hidden Markov Models (HMMs).

Options

treat.bin.file This should be set to the name of the file with the "treatment" read counts
control.bin.file This should be set to the name of the file with the "control" read counts
normalize TRUE or FALSE. Should the treatment and control files be normalized for depth? The default value is TRUE
chr.names This can be set to "mouse", "human" or a vector of chromosomes that you want to look for domains in.
skip.chrsUse this parameter to skip analysis of specific chromosomes. You can specify an individual chromosome with a string, like "chrX", or you can specify a vector of chromosomes, like c("chr1", "chr2").
min.prob Bins with a lower posterior probability than min.prob will be filtered out. The default value is 0.6.
out.flie.name Use this to provide the name of the output file you want hiddenDomains to create with the results
min.read.count This is a parameter that can be fiddled with if the HMM libraries fail to converge. If this happens, try setting this to -1.
data If you have already loaded in the "treatment" data into a variable, you can set data to that variable and omit treat.bin.file
control.data If you have already loaded in the "control" data into a variable, you can set data to that variable and omit control.bin.file
debug TRUE or FALSE. Print out a lot of intermediate messages. The default value is FALSE

Troubleshooting Guide