hiddenDomains Manual, Tutorial and Troubleshooting Guide

hiddenDomains is a suite of programs used to identify significant enrichment of ChIP-seq reads that span large domains, like HK27me3. The input data can be in BAM format, or in a tab-delimited "reads per bin" format described below. The output is a BED formatted file the lists the enriched domains and their posterior probabilities.

hiddenDomains requires samtools, so if you don't already have that installed, you'll need to do that.

This manual and tutorial will teach you how to use the hiddenDomains suite of programs. In it we will analyze a mouse ChIP-seq dataset for H3k27me3

Quickstart Tutorial

Download and unpack hiddenDomains

If you haven't already done this, download the latest version of hiddenDomains from the Sourceforge website: https://sourceforge.net/projects/hiddendomains/
Now unpack it
```
shell$ tar -xzvf hiddenDomains.VERSION.NUMBER.tar.gz
```
Where VERSION and NUMBER represent the release that you downloaded.
Install R and required R packages (if necessary)
Download and install R if you don't already have it.
If you don't already have them, you will need to install two hidden Markov model libraries:
- depmixS4
- HiddenMarkov
You can do this by starting R on the command line:
```
shell$ R
```
Or, if you prefer RStudio, you can use that.
Regardless of how you started R, type the following commands to install the packages.
```
> install.packages("depmixS4")
> install.packages("HiddenMarkov")
```
Analyze the Data
Once you have unpacked hiddenDomains, the next step is to analyze it.
hiddenDomains comes with two example data files:
- k27me3_chr6.bam, An H3K27me3 ChIP-seq dataset that we will use to identify enriched domains.
- input_chr6.bam, An "Input" ChIP-seq dataset that we can use to limit the number of false positives by ruling out domains that are enriched regardless of the antibody.
Both of these files only contain reads that mapped to chromosome 6 in order to limit the size of the files and to get you up and running as quickly as possible.

To analyze the data with hiddenDomains.
```
shell$ cd hiddenDomain.VERSION.NUMBER
shell$ ./hiddenDomains -g ChromInfo_mm9.txt -b 200 -t k27me3_chr6.bam -c input_chr6.bam -o results
```
hiddenDomains has three required parameters:
- -g ChromInfo_FILE.txt :A tab delimited text file that lists the chromosome names (these have to match the chromosome names in the BAM files) and their sizes. hiddenDomains comes with ChromInfo.txt files for MM9, MM10, HG18 and HG19, and is easy to make a custom file for other genomes.
  To make your own ChromInfo.txt file, just create a tab delimited file with the chromosome names in the first column and their sizes in the second column. Here is an example ChromInfo.txt file for a genome with 3 chromosomes:
```
    chr1    197195432
    chr2    181748087
    chr3    159599783    
```
- -t CHIP_SEQ.bam :The ChIP-seq dataset you want to analyze. The -t stands for "treatment". NOTE: While you can run hiddenDomains without a control (or "input") file, we strongly recommend using one. This will help eliminate false positives in the output.
- -o OUTPUT_FILE_PREFIX :The prefix that you want all of the output filenames to have. hiddenDomains creates four or five output files. These are:
  1. OUTPUT_FILE_PREFIX_treatment_bins.txt :This file contains "binned" read counts.
  2. OUTPUT_FILE_PREFIX_control_bins.txt :This file contains "binned" read counts. hiddenDomains creates this only if you provide a control dataset.
  3. OUTPUT_FILE_PREFIX_domains.txt :This file contains the enriched domains and peaks, along with their posterior probabilities.
  4. OUTPUT_FILE_PREFIX_vis.bed :A BED file that is ready to be uploaded to the UCSC genome browser.
  5. OUTPUT_FILE_PREFIX_analysis.bed :A BED file that has all consecutive enriched bins merged into a single line, making it ideal for analysis with bedtools.
In addition to these required parameters, hiddenDomains accepts many optional parameters (described in detail in the manual). The optional parameters that are used here are:
- -b BIN_WIDTH :The width of the bins for analysis. The default size is 1000bp.
- -c INPUT_SEQ.bam :The "input" dataset you want to analyze. The -c stands for "control".
NOTES:
1. Additional parameters and usage information can be found by entering hiddenDomains -h
2. Depending on the computer, this command should run in less than 5 minutes with ample notices printed to the terminal while it works.
3. If you get an error that hiddenDomains could not run, you either need to make it executable, with the command
```
shell$ chmod 755 hiddenDomains
```
  and then re-run the command above, or run it directly with Perl using the following command
```
shell$ perl ./hiddenDomains -g ChromInfo_mm9.txt -b 200 -t k27me3_chr6.bam -c input_chr6.bam -o results
```
To verify that hiddenDomains has worked correctly, we have provided bedgraph files for the original data and you can upload these to the UCSC Genome Browser along with the BED files that you just created. You can then peruse the results at your liberty, however, there are two locations that we would like to point out that illustrate the value of having a control dataset.
These are the files you need to upload to the UCSC Mouse mm9 Genome Browser:
1. input_chr6.bedgraph.gz - this will display the normalized reads from the "input". The reads should appear uniformly distributed. You may want to set the track "Vertical viewing range" from 0 to 200. You do this by clicking on the grey bar on the left side of the track, setting the Vertical viewing range "min" and "max" values to 0 and 200, and then selecting "use vertical viewing range setting" form the "Data view scaling" pulldown menu.
2. k27me3_chr6.bedgraph.gz - this will display the normalized reads - this will display the normalized reads from the K27me3 ChIP-seq experiment. The reads should appear piled up near genes that are repressed. You may want to set the track "Vertical viewing range" from 0 to 200.
3. k27_hiddenDomain_no_control.bed
4. k27_hiddenDomain.bed
Once you finish uploading the files, you can point the Genome Browser to the Hox gene cluster at chr6:52,000,664-52,512,712. Here you see that both methods called the rather broad domain of H3k27me3 reads that cover this cluster of genes.

chr6:103,414,566-103,717,194 highlights the importance of using a control dataset to limit false positives. In this example there is a location with a lot of reads in both the K27me3 dataset and the input dataset. Without using the input dataset as a control, hiddenDomains calls a peak at this location. However, when run with the input dataset as a control, hiddenDomains does not call this peak.

Manual

hiddenDomains can be run two different ways. In the tutorial, we used hiddenDomains to do three things:

Bin the reads
Look for domains
Generate BED files

However, for more control, you can do these things separately. The individual programs have additional options. For example, you can specify to only analyze certain chromosomes, or you can customize the track line in the "vis" BED file.

If you want to run the programs separately...

binReads.pl bins reads
hiddenDomains.R looks for domains
domainsToBed.pl and domainsMergeToBed.pl create BED files.

binReads.pl

Usage: binReads.pl [options] bamFile.bam > binned_reads.txt

binReads.pl takes a BAM formatted file of aligned reads and outputs the number of reads that overlap each bin, where a "bin" is an equally sized, non-overlapping portion of a chromosome.

Options

-h	Print out the help information.
-b BIN_WIDTH	The width of the bin. Default is 1000bp, which works well with domains and transcription factors. If the exact location of a transcription factor is of interest (rather than answering "does this transcription factor bind in this region"), you can measure the peak width in the UCSC genome browser and use that to define the bid width.
-B	The input file is in BED format.
-q MIN_MAPQ	The minimum MAPQ score. Default is 30.
-M	Assume all bins should be on mouse chromosomes. This is the default.
-H	Assume all bin should be on human chromosomes.
-c "chr1 chr2 ..."	Create a custom list of chromosomes to bin reads into.

domainsMergeToBed.pl

Usage: domainsMergeToBed.pl [options] domainFile.txt > domainFile.bed

domainsMergeToBed.pl takes the output from hiddenDomains.R and merges all consecutive domains with a posterior greater than a threshold (default is 0; all domains are merged) into a single domain and generates track of enriched domains for the UCSC Genome Browser.

Options

-h	Print out the help information.
-b BIN_WIDTH	The width of the bin. Default is 1000bp.
-g ChromInfo.txt	If you get an out of bounds error when uploading a bed file to the UCSC genome browser, you can use this option to trim the bounds to the size of the chromosomes. hiddenDomains comes with ChromInfo.txt files for MM9, MM10 and HG19, and it is easy to make a custom file for other genomes. To make your own ChromInfo.txt file, just create a tab delimited file with the chromosome names in the first column and their sizes in the second column. Here is an example ChromInfo.txt file for a genome with 3 chromosomes: chr1 197195432 chr2 181748087 chr3 159599783
-p MIN_POSTERIOR	Toss out parts of domains that have posterior values that are less than MIN_POSTERIOR. You can set this to any value you want, but for reference, domainsToBed bins according the following scale: >= 0.9 0.9 > posterior >= 0.8 0.8 > posterior >= 0.7 0.7 > posterior >= min posterior for significance The default value is 0; everything is merged by default.
-t	If set, a track line will be added to the start of the bed file
-n TRACK_NAME	The name you want to give your track. The default value is the name of the domain file (without the .txt suffix if it has one).

domainsToBed.pl

Usage: domainsToBed.pl [options] domainFile.txt > domainFile.bed

domainsToBed.pl takes the output from hiddenDomains.R and generates track of enriched domains and their posterior probabilities for the UCSC Genome Browser.

Options

-h	Print out the help information.
-b BIN_WIDTH	The width of the bin. Default is 1000bp.
-g ChromInfo.txt	If you get an out of bounds error when uploading a bed file to the UCSC genome browser, you can use this option to trim the bounds to the size of the chromosomes. hiddenDomains comes with ChromInfo.txt files for MM9, MM10 and HG19, and it is easy to make a custom file for other genomes. To make your own ChromInfo.txt file, just create a tab delimited file with the chromosome names in the first column and their sizes in the second column. Here is an example ChromInfo.txt file for a genome with 3 chromosomes: chr1 197195432 chr2 181748087 chr3 159599783
-t	If set, a track line will be added to the start of the bed file
-n TRACK_NAME	The name you want to give your track. The default value is the name of the domain file (without the .txt suffix if it has one).

hiddenDomains

Usage: hiddenDomains [options] -g ChromInfo -t TreatmentReads -o OutputPrefix

Options

-h	Print out the help information.
-b BIN_WIDTH	The width of the bin. Default is 1000bp, which works well with domains and transcription factors. If the exact location of a transcription factor is of interest (rather than answering "does this transcription factor bind in this region"), you can measure the peak width in the UCSC genome browser and use that to define the bid width.
-B	The input file is in BED format. NOTE: All read files have to have to same format (either BED or BAM). Use binReads.pl as a stand alone program if you have a more complicated set up.
-c ControlReads	A BED or BAM file that contains aligned read reads. Use the -B option to specify BED format. BAM format is the default.
-g ChromInfo.txt	If you get an out of bounds error when uploading a bed file to the UCSC genome browser, you can use this option to trim the bounds to the size of the chromosomes. hiddenDomains comes with ChromInfo.txt files for MM9, MM10 and HG19, and it is easy to make a custom file for other genomes. To make your own ChromInfo.txt file, just create a tab delimited file with the chromosome names in the first column and their sizes in the second column. Here is an example ChromInfo.txt file for a genome with 3 chromosomes: chr1 197195432 chr2 181748087 chr3 159599783
-o OutputPrefix	hiddenDomains generates four or five files with names that start with OutputPrefix. `OutputPrefix_treatment_bins.txt :` A file with the read counts per bin. `OutputPrefix_control_bins.txt :` A file with the read counts per bin. `OutputPrefix_domains.txt :` A text file with all of the enriched domains and posterior probabilities. `OutputPrefix_vis.bed :` A BED file for visualization, which contains one line per significantly enriched bin - this allows for color coding based on the posterior probability. `OutputPrefix_analysis.bed :` The second BED file is for analysis, and this merges all consecutive bins with posterior probabilities greater than MIN_POSTERIOR (as specified with the -p option) or the default value, 0 - which merges all consecutive significant bins.
-p MIN_POSTERIOR	Toss out parts of domains that have posterior values that are less than MIN_POSTERIOR. You can set this to any value you want, but for reference, domainsToBed bins according the following scale: >= 0.9 0.9 > posterior >= 0.8 0.8 > posterior >= 0.7 0.7 > posterior >= min posterior for significance The default value is 0; everything is merged by default.
-q MIN_MAPQ	The minimum MAPQ score. Default is 30.
-t TreatmentReads	A BED or BAM file that contains aligned read reads. Use the -B option to specify BED format. BAM format is the default.

hiddenDomains.R

Usage: hiddenDomains(treat.bin.file="TREATMENT_bins.txt", control.bin.file="CONTROL_bins.txt", out.file.name="OUTFILE.txt", chr.names=["mouse"|"human"|c("chr1","chr2", "etc"]))

hiddenDomains.R takes the output from binReads.pl and attempts to identify enriched domains using Hidden Markov Models (HMMs).

Options

treat.bin.file	This should be set to the name of the file with the "treatment" read counts
control.bin.file	This should be set to the name of the file with the "control" read counts
normalize	`TRUE` or `FALSE`. Should the treatment and control files be normalized for depth? The default value is `TRUE`
chr.names	This can be set to "mouse", "human" or a vector of chromosomes that you want to look for domains in.
skip.chrs	Use this parameter to skip analysis of specific chromosomes. You can specify an individual chromosome with a string, like `"chrX"`, or you can specify a vector of chromosomes, like `c("chr1", "chr2")`.
min.prob	Bins with a lower posterior probability than `min.prob` will be filtered out. The default value is 0.6.
out.flie.name	Use this to provide the name of the output file you want hiddenDomains to create with the results
min.read.count	This is a parameter that can be fiddled with if the HMM libraries fail to converge. If this happens, try setting this to -1.
data	If you have already loaded in the "treatment" data into a variable, you can set data to that variable and omit `treat.bin.file`
control.data	If you have already loaded in the "control" data into a variable, you can set data to that variable and omit `control.bin.file`
debug	`TRUE` or `FALSE`. Print out a lot of intermediate messages. The default value is `FALSE`

Troubleshooting Guide

If the hiddenDomains.R script fails to find peaks/domains on a chromosome, but finds them on others...
1. Consider tweaking the max.read.count and possibly the min.read.count parameters. I've found that setting max.read.count to 10 will fix things a lot of the time. A more rational method for figuring out the optimal parameter is to look at normalized bigWig tracks in the UCSC genome browser to get a sense of how tall the peaks really are and use that number.
2. Try using a larger bin size (i.e. use -b 1000) for binReads.pl and domainsToBed.pl and re-running everything.