hiddenDomains is a suite of programs used to identify significant enrichment of ChIP-seq reads that span large domains, like HK27me3. The input data can be in BAM format, or in a tab-delimited "reads per bin" format described below. The output is a BED formatted file the lists the enriched domains and their posterior probabilities.
hiddenDomains requires samtools, so if you don't already have that installed, you'll need to do that.
This manual and tutorial will teach you how to use the hiddenDomains suite of programs. In it we will analyze a mouse ChIP-seq dataset for H3k27me3
Download and unpack hiddenDomains
If you haven't already done this, download the latest version of hiddenDomains from the Sourceforge website: https://sourceforge.net/projects/hiddendomains/
Now unpack it
shell$ tar -xzvf hiddenDomains.VERSION.NUMBER.tar.gz
Where VERSION
and NUMBER
represent the release
that you downloaded.
Download and install R if you don't already have it.
If you don't already have them, you will need to install two hidden Markov model libraries:
You can do this by starting R on the command line:
shell$ R
Or, if you prefer RStudio, you can use that.
Regardless of how you started R, type the following commands to install the packages.
> install.packages("depmixS4")
> install.packages("HiddenMarkov")
Analyze the Data
Once you have unpacked hiddenDomains, the next step is to analyze it.hiddenDomains comes with two example data files:
k27me3_chr6.bam
, An H3K27me3 ChIP-seq dataset that we will use to identify enriched domains.
input_chr6.bam
, An "Input" ChIP-seq dataset that we can use
to limit the number of
false positives by ruling out domains that are enriched regardless of the
antibody.
To analyze the data with
hiddenDomains
.
shell$ cd hiddenDomain.VERSION.NUMBER
shell$ ./hiddenDomains -g ChromInfo_mm9.txt -b 200 -t k27me3_chr6.bam -c input_chr6.bam -o results
hiddenDomains
has three required parameters:
-g ChromInfo_FILE.txt :
A tab delimited text file that lists
the chromosome names (these have to match the chromosome names in the BAM
files) and their sizes. hiddenDomains
comes with ChromInfo.txt files for
MM9, MM10, HG18 and HG19, and is easy to make a custom file for other genomes.
To make your own ChromInfo.txt file, just create a tab delimited file with the chromosome names in the first column and their sizes in the second column. Here is an example ChromInfo.txt file for a genome with 3 chromosomes:
chr1 197195432 chr2 181748087 chr3 159599783
-t CHIP_SEQ.bam :
The ChIP-seq dataset you want to
analyze. The -t
stands for "treatment". NOTE: While you
can run hiddenDomains
without a control (or "input") file, we
strongly recommend using one. This will help eliminate false positives in the
output.
-o OUTPUT_FILE_PREFIX :
The prefix that you want all
of the output filenames to have. hiddenDomains
creates four or five
output files. These are:
OUTPUT_FILE_PREFIX_treatment_bins.txt :
This file contains
"binned" read counts.
OUTPUT_FILE_PREFIX_control_bins.txt :
This file contains
"binned" read counts. hiddenDomains
creates this only if you
provide a control dataset.
OUTPUT_FILE_PREFIX_domains.txt :
This file contains
the enriched domains and peaks, along with their posterior probabilities.
OUTPUT_FILE_PREFIX_vis.bed :
A BED file that
is ready to be uploaded to the UCSC genome browser.
OUTPUT_FILE_PREFIX_analysis.bed :
A BED file that
has all consecutive enriched bins merged into a single line, making it ideal
for analysis with bedtools
.
In addition to these required parameters, hiddenDomains
accepts
many optional parameters (described in detail in the manual). The
optional parameters that are used here are:
-b BIN_WIDTH :
The width of the bins for analysis. The
default size is 1000bp.
-c INPUT_SEQ.bam :
The "input" dataset you want to
analyze. The -c
stands for "control".
NOTES:
hiddenDomains -h
hiddenDomains
could not run,
you either need to
make it executable, with the command
shell$ chmod 755 hiddenDomains
and then re-run the command above,
or run it directly with Perl using the following command
shell$ perl ./hiddenDomains -g ChromInfo_mm9.txt -b 200 -t k27me3_chr6.bam -c input_chr6.bam -o results
To verify that hiddenDomains has worked correctly, we have provided bedgraph files for the original data and you can upload these to the UCSC Genome Browser along with the BED files that you just created. You can then peruse the results at your liberty, however, there are two locations that we would like to point out that illustrate the value of having a control dataset.
These are the files you need to upload to the UCSC Mouse mm9 Genome Browser:
Once you finish uploading the files, you can point the Genome Browser to the Hox gene cluster at chr6:52,000,664-52,512,712. Here you see that both methods called the rather broad domain of H3k27me3 reads that cover this cluster of genes.
chr6:103,414,566-103,717,194 highlights the importance of using a control dataset to limit false positives. In this example there is a location with a lot of reads in both the K27me3 dataset and the input dataset. Without using the input dataset as a control, hiddenDomains calls a peak at this location. However, when run with the input dataset as a control, hiddenDomains does not call this peak.
hiddenDomains can be run two different ways. In the tutorial, we used
hiddenDomains
to do three things:
However, for more control, you can do these things separately. The individual programs have additional options. For example, you can specify to only analyze certain chromosomes, or you can customize the track line in the "vis" BED file.
If you want to run the programs separately...
binReads.pl
bins reads
hiddenDomains.R
looks for domains
domainsToBed.pl
and
domainsMergeToBed.pl
create BED files.
Usage: binReads.pl [options] bamFile.bam > binned_reads.txt
binReads.pl
takes a BAM formatted file of aligned reads and
outputs the number of reads that overlap each bin, where a "bin" is an equally
sized, non-overlapping portion of a chromosome.
Options
-h | Print out the help information. |
-b BIN_WIDTH | The width of the bin. Default is 1000bp, which works well with domains and transcription factors. If the exact location of a transcription factor is of interest (rather than answering "does this transcription factor bind in this region"), you can measure the peak width in the UCSC genome browser and use that to define the bid width. |
-B | The input file is in BED format. |
-q MIN_MAPQ | The minimum MAPQ score. Default is 30. |
-M | Assume all bins should be on mouse chromosomes. This is the default. |
-H | Assume all bin should be on human chromosomes. |
-c "chr1 chr2 ..." | Create a custom list of chromosomes to bin reads into. |
Usage: domainsMergeToBed.pl [options] domainFile.txt > domainFile.bed
domainsMergeToBed.pl
takes the output from hiddenDomains.R
and merges all consecutive domains with a posterior greater than
a threshold (default is 0; all domains are merged) into a single domain
and generates track of enriched domains
for the UCSC Genome Browser.
Options
-h | Print out the help information. |
-b BIN_WIDTH | The width of the bin. Default is 1000bp. |
-g ChromInfo.txt | If you get an out of bounds error when uploading a bed file to the UCSC
genome browser, you can use this option to trim the bounds to the size of
the chromosomes. hiddenDomains comes with ChromInfo.txt files for MM9,
MM10 and HG19, and it is easy to make a custom file for other genomes.
To make your own ChromInfo.txt file, just create a tab delimited file with the chromosome names in the first column and their sizes in the second column. Here is an example ChromInfo.txt file for a genome with 3 chromosomes: chr1 197195432 chr2 181748087 chr3 159599783 |
-p MIN_POSTERIOR | Toss out parts of domains that have posterior values that are less
than MIN_POSTERIOR. You can set this to any value you want, but for
reference, domainsToBed bins according the following scale:
>= 0.9 0.9 > posterior >= 0.8 0.8 > posterior >= 0.7 0.7 > posterior >= min posterior for significanceThe default value is 0; everything is merged by default. |
-t | If set, a track line will be added to the start of the bed file |
-n TRACK_NAME | The name you want to give your track. The default value is the name of the domain file (without the .txt suffix if it has one). |
Usage: domainsToBed.pl [options] domainFile.txt > domainFile.bed
domainsToBed.pl
takes the output from hiddenDomains.R
and generates track of enriched domains and their posterior probabilities
for the UCSC Genome Browser.
Options
-h | Print out the help information. |
-b BIN_WIDTH | The width of the bin. Default is 1000bp. |
-g ChromInfo.txt | If you get an out of bounds error when uploading a bed file to the UCSC
genome browser, you can use this option to trim the bounds to the size of
the chromosomes. hiddenDomains comes with ChromInfo.txt files for MM9,
MM10 and HG19, and it is easy to make a custom file for other genomes.
To make your own ChromInfo.txt file, just create a tab delimited file with the chromosome names in the first column and their sizes in the second column. Here is an example ChromInfo.txt file for a genome with 3 chromosomes: chr1 197195432 chr2 181748087 chr3 159599783 |
-t | If set, a track line will be added to the start of the bed file |
-n TRACK_NAME | The name you want to give your track. The default value is the name of the domain file (without the .txt suffix if it has one). |
Usage: hiddenDomains [options] -g ChromInfo -t TreatmentReads -o OutputPrefix
Options
-h | Print out the help information. |
-b BIN_WIDTH | The width of the bin. Default is 1000bp, which works well with domains and transcription factors. If the exact location of a transcription factor is of interest (rather than answering "does this transcription factor bind in this region"), you can measure the peak width in the UCSC genome browser and use that to define the bid width. |
-B | The input file is in BED format. NOTE: All read files have to have to same format (either BED or BAM). Use binReads.pl as a stand alone program if you have a more complicated set up. |
-c ControlReads | A BED or BAM file that contains aligned read reads. Use the -B option to specify BED format. BAM format is the default. | -g ChromInfo.txt | If you get an out of bounds error when uploading a bed file to the UCSC
genome browser, you can use this option to trim the bounds to the size of
the chromosomes. hiddenDomains comes with ChromInfo.txt files for MM9,
MM10 and HG19, and it is easy to make a custom file for other genomes.
To make your own ChromInfo.txt file, just create a tab delimited file with the chromosome names in the first column and their sizes in the second column. Here is an example ChromInfo.txt file for a genome with 3 chromosomes: chr1 197195432 chr2 181748087 chr3 159599783 |
-o OutputPrefix |
hiddenDomains generates four or five files with names that start with
OutputPrefix.
|
-p MIN_POSTERIOR | Toss out parts of domains that have posterior values that are less
than MIN_POSTERIOR. You can set this to any value you want, but for
reference, domainsToBed bins according the following scale:
>= 0.9 0.9 > posterior >= 0.8 0.8 > posterior >= 0.7 0.7 > posterior >= min posterior for significanceThe default value is 0; everything is merged by default. |
-q MIN_MAPQ | The minimum MAPQ score. Default is 30. |
-t TreatmentReads | A BED or BAM file that contains aligned read reads. Use the -B option to specify BED format. BAM format is the default. |
Usage: hiddenDomains(treat.bin.file="TREATMENT_bins.txt", control.bin.file="CONTROL_bins.txt", out.file.name="OUTFILE.txt", chr.names=["mouse"|"human"|c("chr1","chr2", "etc"]))
hiddenDomains.R
takes the output from binReads.pl
and attempts to identify enriched domains using Hidden Markov Models (HMMs).
Options
treat.bin.file | This should be set to the name of the file with the "treatment" read counts |
control.bin.file | This should be set to the name of the file with the "control" read counts |
normalize | TRUE or FALSE .
Should the treatment and control files be normalized for depth? The default
value is TRUE |
chr.names | This can be set to "mouse", "human" or a vector of chromosomes that you want to look for domains in. |
skip.chrs | Use this parameter to skip analysis of specific chromosomes. You can specify an individual chromosome with a string, like "chrX" , or you can specify a vector of chromosomes, like c("chr1", "chr2") . |
min.prob | Bins with a lower posterior probability than min.prob will
be filtered out. The default value is 0.6. |
out.flie.name | Use this to provide the name of the output file you want hiddenDomains to create with the results |
min.read.count | This is a parameter that can be fiddled with if the HMM libraries fail to converge. If this happens, try setting this to -1. |
data | If you have already loaded in the "treatment" data into a variable,
you can set data to that variable and omit treat.bin.file |
control.data | If you have already loaded in the "control" data into a variable,
you can set data to that variable and omit control.bin.file |
debug | TRUE or FALSE .
Print out a lot of intermediate messages. The default value is
FALSE |
max.read.count
and possibly the min.read.count
parameters.
I've found that setting max.read.count
to 10 will fix things
a lot of the time. A more rational method for figuring out the optimal
parameter is to look at normalized bigWig tracks in the UCSC genome browser
to get a sense of how tall the peaks really are and use that number.
-b 1000
) for
binReads.pl
and domainsToBed.pl
and re-running
everything.