ReplicationDomain - Detailed ProtocolWhat follows is an early draft of what is now published as: Ryba T, Battaglia D, Pope BD, Hiratani I, Gilbert DM. Genome-scale analysis of replication timing: from bench to bioinformatics.Nat Protoc. 2011 Jun;6(6):870-95. Genome-Scale Analysis of Replication Timing: from Bench to Bioinformatics
Tyrone Ryba, Dana Battaglia, Benjamin D. Pope, Ichiro Hiratani and David M. Gilbert* Running title: Genome-Scale Analysis of Replication Timing1. AbstractReplication timing profiles have been shown to be cell type-specific and reflect genome organization changes upon differentiation. In this work we describe details of the methodology used to analyze replication timing genome wide in mammalian cells. In this protocol, asynchronously cycling cells are pulse labeled with the nucleotide analog 5-bromo-2-deoxyuridine (BrdU) and then sorted into S-phase fractions based on DNA content using flow cytometry. BrdU-labeled DNA from each fraction is immunoprecipitated, amplified, differentially labeled, and co-hybridized to a whole-genome CGH microarray. We also present a guide to the computational analysis of array data. Subjects include initial normalization, scaling, and data quality measures, loess (local polynomial) smoothing of replication timing values, segmentation of data into domains, and assignment of timing values to gene promoters. After these preprocessing steps, we cover various common downstream analyses, such as k-means and hierarchical clustering, as well as methods to relate changes in the replication program to gene expression and other genetic and epigenetic datasets. We recently used these methods to describe the dynamics of widespread replication timing changes in mouse and human cell culture models of development1, 2, 3. IntroductionAlthough the mechanisms that specify the timing and placement of origin firing in higher eukaryotes remain a mystery, all eukaryotes have a defined replication timing program that is largely conserved between closely related species4, including human and mouse3, 5. Analyses of replication timing in various cell types have yielded insights into genome organization and repackaging events during development, suggesting an important role for the timing program itself or 3D genome organization in regulating developmental gene expression. In this article, we describe approaches for measuring replication timing genome-wide in mammalian cells. As data processing and analysis are often a bottleneck in these studies, the second half of the protocol covers several common methods for downstream analysis. Part I: The wet sideThis portion of the protocol describes how to derive raw data from mammalian cells for genome-wide replication timing analysis. Because DNA replication is a cell cycle regulated event, some form of synchronization is necessary. For multi-cellular organisms, most synchronization schemes are cumbersome and only applicable to a limited number of cell lines6, 7. Hence, in cases where multiple lines are to be compared, retroactive synchronization using a fluorescence activated cell sorter (FACS) to select cells based upon the increase in DNA content during S-phase is preferred because it can be applied to any proliferating cell population that can be dissociated into a single celled suspension. This method has been applied to Drosophila8, 9, many mouse and human cell types1, 2, 3, Arabidopsis10, and was recently used to compare replication timing across 14 yeast mutants11. The original conception of the method12, 13 labeled cells with BrdU for a fraction of S-phase and then sorted cells into several different time points during S-phase. BrdU-substituted DNA could then be isolated either based on its increased density or using anti-BrdU-antibodies and specific loci could be examined by hybridization or PCR. With microarray analysis, replication of the entire genome can be queried in a single array hybridization by limiting the analysis to two differentially labeled samples, allowing all probes to be assigned one internally normalized relative replication timing value. The two most popular permutations of the FACS method of retroactive synchronization are described. In one method, BrdU-labeled cells are sorted into early and late S-phase populations, BrdU labeled DNA is immunoprecipatated with an anti-BrdU antibody, and DNA synthesized either early or late is used as the microarray target. Since immunoprecipitation (BrdU-IP) substantially enriches for DNA synthesized in each half of S-phase, this method produces a high signal to noise ratio. However, the efficacy of the BrdU-IP can be variable and must be carefully monitored. In the second method, unlabeled cells are sorted into total S-phase vs. G1-phase populations and DNA from these stages is differentially labeled and used as the target. This obviates BrdU-IP but quantification relies entirely on the 2-fold copy number increase during S-phase. Both methods have been used and produce similar results; a direct comparison in the same cell line has been done in one study1. In both methods, DNA from each fraction is differentially labeled with Cy3 and Cy5 dyes and then cohybridized to a whole-genome oligonucleotide microarray (Nimblegen Systems). The ratio of the abundance of each probe in each fraction is then used to generate a replication-timing profile. BrdU Incorporation and FACs sortingThe nucleotide analog 5-bromo-2’-deoxyuridine (BrdU) can be used to pulse-label newly synthesized DNA during S-phase. For mammalian cell types that have 8-12 hour S phases, incubation with BrdU for two hours has been empirically determined to provide sufficient incorporation to ensure successful BrdU-IP in subsequent steps yet be short enough to identify even subtle differences in replication timing, such as between cells with one vs. two early replicating X chromosomes2. Success has also been achieved with BrdU labeling times as short as one hour, but BrdU-IP can be problematic as there is very little substituted DNA relative to background of unsubstituted DNA that will contribute to noise in the BrdU-IP. The BrdU-labeling times for cells with significantly different S-phase lengths, such as amphibian12 or fly9 cells, should be adjusted appropriately. For first time users, it is recommended that at least 5 x 106 cells be used; however, with experience suffice sufficient S phase cells, as few as 0.5-1 x 106 starting cells can be successfully profiled. The important parameter is to obtain 20,000 - 30,000 cells in each of the early and late S-phase fractions. As described in protocol 1, ethanol-fixed cells can be stained with propidium iodide (PI) and sorted based on DNA content. Alternative fluorochromes that do not require RNAse digestion, such as chromomycin A3, can also be used with ethanol fixed cells12, 13. Some cell types tend to clump or produce a lot of cellular debris when fixed in ethanol. For these cell types, the fixation step can be skipped and DNA can be stained in permeabilized cells with DAPI, as described in protocol 2. The advantage of protocol 1 is that cells can fixed in ethanol can be stored at -20°C (empirically determined to be the optimal temperature) or shipped to collaborators. Shipping should be done on dry ice, with a partition between the tube and the dry ice to prevent cell freezing. All steps, particularly storage, should be performed in the dark since BrdU-substituted DNA is light sensitive. REAGENTS
EQUIPMENT
REAGENT SETUP
PROCEDUREPROTOCOL 1. BrdU labeling and staining of cells for FACSThis protocol can be performed using PI staining and ethanol fixation prior to sorting (option A), or, for cells that break or clump in ethanol, by collecting cells following pulse labeling with BrdU and treating with a DAPI staining solution containing Triton X-100 to permeabilize cell membranes (option B). The drawback of option B is that cells need to be sorted immediately following staining. (A) Labeling and PI staining of cells for FACs following ethanol fixation. TIMING 3.5 hours
(B) BrdU labeling and DAPI staining of cells for FACS. TIMING 3 hours
PROTOCOL 3. FACS sorting fractions of S-phase. TIMING 1 hoursThis protocol describes the use of flow cytometry to sort cells based on DNA content. Electrostatic sorting is performed as described here on a BD (Becton-Dickinson) FACS Aria flow cytometer. During FACS analysis forward and side scatter analyses should be used to select an appropriate population of cells free of doublets or cell debris, both of which can hinder accurate sorting of desired populations. Lasers used in this protocol include 488 Blue to detect PI or 407 Violet to detect DAPI in cells that have been stained for DNA content. Two separate fractions of S phase, early and late, are chosen to be collected, but more can be collected if desired12, 13.
Fig.1 A typical cell cycle profile for a mammalian fibroblast population obtained during FACS analysis by plotting cell count vs. DNA content. In this example, cellular DNA was stained with PI, so DNA content is represented by PI intensity. A G1 peak, representing cells with 2N DNA content, and a G2/M peak, representing cells that have undergone replication and therefore possess a 4N DNA content are labeled. The area between these two peaks is representative of cells in S phase, and can be sorted into two fractions, as indicated here, to obtain early and late S phase samples. Immunoprecipitation of BrdU labeled DNAThis protocol begins with cells that have been pulse labeled with BrdU and sorted into different cell cycle stages by FACS. Here DNA from these cells is sonicated into fragments ranging from 250bp to 2kb and then immunoprecipitated using an anti-BrdU antibody. Immunoprecipitated DNA is then purified and ready for use in further steps of replication timing analysis for a time period of up to one month. If samples have been stored at -20°C prior to beginning the immunopreciptation, thaw samples in a 56°C water bath and add 200µl of SDS-PK Buffer pre-warmed to 56°C with 0.05mg/mL glycogen to each sample prior to performing the phenol-chloroform extraction in step 1. PROTOCOL 4. BrdU-IP. TIMING 2 daysThis protocol describes the use of flow cytometry to sort cells based on DNA content. Electrostatic sorting is performed as described here on a BD (Becton-Dickinson) FACS Aria flow cytometer. During FACS analysis forward and side scatter analyses should be used to select an appropriate population of cells free of doublets or cell debris, both of which can hinder accurate sorting of desired populations. Lasers used in this protocol include 488 Blue to detect PI or 407 Violet to detect DAPI in cells that have been stained for DNA content. Two separate fractions of S phase, early and late, are chosen to be collected, but more can be collected if desired12, 13.
Quality control check of early and late S-phase DNADue to the sensitivity and large number of steps involved, BrdU-IP is one of the trickiest parts of the protocol. Thus to ensure quality, BrdU-IPs are screened by PCR amplification using primers specific to DNA markers of known relative replication time (i.e. early or late). Although real-time PCR can be performed, we find gel electrophoresis to be sufficient to evaluate enrichment of DNA in each IP sample. Importantly, PCR results can vary between aliquots of the same sample. Furthermore, replication timing can vary between cell types2, 3. Consistency across multiple samples from the same cell type is therefore the best way to verify quality. We typically use the primer sets listed below to screen a few IPs from both early and late S phase fractions. Among the mouse sequences listed below, mitochondrial DNA replicates throughout the cell cycle14 and is equally represented in early and late S-phase fractions. Alpha-globin, Pou5f1 and Mmp15 are generally early replicating markers, whereas beta-globin, Zfp42, Dppa2, Ptn, Mash1, and Akt3 are generally late replicating markers. Note that some of these genes are known to switch replication timing at some point during development; for instance, Zfp42 and Dppa2 are early replicating in ESCs whereas they are late replicating in all somatic cell types examined to date. Therefore, as mentioned, consistency across multiple samples from the same cell type is usually the most reliable way to assess the quality of IP samples. Among the human sequences listed below, mitochondrial DNA is equally represented in early and late S-phase fractions, while alpha-globin, MMP15 and BMP1 are generally early replicating markers. PTGS2, NETO1, SLITRK6, ZFP42 and DPPA2 are generally late replicating. Markers amplified, sizes and primer sequences are as follows: Mouse test regions
Human test regions
PROTOCOL 5. PCR method for quality control of BrdU-immunoprecipitation TIMING 5 hours
*** Mitochondrial primer sets should be used at 1.0 uM concentration instead of 0.5 uM. For this master-mix add 0.63 uL F/R 20uM combined primers and 9.31 uL nuclease-free water per reaction. Amplification methods for immunoprecipitated single-stranded DNAOnce purified by immunoprecipitation and screened for sample quality, BrdU incorporated DNA must be amplified to obtain sufficient amounts for array hybridization. If multiple samples pass PCR screening, DNA from parallel immunoprecipitations can be pooled together and used as the starting material for whole genome amplification. Otherwise, a single screened immunoprecipitation is used. Whole genome amplification (WGA) is performed using GenomePlex Complete Whole Genome Amplification Kit*** (Sigma Catalog Number WGA2) followed by GenomePlex WGA Reamplification Kit (Sigma Catalog Number WGA3). Amplified samples are then loaded onto a gel to determine size range and screened once more by PCR to ensure no bias was introduced during amplification. ***GenomePlex Single Cell Whole Genome Amplification Kit (Sigma Catalog Number WGA4) could also be used15, but in our hands WGA2 IP samples "biased" after WGA2 did not improve after WGA4. Thus, we continue to use WGA2 for consistency. PROTOCOL 6. Whole Genome Amplification TIMING 8 hours
Alternative to Steps 1-57: S/G1 Method TIMING 1 dayIn this method, cells are sorted into two fractions, G1 phase and S phase, based on DNA content, and replication timing is derived from the 2-fold copy number increase for early vs. late replicating sequences in pure S-phase populations. DNA analysis using flow cytometry can be performed simply by the use of a single DNA-binding fluorescent dye such as PI or DAPI as originally described (woodfine, 2004). Although this method is adequate, simultaneous measurement of BrdU-labeled DNA by performing BrdU/PI double staining for cell cycle analysis, can discriminate G1 and early-S cells much more efficiently than by PI-only staining. In addition, some cell types, particularly those derived from differentiated stem cells or primary tissues, can produce debris that interferes with good S phase sorting and a short BrdU label described here can eliminate debris, which is not labeled with BrdU. The advantage of this method is that it eliminates the need for BrdU-IP and whole genome amplification steps (described below), which need to be carefully controlled. However, direct comparisons have shown that this method produces a lower signal to noise ratio than Protocol 11. A much shorter BrdU pulse label is used in this protocol at lower concentration, because we are only trying to identify the cells in S phase. With longer BrdU labeling times, G2/M cells become labeled. It should be noted that we originally used the standard protocol for BrdU/PI analysis provided by Becton-Dickenson, which is fine for analysis. However, we found that the high concentration of HCl in this method sheared S phase (BrdU-substituted) DNA to very small fragments that precluded subsequent steps of the protocol. By titrating the HCl, we found that 0.1M HCl produced the optimal compromise between good S vs. G1 separation and minimal DNA shearing. For BrdU/PI double staining, correction of spectral overlap is critical for successful experiments. Spectral overlap exists between emission spectra of PI and FITC/AlexaFluor488 (for BrdU). Without correction, the BrdU/PI plots typically look like Fig. xxA. For this correction, the adjustment of the ratio between PI and AlexaFluor488 (or FITC) gains can significantly reduce the skewing shown in Fig. xxA. Subtraction of the FITC signal from the PI signal (=compensation) may also be required. To perform these corrections, a “BrdU-only” control is required, prepared by staining BrdU-labeled cells without the addition of PI. A “PI-only” control also helps, prepared by staining non BrdU-labeled cells for BrdU and PI. [Note: BrdU-labeled specimen stained for PI only does not reflect background signals derived from the anti-BrdU antibody and thus is not as good as unlabeled cells.] This step can be time-consuming but is critical for successful sorting. We suggest that you first adjust the gains of FSC and SSC, then adjust the PI and AlexaFluor488 gains by trial and error to get the best possible BrdU/PI plot. You may be able to obtain a reasonable BrdU/PI plot without compensation, but if not, compensate by subtracting FITC signal from PI signal. The lower the percentage subtracted, the better.
Fig.2 2D Cell Cycle Sorting for S and G1 phases. Cells labeled with BrdU and stained as described in Protocol 7, and then analyzed on a FACS. A. A typical non-corrected BrdU/PI plot. Note how the plot is skewed to the right due to spectral overlap. B. A corrected BrdU/PI plot . Sorting windows for nicely separated G1, S and G2/M phases of the cell cycle are indicated. PROTOCOL 7. S/G1 FACS Sorting TIMING 4 hours
Labeling and hybridization of amplified samplesIn order to avoid the ambiguity inherent to generalized methods, this protocol is based specifically on NimbleGen products. During our earliest attempts to profile replication timing genome-wide, NimbleGen outperformed other platforms we tested and became our platform of choice. Others have successfully applied this method to additional platforms8, 9, including deep sequencing of BrdU-IP DNA17. Currently, mammalian replication timing data generated from microarray hybridization and deep sequencing is of equal quality1, 3, while the microarray method remains more cost-effective and the bioinformatics are considerably less demanding for the typical laboratory. Future advances reducing BrdU-labeling times and sequencing limitations may make this method more worthwhile and accessible18. Beyond platform choice, array design is also an important consideration. The nature of your study should be a guide in selecting between the variety of available standard and custom designs. For our genome-wide studies in both mouse and human, 385K and 3X720K Comparative Genomic Hybridization (CGH) tiling arrays have sufficient probe densities, showing no disadvantage compared to high density 2.1M CGH tiling arrays1, 2, but have considerable cost and convenience advantages. Once a platform is chosen, the hybridization step is fairly straightforward. Array scanning should be carried out according to manufacturer recommendations avoiding unnecessary laser exposure. Though care should be taken to align channels with respect to signal intensity frequencies, minor differences between channels usually do not impact smoothed timing profiles after normalization. PROTOCOL 8. Labeling and hybridizing TIMING 1-3 daysMethod
***In our hybridization experiments, we used NimbleGen scanner GenePix 4000B and the accompanying NimbleGen Arrays User’s Guide, CGH Analysis v5.1. Newer equipment is accompanied with a newer version of the User's Guide and is operated in a slightly different manner. In v5.1 of the User's Guide, Chapter 3 covers Sample Labeling, Chapter 4 covers Hybridization and Washing, Chapter 5 covers Two-Color Array Scanning, and step 6 of Chapter 6 gives details on the generation of .pair reports. Part II: Normalization and computational analysis of replication timing datasetsGeneral methods for normalizing and analyzing microarray experiments for chromatin modifications or transcription at gene promoters have been described in detail in other works1-3. In this section of the protocol we focus on the methods specifically useful for replication timing analysis using whole-genome comparative genome hybridization (CGH) microarrays22. Similar to two-color microarray designs comparing an experimental sample to a reference, the replication timing experiments we describe employ a two-channel design comparing early versus late fraction enrichment for each target. Typically, we include two dye-swap replicates per sample to address bias due to dye-specific effects, such as more rapid photobleaching of Cy5 dye than Cy3. All of the analysis described here uses the R framework for statistical computing5-7. Through user-submitted packages that facilitate a wide variety of methods, R has become an indispensible tool for many common computational tasks. Although R has an initially steep learning curve due to its command line interface, help is available in many locations and forms, including books8-10, online manuals (http://cran.r-project.org/), and mailing lists aggregated in the R Mailing Lists Archive (http://tolstoy.newcastle.edu.au/R/). Help can also be found within R itself; other than the commands presented here, str() is often helpful for viewing the structure of variables and datasets created along the way, and the ? operator (e.g., ?data.frame() ) provides a help page for the corresponding function. Finally, while we present methods for extracting information from microarray experiments using a particular set of commands, these are intended as a set of guidelines or starting points rather than a cookbook, and a vast array of alternatives are available at every step. Quality controlThe overall quality of a microarray experiment can be examined from many angles. For replication timing analysis on CGH arrays we focus on six qualities important for reliable results:
Here, we highlight steps 2-6 in verifying the array data in this section, with starting material verified as outlined in Quality Control of S-phase DNA. Many of the diagnostics for array quality control are most naturally checked in the course of normalizing a dataset, and are presented here in that context. Diagnostic plots and commands that examine the properties outlined above are covered in quality control sections 1-6 below. NormalizationNormalization — the process of transforming microarray datasets to remove systematic bias from true biological signal — is a complex but vital part of any microarray study. Normalization methods are tailored to the platform and design of each experiment, but even within a platform a wealth of alternatives are available. Here, we describe a flexible method for normalizing two-color NimbleGen tiling CGH arrays for replication timing analysis. Our philosophy is to minimize the number of transformations applied to the data and apply only minimally invasive global methods for removing bias and scaling datasets to allow comparisons between them. We use the R package LIMMA (linear models for microarray data), also available with a user interface through the limmaGUI package, for normalization and scaling11,12. The steps for this process are straightforward, and are outlined below. To illustrate the method, we use two biological replicate datasets of mouse L1210 lymphoblast cells, available at [address]. 1.) Create RGL files from the original NimbleGen .pair files. These "Red-Green lists" contain columns for both red (Cy5) and green (Cy3) channel signal intensities. When reading large tables in R, such as .pair files, explicitly noting the number of rows and data type of each column as illustrated in lines X-Y will save a substantial amount of memory and calculation time. Occasionally, line 2 below will set the genomic position columns of large datasets as an integer type, which lacks the memory space to store large numbers. If so, set the type manually with > classes[X] = "numeric" (where X is the column number containing position information) after line 2. The "nrows" in lines 3 and 4 can be a modest overestimate; the correct number of rows will be present in the final table, but an estimate allows the system to allocate the correct amount of memory.
1| tab5rows <- read.delim("318990_4L1210LymphoblastP1_532.pair", header = TRUE, nrows = 5, skip=1) 2.) Create a "targets" text file that describes the target files for normalization. We will name this file "T.txt". Note that, to be read correctly, the file should be tab-delimited and contain only one carriage return at the end of the final line. Place this file in the same directory as the raw .pair files and .rgl files generated above.
3.) Install the LIMMA package. Install a current version of the LIMMA package according to the instructions at http://bioinf.wehi.edu.au/limma/ , or using the command line interface:
9| source("http://www.bioconductor.org/biocLite.R") 4.) Perform loess and scale normalization using LIMMA. This process is more straightforward than many two-color normalization methods, since NimbleGen arrays do not have print tip groups, spot background areas, or mismatch spots that must be accounted for. Loess normalization (normalizeWithinArrays) corrects the internal dependence of red-green ratios on their intensity independently for each array, and is examined further in sections QC#1 and QC#3. Scale normalization (normalizeBetweenArrays) equalizes the distribution of timing values between multiple samples for comparisons, and is verified in section QC#2. As with ChIP-chip methods13,14, this type of scale normalization may not be appropriate for examining subsets of the genome where large unbalanced timing changes are expected (e.g., timing of the X chromosome before and after inactivation), but is ideal for whole-genome analyses.
12| library(limma) This process generates three 'MAList'-type data objects: one that stores the raw array data (r), one that stores the samples after within-array normalization (MA.l), and one that stores the same information after scale normalization between arrays (MA.q). We examine these further in the quality control sections below. QC # 1 — Distribution of spot intensities for red and green channels The distributions of red and green spot intensities should be fairly well aligned , and have tails with high (above 211, or 2048) signal. Experiments in which signal intensity drops off more sharply will often show higher levels of noise in the final dataset.
18| plotDensities(r)
QC #2 — Distribution of ratio values before and after normalization Boxplots of the Cy5/Cy3 ratios for each experiment may be slightly different before normalization (and after within-array normalization), but 1st and 3rd quartiles (the box boundaries) of all experiments should be equal after between-array normalization.
21| boxplot(MA.l$M~col(MA.l$M), names=colnames(MA.l$M))
QC #3 — Ratio vs. intensity plots Create MA plots to examine dependence of ratios on signal intensities. Points will often be skewed to low Cy5/Cy3 ratios at low intensities due to photobleaching of Cy5, but should be corrected after within-array loess normalization in Limma. This bias is the most common artifact for NimbleGen arrays, but refer to [ref] for discussion of other common types of bias diagnosed with MA plots.
23| plotMA(r, array=1) # Raw data
5.) Create an intermediate file containing the normalized datasets. At this stage we can write a file containing the results of normalization and remove the other data from memory. (Alternatively, "rm(list=ls())"at line X will clear all variables.) This tab-delimited text file will be further processed to sort and average the normalized datasets and check other quality control measures.
25| write.table(MA.q$M, file="Loess_mLymph_112909.txt", quote=F, row.names=F, sep="\t") 6.) Assign position and chromosome information to the normalized datasets. This can be accomplished using the original .pair files, which typically contain this information in columns "POSITION" and "SEQ_ID" respectively.
27| tab5rows <- read.table("Loess_mLymph_112909.txt", header = TRUE, nrows = 5) Some data formats, such as HD2 triplex arrays, contain a different format of SEQ_ID column with chromosome and chromosome endpoints combined (e.g., "chr11:1-134452384"). In these cases use the following lines to parse chromosome labels from the PROBE_ID column:
35| b = a$PROBE_ID 7.) Plot timing values across a chromosome. This serves to verify the orientation for early/late domains, as well as the overall technical quality of the experiments.
43| RTb = subset(RT, RT$CHR == "chr1")
Fig.X Replication timing values across chromosome 1. Notice in Figure X that, due to the photobleaching of Cy5 diagnosed in QC#1, timing is skewed toward early values in replicate 1 and late values in replicate 2, demonstrating the practical advantages of dye-swap replicates. In lines X, also note that 1 and 2 are the column indices of the first and second replicate; additional experiments can be plotted together using higher row numbers for mfrow in line X. Using known regions of early or late replication, we can now verify that the timing values are properly oriented. If not, we reverse them by multiplying the appropriate data columns by -1. 47| RT[,c(1,2)] = RT[,c(1,2)] * -1 # Reverses early/late orientation For dye-swap replicates, LIMMA also has more advanced functions for identifying probes affected by dye effects in topTable(), though differentially affected probes are generally distributed equally in early and late domains and contribute to noise rather than systematic error. 8.) Average replicate datasets. This is quite straightforward: 48| RT$mLymphAve = (RT[,1] + RT[,2]) / 2 QC #4 — Autocorrelation function Since nearby loci should replicate almost synchronously, the correlation of timing between neighboring probes, or autocorrelation, is a useful measure of overall data quality. The autocorrelation function (acf) describes the correlation between neighboring data points as a function of their genomic distance. High-quality datasets will have a correlation between nearest neighbor timing values of R = 0.60-0.80. This measure of signal-to-noise ratio will improve as more replicates with equivalent states are averaged. Before checking the autocorrelation, it is important to ensure that the data is sorted by chromosome and genomic position. This can be accomplished with 49| RT = RT[order(RT$CHR, RT$POSITION),] Note that the order of mouse chromosomes by the default sorting method will be 1, 10-19, 2-9, X, then Y. This order itself is unimportant, but should be consistent across experiments to prevent errors in downstream analysis. Next, plot the autocorrelation and output the second (nearest neighbor) autocorrelation to the console:
50| acf(RT[,1],lag=1000)$acf[2] # Replicate 1: R = 0.742
QC #5 — False-color array image to check for spatial artifacts Most probes on tiling microarray designs are not (collinear), but instead randomly distributed with respect to genomic location. Therefore, spatial artifacts in the scanned images should not much affect timing values in any particular location in the genome, but may reduce the overall signal to noise ratio of the experiment if they cover a substantial portion of the array. To check for spatial artifacts, we examine the original .tiff images, as shown below.
Downstream analysis
|
| Rep1 | Rep2 | Ave | |
| Lymphoblast Rep1 | 1.000 | 0.978 | 0.995 |
| Lymphoblast Rep2 | 0.978 | 1.000 | 0.994 |
| Lymphoblast Ave | 0.995 | 0.994 | 1.000 |
The cor() function defaults to Pearson correlation but other methods are available (see ?cor in R). If missing values are present, add "na.rm=T" to remove them.
2.) Segmentation
Segmentation is a useful way to define the boundaries of replication domains and determine their average timing. Biologically, these segments appear to correspond to domains of coordinately regulated, synchronously firing origins that may be part of replication factories. We perform segmentation using the DNACopy algorithm designed by Venkatraman et al.33, which has been shown to perform favorably compared to available alternatives for CGH copy number analysis34, 35, 36.
79| library(DNAcopy)
80| mLymph = CNA(RT$mLymphAve, RT$CHR, RT$POSITION, data.type="logratio", sampleid = "mLymph")
81| Seg.mLymph = segment(mLymph, nperm=10000, alpha=1e-15, undo.splits="sdundo", undo.SD=1.5, verbose=2)
82| write.table(Seg.mLymph$output, row.names=F, quote=F, file="Mouse lymphoblast segments.txt", sep="\t")
The result of this process is a 'CNA' object called Seg.mLymph, which contains the raw data and segmentation breakpoints assigned by circular binary segmentation37. The number of segments assigned can be examined using str(Seg.mLymph$output), and visualized using various functions built into DNAcopy.
83| par(ask=T,mar=c(3.1,4.1,1,1)) # Set figure margins
84| plot(Seg.mLymph, plot.type="c") # Plot each chromosome separately
85| plot(Seg.mLymph, plot.type="s") # Plot overview of all chromosomes
86| plot(subset(Seg.mLymph,chromlist="chr2"), pch=19, pt.cols=c("gray","gray"), xmaploc=T, ylim=c(-3,3), main="")
Fig.X3 Raw (gray) and segmented (red) replication timing data along chromosome 2.
As before, a tab-delimited text file containing segment, and average replication timing values can be created:
87| write.table(Seg.mLymph$output, row.names=F, quote=F, sep=”\t”)
Though the goal of separating the genome into regions of synchronous replication is valuable in theory and very useful in practice, applying segmentation brings several challenges. The first is that not all domains of replication identified by the algorithm are bonafide synchronously replicating domains; in particular, regions where replication timing transitions from early to late can create artifactual "transition segments" that arise from the slope between early and late domains. If necessary, these can be filtered out by virtue of the fact that such segments are much smaller than typical domains and have replication timing values that are intermediate between their neighboring segments.
The second segmentation challenge is the sensitivity of most algorithms to both the properties of the data and internal parameters that can significantly affect the results. For DNACopy, the average number and size of segments are affected by the signal-to-noise ratio of the dataset: high-quality data will naturally have more significant transitions in timing and be assigned more segments. To control for this, two approaches are (possible):
1) Perform segmentation with parameters adjusted to offset quality difference between datasets (For DNAcopy, adjusting "undo.SD" is the most direct method)
2) Add a small amount of Gaussian noise to higher-quality datasets to equalize their autocorrelation before performing segmentation with equivalent parameters
In general, approach #1 is more subjective unless the effect of different levels of noise (autocorrelation levels) is thoroughly tested, making approach #2 preferable when only small quality differences separate the arrays (indeed, it is usually easier to compare segmentation results from arrays with lower, but more uniform, quality). For this purpose, we adapt the "jitter" function of Chambers et al.27, substituting Gaussian noise for uniformly distributed noise.
88| addNoise = function(x, factor = 1, amount = NULL) {
89| if (length(x) == 0)
90| return(x)
91| if (!is.numeric(x))
92| stop("'x' must be numeric")
93| z <- diff(r <- range(x[is.finite(x)]))
94| if (z == 0)
95| z <- abs(r[1])
96| if (z == 0)
97| z <- 1
98| if (is.null(amount)) {
99| d <- diff(xx <- unique(sort.int(round(x, 3 -
100| floor(log10(z))))))
101| d <- if (length(d))
102| min(d)
103| else if (xx != 0)
104| xx/10
105| else z/10
106| amount <- factor/5 * abs(d)
107| }
108| else if (amount == 0)
109| amount <- factor * (z/50)
110| x + stats::rnorm(length(x), 0, amount)
111| }
112| RT$mLymphR1noise = addNoise(as.numeric(RT$mLymphR1), factor=300, amount=.848)
113| RT$mLymphR2noise = addNoise(as.numeric(RT$mLymphR2), factor=300, amount=.968)
114|acf(RT$mLymphR1noise)$acf[2] #0.8274651
115|acf(RT$mLymphR2noise)$acf[2] #0.8274654
3.) Domain size analysis (boxplots of sizes)
After segmentation, the sizes of replication domains can be extracted from the segmented dataset to examine average sizes examined together or separately for domains with early or late timing.
116| LymphR1 = Seg.mLymphR1$output; LymphR2 = Seg.mLymphR2$output
117| LymphR1$size = LymphR1$loc.end - LymphR1$loc.start
118| LymphR2$size = LymphR2$loc.end - LymphR2 $loc.start
119| LymphR1Early = subset(LymphR1, LymphR1$seg.mean > 0)
120| LymphR2Late = subset(LymphR2, LymphR2$seg.mean < 0)
121| boxplot(LymphR1Early$size, LymphR2Early$size, LymphR1Late$size, LymphR2Late$size)
Dynamic changes in the timing program
To examine changes in the replication program during differentiation, several useful methods leverage the segmentation and loess smoothing methods introduced above. Alternatives, such as support vector machines and hidden Markov models, are reviewed by Diaz-Uriarte et al.38. Since no single method is sufficient to fully describe the type, degree, and distribution of timing changes during development, we cover several complementary ways to measure these properties and explore the relationships between cell types. These include:
1) The amount of the genome changing replication timing (percent changes analysis)
2) The degree and relationships of RT changes between cell types (clustering approaches)
3) The properties of domains that change timing upon differentiation (switching domain analysis)
1.) Percent changes analysis: The amount of the genome with differential timing between two or more cell types can be analyzed using an arbitrary, percentile, or significance-based cutoff for RT changes. Since the significance approach is affected by quality differences between datasets, and percentile cutoffs necessarily include an arbitrary fraction of the genome, we recommend scaling datasets to equivalent ranges and applying an empirical cutoff for biologically measurable timing changes to quantify these genome-wide. As most methods for quantifying timing changes are sensitive to scale differences, datasets should be scaled and normalized together in LIMMA as shown for mouse lymphoblast replicates 1 and 2.
122| RTd1 = RT$mLymphR1 - RT$mLymphR2
123| mLength = length(RTd1) # Total number of probes
124| s = 0.67 # Cutoff for significant changes
125| sum(abs(RTd1)>s)/mLength # Percentage changing, R1 vs. R2
126| sum(RTd1 < -s)/mLength # EtoL subset: 1.6%
127| sum(RTd1 > s)/mLength # LtoE subset: 1.3%
2.) Clustering approaches. Cluster analysis is a traditional staple for microarrays, particularly for analysis of differentially expressed genes. For replication profiles, our two most common methods are hierarchical clustering, which agglomerates experiments bottom-up from the most similar pairs of replication profiles to gradually more inclusive clusters, and k-means clustering, which partitions a set of timing profiles into a pre-set k number of similar patterns. For k-means clustering, we have used the programs Cluster39 and TreeView (http://rana.lbl.gov/EisenSoftware.htm) and refer readers to the corresponding guides. For hierarchical clustering, to compute clusters based on the stability of connections between cell types, we use the "pvclust" package in R40, which performs hierarchical clustering with p-values ascribed to each cluster. Since many clustering algorithms are computationally intensive, and to improve the precision of individual RT measurements, we typically average timing datasets into windows of approximately 200kb prior to clustering:
128| mLymph.R1 = NULL; mLymph.R2 = NULL
129| for (x in 1:10994) {
130| z1 = x * 35 # For 385k arrays,
131| z2 = (x+1) * 35 # 5.8kb median spacing * 35 = 203kb
132| mLymph.R1[x] = mean(RT$mLymphR1[z1:z2])
133| mLymph.R2[x]= mean(RT$mLymphR2[z1:z2])
134| cat("Current window: ", x, "\n")
135| }
136| RT = data.frame(mLymph.R1, mLymph.R2)
137| library(pvclust)
138| cluster.bootstrap <- pvclust(RT, nboot=1000, method.dist="abscor")
139| plot(cluster.bootstrap)
140| pvrect(cluster.bootstrap)
There are several reasons to take care when interpreting the results of hierarchical clustering: 1) a wide variety of topologies are possible for a single dendrogram, since any node can be flipped without changing the connections between clusters, 2) agglomerative clusters can change substantially when new experiments are added, and 3) the exact connections produced (though usually not the overall structure of the dendrogram) often change for different clustering algorithms or distance metrics. By default, pvclust measures distances between cell types using the common Euclidean distance metric. A wide variety of other options and other clustering approaches are also available through R; see http://cran.r-project.org/web/views/Cluster.html for an overview.
3.) Properties of RT switching domains. It is often useful to define the boundaries of domains that switch to earlier or later replication timing (“switching domains”) and analyze the properties of genetic and epigenetic elements within them. To compute these domains, we first subtract the normalized (not loess smoothed) values of the two experiments to be compared, then segment the resulting profile of timing changes as shown below:
141| dRT = CNA(RT$NPCave-RT$ESCave, RT$CHR, RT$POSITION, data.type="logratio", sampleid= "NPC-ESC dRT")
142| Seg.dRT = segment(dRT, nperm=10000, alpha=1e-15, undo.splits = "sdundo", undo.SD=1.5, verbose=2)
143| dRTdom = Seg.dRT$output
144| quantile(dRTdom$seg.mean, probs = c(0.05, 0.95)) # Top 5% >
145| quantile(dRTdom$seg.mean, probs = c(0.40, 0.60)) # Middle 20%
146| LtoEdom = subset(dRTdom, dRTdom$seg.mean > 1.28552)
147| EtoLdom = subset(dRTdom, dRTdom$seg.mean < -1.32328)
148| middleDom = subset(dRTdom, dRTdom$seg.mean > -0.14808 & dRTdom$seg.mean < 0.23698)
Comparison and alignment to outside datasets
While much useful information can be derived from the properties of replication domains themselves, the vast amount of publicly accessible data made available through initiatives such as ENCODE41, 42, 43 and public repositories like GEO44, 45 allow the timing program to be compared to a wide array of genome-wide or gene-centric data. However, the differences in formats between ChIP-chip, ChIP-seq, and other approaches requires some care in processing, and almost any dataset has specific quirks Here, we cover methods that are generally useful for comparing replication timing to other epigenetic datasets.
1.) Assignment of replication timing values to gene promoters
Although the main purpose of loess is to derive an overall smoothed replication timing profile, the smoothed data object produced is valuable for other tasks as well, since it can be interrogated at any set of genomic coordinates. This is especially helpful for comparing datasets from different array platforms, since it allows us to calculate timing (or other) values at a shared set of coordinates. Using this approach, we can assign timing data to NCBI's RefSeq gene promoter locations (http://www.ncbi.nlm.nih.gov/RefSeq/) as outlined below:
149| chrs = levels(RefSeq$CHR) #[DEFINE RefSeq]
150| AllSm = NULL # Store smoothed data for all chromosomes
151| ChrSm = NULL # Store smoothed data for current chromosome
152| for(chr in chrs) {
153| RTc = subset(RT, CHR == chr)
154| RSc = subset(RefSeq, CHR == chr)
155| cat("Current chromosome: ", chr, "\n")
156| lspan = 300000/(max(RTc$POSITION)-min(RTc$POSITION))
157|
158| smLym1 = loess(RT$mLymphR1 ~ RT$POSITION, span = lspan)
159| smLym2 = loess(RT$mLymphR2 ~ RT$POSITION, span = lspan)
160| smLym3 = loess(RT$mLymphAve ~ RT$POSITION, span = lspan)
161|
162| Lym1 = predict(smLym1, RSc$TSS)
163| Lym2 = predict(smLym2, RSc$TSS)
164| Lym3 = predict(smLym3, RSc$TSS)
165|
166| ChrSm = data.frame(CHR=chr,POSITION= RSc$TSS, Lym1, Lym2, Lym3)
167| AllSm = rbind(AllSm, ChrSm)
168| }
169| write.table(AllSm, “Mouse lymphoblast RT at RefSeq gene positions.txt”, quote=F, sep=”\t”)
The “TSS” column in the RefSeq gene table represents transcription start sites. These are equivalent to the ‘txStart’ position in the original RefSeq table for genes with forward (“+”) orientation, and ‘txEnd’ for genes in reverse (“-“) orientation.
2.) Assignment of histone and other epigenetic marks to gene promoters
Through the ENCODE project and other initiatives, a wide array of epigenetic datasets have been made available at GEO, the UCSC Genome Browser, and similar data repositories. Just as timing can be assigned to gene promoters as outlined above, these datasets can be used to assign values at any set of genomic loci, including gene promoters or the boundaries of replication domains. However, in all cases, special care must be paid to ensure that datasets are compared in compatible genomic builds and equivalent cell types, and that the method of quantification is consistent with the methods and goals of the studies involved.
Below is a generic method for assigning values from epigenetic mark datasets to promoter regions surrounding gene transcription start sites. Since histone marks are regulated in much smaller units than replication timing, we typically assign individual points or averages to promoters rather than a loess-smoothed value (however, the latter may be feasible with sufficiently high resolution and small span). Two files are required: One with columns describing the genomic coordinate, orientation, and chromosome of each gene (RefSeq), and another with the coordinate, chromosome, and data value for each mark (Marks). For our example, we match several modifications from a generic genomewide ChIP-seq experiment to windows +500 to -2500 bases from RefSeq gene promoters.
170| chrs = levels(Marks$CHR); AllGenes = NULL; AllHist = NULL
171| for (chr in chrs) {
172| RSc = subset(RefSeq, CHR == chr)
173| RTc = subset(Marks, CHR == chr)
174| for(m in 1:nrow(RSc)) {
175| if(RSc[m,]$Strand == "+") {
176| RTcSub = subset(RTc, (RTc$Start < RSc[m,]$txStart +500) & (RTc$Start > RSc[m,]$txStart - 2500))
177| AllHist = rbind(AllHist, apply(RTcSub, 2, max)[3:12])
178| AllGenes = rbind(AllGenes, RSc[m,]$Gene)
179| }
180| if(RSc[m,]$Strand == "-") {
181| RTcSub = subset(RTc, (RTc$Start < RSc[m,]$txEnd +2500) & (RTc$Start > RSc[m,]$txEnd - 500))
182| AllHist = rbind(AllHist, apply(RTcSub, 2, max)[3:12])
183| AllGenes = rbind(AllGenes, RSc[m,]$Gene)
184|
185| }
186| cat("Chromosome:", chr, " Gene:", m, "/", nrow(RSc), "\n")
187| }
188| }
189| OutFile = data.frame(cbind(AllGenes, AllHist), stringsAsFactors=F)
190| write.table(OutFile, file="Histone modifications at RefSeq gene positions.txt", row.names=F, quote=F, sep="\t")
Note that line X assigns the highest value within the promoter window to the gene. Other approaches to assign histone modification values to genes include averaging the number of reads within the body of genes46, individually analyzing equally spaced bins across open reading frames47, and assessing promoters with significant binding above background48. Bear in mind that the transcription start site may not be the best target for all modifications; indeed, for H3K36me3 marking transcription elongation, values at the transcription endpoint or exon 5' ends may better represent overall enrichment49. For some biological questions, additional information can be gleaned by analyzing reads separately for forward and reverse strands. In general, the methods of quantification should be informed by the biological properties of the marks to be quantified and the questions to be addressed.
3.) Integration of epigenetic mark values over replication domains (density vs. timing)
Given that the magnitude of correlations between genetic properties generally increases when measured in larger windows [ref], it is important to quantify these relationships in windows consistent with biologically regulated unit sizes. The method below can be used to compute correlations between domainwide replication timing values and the average level of epigenetic marks within timing domains segmented above.
191| Seg.RT = read.table("Lymph-1 segments.txt",header=T)
192| chrs = levels(Seg.RT$chrom)
193| MarksData = NULL # Store averaged mark data in domains
194| RTData = NULL
195| dom = 0
196| for(chr in chrs) {
197| Seg.RTb = subset(Seg.RT, Seg.RT$chrom == chr)
198| MarksB = subset(Marks, Marks$CHR == chr)
199| for (i in 1:dim(Seg.RTb)[1]) {
200| # Find subset of marks in RT domain
201| cat("Current chr:", chr, " Domain:", i, "\n")
202| MarksD = subset(MarksB, MarksB$Start > Seg.RTb[i,]$loc.start & MarksB$Start < Seg.RTb[i,]$loc.end)
203| MarksD = MarksD[,3:12]
204| MarksD[,1:10] = MarksD[,1:10] - MarksD[,1]
205| MarksData = rbind(MarksData, apply(MarksD,2, "mean"))
206| dom = dom + 1
207| }
208| }
209| cor(Seg.RT$seg.mean, data.frame(MarksData))
210| plot(Seg.RT$seg.mean, data.frame(MarksData)[1])
- Hiratani, I. et al. Global reorganization of replication domains during embryonic stem cell differentiation. PLoS biology 6, e245(2008).
- Hiratani, I. et al. Genome-wide dynamics of replication timing revealed by in vitro models of mouse embryogenesis. Genome research 20, 155-69(2010).
- Ryba, T. et al. Evolutionarily conserved replication timing profiles predict long-range chromatin interactions and distinguish closely related cell types. Genome research 20, 761-70(2010).
- Pope, B.D., Hiratani, I. & Gilbert, D.M. Domain-wide regulation of DNA replication timing during mammalian development. Chromosome research : an international journal on the molecular, supramolecular and evolutionary aspects of chromosome biology 18, 127-36(2010).
- Yaffe, E. et al. Comparative analysis of DNA replication timing reveals conserved large-scale chromosomal architecture. PLoS genetics 6, e1001011(2010).
- Farkash-Amar, S. et al. Global organization of replication time zones of the mouse genome. Genome research 18, 1562(2008).
- Karnani, N. et al. Pan-S replication patterns and chromosomal domains defined by genome-tiling arrays of ENCODE genomic areas. Genome research 17, 865-76(2007).
- Schübeler, D. et al. Genome-wide DNA replication profile for Drosophila melanogaster: a link between transcription and replication timing. Nature genetics 32, 438-42(2002).
- Schwaiger, M. et al. Chromatin state marks cell-type- and gender-specific replication of the Drosophila genome. Genes & development 23, 589-601(2009).
- Lee, T. et al. Arabidopsis thaliana chromosome 4 replicates in two phases that correlate with chromatin state. PLoS genetics 6, e1000982(2010).
- Koren, A., Soifer, I. & Barkai, N. MRC1-dependent scaling of the budding yeast DNA replication timing program. Genome research 20, 781-90(2010).
- Gilbert, D.M. Temporal order of replication of Xenopus laevis 5S ribosomal RNA genes in somatic cells. Proceedings of the National Academy of Sciences of the United States of America 83, 2924-8(1986).
- Gilbert, D.M. & Cohen, S.N. Bovine papilloma virus plasmids replicate randomly in mouse fibroblasts throughout S phase of the cell cycle. Cell 50, 59-68(1987).
- Aladjem, M.I. et al. Replication initiation patterns in the beta-globin loci of totipotent and differentiated murine cells: evidence for multiple initiation regions. Molecular and cellular biology 22, 442-52(2002).
- Acevedo, L.G. et al. Genome-scale ChIP-chip analysis using 10,000 human cells. BioTechniques 43, 791-7(2007).
- O'Geen, H. et al. Comparison of sample preparation methods for ChIP-chip assays. BioTechniques 41, 577-80(2006).
- Hansen, R.S. et al. Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. Proceedings of the National Academy of Sciences of the United States of America 107, 139-44(2010).
- Pombo, A. & Gilbert, D.M. Nucleus and gene expression: the structure and function conundrum. Current opinion in cell biology (2010).doi:10.1016/j.ceb.2010.04.010
- Royce, T.E. et al. Extrapolating traditional DNA microarray statistics to tiling and protein microarray technologies. Methods in enzymology 411, 282-311(2006).
- Bolstad, B.M. et al. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics (Oxford, England) 19, 185-93(2003).
- Yang, Y.H. et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic acids research 30, e15(2002).
- Pollack, J.R. et al. Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nature genetics 23, 41-6(1999).
- Gentleman, R.C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome biology 5, R80(2004).
- Team, R.D. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing Vienna Austria ISBN 3, 2673(R Foundation for Statistical Computing: 2008).
- Crawley, M.J. The R Book. 950(Wiley: 2007).
- Chambers, J.M. Software for Data Analysis: Programming with R. 498(Springer: 2008).
- Spector, P. Data Manipulation with R. 154(Springer Publishing Company, Incorporated: 2008).
- Wettenhall, J.M. & Smyth, G.K. limmaGUI: a graphical user interface for linear modeling of microarray data. Bioinformatics (Oxford, England) 20, 3705-6(2004).
- Smyth, G.K. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical applications in genetics and molecular biology 3, 3(2004).
- Peng, S. et al. Normalization and experimental design for ChIP-chip data. BMC bioinformatics 8, 219(2007).
- Gottardo, R. Modeling and analysis of ChIP-chip experiments. Methods in molecular biology (Clifton, N.J.) 567, 133-43(2009).
- Venkatraman, E.S. & Olshen, A.B. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics (Oxford, England) 23, 657-63(2007).
- Dellinger, A.E. et al. Comparative analyses of seven algorithms for copy number variant identification from single nucleotide polymorphism arrays. Nucleic acids research (2010).doi:10.1093/nar/gkq040
- Willenbrock, H. & Fridlyand, J. A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics (Oxford, England) 21, 4084-91(2005).
- Lai, W.R. et al. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics (Oxford, England) 21, 3763-70(2005).
- Olshen, A.B. et al. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics (Oxford, England) 5, 557-72(2004).
- Díaz-Uriarte, R. & Rueda, O.M. ADaCGH: A parallelized web-based application and R package for the analysis of aCGH data. PloS one 2, e737(2007).
- Eisen, M.B. et al. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America 95, 14863-8(1998).
- Suzuki, R. & Shimodaira, H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics (Oxford, England) 22, 1540-2(2006).
- The ENCODE (ENCyclopedia Of DNA Elements) Project. Science (New York, N.Y.) 306, 636-40(2004).
- Birney, E. et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799-816(2007).
- Rosenbloom, K.R. et al. ENCODE whole-genome data in the UCSC Genome Browser. Nucleic acids research 38, D620-5(2010).
- Edgar, R., Domrachev, M. & Lash, A.E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research 30, 207-10(2002).
- Barrett, T. et al. NCBI GEO: archive for high-throughput functional genomic data. Nucleic acids research 37, D885-90(2009).
- Barski, A. et al. High-resolution profiling of histone methylations in the human genome. Cell 129, 823-37(2007).
- Salcedo-Amaya, A.M. et al. Dynamic histone H3 epigenome marking during the intraerythrocytic cycle of Plasmodium falciparum. Proceedings of the National Academy of Sciences of the United States of America 106, 9655-60(2009).
- Guenther, M.G. et al. A chromatin landmark and transcription initiation at most promoters in human cells. Cell 130, 77-88(2007).
- Hon, G., Wang, W. & Ren, B. Discovery and annotation of functional chromatin signatures in the human genome. PLoS computational biology 5, e1000566(2009).











