Sarah Louie, Hajime Ogino, Robert Grainger
Details of these steps are discussed below in conjunction with a "Genomial Tutorial" .pdf file.
Conserved non-coding regions of genomes are of great interest to many biologists because, in many cases, they function as gene regulatory sequences. Comparative genomics utilizes the tendency of functional sequences to be selected for during evolution and therefore resolves them from nonfunctional orthologous sequences through multispecies comparisons. The evolutionary distance between the Xenopus tropicalis and the human genomes is approximately 350 million years (Hedges 2006), making the Xenopus genome well positioned for use in identifying conserved non-coding elements by phylogenetic analysis. For example, enhancer sequences can be identified through alignments between Xenopus and mammals (Ogino, 2008). However, many enhancers (and some exons) cannot be identified through alignments comparing fish and mammalian genomes because of further evolutionary distance, and, in the case of zebrafish, genome duplication (Schwartz, 2003; and our observations). Because our lab is interested in the regulatory networks involved in eye formation, we have evaluated many potentially useful bioinformatic methods to identify potential regulatory elements for eye genes using the Xenopus genome. The particular set of programs and processes described here have enabled us to rapidly and accurately predict functional regulatory elements in genes involved in eye development, as validated with transgenic methods in Xenopus . What follows is a description of a bioinformatic protocol, which we continually refine over time as various tools emerge and are updated. Please send comments and suggestions to email@example.com .
The goal of phylogenetic analysis is to identify conserved non-coding sequences that are likely to function in gene regulation. Since enhancer sequences may be relatively short and distant from the coding region of a gene, they can be difficult to predict. By using genome alignment tools with appropriate levels of sensitivity and specificity, and incorporating genomes spanning several hundred million years of evolutionary pressure, it has become possible to resolve functional noncoding sequences from non-regulatory sequence within Mb sized regions of genomic sequence. Several steps are involved in predicting evolutionary conserved regulatory regions (ECRs) with bioinformatics: 1) identify large syntenic regions in multiple genomes of interest; 2) mask repeats and build sequence structure files; 3) perform pairwise and multiple alignments; and 4) extract short conserved regions for further study.
OVERVIEW (Slide 4)
(1) Extract genomic sequences of interest from genome assembly data.
The first step in generating an alignment with MultiPipMaker is creating all the sequence files needed by this tool (Slide 5). Raw sequence data is needed for each genome to be included in the alignment, all in the simple text format.
(2) Build accompanying text files.
The second step in preparing for the alignment is building the exon, repeat and underlay files to accompany the sequence you choose to be the base sequence for the alignment (Slide 12). The repeats and exons files are text files that indicate location of repetitive sequences and exons in the base sequence of the alignment, respectively. The underlay file is a text file that instructs the program to shade specific gene structures (ex. exons and introns) with colors in the output. First retrieve the exon and underlay files from PipHelper (http://pipmaker.bx.psu.edu/cgi-bin/piphelper), by entering the genome and location corresponding to the base sequence downloaded from the UCSC genome browser (Slide 13). Click on the exon and underlay file links (Slide 14), and save this information as separate simple text files as shown in the examples for the Pax6CE1 (Slides 15-16). The repeat file produced by PipHelper can be used, however the RepeatMasker tool at the ISB website (http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker) (Slide 17) includes low complexity regions in the report in addition to the simple repeats documented by PipHelper for the Pax6CE1 example. Simply enter the sequence file and source genome (Slide 18) and open the text file (Slide 19), then save the resulting documentation (Slide 20) as a simple text file. If these files for your base sequence are not available from PipHelper (as occasionally is the case for Xenopus ), you have to make them yourself by analyzing gene structures using standard DNA analysis tools (such as Entrez and Vector NTI) and use the examples as a template to create the necessary files.
(3) Generate a multiple genomic sequence alignment.
To generate an alignment of the genomic sequences retrieved from the genome browser(s) (Slide 21) we use the more sensitive global alignment programs, PipMaker (http://pipmaker.bx.psu.edu/pipmaker/). At the PipMaker website (http://pipmaker.bx.psu.edu/pipmaker/), there are three programs, (basic) PipMaker, Advanced PipMaker, and MultiPipMaker (Slide 22). PipMaker and Advanced PipMaker generate pairwise alignments, and MultiPipMaker is used to align three or more sequences. Links to detailed instructions of these programs are found in the website. Click on MultiPipMaker and enter the number of sequences to be aligned on the next page (Slide 23). Select “Generate nucleotide level view” option to see a raw sequence alignment data in addition to a schematic “Pip” view. Next, upload DNA sequence files, repeat, exon and underlay files for the base sequence (Slide 24) and check the “use as default in pip” box. An annotation file is not necessary for the analysis. Enter names and upload sequence files for the additional orthologous loci of interest in the spaces below and check the “Search both strands” and “High sensitivity and low time limit” options. Detailed instructions of the “Show all matches”, “Chaining”, and “Single coverage” selections are found on the Advanced PipMaker instructions website (http://pipmaker.bx.psu.edu/pipmaker/pip-instr3.html). “Show all matches” is the setting we often use, though it can lead to a somewhat complex output view if regions of the base sequence align with multiple sequences in the second genome.
(4) Perform a local realignment of ECR sequences.
It is helpful to realign the ECR region of interest and shade conserved residues with more a more accurate local alignment program before attempting to phylogenetically footprint the TFBMs predicted in step (5) (Slide 29). VectorNTI (Invitrogen) is an easy way to extract, realign, and shade in one program. ClustalW (http://www.ebi.ac.uk/clustalw/) is a proven local alignment program, and can be used in combination with BOXSHADE to generate virtually the same results as with VectorNTI. To use ClustalW, first open the genome sequences that you used for PipMaker analysis with a standard DNA analysis tool (Vector NTI, etc.), and create new text files in fasta format for just the ECR sequence in each genome. The Sequence Manipulation Suite also has a DNA sequence Range Extractor function that is a helpful online alternative (http://www.bioinformatics.org/sms2/range_extract_dna.html). An easy way to access the ClustalW and BOXSHADE programs in tandem is via the SDSC Biology WorkBench (http://workbench.sdsc.edu/) (Slide 30). Create a free account (Slide 31) and enter the WorkBench and click “Nucleic Tools”. In the next window, click “Add” to upload your ECR sequence files, then click “Save” to save each file to your account (Slide 32). Next select the sequence files and click “ClustalW-multiple sequence alignment”, then “Run” and “Save” to store your alignment in the WorkBench. Click on “Alignment Tools” (Slide 33) and click on “BOXSHADE-color-coded plots of pre-aligned sequences” and your saved alignment file then click “run”. In the next window, shading options for conserved and similar residues are presented and can be adjusted as desired. Click “Submit” to show the shaded ECR alignment (Slide 34). This can be saved and opened as a pdf file.
(5) Identify putative transcription factor binding sites conserved in the ECR.
The final step of our protocol involves the in silico prediction of transcription factor binding motifs (TFBMs) that are conserved in the ECR of interest (Slide 35). Many programs that search for TFBMs in DNA sequences exist and are classified into three groups: pattern search programs, weight matrix search programs, and HMM-based programs. Because HMM-based programs require some special mathematical knowledge, pattern and weight matrix search programs are widely used for conventional analysis. This may be done using either the TRANSFAC or other available databases or a user-defined list of motifs (Slide 36). The available databases such as TRANSFAC give an extremely inclusive view of many possible TFBMs, and may provide excellent insight into unexpected mechanisms. However the TFBMs included in the databases are not all selected by the most stringent means, or by criteria that reflect accurate in vivo binding sites. A more limited but highly useful strategy is to develop a user-defined list of TFBMs that may be curated for a particular tissue, stage of development, etc. by searching the literature for TFBMs validated functionally, preferably by Chromatin Immuno-precipitation (ChIP) analysis in vivo.
The protocol described here is only a starting point for analysis of conserved elements as they are only untested predictions of possible function. Once an ECR is identified, it is necessary to test whether an ECR has gene regulatory activity. This can rapidly be done using transgenesis in Xenopus, using a protocol provided on our lab website (http://faculty.virginia.edu/xtropicalis/overview/transgen_protocol.html). Perhaps the greatest utility of the Xenopus system for this kind of study is the extraordinary efficiency with which one can make transgenic embryos, and thereby analyze the activity of putative enhancers in reporter constructs, either in their native form or after mutagenesis to test the significance of particular sites within a regulatory element. Finally, in vitro mutagenesis and in vivo ChIP assays are necessary to test the functionality of predicted TFBMs within a conserved regulatory element. This has been successfully done for the Lens1/Foxe3 enhancer by Hajime Ogino (Ogino, 2008), and the protocol for this is also provided on the Grainger lab website (http://faculty.virginia.edu/xtropicalis/chIP_analysis.htm).
Ivan Ovcharenko, Marcelo Nobrega, Gabriela Loots, Lisa Stubbs. ECR Browser: a Tool for Visualizing and Accessing Data from Comparisons of Multiple Vertebrate Genomes. Nucleic Acids Research (2004) 32: W280-W286.
Ivan Ovcharenko, Gabriela Loots, Ross Hardison, Webb Miller, Lisa Stubbs. zPicture: Dynamic Alignment and Visualization Tool for Analyzing Conservation Profiles. Genome Research (2004) 14:472-477.
Kleinjan DA, van Heyningen V. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am J Hum Genet. (2005) Jan;76(1):8-32.
Lettice LA, Hill AE, Devenney PS, Hill RE. Point mutations in a distant sonic hedgehog cis-regulator generate a variable regulatory output responsible for preaxial polydactyly. Hum Mol Genet. (2008) Apr 1;17(7):978-85.
Ogino H, Fisher M, Grainger RM. Convergence of a head-field selector Otx2 and Notch signaling: a mechanism for lens specification. Development. (2008) Jan;135(2):249-58.
Scott Schwartz, Zheng Zhang, Kelly A. Frazer, Arian Smit, Cathy Riemer, John Bouck, Richard Gibbs, Ross Hardison, and Webb Miller. PipMaker---A Web Server for Aligning Two Genomic DNA Sequences. Genome Research (2000) Vol. 10, Issue 4; 577-586.
Scott Schwartz, Laura Elnitski, Mei Li, Matt Weirauch, Cathy Riemer, Arian Smit, Eric Green, Ross Hardison, NISC Comparative Sequencing program, Webb Miller. MultiPipMaker and Supporting Tools: Alignments and Analysis of Multiple Genomic DNA sequences. Nucleic Acids Research (2003) 31: 3518-3524.
Scott Schwartz, W. James Kent, Arian Smit, Zheng Zhang, Robert Baertsch, Ross C. Hardison, David Haussler, and Webb Miller. Human-Mouse Alignments with BLASTZ. Genome Res. (2003) 13: 103-107.