The system integrated numerous computational methods to comprehensively annotate the regulatory features of five mammalian genomes such as human, mouse, rat, chimp, and dog

The system integrated numerous computational methods to comprehensively annotate the regulatory features of five mammalian genomes such as human, mouse, rat, chimp, and dog. The related information of the  regulatory features such as Transcription Start Site (TSS), first exon end position, Transcription Factor Binding Site (TFBS), CpG island, G + C content, repeats (SINE, LINE, tandem repeats and so on), TATA box, CCAAT box, GC box, statistical over-represented (OR) oligonucleotide, DNA stability, microRNA target sites and Single Nucleotide Polymorphism (SNP) are described as follows.

 

*          Transcription Start Site

        The transcription start site (TSS) is an initiation site of the production of mRNA molecules. The important regulatory elements usually located near the TSS, which is the so called gene promoter region. The system collected five mammalian known gene start sites from Ensembl genome database, including human, mouse, rat, chimpanzee and dog, and the number of known genes are 22774, 25420, 22159, 22475 and 18201, respectively. Since users input a sequence to be searched for the homogeneity with the known gene promoter sequences, the Ensembl annotated start sites are used to extract the promoter region. By default, the upstream 2000 bps from TSS (+1) to the first exon end are extracted and defined as the promoter region. Besides, DBTSS collects a full-length cDNA library which experimentally determined human and mouse gene TSS, are integrated by GPMiner to improve the annotation of gene start sites.

Users can also input a novel sequence to be annotated the putative TSS. The system integrates the TSS prediction tool, Eponine, which detects the transcriptional initiation site near the TATA box together with the flanking regions of G-C enrichment. A parameter of score threshold should be set (0 ~ 1.0), the value is set to 0.8 with the highly prediction accuracy. A lower score threshold will make much TSS predictions, and increase the false positive problem.

 

*          First Exon End Position (only for known genes)

The information of first exon end position is only annotated for the known genes. With the Ensembl core libraries, the most 5’-end exon of known genes are extracted and used to determine the first exon end position. The average distances between TSS (+1) and first exon end of human, mouse, rat, chimpanzee and dog are 308, 369, 300, 310 and 246, respectively. The information of first exon end position is used by the system to define the known gene promoter regions.

 

*          CpG Island

        In vertebrate genomes, the CpG Islands (CGIs) are involved in DNA methylation of gene transcription. 50-60% of the human genes exhibit a CGI over the transcription start site (TSS) but not all the CGIs are associated with promoter regions (Larsen et al., 1992). The CGIs associated with promoters can be, a priori, identified from their structural characteristics (greater size, higher G+C content and CpG o/e ratio; Ioshikhes and Zhang, 2000; Ponger et al., 2001). CpGProD can detect the CGIs in the promoter region with prediction specificity ~ 70%, which is integrated by GPMiner to search the CGIs for input sequence. The CGIs are defined as DNA regions longer than 500 nucleotides, with a moving average C+C frequency above 0.5 and a moving average CpG observed/expected (o/e) ratio greater than 0.6. The information of CpG islands can help improving the prediction of gene promoter regions.

 

*          G + C Content

        The C + C content represent a frequency of nucleotide G and C occurrence in a given window. The default window size is 15 nt sliding 1 nt each time. The representation of G + C content can help observing the CpG islands and GC box in the promoter region. It is found that most genes had high G + C content in promoter regions.

 


*          Transcription Factor Binding Site

        The experimentally identified TF bind sites were obtained from TRANSFAC (professional 8.1), which contains 5,711 transcription factors and 14,406 binding sites. In the system, 4,206 known binding sites are matched to upstream regions of human, mouse, rat, chimpanzee, and dog genes. A program, namely MATCH, was implemented to match the consensus patterns of the TRANSFAC known binding sites to the input sequences. Two important parameters of MATCH, core score and matrix score, represent the sequence matching score of core region and whole region of binding site, respectively. For high specificity of transcription factor binding site matching, the system set 1.0 (perfect match) for core score and 0.95 for matrix score. The known TF binding sites are used to scan the input sequence in both strands, and the positions of each known site homolog are then displayed in the graphical visualization.

 


*          TATA box, CCAAT box, and GC box

        Narang et al. used computational method to reveal several important core and proximal promoter elements such as TATA box, CCAAT box, GC box, etc., along with their expected locations around the TSS. These oligonucleotides are kinds of transcription factor binding site and located near the transcription start site. As shown in Table S3, the lists of TATA box, CCAAT box, and GC box with positional densities are used by GPMiner to help the annotation of promoter region.

 

Table S3. The lists of TATA box, CCAAT box, and GC box with positional densities (Narang et al.).

Consensus

Preferred Position

Corresponding oligonucleotides

Window Position

Probability

TATA box

-35 to -25

TATAAA

TATAAC

TATAAG

TATATA

TAAAAG

TAAAGG

TAAATA

TGTATA

ATAAAA

ATAAAG

ATAAAT

ATATAA

CCTATA

CTATAA

CTATAT

GCTATA

GTATAA

GTATAT

-40 to -20

-40 to -20

-40 to -20

-40 to -20

-40 to -20

-40 to -20

-40 to -20

-40 to -20

-40 to -20

-40 to -20

-40 to -20

-40 to -20

-40 to -20

-40 to -20

-40 to -20

-40 to -20

-40 to -20

-40 to -20

0.564

0.25

0.473

0.365

0.364

0.299

0.275

0.307

0.299

0.348

0.285

0.394

0.437

0.597

0.413

0.543

0.568

0.331

CCAAT box

-165 to -40

(-90 mean)

ACCAAT

CAATGG

CCAATC

CCAATG

GACCAA

GCCAAT

-140 to -80

-140 to -80

-140 to -80

-140 to -80

-140 to -80

-140 to -80

0.259

0.201

0.201

0.279

0.209

0.232

GC box

-164 to +1

GGCGGG

GGGCGG

GGGGCG

CGGCGG

CGGGGC

GCGCCG

GCGGCG

GCGGGC

GCGGGG

GGCGGG

GGGGCG

CGGCGG

GCGGCG

GGCGGC

-140 to -80

-140 to -80

-140 to -80

-80 to -20

-80 to -20

-80 to -20

-80 to -20

-80 to -20

-80 to -20

-80 to -20

-80 to -20

-20 to +40

-20 to +40

-20 to +40

0.203

0.208

0.218

0.201

0.256

0.203

0.201

0.211

0.253

0.275

0.266

0.249

0.251

0.254

 

 

*          Over Represented (OR) Oligonucleotide

        The system applies a statistical method to discover statistically significant oligonucleotides in promoter region, the so called over-represented (OR) Oligonucleotide, which is identified by comparing their frequencies of occurrence in the promoter regions to their background frequencies of occurrence throughout whole genome. If Pb(S) is the background occurrence probability of oligonucleotide S in whole genomic sequence, then the oligonucleotide S would be expected to occur u = T ´ Pb(S) times in the promoter regions of genes, where T represents the total number of possible matching positions of an oligonucleotide with length w across both strands of the sequence set. Using the binomial distribution model, the standard deviation of oligonucleotide occurrences becomes . Let n be the frequency of the considered oligonucleotide S occurring in the promoter regions; the Z-score is given by Z = (n - u) / s. The probability of observing at least n successes, as given by Chebyshev’s theorem, is less than or equal to. If Z > 0, then a lower p-value corresponds to a more over-represented oligonucleotide. If Z < 0, then a lower p-value corresponds to a more under-represented oligonucleotide. By statistical significance, we selected the oligonucleotide with the p-value (cumulative hypergeometric probability) < 0.01 to be the OR aligonucleotides.

 


*          Repeats

The repeats such as SINE, LINE, Alu, L1, and so on, are extracted from Ensembl database by using Ensembl core libraries. Previous study (Batzer et al.) found that repeats such as Alu and L1 elements can alter the distribution of methylation in the genome, and possibly in gene transcription. These repeats are represented only for known gene promoter sequences. To find the tandem repeats in promoter region, the system integrates a program namely tandem repeat finder. The parameters such as period size, copy number, consensus size, score, etc. are set corresponding to the default value of tandem repeat finder.

 


*          DNA Stability

        Aditi Kanhere et al. (13) devised a novel regulatory feature, DNA stability, for prokaryotic promoter prediction. DNA stability is the structural property of the fragment of DNA duplex, and calculates the minimum free energy based on the hydrogen bond of A-T and C-G pairs. The standard free energy change () corresponding to the melting transition of an ‘n’ nucleotide (or ‘n-1’ dinucleotides) long DNA molecule, from double strand to single strand, is calculated as follows (7):

where,  denotes two types of initiation free energy : “initiation with terminal G×C” and “initiation with terminal A×T”;  is +0.43 kcal/mol and is applicable if the duplex is self-complementary, and  represents the standard free energy change for type ij dinucleotide. Table S1 lists the standard free energy changes for ten Watson-Crick types ij.

 

Table S1. Watson-Crick type unified standard free energy change (23).

Dinucleotide Types

Standard free energy (kcal/mol)

AA/TT

-1.00

AT/TA

-0.88

TA/AT

-0.58

CA/GT

-1.45

GT/CA

-1.44

CT/GA

-1.28

GA/CT

-1.30

CG/GC

-2.17

GC/CG

-2.24

GG/CC

-1.84

Initiation with terminal G×C

0.98

Initiation with terminal A×T

1.03

 

 

In the present calculation, each promoter sequence is divided into overlapping windows of 15 bp (or 14 dinucleotide steps), and for each window the free energy is calculated as shown above. This study used the equation of standard free energy change (mentioned in the supplementary materials) to calculate the stability of DNA duplex with window size = 15 nt, sliding from -1000 to +201 of TSS in the DBTSS human and mouse experimentally determined promoters. Figure S1 shows the distributions of average free energy of DNA duplex formation, and reveals a peak near the TSS, lying between -10 and -30 region, which corresponds to the TATA box in the eukaryotic promoter sequences. Aditi Kanhere et al. (13) demonstrated that the change in DNA stability appears to provide a much better clue than the usual sequence motifs.

Fig. S1. Distributions of average free energy of DNA duplex formation in human and mouse promoters. The promoter sequences is humans and mice are 10,607 and 10,480, respectively. For correcting promoter sequences, the promoter with a number of full-length cDNA mapping the indicated TSS exceeding 5 is defined as “filter” set. Notably, the distributions of the two sets “all” and “filter” are almost equal.

 

 

*          MicroRNA Target Site

Morris et al. found that small interfering RNA (siRNA) and microRNA (miRNA) silencing gene transcription associated with DNA methylation of the target sequence, and demonstrated that siRNA-directed transcriptional silencing is conserved in mammals, enabling the inhibition of mammalian gene function. The miRNA gene profiles are extracted from miRBase, and been used to detect the miRNA target sites. The system integrates miRanda, to detect the microRNA target sites associated with two major parameters, the Minimum Free Energy (MFE) and score. For the highly prediction accuracy, the values of MFE and score are set to -20 and 150, respectively.


 


           Tutorial







Bid Lab, Institute of Bioinformatics, National Chiao Tung University , Taiwan.

Contact us:Tzong-Yi Lee with questions or comments