Related Works

 

Various promoter prediction methods have been developed previously for analyzing gene promoter regions. (Table 1). CpGProD identifies CpG islands in mammalian promoter regions (1). DragonGSF predicts gene promoters based on information regarding CpG islands, transcriptional start sites (TSS) and signals downstream of the predicted TSSs (2). NNPP2.2 applied a time-delay neural network for promoter annotation in the Drosophila melanogaster genome (3). Eponine detects the transcriptional initiation site near the TATA box, together with the flanking regions of GC enrichment (4). As to the identification of transcriptional start sites, McPromoter is a statistical method to identify eukaryotic polymerase II TSS in genomic DNA (5-7). FirstEF presents a set of discriminant functions that can recognize both boundaries of the first exon (8). PromoSer computationally identifies the transcription start sites by considering the alignments of numerous partial and full-length mRNA sequences to genomic DNA (9). PromH identifies promoters based on conservation of regulatory features in pairs of human/mouse orthologous genes. Furthermore, another regulatory feature in promoter regions, DNA stability, was investigated for analyzing prokaryotic promoters (10). DNA stability is the structural property of the fragment of the DNA duplex. The minimum free energy of DNA duplex is calculated based on the hydrogen bonding of the A-T and C-G pairs. Kanhere et al. demonstrated that the DNA stability of promoter regions appears to provide a much better clue for determining the location of transcriptional start site (11).

 

Table 1. The summary of previous promoter prediction programs.

Tool

Method

Species

Features

Data Source

Citation

Sn.

Sp.

False positive rate

Acc.

Eponine

Relevance

Vector Machine

(RVM)

Mammalian

TATA box in a G+C rich domain

EPD

Reese 2001

53.5%

73.5%

-

-

Promoter 2.0

ANN

Vertebrate

Four TFBSs (TATA box, CCAAT box, GC box, Inr)

EPD

Ohler, Stemmer et al. 2000

68%

-

8%

-

NNPP 2.2

ANN

Drosophila

TATA box & Inr

EPD

Down and Hubbard 2002

70%

-

7.2%

-

CpGProD

Statistics based

mammalian

CpG Island

GenBank

Ponger and Mouchiroud 2002

56%

39%

-

-

PromoterInspector

Statistics based

Vertebrate

IUPAC

EPD

Scherf, Klingenhoff

et al. 2000

48%

85%

-

-

Dragon PF

ANN

Human Chr. 22

CpG Island related

EPD

Knudsen 1999

60.17%

-

-

-

Dragon GSF

ANN

Human Chr.

4,21,22

G+C rich &G+C poor

DBTSS

Bajic, Seah et al. 2002

65.10%

-

-

77.80%

McPromoter

ANN, Interpolated Markov

Model

Human Chr. 22

Statistical properties of promoters versus nonpromoters

EPD

Ohler, et al. 2002

52.1%

40.3%

-

-

First Exon Finder

Quadratic

discriminating

analysis

Human Chr.

21,22

CpG Island related

NCBI

Davuluri, Grosse et al. 2001

79.3%

53.5%

-

-

 

Materials

 

DBTSS is a transcriptional start site database, which was established by gathering experimentally identified promoter regions with the oligo-capping method (12). The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally (13). By default, all the bps beginning with those upstream 2000 bps to those downstream 200 bps relative to the TSS (+1) are defined as the promoter regions and extracted for sequence homology search. The experimentally identified promoters originating from human and mouse genomes collected from DBTSS (Table 2) were mapped to Ensembl genomic positions, and the flanking sequences of -3000 bps to +3000 bps around the mapped TSSs. Furthermore, the homologous promoter sequences between human and mouse were analyzed by CLUSTALW (14). The sequence identity of homologous promoter sequences greater than 80% were extracted and defined as the training sequences. The training sequences were classified into two subgroups based on whether the presents of CpG islands or not using CpGProD (1). The statistics of classified training set are given in Table 3.

 

Table 2. The statistics of experimentally identified TSSs collected from DBTSS and EPD.

Database

Number of TSSs

Human

Mouse

Rat

DBTSS

30,964

19,924

N/A

EPD

1,871

196

119

 

Table 3. The statistics of training set (-3000 ~ +3000 of TSS) collected from DBTSS. The homologous promoter sequences (sequence identity > 80%) between human and mouse are analyzed by CLUSTALW (5).

Species

Number of training set

all

with CpG

without CpG

Human

6,464

4,898

1,566

Mouse

8,885

6,723

2,162

Homology between human and mouse

6,452

4,898

1,554

 

 

References

1.             Ponger, L. and Mouchiroud, D. (2002) CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences. Bioinformatics, 18, 631-633.

2.             Bajic, V.B. and Seah, S.H. (2003) Dragon gene start finder: an advanced system for finding approximate locations of the start of gene transcriptional units. Genome Res, 13, 1923-1929.

3.             Reese, M.G. (2001) Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome. Comput Chem, 26, 51-56.

4.             Down, T.A. and Hubbard, T.J. (2002) Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res, 12, 458-461.

5.             Ohler, U. (2000) Promoter prediction on a genomic scale--the Adh experience. Genome Res, 10, 539-542.

6.             Ohler, U., Liao, G.C., Niemann, H. and Rubin, G.M. (2002) Computational analysis of core promoters in the Drosophila genome. Genome Biol, 3, RESEARCH0087.

7.             Ohler, U., Harbeck, S., Niemann, H., Noth, E. and Reese, M.G. (1999) Interpolated markov chains for eukaryotic promoter recognition. Bioinformatics, 15, 362-369.

8.             Davuluri, R.V., Grosse, I. and Zhang, M.Q. (2001) Computational identification of promoters and first exons in the human genome. Nat Genet, 29, 412-417.

9.             Halees, A.S., Leyfer, D. and Weng, Z. (2003) PromoSer: A large-scale mammalian promoter and transcription start site identification service. Nucleic Acids Res, 31, 3554-3559.

10.           Solovyev, V.V. and Shahmuradov, I.A. (2003) PromH: Promoters identification using orthologous genomic sequences. Nucleic Acids Res, 31, 3540-3545.

11.           Kanhere, A. and Bansal, M. (2005) A novel method for prokaryotic promoter prediction based on DNA stability. BMC Bioinformatics, 6, 1.

12.           Yamashita, R., Suzuki, Y., Wakaguri, H., Tsuritani, K., Nakai, K. and Sugano, S. (2006) DBTSS: DataBase of Human Transcription Start Sites, progress report 2006. Nucleic Acids Res, 34, D86-89.

13.           Zampieron, A., Elseviers, M., De Vos, J.Y., Favaretto, A., Geatti, S. and Harrington, M. (2005) The European practice database (EPD): results of the study in thte North-East of Italy. Edtna Erca J, 31, 49-54.

14.           Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 22, 4673-4680.