This work develops an integrated system, namely GPMiner, incorporates support vector machine (SVM) with nucleotide composition, over-represented hexamer nucleotides and DNA stability for mammalian proximal promoter identification, and mines the regulatory elements, including transcriptional start sites, transcription factor binding sites, CpG islands, tandem repeats, TATA box, CCAAT box, GC box, statistically over-represented sequence patterns, GC content (GC%) and DNA stability. Evaluated by benchmark “Cross”, the predictive sensitivity and specificity of GPMiner are about 80%. Additionally, GPMiner allows users to input a group of genes for mining the co-occurrence of transcription factor binding sites with statistical measures in the promoter sequences. All the mined promoter regions and regulatory features in the user input sequence are graphically visualized for facilitating gene transcription analysis.

Additionally, 1,871 human promoter sequences (-3000 to +3000) of EPD, which comprise the independent test set, were used to evaluate the predictive performance of GPMiner, NNPP2.2 (1), Eponine (2) and McPromoter (3), based on the same evaluating benchmark “Cross”. The testing sequences whose regions are within -200 to +100 relative to TSSs (+1) are defined as a positive set; otherwise, the negative set is randomly extracted from regions other than the positive set. The predictive performance of GPMiner compared with NNPP2.2, Eponine and McPromoter, is given in Table S7 (See Supplementary Materials). Furthermore, the distribution of promoter predictions of GPMiner comparing with NNPP2.2, Eponine, and McPromoter is shown in Fig. 1. The sensitivity of GPMiner appears better than that of other methods; however, the predictive specificities of McPromoter and Eponine are better than GPMiner. Table 1 gives the funcation comparisons with several representative programs of promoter annotation.

 

Figure 1. The distributions of promoter predictions of GPMiner comparing with NNPP2.2, Eponine, and McPromoter.

 

 

Table 1. Comparison of GPMiner with several representative gene promoter annotation programs.

Transcriptional regulatory features

PromoSer (4)

PromH (5)

DragonGSF (6)

McPromoter (3)

GPMiner

Species supported

human, mouse, and rat

human and mouse

mammalian

Eukaryote

Human, mouse, rat, chimp, and dog

Promoter identification

yes

yes

yes

yes

yes

Map to known gene promoters

Yes

-

-

-

DBTSS, EPD and Ensembl

Transcription factor binding site

-

Yes

yes

-

TRANSFAC MATCH

TATA-box

-

Yes

-

Yes

Yes

Tandem repeat

Yes

-

-

Yes

Tandem Repeat Finder

CpG island

-

-

Yes

-

CpGProD

Over-represented pattern

-

-

-

-

Yes

DNA stability

-

-

-

-

Yes

GC content

-

-

Yes

-

Yes

Co-occurrence of TFBSs

-

Yes

-

-

Yes

Graphical view

Yes

-

-

Yes

Yes

 

 

References:

1.         Reese, M.G. (2001) Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome. Comput Chem, 26, 51-56.

2.         Down, T.A. and Hubbard, T.J. (2002) Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res, 12, 458-461.

3.         Ohler, U. (2000) Promoter prediction on a genomic scale--the Adh experience. Genome Res, 10, 539-542.

4.         Halees, A.S., Leyfer, D. and Weng, Z. (2003) PromoSer: A large-scale mammalian promoter and transcription start site identification service. Nucleic Acids Res, 31, 3554-3559.

5.         Solovyev, V.V. and Shahmuradov, I.A. (2003) PromH: Promoters identification using orthologous genomic sequences. Nucleic Acids Res, 31, 3540-3545.

6.         Bajic, V.B. and Seah, S.H. (2003) Dragon gene start finder: an advanced system for finding approximate locations of the start of gene transcriptional units. Genome Res, 13, 1923-1929.