GPMiner: a web server for mining transcriptional regulatory elements in mammalian gene promoter regions

 

Abstract

The sequence features located in the promoter region are involved in regulating the initiation of gene transcription. Although numerous computational methods have been proposed for predicting promoter regions and transcriptional start sites, they lack annotations for the regulatory features such as the transcription factor binding sites, CpG islands, tandem repeats, the TATA box, the CCAAT box, the GC box, over-represented oligonucleotides, DNA stability and GC-content. To facilitate the investigation of gene promoters, therefore, this work presents an integrated system for identifying promoter regions and annotating regulatory features in user-input sequences. The proposed promoter identification method, whose predictive sensitivity and specificity are both ~80%, incorporates the support vector machine (SVM) with nucleotide composition, over-represented hexamer nucleotides and DNA stability. Additionally, the input sequence also can be analyzed for homogeneity of experimental mammalian promoter sequences. After identifying the promoter regions, regulatory features are graphically visualized to facilitate the observation of gene promoters.

 

 

System Flow

Figure 1. The system flow of GPMiner.

 

Figure 1 illustrates the system flow of GPMiner, which identifies promoter regions and annotates transcriptional regulatory features in the user input genomic sequence. The computational models for promoter identification were constructed by incorporating the support vector machine (SVM) with the features of nucleotide composition, over-represented (OR) hexamer nucleotides and DNA stability. Additionally, GPMiner allows users to input a group of genes for discovering the co-occurrence of transcription factor binding sites in the promoter sequences. All the mined promoter regions and regulatory features in the user input sequence are graphically visualized for facilitating the analysis of gene transcriptional regulation. The details of the method are described as follows.

 

 

Result

 

Performance of Promoter Identification

A benchmark, namely, “Cross,” is used to evaluate the prediction performance of GPMiner which incorporates the support vector machine (SVM) with nucleotide composition, over-represented hexamer nucleotides and DNA stability for mammalian proximal promoter identification. The benchmark is to extract equal sizes of the positive set and the negative set, construct the SVM model, and evaluate the model with k-fold cross-validation (k=5). The prediction performances of the constructed SVM models trained with three kinds of regulatory features based on five specified window sizes are shown in Table 1. Since the training sequences are classified into two subgroups by CpG islands, such as “with CpG” and “without CpG”, the prediction performance of “with CpG” is strongly higher than “without CpG”; furthermore, it is found that the larger the setting of the window size, the higher the is the prediction performance of SVM models. However, by considering both prediction performance and window size, the window size -200 to +100 is selected as the specified window for identifying proximal promoter regions. It was mentioned that vertebrate gene expression is often regulated by the proximal promoter, which is traditionally defined as being between -200 bp and TSS.

 

Table 1. The prediction performance of the constructed SVM models with three kinds of regulatory features based on five specified window sizes.

Nucleotide Composition

Training set

Window size

Pr.

Sn.

Sp.

Ac.

all

- 60 ~ + 20

69%

69%

69%

69%

-100 ~ + 50

70%

67%

71%

70%

-200 ~ +100

72%

69%

74%

71%

-300 ~ +150

74%

70%

75%

72%

-400 ~ +200

74%

72%

75%

74%

With CpG

- 60 ~ + 20

70%

74%

68%

71%

-100 ~ + 50

71%

75%

69%

72%

-200 ~ +100

73%

76%

71%

74%

-300 ~ +150

74%

77%

74%

75%

-400 ~ +200

76%

80%

75%

77%

without CpG

- 60 ~ + 20

68%

61%

71%

66%

-100 ~ + 50

66%

62%

68%

65%

-200 ~ +100

68%

65%

69%

67%

-300 ~ +150

67%

63%

68%

66%

-400 ~ +200

66%

63%

68%

66%

Over-represented hexa-mer oligonucleotides

Training set

Window size

Pr.

Sn.

Sp.

Ac.

all

- 60 ~ + 20

69%

51%

77%

64%

-100 ~ + 50

71%

58%

76%

67%

-200 ~ +100

73%

62%

77%

70%

-300 ~ +150

76%

64%

80%

72%

-400 ~ +200

79%

65%

83%

74%

with CpG

- 60 ~ + 20

70%

56%

75%

66%

-100 ~ + 50

72%

67%

74%

71%

-200 ~ +100

76%

75%

76%

76%

-300 ~ +150

78%

77%

78%

78%

-400 ~ +200

81%

79%

82%

80%

without CpG

- 60 ~ + 20

59%

22%

85%

53%

-100 ~ + 50

65%

28%

85%

56%

-200 ~ +100

66%

32%

83%

58%

-300 ~ +150

66%

32%

83%

58%

-400 ~ +200

66%

35%

82%

58%

DNA stability

Training set

Window size

Pr.

Sn.

Sp.

Ac.

all

- 60 ~ + 20

69%

67%

69%

68%

-100 ~ + 50

70%

68%

71%

69%

-200 ~ +100

73%

70%

74%

71%

-300 ~ +150

72%

70%

72%

71%

-400 ~ +200

73%

71%

74%

72%

with CpG

- 60 ~ + 20

71%

74%

70%

71%

-100 ~ + 50

72%

75%

71%

73%

-200 ~ +100

73%

76%

73%

75%

-300 ~ +150

75%

79%

73%

76%

-400 ~ +200

75%

81%

74%

77%

without CpG

- 60 ~ + 20

66%

62%

68%

65%

-100 ~ + 50

66%

62%

69%

65%

-200 ~ +100

67%

64%

68%

66%

-300 ~ +150

67%

65%

69%

66%

-400 ~ +200

68%

66%

70%

67%

 

 

Web Interface

The examples of web interfaces of GPMiner are shown in Fig. 2. In the submission interface, users first choose one of five mammals such as human, mouse, rat, chimpanzee and dog and input a genomic sequence or chromosomal location for identifying proximal promoter regions and for mining regulatory features. Eight types of regulatory features are currently provided in GPMiner; by default, all the regulatory features are chosen for annotation in the input sequence. Especially, users can input the chromosome location to specify the regions of interest for retrieving the genes located in the chromosome region. During the mining process, the system uses integrated tools individually to annotate the regulatory features in the input sequence. Each regulatory feature annotating tool has some search parameters, such as the score threshold for NNPP2.2, Eponine, and McPromoter, the core score and matrix score for TRANSFAC MATCH, Z-Score for over-represented oligonucleotide, i.e., the default parameters are set and the related documentation shown in the help webpage. After the mining of regulatory features, a graphical visualization of the identified promoter regions and mined regulatory features is provided to users. Furthermore, the cross species analysis of homologous gene promoters is shown as the conserved regions in the promoters. Conserved regulatory features in the promoter regions could also be observed through cross species analysis.

 

Figure 2. The submission and result interface of GPMiner.