DPRP: A database of phenotype-specific
regulatory programs derived from
transcription factor binding data
Tutorial and Help

Tutorial and Help


About DPRP

Research and development laboratories:
If you use DPRP in your projects, please cite the following reference: 
David T.W. Tzeng†, Yu-Ting Tseng†, Matthew Ung, I-En Liao, Chun-Chi Liu*, Chao Cheng*. (2014) DPRP: A database of phenotype-specific regulatory programs derived from transcription factor binding data. Nucleic Acids Research 42: D178-183 (†co-first authors, *corresponding authors) [PubMed]

We developed an interactive network web interface to visualize the TF regulatory programs using CytoscapeWeb (http://cytoscapeweb.cytoscape.org/) (PMID: 20656902).

Background

The DPRP database provides three major functions:

  1. Gene expression database: the database has collected 984 gene expression datasets including a total of 29,744 arrays. These datasets were originally generated to explore differential gene expression under different conditions or treatments, e.g. gene expression changes during development; differential gene expression between different subtypes of breast cancer. Thus, each dataset has several subsets and each subset has a number of samples. To identify differentially expressed genes (DEGs) for each dataset, we selected the subsets with at least 3 samples, and then performed t-test for all subset pairs without overlap samples. It resulted in 3777 subset pairs.
  1. Phenotype annotation: to systematically annotate gene expression data and address synonymous issue, we used UMLS technology that provides a comprehensive catalog of medical concepts. To concentrate on human disease study, we limited the UMLS concepts to three semantic types: "Pathologic Function", "Injury or Poisoning" and "Anatomical Abnormality". UMLS also provides the language processing tool MetaMap to enable the automated mapping of text onto UMLS concepts. Given a GEO dataset, we determined the phenotypic context of this dataset based on the Medical Subject Headings of its corresponding PubMed record and its dataset summary in GEO, and then parsed these texts to identify relevant UMLS concepts using the MetaMap program. These UMLS concepts provided the GEO dataset level annotation. These datasets were organized in the database in a searchable manner and provide a useful resources for specific biological or clinical research. For example, a user can type in “Breast Carcinoma” as a keyword to obtain a list of datasets related to breast cancer. To facilitate user-friendly text search, we adopt the jQuery AutoComplete technique to guide the user for keyword selection. When a specific dataset is selected, the database will list a number of phenotype pairs (e.g. breast cancer subclasses) for comparing regulatory activity of TFs.
  1. TF regulatory program: the database identified regulatory programs underlying each phenotype associated with a dataset. When two different phenotypes for a given dataset are specified (e.g. estrogen treated versus untreated MCF7 cell lines), we inferred the regulatory programs responsible for the differential gene expression between them. The database provide a list of TFs that show significant differential activity and the regulatory network consisting of these significant TFs based an integrative framework that combine expression data with ChIP-seq data.

An overview of the DPRP web interface. (A) Users can perform a query by the following procedures: (i) Users can input a disease name in the auto-completed keyword field which provides a list of partially matched UMLS concepts for selection. Alternatively, users can also input a dataset ID in the keyword field to select a specific dataset. (ii) After UMLS concept selection, the datasets associated with the selected concept will be shown in a dataset list, from which the user can select the dataset of interest. (iii) Given a specific dataset, the subset pairs from the selected dataset will be displayed in a subset list, and then the user can select the subset pair to search TF regulatory programs. DPRP provides three different methods to rank the potential TFs, in which users can determine which ranking guidelines to use. In addition, users can upload their own gene expression data with gene list and t-value of T-test or log ratios between two subsets. (B) The database integrated gene expression data and ChIP-seq TF binding data to identify the regulatory programs underlying a selected phenotype pair. (C) The output webpages: DPRP generates a list of the TFs and ranks them by their P values or Q values. In TF table view, users can export the table of candidate TFs as a text file. Based on the ranked TF list, DPRP generates a regulatory network consisting of all significant TFs, in which users can export the TF network as a png, svg or xml file.


The main interface of the database


  • The users type in “Breast Carcinoma” as the keyword in the “Keyword” text box. As a result, a number of breast cancer related datasets will be listed in the “Dataset ID” text box for user to make selection. After the user specify the dataset, a number of phenotype pairs will subsequently listed in the “Subset Pair” text box. The user can then select the subset pair of interested and specify cutoff for P-value.

A

Users can input disease name or dataset ID into the auto-completed field which provides quickly finding and selecting some match value. Furthermore, by entering more characters, it will filter down the list to better matches. In addition, since we built a list of disease name dataset ID from the database, if users found that the auto-completed field does not show any corresponding value prompt in the input window, it means the search will not have any result. Thus, the auto-complete function not only helps users search efficiently but also makes a quick filtering.

B

After users input the keyword, this scroll list will show all the dataset which have BASE regulatory activity score in our database.

C

After users choose dataset ID, this scroll list will show all the subset pair associated to the dataset which user selected.

D

Since we provided three methods, users can choose which method to be their ranked basis.

E

Users can upload their own gene expression data with gene list and t-value of T-test or log ratios between two subsets.


Upload your gene expression data

1. Upload data file

2. Processing data file

3. Output result corresponding to user's gene expression data


Output of the database

Significant TF list

DPRP generates a list of the TFs and ranks them by their P values or Q values. In TF table view, users can export the table of candidate TFs as a text file. Based on the ranked TF list, DPRP generates a regulatory network consisting of all significant TFs, in which users can export the TF network as a png, svg or xml file.

RAS: Regulatory activity score

P (Up): Fisher's test P value with up-regulated genes

P (Down): Fisher's test P value with down-regulated genes

D value: the maximum difference in KS test

TF regulatory network

DPRP generates a list of the TFs and ranks them by their P values or Q values. In TF table view, users can export the table of candidate TFs as a text file. Based on the ranked TF list, DPRP generates a regulatory network consisting of all significant TFs, in which users can export the TF network as a png, svg or xml file.


Discussion of the three algorithms

Fisher’s Exact test: Given a subset pair, we select the up-regulated and the down-regulated DEGs with P < 0.01. In case that the number of DEGs with P < 0.01 is less than 500, we instead select the top 500 significant genes to ensure enough genes are included for stable results in subsequent statistical analyses. To estimate the significance of differential TF activity, we performed Fisher’s exact tests to examine the overlap between the up/down-regulated gene set and TF target genes. This method requires two cut-off values: one is used to define the up- and the down- regulated genes, and the other is used to define the TF target genes. A more detailed description of applying Fisher’s exact test for TF activity inference can be found in previous studies(PMID:16082366, PMID:20215436).

Kolmogorov-Smirnov test: Given a subset pair, we calculated the t-scores for all genes by comparing their expression differences between the two subsets. For each TF we performed Kolmogorov-Smirnov test (KS test) to compare the distributions of the t-scores between target genes and non-target genes. To define the target gene set of a TF, we set the cutoff value as P < 0.01. If the number of target genes with P < 0.01 is less than 500, we select the top 500 significant target genes for the regulatory program analysis. For each TF, the KS test resulted in a P value, indicating the significance of its activity change, and a D value, indicating the direction of its activity change. A positive D value indicates that target genes of a TF have significantly higher expression levels than non-target genes, and a negative D-value indicates the reverse. A similar KS test based method has been proposed by Tsai et al. to identify cell cycle related TFs in yeast (PMID:16157877).

BASE algorithm: The cut-off values for defining TF target genes and DEGs are usually arbitrary and hard to determine in advance. Comparing to the Fisher’s exact test and the KS-test, BASE is a nonparametric algorithm that requires no cut-off setting for TF target genes or DEGs (PMID 23885756). First, we calculated the t-scores for all genes by comparing their expression differences between a pair of subsets, and sorted them in the decreasing order to obtain a ranked gene list. Each gene in the list is associated with a ti, the t-score for this gene, and a bi, the binding affinity of a TF to this gene calculated by TIP algorithm (PMID:22121212). Then we calculated a cumulative distribution function by aggregating │tibiand a reference function by aggregating ti. Finally, we calculated the maximum deviation between the functions and applied a permutation-based method to normalize the score and to estimate its significance. The normalized score is called regulatory activity score (RAS), which indicates the direction of the activity change of a TF. For a transcriptional activator, a positive/negative RAS indicates enhanced/reduced activity of the TF; while for a transcriptional repressor, the reverse is true. A more detailed description about BASE can be found in PMID:23885756).  

Back Top