Help of CellPred

Content

Main Model Equation

Basic model:
g = β0 + 
C
j=1
βj pj + e 

A multi-variate linear regression model was used for predicting tissue components, where g is the expression value for a gene, pj is the percentage of a given tissue component determined by the pathologists, and βj is the expression coefficient associated with a given cell type. In this model, C is the number of tissue types under consideration. βj estimates the relative expression level in cell type j compared to the basic expression level β0. Signals from other cell types that are not part of the model are subsumed into β0 and e.

Genes that best predict tissue percentage independently estimated by pathologists are established in a training set. Then g and βj from each of these “predictive” genes is applied to the expression data in another dataset to estimate the tissue percentage of each sample by solving for pj using linear regression.

Extended model:
Gi = αi + 
C
j=1
βij Pj + 
N
k=1
γik Fk + єi 

The basic model can be extended if Gi is also impacted by other discontinuous factors, such as batch effects, or hospital effects. Given N such factors, Fk represents the levels of the kth factor, and γik determines the weight of this factor.

A matrix containing all parameters in the model: αi, βij, and γik, can be computed by fitting the model with Gi, Pj, and Fk in the training data sets. Reversely, with all parameters known, the model can be fitted to Gi and Fk in experimental data set, therefore Pj, the percentages of different tissue components, are revealed.

Intensity File

The intensity file is the file containing gene expression values of hybridization signals on a microarray or data from high-throughput sequencing. CellPred only accept single-channel intensity files.
CellPred requires all intensity files used in training and testing to be in the same platform and the same format: Affymetrix CEL file, or a Specific Format (tab-delimited text file). If necessary, users can preprocess the data from multiple platforms outside the program and then import only those portions of each file that have an identical number of rows in the same order and with the same ID.


Affymetrix CEL File
    The CEL file of an Affymetrix expression GeneChip.

Specific Format
    Text files contain spot intensities. Must be tab-delimited text file. The files must have appropriate column names (case insensitive), including “ID”(the unique identifier for each probe), “ch1.Intensity”(foreground intensity for channel 1 - the only channel will be used by CellPred), all other columns will be skipped by CellPred.

For example (download here):
IDch1.Intensity
Probe1564
Probe21005
Probe3392

Cell-percentage File

This file contains the cell percentage information for every sample associated with gene expression data files. The column "FileName" (case insensitive) must be provided. At least two other columns should be provided for cell percentages and the sum for each row must be no more than 100. The names for these columns can be defined by the user but each name needs to be unique. In this example two types of cells are used: tumor epithelial cells ("Tumor cell"), and stroma cells ("Stroma cell").

Here is an example for the cel-percentage file:
FileNameTumor cellStroma cell
GSM102114.txt7520
GSM102116.txt6830
GSM102118.txt4550
GSM102120.txt7010
GSM102122.txt8210
GSM102115.txt3653
GSM102117.txt6532
GSM102119.txt7323
GSM102121.txt2570
GSM102123.txt5344

Factor File

If some factors are considered to have impacts on experimental data, e.g., the effect of experiment batch, and the hospital origin of samples, the user can include them into a factor file, and use them in model-training and percentage-prediction. The variables in a factor file will be used as fixed-effect discontinuous variables in the multiple linear regression model. If a model is trained with factors, these factors should be provided for prediction with this model.

This structure of a factor file is very similar to that of a percentage file: one column for "FileName" and one or more other columns for factors involved. If a factor file is to be used, each experiment (intensity file) involved should have a row in the factor file. Additional rows will be omitted. So the training data set and test data set can share one factor file.

Here is an example for the factor file:
FileNameBatchHospital
GSM102114.txt1scp
GSM102116.txt1mer
GSM102118.txt2scp
GSM102120.txt2mer
GSM102122.txt1scp
GSM102115.txt2mer
GSM102117.txt3scp
GSM102119.txt3scp
GSM102121.txt3mer
GSM102123.txt3mer

Train Model


Number of probes to be used in the model
    CellPred uses correlation coefficients, instead of F statistics as mentioned in the manuscript (to be published), to rank probes that are closely related to cell percentages. Two approaches generated similar prediction results, CellPred uses correlation coefficients approach since it runs faster and takes less computation time. In this strategy, genes are ranked by the correlation coefficients between their expression values and the percentage of tissue components across all samples in the training data set, and only probes of higher correlations with any tissue component will be used for regression.
The user needs to define the number of probes to be used in the prediction model. We suggest that different number of probes should be attempted to assess the effect of number of probes on prediction.
    Increasing the number of genes used in the prediction model does not necessarily increase the prediction power. In the example of prostate cancer study, we found that the most accurate predictions could be reached within 250 genes.

Predict

Using a trained model to predict cell percentages for other sets of expression data files.