Help of CellPred

Content

Main Model Equation
Intensity File
Cell-percentage File
Factor File
Train Model
Predict

Content
Main Model Equation Intensity File Cell-percentage File Factor File Train Model Predict

Main Model Equation

Basic model:

g = β₀ +
C

∑

j=1

β_j p_j + e

A multi-variate linear regression model was used for predicting tissue components, where g is the expression value for a gene, p_j is the percentage of a given tissue component determined by the pathologists, and β_j is the expression coefficient associated with a given cell type. In this model, C is the number of tissue types under consideration. β_j estimates the relative expression level in cell type j compared to the basic expression level β₀. Signals from other cell types that are not part of the model are subsumed into β₀ and e.

Genes that best predict tissue percentage independently estimated by pathologists are established in a training set. Then g and β_j from each of these “predictive” genes is applied to the expression data in another dataset to estimate the tissue percentage of each sample by solving for p_j using linear regression.

Extended model:

G_i = α_i +
C

∑

j=1

β_ij P_j +
N

∑

k=1

γ_ik F_k + є_i

The basic model can be extended if G_i is also impacted by other discontinuous factors, such as batch effects, or hospital effects. Given N such factors, F_k represents the levels of the kth factor, and γ_ik determines the weight of this factor.

A matrix containing all parameters in the model: α_i, β_ij, and γ_ik, can be computed by fitting the model with G_i, P_j, and F_k in the training data sets. Reversely, with all parameters known, the model can be fitted to G_i and F_k in experimental data set, therefore P_j, the percentages of different tissue components, are revealed.

Intensity File

The intensity file is the file containing gene expression values of hybridization signals on a microarray or data from high-throughput sequencing. CellPred only accept single-channel intensity files.
CellPred requires all intensity files used in training and testing to be in the same platform and the same format: Affymetrix CEL file, or a Specific Format (tab-delimited text file). If necessary, users can preprocess the data from multiple platforms outside the program and then import only those portions of each file that have an identical number of rows in the same order and with the same ID.

Affymetrix CEL File: The CEL file of an Affymetrix expression GeneChip.
Specific Format: Text files contain spot intensities. Must be tab-delimited text file. The files must have appropriate column names (case insensitive), including “ID”(the unique identifier for each probe), “ch1.Intensity”(foreground intensity for channel 1 - the only channel will be used by CellPred), all other columns will be skipped by CellPred.

ID	ch1.Intensity
Probe1	564
Probe2	1005
Probe3	392

FileName	Tumor cell	Stroma cell
GSM102114.txt	75	20
GSM102116.txt	68	30
GSM102118.txt	45	50
GSM102120.txt	70	10
GSM102122.txt	82	10
GSM102115.txt	36	53
GSM102117.txt	65	32
GSM102119.txt	73	23
GSM102121.txt	25	70
GSM102123.txt	53	44

FileName	Batch	Hospital
GSM102114.txt	1	scp
GSM102116.txt	1	mer
GSM102118.txt	2	scp
GSM102120.txt	2	mer
GSM102122.txt	1	scp
GSM102115.txt	2	mer
GSM102117.txt	3	scp
GSM102119.txt	3	scp
GSM102121.txt	3	mer
GSM102123.txt	3	mer