- Affymetrix CEL File
- The CEL file of an Affymetrix expression GeneChip.
- Specific Format
- Text files contain spot intensities. Must be
tab-delimited text file. The files must have appropriate column names (case
insensitive), including “ID”(the unique identifier for each probe),
“ch1.Intensity”(foreground intensity for channel 1 - the only
channel will be used by CellPred), all other columns will be skipped by
CellPred.
For example (download here):
ID | ch1.Intensity |
---|
Probe1 | 564 |
Probe2 | 1005 |
Probe3 | 392 |
Cell-percentage File
This file contains the cell percentage information for every sample associated
with gene expression data files. The column "FileName" (case insensitive) must
be provided. At least two other columns should be provided for cell percentages
and the sum for each row must be no more than 100. The names for these columns
can be defined by the user but each name needs to be unique. In this example
two types of cells are used: tumor epithelial cells ("Tumor cell"), and stroma
cells ("Stroma cell").
Here is an example for the cel-percentage file:
FileName | Tumor cell | Stroma cell |
GSM102114.txt | 75 | 20 |
GSM102116.txt | 68 | 30 |
GSM102118.txt | 45 | 50 |
GSM102120.txt | 70 | 10 |
GSM102122.txt | 82 | 10 |
GSM102115.txt | 36 | 53 |
GSM102117.txt | 65 | 32 |
GSM102119.txt | 73 | 23 |
GSM102121.txt | 25 | 70 |
GSM102123.txt | 53 | 44 |
Factor File
If some factors are considered to have impacts on experimental data, e.g., the
effect of experiment batch, and the hospital origin of samples, the user can
include them into a factor file, and use them in model-training and
percentage-prediction. The variables in a factor file will be used as
fixed-effect discontinuous variables in the multiple linear regression model.
If a model is trained with factors, these factors should be provided for
prediction with this model.
This structure of a factor file is very similar to that of a percentage file: one column for "FileName" and one or
more other columns for factors involved. If a factor file is to be used, each
experiment (intensity file) involved should have a row in the factor file.
Additional rows will be omitted. So the training data set and test data set can
share one factor file.
Here is an example for the factor file:
FileName | Batch | Hospital |
GSM102114.txt | 1 | scp |
GSM102116.txt | 1 | mer |
GSM102118.txt | 2 | scp |
GSM102120.txt | 2 | mer |
GSM102122.txt | 1 | scp |
GSM102115.txt | 2 | mer |
GSM102117.txt | 3 | scp |
GSM102119.txt | 3 | scp |
GSM102121.txt | 3 | mer |
GSM102123.txt | 3 | mer |
Train Model
- Number of probes to be used in the model
- CellPred uses correlation coefficients, instead of F statistics as mentioned in the manuscript (to be published), to rank probes that are closely related to cell percentages. Two approaches generated similar prediction results, CellPred uses correlation coefficients approach since it runs faster and takes less computation time. In this strategy, genes are ranked by the correlation coefficients between their expression values and the percentage of tissue components across all samples in the training data set, and only probes of higher correlations with any tissue component will be used for regression.
The user needs to define the number of probes to be used in the prediction model. We suggest that different number of probes should be attempted to assess the effect of number of probes on prediction.
Increasing the number of genes used in the prediction model does not necessarily increase the prediction power. In the example of prostate cancer study, we found that the most accurate predictions could be reached within 250 genes.
Predict
Using a trained model to predict cell percentages for other sets of expression data files.