Pipeline for building testing data
Software and models that have been evaluated are summeried in following tables
Software packages | Input | Algorithm | Features | Online analysis | Binary/source |
---|---|---|---|---|---|
CPC | Sequence | SVM | ORF, consv | http://cpc.cbi.pku.edu.cn/programs/run_cpc.jsp | http://cpc.cbi.pku.edu.cn |
CPC2 | Sequence | SVM | Fickett, ORF, pI | http://cpc2.cbi.pku.edu.cn/ | http://cpc2.cbi.pku.edu.cn/ |
CNCI | Sequence | SVM | MLCDS | NaN | http://www.bioinfo.org/software/cnci |
CPAT | Sequence/(GM and R) | LR | ORF, Fickett, hexamers | http://lilab.research.bcm.edu/cpat/ | https://sourceforge.net/projects/rna-cpat/files/ |
FEELnc | Sequence | RF | ORF; k-mer | NaN | https://github.com/tderrien/FEELnc |
hmmscan | Sequence | Cut-off | SS | https://www.ebi.ac.uk/Tools/hmmer/search/hmmscan | http://hmmer.org/download.html |
longdist | Sequence | SVM | np of ORF; ORF | NaN | https://github.com/hugowschneider/longdist.py |
PLEK | Sequence | SVM | k-mer | NaN | https://sourceforge.net/projects/plek/files/ |
PLncPRO | Sequence | RF | ORF; consv | NaN | http://ccbb.jnu.ac.in/plncpro |
RNAplonc | Sequence | REPTree | k-mer; ORF | NaN | https://github.com/TatianneNegri/RNAplonc |
COME | GM | BRF | GC%,conservation, SS | NaN | https://github.com/lulab/COME |
iSeeRNA | GM | LR | ORF,di-mer,tri-mer, consv | http://sunlab.cpy.cuhk.edu.hk/iSeeRNA/webserver.html | https://sunlab.cpy.cuhk.edu.hk/iSeeRNA/download.html |
lncRScan-SVM | GM | SVM | ORF, tri-mer, exon,consv | NaN | https://sourceforge.net/projects/lncrscansvm/files/ |
lncScore | Sequence and GM | LR | ORF,exon,MCSS | NaN | https://github.com/WGLab/lncScore |
Input: GM (gene model, mostly gtf file), R (reference genome); learning-model: SVM (support vector machine), LR (logistic regression), RF (random forest), REPTree (Reduced Error Pruning Tree), BRF (balanced random forest); Features: consv (sequence conservation), SS (secondary structures), np (nucleotide patterns), MCSS ( maximum coding subsequence), MLCDS (the most-like Coding domain Sequence), Fickett (Fickett TESTCODE score), pI (isoelectric point), socf (Sequence-order correlation factors); Web: T (has web server); F (has not web server)
Name of model | Software | Attribute of model | Group |
---|---|---|---|
CPC | CPC | - | J & R |
CPC2 | CPC2 | - | J & R |
CNCI_ve | CNCI | vertebrate [a] | J & R |
CNCI_pl | CNCI | plant [a] | - |
CPAT_human | CPAT | human [a] | J & R |
CPAT_mouse | CPAT | mouse [a] | R |
CPAT_zebrafish | CPAT | zebrafish [a] | - |
CPAT_fly | CPAT | fruit fly [a] | - |
FEELnc_hm_cl | FEELnc | human; cl [b] | - |
FEELnc_hm_sf | FEELnc | human; sf [b] | - |
FEELnc_ms_cl | FEELnc | mouse; cl [b] | - |
FEELnc_ms_sf | FEELnc | mouse; sf [b] | - |
FEELnc_zf_cl | FEELnc | zebrafish; cl [b] | - |
FEELnc_zf_sf | FEELnc | zebrafish; sf [b] | - |
FEELnc_ff_cl | FEELnc | fruit fly; cl [b] | - |
FEELnc_ff_sf | FEELnc | fruit fly; sf [b] | - |
FEELnc_wm_cl | FEELnc | worm; cl [b] | - |
FEELnc_wm_sf | FEELnc | worm; sf [b] | - |
FEELnc_ab_cl | FEELnc | arabidopsis; cl [b] | - |
FEELnc_ab_sf | FEELnc | arabidopsis; sf [b] | - |
FEELnc_all_cl | FEELnc | combined six species data; cl [b] | J & R |
FEELnc_all_sf | FEELnc | combined six species data; sf [b] | R |
hmmscan_A | hmmscan | Pfam-A [c] | J & R |
hmmscan_B | hmmscan | Pfam-B [c] | - |
hmmscan_both | hmmscan | Pfam-A and Pfam-B [c] | - |
longdist_GRCh37 | longdist | human37 [a] | - |
longdist_GRCh37_GRCm38 | longdist | human37_mouse [a] | - |
longdist_GRCh38 | longdist | human38 [a] | - |
longdist_GRCh38_GRCm38 | longdist | human38_mouse [a] | - |
longdist_GRCm38 | longdist | mouse [a] | - |
longdist_GRCm38_GRCz10 | longdist | mouse_zebrafish [a] | - |
PLEK | PLEK | - | J & R |
PLncPRO_mono | PLncPRO | monocots [a] | - |
PLncPRO_dico | PLncPRO | dicots [a] | J & R |
RNAplonc_cut | RNAplonc | remove results missing label [d] | - |
RNAplonc_guess | RNAplonc | label the missing-label as lncRNA [d] | J & R |
COME_seq | COME * | multiple sequence-derived features only [e] | - |
COME_all | COME * | sequence-derived features, expression features and histone features [e] | R |
iSeeRNA | iSeeRNA * | - | R |
lncRScan-SVM | lncRScan-SVM * | - | R |
lncScore | lncScore * | - | R |
Software are marked with "*" can only work with limited species. Attribute of model: The key attribute to distingush models for one software. [a], the species of the training data; [b], the species of training data and the way the training data is used. Specifically, "cl" is for that both coding and noncoding sequences are real transcripts, while "sf" is for that noncoding sequences are shuffled from coding sequences; [c] which database is used; [d] the way to processing result; [e] the feature used for training. Group: Models are grouped to perform different comparisons in our research. "J" stands for J-models that were used in joint prediction and "R" stands for representative models.