Pipeline for building testing data

Prepering testing data

LncRNA identification requires sequence data in fasta or GTF format. Most sequence data in this study were downloaded directly from public databases; other data were assembled using data generated from in silico simulation or actual sequencing.

prepare_data1

Software and models that have been evaluated are summeried in following tables

Table 1: Software for lncRNA indentification
Software packages Input Algorithm Features Online analysis Binary/source
CPC Sequence SVM ORF, consv http://cpc.cbi.pku.edu.cn/programs/run_cpc.jsp http://cpc.cbi.pku.edu.cn
CPC2 Sequence SVM Fickett, ORF, pI http://cpc2.cbi.pku.edu.cn/ http://cpc2.cbi.pku.edu.cn/
CNCI Sequence SVM MLCDS NaN http://www.bioinfo.org/software/cnci
CPAT Sequence/(GM and R) LR ORF, Fickett, hexamers http://lilab.research.bcm.edu/cpat/ https://sourceforge.net/projects/rna-cpat/files/
FEELnc Sequence RF ORF; k-mer NaN https://github.com/tderrien/FEELnc
hmmscan Sequence Cut-off SS https://www.ebi.ac.uk/Tools/hmmer/search/hmmscan http://hmmer.org/download.html
longdist Sequence SVM np of ORF; ORF NaN https://github.com/hugowschneider/longdist.py
PLEK Sequence SVM k-mer NaN https://sourceforge.net/projects/plek/files/
PLncPRO Sequence RF ORF; consv NaN http://ccbb.jnu.ac.in/plncpro
RNAplonc Sequence REPTree k-mer; ORF NaN https://github.com/TatianneNegri/RNAplonc
COME GM BRF GC%,conservation, SS NaN https://github.com/lulab/COME
iSeeRNA GM LR ORF,di-mer,tri-mer, consv http://sunlab.cpy.cuhk.edu.hk/iSeeRNA/webserver.html https://sunlab.cpy.cuhk.edu.hk/iSeeRNA/download.html
lncRScan-SVM GM SVM ORF, tri-mer, exon,consv NaN https://sourceforge.net/projects/lncrscansvm/files/
lncScore Sequence and GM LR ORF,exon,MCSS NaN https://github.com/WGLab/lncScore
Input: GM (gene model, mostly gtf file), R (reference genome); learning-model: SVM (support vector machine), LR (logistic regression), RF (random forest), REPTree (Reduced Error Pruning Tree), BRF (balanced random forest); Features: consv (sequence conservation), SS (secondary structures), np (nucleotide patterns), MCSS ( maximum coding subsequence), MLCDS (the most-like Coding domain Sequence), Fickett (Fickett TESTCODE score), pI (isoelectric point), socf (Sequence-order correlation factors); Web: T (has web server); F (has not web server)
Table 2: Models for lncRNA identification
Name of model Software Attribute of model Group
CPC CPC - J & R
CPC2 CPC2 - J & R
CNCI_ve CNCI vertebrate [a] J & R
CNCI_pl CNCI plant [a] -
CPAT_human CPAT human [a] J & R
CPAT_mouse CPAT mouse [a] R
CPAT_zebrafish CPAT zebrafish [a] -
CPAT_fly CPAT fruit fly [a] -
FEELnc_hm_cl FEELnc human; cl [b] -
FEELnc_hm_sf FEELnc human; sf [b] -
FEELnc_ms_cl FEELnc mouse; cl [b] -
FEELnc_ms_sf FEELnc mouse; sf [b] -
FEELnc_zf_cl FEELnc zebrafish; cl [b] -
FEELnc_zf_sf FEELnc zebrafish; sf [b] -
FEELnc_ff_cl FEELnc fruit fly; cl [b] -
FEELnc_ff_sf FEELnc fruit fly; sf [b] -
FEELnc_wm_cl FEELnc worm; cl [b] -
FEELnc_wm_sf FEELnc worm; sf [b] -
FEELnc_ab_cl FEELnc arabidopsis; cl [b] -
FEELnc_ab_sf FEELnc arabidopsis; sf [b] -
FEELnc_all_cl FEELnc combined six species data; cl [b] J & R
FEELnc_all_sf FEELnc combined six species data; sf [b] R
hmmscan_A hmmscan Pfam-A [c] J & R
hmmscan_B hmmscan Pfam-B [c] -
hmmscan_both hmmscan Pfam-A and Pfam-B [c] -
longdist_GRCh37 longdist human37 [a] -
longdist_GRCh37_GRCm38 longdist human37_mouse [a] -
longdist_GRCh38 longdist human38 [a] -
longdist_GRCh38_GRCm38 longdist human38_mouse [a] -
longdist_GRCm38 longdist mouse [a] -
longdist_GRCm38_GRCz10 longdist mouse_zebrafish [a] -
PLEK PLEK - J & R
PLncPRO_mono PLncPRO monocots [a] -
PLncPRO_dico PLncPRO dicots [a] J & R
RNAplonc_cut RNAplonc remove results missing label [d] -
RNAplonc_guess RNAplonc label the missing-label as lncRNA [d] J & R
COME_seq COME * multiple sequence-derived features only [e] -
COME_all COME * sequence-derived features, expression features and histone features [e] R
iSeeRNA iSeeRNA * - R
lncRScan-SVM lncRScan-SVM * - R
lncScore lncScore * - R
Software are marked with "*" can only work with limited species. Attribute of model: The key attribute to distingush models for one software. [a], the species of the training data; [b], the species of training data and the way the training data is used. Specifically, "cl" is for that both coding and noncoding sequences are real transcripts, while "sf" is for that noncoding sequences are shuffled from coding sequences; [c] which database is used; [d] the way to processing result; [e] the feature used for training. Group: Models are grouped to perform different comparisons in our research. "J" stands for J-models that were used in joint prediction and "R" stands for representative models.