Evaluation of lncRNA identification methods

Pipeline for building testing data

Software and models that have been evaluated are summeried in following tables

Table 1: Software for lncRNA indentification
Software packages	Input	Algorithm	Features	Online analysis	Binary/source
CPC	Sequence	SVM	ORF, consv	http://cpc.cbi.pku.edu.cn/programs/run_cpc.jsp	http://cpc.cbi.pku.edu.cn
CPC2	Sequence	SVM	Fickett, ORF, pI	http://cpc2.cbi.pku.edu.cn/	http://cpc2.cbi.pku.edu.cn/
CNCI	Sequence	SVM	MLCDS	NaN	http://www.bioinfo.org/software/cnci
CPAT	Sequence/(GM and R)	LR	ORF, Fickett, hexamers	http://lilab.research.bcm.edu/cpat/	https://sourceforge.net/projects/rna-cpat/files/
FEELnc	Sequence	RF	ORF; k-mer	NaN	https://github.com/tderrien/FEELnc
hmmscan	Sequence	Cut-off	SS	https://www.ebi.ac.uk/Tools/hmmer/search/hmmscan	http://hmmer.org/download.html
longdist	Sequence	SVM	np of ORF; ORF	NaN	https://github.com/hugowschneider/longdist.py
PLEK	Sequence	SVM	k-mer	NaN	https://sourceforge.net/projects/plek/files/
PLncPRO	Sequence	RF	ORF; consv	NaN	http://ccbb.jnu.ac.in/plncpro
RNAplonc	Sequence	REPTree	k-mer; ORF	NaN	https://github.com/TatianneNegri/RNAplonc
COME	GM	BRF	GC%,conservation, SS	NaN	https://github.com/lulab/COME
iSeeRNA	GM	LR	ORF,di-mer,tri-mer, consv	http://sunlab.cpy.cuhk.edu.hk/iSeeRNA/webserver.html	https://sunlab.cpy.cuhk.edu.hk/iSeeRNA/download.html
lncRScan-SVM	GM	SVM	ORF, tri-mer, exon,consv	NaN	https://sourceforge.net/projects/lncrscansvm/files/
lncScore	Sequence and GM	LR	ORF,exon,MCSS	NaN	https://github.com/WGLab/lncScore

Input: GM (gene model, mostly gtf file), R (reference genome); learning-model: SVM (support vector machine), LR (logistic regression), RF (random forest), REPTree (Reduced Error Pruning Tree), BRF (balanced random forest); Features: consv (sequence conservation), SS (secondary structures), np (nucleotide patterns), MCSS ( maximum coding subsequence), MLCDS (the most-like Coding domain Sequence), Fickett (Fickett TESTCODE score), pI (isoelectric point), socf (Sequence-order correlation factors); Web: T (has web server); F (has not web server)

Table 2: Models for lncRNA identification
Name of model	Software	Attribute of model	Group
CPC	CPC	-	J & R
CPC2	CPC2	-	J & R
CNCI_ve	CNCI	vertebrate [a]	J & R
CNCI_pl	CNCI	plant [a]	-
CPAT_human	CPAT	human [a]	J & R
CPAT_mouse	CPAT	mouse [a]	R
CPAT_zebrafish	CPAT	zebrafish [a]	-
CPAT_fly	CPAT	fruit fly [a]	-
FEELnc_hm_cl	FEELnc	human; cl [b]	-
FEELnc_hm_sf	FEELnc	human; sf [b]	-
FEELnc_ms_cl	FEELnc	mouse; cl [b]	-
FEELnc_ms_sf	FEELnc	mouse; sf [b]	-
FEELnc_zf_cl	FEELnc	zebrafish; cl [b]	-
FEELnc_zf_sf	FEELnc	zebrafish; sf [b]	-
FEELnc_ff_cl	FEELnc	fruit fly; cl [b]	-
FEELnc_ff_sf	FEELnc	fruit fly; sf [b]	-
FEELnc_wm_cl	FEELnc	worm; cl [b]	-
FEELnc_wm_sf	FEELnc	worm; sf [b]	-
FEELnc_ab_cl	FEELnc	arabidopsis; cl [b]	-
FEELnc_ab_sf	FEELnc	arabidopsis; sf [b]	-
FEELnc_all_cl	FEELnc	combined six species data; cl [b]	J & R
FEELnc_all_sf	FEELnc	combined six species data; sf [b]	R
hmmscan_A	hmmscan	Pfam-A [c]	J & R
hmmscan_B	hmmscan	Pfam-B [c]	-
hmmscan_both	hmmscan	Pfam-A and Pfam-B [c]	-
longdist_GRCh37	longdist	human37 [a]	-
longdist_GRCh37_GRCm38	longdist	human37_mouse [a]	-
longdist_GRCh38	longdist	human38 [a]	-
longdist_GRCh38_GRCm38	longdist	human38_mouse [a]	-
longdist_GRCm38	longdist	mouse [a]	-
longdist_GRCm38_GRCz10	longdist	mouse_zebrafish [a]	-
PLEK	PLEK	-	J & R
PLncPRO_mono	PLncPRO	monocots [a]	-
PLncPRO_dico	PLncPRO	dicots [a]	J & R
RNAplonc_cut	RNAplonc	remove results missing label [d]	-
RNAplonc_guess	RNAplonc	label the missing-label as lncRNA [d]	J & R
COME_seq	COME *	multiple sequence-derived features only [e]	-
COME_all	COME *	sequence-derived features, expression features and histone features [e]	R
iSeeRNA	iSeeRNA *	-	R
lncRScan-SVM	lncRScan-SVM *	-	R
lncScore	lncScore *	-	R

Software are marked with "*" can only work with limited species. Attribute of model: The key attribute to distingush models for one software. [a], the species of the training data; [b], the species of training data and the way the training data is used. Specifically, "cl" is for that both coding and noncoding sequences are real transcripts, while "sf" is for that noncoding sequences are shuffled from coding sequences; [c] which database is used; [d] the way to processing result; [e] the feature used for training. Group: Models are grouped to perform different comparisons in our research. "J" stands for J-models that were used in joint prediction and "R" stands for representative models.