FishCODE

Transcriptome upstream processing script
Epigenomics upstream processing script

Species-corresponding genomes and annotations

Species Name	Genome Version	download url
Danio rerio	GRCz11	Download
Protopterus annectens	GCF_019279795.1	Download
Monopterus albus	GCF_001952655.1	Download
Gadus morhua	GCF_902167405.1	Download
Clupea harengus	GCF_900700415.2	Download
Salmo salar	GCF_000233375.1_ICSASG_V2	Download
Lates calcarifer	ASM164080v1	Download
Collichthys lucidus	GCA_004119915.2	Download
Amia calva	GCA_016984155.1	Download
Ictalurus punctatus	GCF_001660625.1_IpCoco_1.2	Download
Cyprinus carpio	GCF_018340385.1_ASM1834038v1	Download
Perca fluviatilis	GCF_010015445.1	Download
Pimephales promelas	GCF_016745375.1	Download
Ctenopharyngodon idella	GC_v1	Download
Xiphophorus hellerii	GCF_003331165.1	Download
Coregonus clupeaformis	GCF_020615455.1	Download
Larimichthys crocea	GGCF_000972845.2	Download
Siniperca chuatsi	GCF_020085105.1	Download
Oryzias latipes	GCF_002234675.1	Download
Astyanax mexicanus	GCF_023375975.1	Download
Triplophysa tibetana	GCA_008369825.1	Download
Oreochromis niloticus	GCA_001858045.2	Download
Oncorhynchus mykiss	GCF_013265735.2_USDA_OmykA_1.1	Download
Labeo rohita	GCA_004120215.1_ASM412021v1	Download
Hypophthalmichthys molitrix	CNP0000974	Download
Channa argus	GCA_004786185.1	Download
Lepisosteus oculatus	GCF_000242695.1	Download
Tetraodon nigroviridis	GCA_000180735.1	Download
Pangasianodon hypophthalmus	GCF_009078355.1_GENO_Phyp_1.0	Download
Puntigrus tetrazona	GCF_018831695.1	Download
Takifugu rubripes	GCF_901000725.2	Download
Gasterosteus aculeatus	GCF_016920845.1	Download
Nothobranchius fuzeri	GCF_027789165.1	Download
Megalobrama amblycephala	GCF_901000725.2	Download
Tachysurus fulvidraco	GCF_003724035.1_ASM372403v1	Download

Gene basic information

Gene basic information: a: Gene basic information, species, chromosome, chromosome position, original database id and description b: Sequence quick pull, select nucleic acid or protein will be presented in fasta format c: Gene compared with Swiss-prot vertebrate data id corresponding information d-e: GO/KEGG annotation information.

RNA information

RNA information: a: Visualization of the gene transcript node, click on the transcript to browse the substructure and download the corresponding sequence b: Visualization of the transcriptome analysis on the gene information page (the process of selecting the data set is similar to the "data selection" in the previous "Omics Online Analysis" " part) c: Statistical test analysis parameter setting d: Analysis result picture display, click "picture ggplot file" below to download the corresponding ggplot2 file, upload it to the Retuning Plots part of Visual Omics, and you can perform online high-freedom degree of picture editing.

Protein information

protein information: a: Prediction and visualization of protein domains corresponding to genes, below are predicted result files, visualization pictures and corresponding ggplot2 files. b: Protein subcellular localization c: Protein interaction information

SNP information

SNP information: a: SNP information associated with the FishCODE gene page, users can click on the SNP form row to jump to the detailed information page of the SNP, b: basic information of the SNP site c: SNP flanking sequence (optional distance) d: multiple genome positions Point comparison information e: SNP genome annotation information. f: Population type, size, sampling location and other information. g: Article information h: SNP visualization information display. Click the target variation to display the position information of the SNP and its adjacent sites on the genome, as well as the detailed information of the SNP site.

Macro related information

Macro related information: a: Bisulfite-Seq DMRs analyzes the associated gene information, and associates the gene with the experimental conditions of the data set; b: the gene obtained by RNA-Seq differential analysis, associates the gene with the experimental conditions of the data set.

Transcriptome upstream script introduction

The purpose of this part of the code is to help users pre-process their own transcriptome raw data on their own machine. The output of this script is the gene count and TPM file. These output results can be input by the user as an input file to the "upload own data" section of FishCODE's online transcriptome analysis.

Considering the large-scale computing power required for the pre-processing of omics raw data, this series of processing scripts currently only supports Linux or Linux-like systems. More local software supported by the system are in our development list, so please pay attention to our webpage. It is worth noting that although FishCODE focuses on the analysis of fish omics data, this series of scripts is a general omics data processing flow, and users can use this processing flow to process any omics annotations (at least genome and gff files) Species.

Code download

Version Name	Modify Time	Download	Breif Introdunction
Transcriptome_linux_1.1	25 June 2023	download	The original version, users can use this series of scripts to process omics raw data on the linux system

Getting started

Before you start

The pp.conf file records the absolute paths of the dependent software mentioned below. Before you start you must configure this file and replace the execution path of the software with the absolute path to the software on your server!

First: create database index for analysis

Dependent software: gffread, salmon 1.4.0, bedtools v2.26.0, mashmap

$ bash First_bulid_index.sh ./build_faidx_example/GCF_901000725.2_fTakRub1.2_genomic.fna ./build_faidx_example/GCF_901000725.2_fTakRub1.2_genomic.gff 30

Note: 1. Your genome and corresponding annotation files must be in the same folder. 2. All the following working directories default to the decompressed directory.
3. The first parameter is the genome location, the second parameter is the gff file location, and the third parameter is the number of threads

Second: quality control

Dependent software: NGSQCToolkit_v2.3.3 (IlluQC.pl)

$ bash filter.sh 792045 SRR17641338

Note: The first parameter is the name of the upper folder where fastq.gz is located, or the project name, and the second parameter is the prefix of fastq.gz. For the specific file structure, please refer to the compressed package. The user must create a project under the parent directory 00_data, and store the sequencing files to be processed under the project name. "./00_data/projectname/*fastq.gz"

Third: count

Dependent software: salmon 1.4.0

$ bash salmon.sh 792045 SRR17641338 ./build_faidx_example/GCF_901000725.2_fTakRub1.2_genomic.fna ./build_faidx_example/GCF_901000725.2_fTakRub1.2_genomic.gff

Note: The first parameter is the project name of the second step, the second parameter is the prefix of fastq.gz, the third parameter is the genome location, and the fourth parameter is the location of the gff file (note that the location of this genome and annotation file must Strictly the input genome and gff file path for the first step)

Others

1. The output result of the third step will be in 01_result/792045/SRR17641338_quant/ under the same directory

2. All directory structures and demonstration data examples are included in the compressed package, allowing users to try it easily.

3. In order to facilitate users to generate transcriptome data corresponding to the species included in FishCODE, we provide a download package of the genome and annotation files of FishCODE's 35 fish species in the "Help->Usage->gene information" section.

Epigenomics upstream script introduction

The purpose of this part of the code is to help users preprocess their own methylome (only referring to the "Bisulfite-Seq" library construction strategy) raw data on their own machines. The output of this script is a file of methylation ratios for the sites. Users can directly import these output results into the methlKit package as input files.

Code download

Version Name	Modify Time	Download	Breif Introdunction
Methylome_linux_1.1	25 June 2023	download	The original version, users can use this series of scripts to process omics raw data on the linux system

First: quality control

Dependent software: FASTX Toolkit 0.0.13

$ bash filter_fastx.sh 555065 SRR9697472

Note: The first parameter of the quality control script is the project name, and the second parameter is the prefix name of fastq.gz. But before you run the script, you need to replace "@SRR" on line 20 of "filter_fastx.sh" with the special flag string of your sequencing data. What are special identifiers? Take NCBI's original sequencing file as an example. The special string of SRRXXXX.fastq.gz is "@SRR". You can zless open the fastq.gz file to view the first line and replace the "NCBI's" in line 20 of filter_fastx.sh. @SRR".

Second: count

Dependent software: BSMAP 2.9, samtools 1.3.1, Python 2.7.15 [sys, time, os, array, optparse]

$ bash methy.sh 555065 SRR9697472 ./zebrafish/GCF_000002035.6_GRCz11_genomic.fna

Note: The first parameter of the quality control script is the project name, the second is the prefix name of fastq.gz, and the third parameter is the absolute path of the genome of the corresponding species.

Others

1. The output result of the third step will be in 02methyPos/01methratio/555065 under the same directory

2. All directory structures and demonstration data examples are included in the compressed package, allowing users to try it easily.

3. *methratio.txt is the final output result, where *methratio is the methylation status of all sites, and the final result is the filtered site file with coverage

4. Considering that methylation computing resources consume a lot, we recommend that your running server has at least 100G of RAM, 200G of storage space, and 10 CPU cores. (actually depends on your genome size)