Help for loading data into WebArrayDB |
Contents
1 Introduction
The data to be inputted into databases are organized in projects. The information for a project falls into 5 categories:
|
(1) project | General information about the project. |
| (2) array | Detail information of each involved array. |
| (3) sample | Information of new samples. |
| (4) platform | Information of new microarray platforms. |
| (5) protocol | Information of new protocols for array/sample manipulation
|
Each project, array, sample, platform or protocol has a record in database. Logically arrays, samples, platforms and protocols are associated with (or belong to) the project with which they were defined, but they can be refered in other projects too. A typical example is, an array can use a protocol defined in another project for hybridization.
There are two interfaces, web interface and command console, for inputting data into databases in WebArrayDB.
2 Web interface
Web interface can be used by collaborating groups users or internet users. With the web interface, uses can
-
create new projects.
- complement existing projects.
- browse/search records
- view details of a record, and even edit it if the record was inputted by the same user.
3 Command console
A small shell was written for WebArrayDB to facilitate data access. It can be used not only for inputting data, but also for search data. The command console is simple but flexible and powerful since it supports SQL syntax. It is ideal for experienced users.
A few commands were implemented. The commands like “fillwith” and “fillbatch” are designed to load data into WebArrayDB; “display” and “saveto” can display and save searched data. Limited SQL commands, including "SELECT" and "SHOW" can be used to find data of interests in the database too. The command “help” will give a complete list.
Enter the shell:
$ WebArrayDB
WebArrayDB−>
To load data into WebArrayDB, a file containing necessary information is mandatory. This file is call project file (the name “input.txt” was used for this file for the following discussion). To load data for one project, “fillwith” can be used with a compact “input.txt”:
WebArrayDB−> fillwith input.txt;
To load data for a batch of projects, the input.txt should contain more forms:
WebArrayDB−> fillbatch input.txt;
SQL inquery commands are used to obtain data from WebArrayDB. For example, to get .
|
WebArrayDB | −> | SELECT | |
| −> | | it.array_id, fg, bg, flag |
| −> | FROM |
| −> | | intensity AS it, sample, sampxref |
| −> | WHERE |
| −> | | sample.name = ’Cell line genomic DNA’ |
| −> | | AND sampxref.sample_id = sample.id |
| −> | | AND sampxref.array_id = it.array_id |
| −> | ORDER BY |
| −> | | it.probe_id; |
Then the following commands display and save results in the file ’results.txt’:
WebArrayDB−> display;
WebArrayDB−> saveto results.txt;
4 File types
Besides the project file(“input.txt”), intensity files should be supplied for new array data - while image files are optional. Probe files that containing platform information will be necessary for platforms new to WebArrayDB. A protocol file is optional when defining a new protocol.
In general, these files fall into 6 categories:
5 Image file
Optional file.
This file is the raw image file as the output of microarray-scanning. Usually it is in “TIFF” format for a two-color array or the “.DAT” file format for a Affymetrix GeneChip. As a storage tool, WebArrayDB attempts to save all related information to a microarray experiment, including microarray image files. Raw image files can be searched in the database.
6 Intensity file
Required file for array data.
Intensity files contain the microarray hybridization intensity data which is required for any arrays to be loaded to WebArrayDB. Different image processing softwares produce intensity files of their specific formats. WebarrayDB may read intensity files from Affymetrix, Agilent, ArrayVision, Genepix, QuantArray, ImaGene (* the ImaGene format may not function in web interface), SMD, SPOT or in a user-defined format. Notice that the name and the value may be different in WebArrayDB for these formats. The following table lists the format names and corresponding values.
|
| Format name | Value in WebArrayDB |
|
| Affymetrix | CEL |
| Agilent | agilent |
| ArrayVision | arrayvision |
| BlueFuse | bluefuse |
| Genepix | genepix |
| GenePix Custom | genepix.custom |
| GenePix Median | genepix.median |
| ImaGene | imagene |
| QuantArray | quantarray |
| Scanarray Express | scanarrayexpress |
| SMD | smd |
| SMD Old | smd.old |
| SPOT | spot |
| user-defined | user.defined |
|
For Affymetrix, the intensity file means the “.CEL” file. All these file formats have been explained in WebArray, but the user-defined format is a little different from that in WebArray, see below.
6.1 user-defined format
User-defined format allows multiple dyes (channels) experiment data. For each channel, the foreground intensity value is required, the background value and the flag value are optional. For the ith channel, the column names should be chi.Intensity, chi.Background and chi.Flag. See the examples below:
| ID | ch1.Intensity+ | [ch1.Background] | [ch1.Flag] | [ch2.Intensity | [ch2.Background] | [ch2.Flag]] | [ch3.Intensity | [ch3.Background] | [ch3.Flag]] | ... |
The differences of the user-defined format between WebArrayDB and WebArray are:
- WebArrayDB supports 1, 2 or more channels while WebArray only recognizes two-color data;
- the “ID” column is prerequisite and must be exactly same to that in the Gene List File in WebArray. In contrast, “ID” is optional in WebArrayDB, so a simplest intensity file can be of just one column: “ch1.Intensity”.
In case that the “ID” column is presented, it will be used to adjust the order of rows in intensity data file to that of “unique_id” in the probe file only when all the following criteria are satisfied:
-
it should contain the same things as the “unique_id” (or “ID”) column in the probe file, but the order of ID items can be different.
- there are not replicates for any ID item.
- The column names are case-insensitive in WebArrayDB, e.g. “CH2.intEnsiTY” will be recognized as “ch2.Intensity”.
WebArrayDB support “ratio” and “log-ratio” values as well as “intensity values”. Sometimes output data from arrays are ratioes of foreground intensity values and background intensity values, or even more, such ratioes are subject to a logarithmic transformation based on 2. In such cases, the values in the data_type column in the [array] sectionare “ratio” and “log-ratio” respectively.
Files for ratio values in user-defined format may have these columns:
Files for log-ratio values in user-defined format may have these columns:
7 Protocol file
Optional file.
A protocol file is to present the details of an experimental protocol. Portable formats like ASCII TEXT, PDF, PS, RTF format are strongly recommended, although other formats, such as MS-word are acceptable as well.
8 Probe file
Required file for defining a NEW microarray platform.
Probe files are used to define a new microarray platform. Only the column “unique_id” (or “ID” can be used as a replacement) must be offered in this file, all other are optional.
For the purpose of cross-platform alignment, users may consider include some columns for mapping with other platforms. Keep in mind, you need to define the column names as reference IDs before using those columns to map arrays (see the reference ID section for more details). In addtion to the reference IDs, “unique_id” and “gene_symbol” can be used for cross-platform alignment too.
For Affymetrix GeneChips, their annotation files (downloadable from Affymetrix’s website) can be directly used as the probe files. For other array platforms, this may be a modified version of a gene list file (sometimes called GAL file). In some gene list files, there is a column “block” instead of “block_row” and “block_col”. For convenience of users, a column “block” can be offered in probe files. WebArrayDB will try to guess a reasonable “block_row” and “block_col” from “block” in case that only the “block” column is available. If none of the three columns is offered, both “block_row” and “block_col” will be set to 1.
| idx | block_row | block_col | row | col | unique_id or ID+ | gene_symbol | gene_title | chromosome | probe_start | probe_end | probe_strand | probe_sequence | bioseq_type | probe_purpose | designation | gene_start | gene_end | gene_strand | cpg_dist | user_attr1 | user_attr2 | ... |
-
idx:
- This is the index or order number of all probes in the platform. If not offered. WebArrayDB will fill it automatically with the number from 1 to number of probes in order.
- block_row & block_col:
- For spotted arrays, each pin will make a block at a contiguous area. These two columns record the row and column location of each block. If no block formed in the platform (e.g. Affyemtrix GeneChip), both values will be set to 1.
- row & col:
- The row and column location of a probe in the block, or in the whole microarry if no block. For Affymetrix gene, “row” are set to 1, “col” use the same value as “idx”
- unique_id or ID+:
- This is a name string that is given to a probe in a platform. Idealy, it is very good to have all the “unique_id” items “unique” across a whole platform. But it is not mandatory since some users would like to have replicate probes share a same “unique_id”.
- gene_symbol:
- A gene symbol, when one is available (from UniGene or other databases). Gene symbol is “A unique abbreviation of a gene name consisting of italicized uppercase Latin letters and Arabic numbers formally assigned by the by HUGO Gene Nomenclature Committee after a gene has been identified (Note: a putative gene may be referred to by its locus name prior to its identification)” - copied from http://ghr.nlm.nih.gov/glossary=genesymbol.
In Affymetrix annotation files, gene symbols are derived by different organizations for different species. Affymetrix data comes from the UniGene record for UniGene based arrays such as human, mouse, and rat. For arrays that are not based on the UniGene database, Affymetrix obtains the gene symbol from various sources including: FlyBase, WormBase, and Saccharomyses Genome Database.
- gene_title:
- Title of Gene represented by the probe. The gene title (or name) is usually extracted from the Gene or UniGene databases. In some cases, specialty databases (such as WormBase, etc.) may provide the gene title/name.
- chromosome:
- The chromosome from which the probe sequence comes
- probe_start:
- start position at the chromosome of the probe sequence
- probe_end:
- end position at the chromosome of the probe sequence
- probe_strand:
- The DNA strand from which the probe come
- probe_sequence:
- nucleic acid sequence of the probe
- bioseq_type:
- Indicates whether the sequence is an Exemplar, Consensus or Control sequence. An Exemplar is a single nucleotide sequence taken directly from a public database. This sequence could be an mRNA or EST. A Consensus sequence is a nucleotide sequence assembled by Affymetrix, based on one or more sequence taken from a public database
- probe_purpose:
- The purpose of the probe. Use value “normal”, “control” or something else. “control” probes can be used as controls in within-array normalization method “composite” and “control”.
- designation:
- The designation features of the probe. e.g. how do the mismatched bases arranged in relation to location and number.
- gene_start:
- start position at the chromosome of the gene sequence
- gene_end:
- end position at the chromosome of the gene sequence
- gene_strand:
- The DNA strand from which the gene come
- cpg_dist:
- The minimal distance from the probe sequence to CpG island.
- user_attr1:
- Users can use this column define a feature for their private purpose
- user_attr2:
- Users can use this column define another feature for their private purpose
8.1 Reference IDs
Reference IDs refer to the IDs that can be used to map probes from different microarray platforms. The IDs should be included in the probe file and listed in columns whose column names should be claimed as "reference IDs" in advance.
We have already included the following IDs from biological databases as reference IDs, you may browse them or add more using the online function “Browse/Add external databases as reference” in the “Database Management” interface of WebArrayDB.
-
AGI:
- AGI ID, a uniform, gene nomenclature system for Arabidopsis created by the Arabidopsis Genome Initiative (AGI). AGI is an international effort to sequence the complete Arabidopsis genome. AGI ID’s are based on the following format: At = organism 1, 2, 3, 4, 5 = chromosome g = gene 00010 = gene id.
- Affymetrix:
- Affymetrix Probe Set ID.
- EC:
- Enzyme Commission number (EC number).
- Ensembl:
- EnsEMBL ID, a transcript identifier from the ENSEMBL project.
- Entrez:
- IDs and symbols are extracted from Entrez Gene.
- FlyBase:
- A locus name from FlyBase: A database of the drosophila genome.
- GO:
- Gene Ontology term.
- InterPro:
- InterPro ID. InterPro is a database of protein families, domains, repeats and sites in which identifiable features found in known proteins can be applied to new protein sequences.
- MGI:
- MGI ID, a locus identifier from the Mouse Genome Informatics (MGI) database.
- OMIM:
- ID in OMIM, Online Mendelian Inheritance in Man. This gives a link to the gene?s description in Online Mendelian Inheritance in Man, a hand-curated database of disease and genetic disorders, biomedical and biochemical information, and phenotypes associated with known human genes. OMIM indexes give the NetAffx user access to detailed descriptions of biomedical research associated with their genes of interest. Only available probe sets to human genes.
- Pathway:
- Displays the GenMAPP pathway if the transcript has been found to play a role in a proteome functional pathway in the GenMAPP collection.
- QTL:
- (Quantitative Trait Loci) Genetic linkage data that provide disease associations for some loci. This data comes from RatMap at the Rat Genome Database; so these annotations only appear on Rat arrays.
- RGD:
- RGD ID, a locus from the Rat Genome Database (RGD).
- RefSeq Protein ID:
- ID of the protein sequence in the NCBI RefSeq database.
- RefSeq Transcript ID:
- References to multiple sequences in RefSeq.
- Representative Public ID:
- The accession number of the representative sequence on which the probe set is based. For UniGene based arrays, this is usually a GenBank, dbEST or RefSeq accession used for sequence selection. Refer to the Sequence Source field under the Sequence section to determine the database used.
- SGD:
- SGD accession number, a locus from the Saccharomyces Genome Database.
- SwissProt:
- SWISS-PROT (sometimes known as SWALL) accession numbers of the peptide sequences corresponding to the mRNA’s in the UniGene cluster represented by the probe set.
- Trans Membrane:
-
- UniGene:
- UniGene ID - The UniGene collection of sequences.
- WormBase:
- A locus name from Wormbase, a database of the genome and biology of C. elegans.
- XDB:
- Xenopus Gene Database provides mappings between XGD IDs and Affymetrix probe set IDs.
9 Project file
This section explains the contents of a project file for a single project
The project file is required when operating on the “command console”, or when using the online function “create a new project (with online forms)”.
If the user adds or updates a project by online functions “create a new project (with online forms)” or “add more data to an existing project”, tables corresponding to sections of the project file will be automatically produced by WebArrayDB.
Whether it is required or not, the annotations for fields/columns in the following subsections are helpful for understanding the data content in WebArrayDB.
*Note:
-
input.txt is a text file that consists of many sections.
- Sections are seperated by one or more blank lines.
- Each section has a title line and a data table.
- The title is enclosed in brackets and this line is followed by a table with column names and data. Columns are seperated by TAB character.
- fields with “+” as superscript must be provided a value in each row (record).
- fields with “/” as superscript might be a single value or multiple values joint by “ /// ”.
- blank line is not allowed before or within the table.
- Within a cell of a table, “\\”, “\t” and “\n” are used for character slash, TAB and new line respectively. “ /// ” is used to separate items within a cell.
- The column keyword - a keyword is a character string that will be used for searching data. Multiple keywords should be seperated by “;”, any leading or tailing blank characters will be removed.
- The column user_added_cols is used to define user-added features. It consists of name-value pairs separated by “;”. A name-value pair looks like
“AnUserColName = Value for this record”,
An alternative way to define a new feature is offered in batch loading mode - the mode for loading data by a project-definition file. In the project-definition file, the user are allowed to add new columns that are not list in those sections below, but make sure that the names of those columns start with “[user_added]” as the token, which will not be considered as a part of the column name and will be removed before saving into the database.
Users are allowed to defined three types of features, “string”, “integer” and “float” type. When a user-added feature is used for the first time, WebArrayDB will determine its type by the first occurrence of its value - a character string. There are two ways to tell a value’s type - implicit or explicit.
The implicit way for determination of value type:
-
it is an integer string if the value is an optional “+” or “-” followed by a string that consists of only digits, e.g. 3837, +54, -734;
- it is a float string if the value is an optional “+” or “-” followed by a string consists of digits and a dot “.”, e.g. 23.4, .23, 45. -1.4, +.35;
- it is a float string if the value is an integer string or a float string followed by “E” or “e” and an integer string, e.g. 3e5, 2.4E-3;
- it is a character string if it is an integer string or a float string enclosed by a pair of quotation marks(’ or "), the quotation marks will be removed;
- it is a character string if it is neither an integer string nor a float string.
The explicit way will use a prefix for definition, “[int]” or “[integer]” for integer, “[float]” for float and “[str]” or “[string]” for character string. The prefix is case-insensitive. A value is:
-
an integer string if the value is “[int]” or “[integer]” followed by an integer string defined above;
- a float string if the value is “[float]” followed by a float string defined above;
- a character string if the value start with “[str]” or “[string]”
Any case that doesn’t meet the above criteria will be used as a character string. An integer string will be converted to an integer in the dabatase, a float string will be converted to a double-precision float in the database and a character string will be saved as a string in the database
* IMPORTANT: Since the semicolon “;” is used to separate keywords or user-defined features, so there is no way to include the semicolon character in a keyword or in the value for a user-defined column!
9.1 [project]
| name | factors | tissue | design | QC | description | authors | journal | publish_year | pubmed_id | data_link | user_name | release_date | related_files/ | keyword | user_added_cols |
-
name:
- Project name, should be unique among those from a single user.
- factors:
- Factors that the project aims to study to investigate their the effects on samples.
- tissue:
- Sample tissue used in the project.
- design:
- A description of experiment design.
- QC:
- Measures or steps taken for quality control.
- description:
- A comprehensive description of the project.
- authors:
- people who carried out this project.
- journal:
- The journal on which the result of this project was published.
- publish_year:
- The year when the result of the project was published.
- pubmed_id:
- The pubmed ID of published papers on medline.
- data_link:
- Web URL links for published papers.
- user_name:
- The user’s ID who defines this project.
- release_date:
- The date after that the project will open for public access.
- related_files:
- This column lists names of annotation files for the project
- keyword:
- see note.
- user_added_cols:
- see note.
This table contains project informations. One line for one project. Usually it should be present if the user is inputting array data (hybridization data)
9.2 [sample]
For generic samples
| name+ | organism | tissue | individual_id | gender | age | description | keyword | user_added_cols |
-
name+:
- Name of the sample is needed and should be unique among all samples defined by a same user.
- organism:
- The organism name from which the sample was taken.
- tissue:
- The tissue name from which the sample was taken.
- individual_id:
- The identical ID/number/name for the individual from which the sample was taken.
- gender:
- Gender of the sample individual. Valid values are “male”, “female” or “NA”.
- age:
- Age of the sample individual. It is an integer.
- description:
- A comprehensive description of the sample.
- keyword:
- see note.
- user_added_cols:
- see note.
This table is needed if there are used samples not available in WebArrayDB. Each sample should have one line of data. Samples might be defined in the [array] section too.
9.3 [protocol]
| name+ | category | description | keyword | user_added_cols |
-
name+:
- Name of the protocol is needed and should be unique among all protocols defined by a same user.
- category:
- A value from (“process”, “technique”, “label”, “hybridization”, “image”, “data”) for “sample growth/treatment/separation”, “DNA/RNA/protein extraction/purification”, “DNA/RNA/protein labeling”, “hybridization and washing method”, “scanning method” and “image quantification method” respectively
- description:
- A comprehensive description of the protocol
- keyword:
- see note.
- user_added_cols:
- see note.
This table should be present only if there are new protocols introduced. Protocols might be defined in the [array] section too.
9.4 [platform]
| name+ | probe_file+ | category+ | probe_num+ | replicate+ | space+ | manufacturer | organism | description | availability | keyword | user_added_cols |
-
name+:
- Name of the platform is needed and should be unique among all platforms in a same database.
- probe_file+:
- See probe file.
- category+:
- a value from (“antibody”,“in situ oligonucleotide”,“MPSS”,“MS”,“oligonucleotide beads”,“other”,“RT-PCR”,“SAGE NlaIII”,“SAGE Sau3A”,“spotted DNA/cDNA”,“spotted oligonucleotide”,“spotted protein”).
- probe_num+:
- Number of probes in the platform.
- replicate+:
- Number of replicates of each probe.
- space+:
- Space between replicate spots, or the difference of replicate spots’ sites on array. e.g. if two replicate spots are adjacent on array, its space is 1.
- manufacturer:
- The manufacturer who produces this platform.
- organism:
- The organism for which the platform was designed.
- description:
- A comprehensive description of the platform.
- availability:
- A value from (“public”,“private”). The default is “public”, means that this platform is open for public use even if the project containing it is not opened yet.
- keyword:
- see note.
- user_added_cols:
- see note.
This table offers platform informations.
9.5 [array]
| identifier | platform+ | channel_num | intensity_file+/ | intensity_format+ | hyb_date | protocol_hyb | protocol_image | protocol_data | data_type | description | sample(_chN)+ | organism(_chN) | tissue(_chN) | individual_id(_chN) | gender(_chN) | age(_chN) | description(_chN) | dye(_chN) | protocol_process(_chN) | protocol_tech(_chN) | protocol_label(_chN) | exp_factor(_chN) | image_file(_chN)/ | image_format(_chN) | keyword | user_added_cols |
-
identifier:
- should be unique among all arrays defined by a same user, for example, barcodes of slides.
- platform+:
- Platform name or ID.
- channel_num:
- 1 for single channel arrays, 2 for two-color arrays, etc.
- intensity_file+/:
- The file name of intensity data. for ImaGene data, two-color arrays have two data files and their names should be separated by “ /// ”.
- intensity_format+:
- The formats of data file are described in Intensity file
- hyb_date:
- The date on which the array was hybridized.
- protocol_hyb:
- protocol name or ID for hybridization and washing.
- protocol_image:
- protocol name or ID for image scanning.
- protocol_data:
- protocol name or ID for image quantification.
- data_type:
- a value from (“intensity”, “ratio”, “log-ratio”).
- description:
- A comprehensive description of the array.
- sample(_chN)+:
- sample name for channel N. organism(_chN), tissue(_chN), individual_id(_chN), gender(_chN), age(_chN) and description(_chN) are used for fields in the [sample] section, they might have different name and number.
- dye(_chN):
- The dye used for channel N, a value from “Cy3”, “Cy5”, “biotin”, “sybr green”, “syto 61”, or other dyes added by users.
- protocol_process(_chN):
- protocol name or ID for sample growth/treatment/separation.
- protocol_tech(_chN):
- protocol name or ID for DNA/RNA/protein extraction/purification.
- protocol_label(_chN):
- protocol name or ID for DNA/RNA/protein sample labelling.
- exp_factor(_chN):
- The factor that is going to be investigated through this sample.
- image_file(_chN)/:
- The file name for scanned image
- image_format(_chN):
- The format of image file. Typically, it is one of “TIF”, “BMP”, “GIF”, “JPG”, or “DAT” for Affymetrix image file.
- keyword:
- see note.
- user_added_cols:
- see note.
This table describe information related to microarray slides or chips. One line for each slide/chip.
Simple protocols or samples (without keywords nor the user_added_cols items) can be defined in the [array] section.
For a new protocol, the definition can be completed in corresponding protocol cell while the format is:
[new_protocol_name]:protocol definitions
For a new sample, it is even simpler - just fill something in one or more of the columns that can be found in the [sample] section, like organism(_chN), tissue(_chN), tissue(_chN), etc. Since a new sample definition can be produced in such a way, don’t fill anything in these columns except the sample(name) column if it is not a new sample!!
9.6 An Example
Here is an example of input.txt.
10 Project annotation file
Optional file.
The project annotation file contains additional information for a certain project. Although WebArrayDB doesn’t have limitation in its format, plain text format is recommended.
11 Probe-mapping file
Optional file.
This file can be used to map probes across different platforms. Four columns are required:
| Platform_A | unique_id_A | Platform_B | unique_id_B |
Alignment files downloaded from Affymetrix can be used directly to map probes between different types of GeneChip. In such cases, the four columns in the files: “A Array Name”, “A Probe Set Name”, “B Array Name” and “B Probe Set Name” will be used as the four required columns correspondingly.
Usually alignment by files provides a more comprehensive and more reliable map of probes. Based on such an understanding, when users use the option “automatic” to match probes, WebArrayDB will use such alignments if all involved platforms were aligned already, otherwise all available reference columns will be used for alignment.