Help for loading data into WebArrayDB

Contents

1  Introduction

The data to be inputted into databases are organized in projects. The information for a project falls into 5 categories:

(1) projectGeneral information about the project.
(2) arrayDetail information of each involved array.
(3) sampleInformation of new samples.
(4) platformInformation of new microarray platforms.
(5) protocolInformation of new protocols for array/sample manipulation

Each project, array, sample, platform or protocol has a record in database. Logically arrays, samples, platforms and protocols are associated with (or belong to) the project with which they were defined, but they can be refered in other projects too. A typical example is, an array can use a protocol defined in another project for hybridization.

There are two interfaces, web interface and command console, for inputting data into databases in WebArrayDB.

2  Web interface

Web interface can be used by collaborating groups users or internet users. With the web interface, uses can

3  Command console

A small shell was written for WebArrayDB to facilitate data access. It can be used not only for inputting data, but also for search data. The command console is simple but flexible and powerful since it supports SQL syntax. It is ideal for experienced users.

A few commands were implemented. The commands like “fillwith” and “fillbatch” are designed to load data into WebArrayDB; “display” and “saveto” can display and save searched data. Limited SQL commands, including "SELECT" and "SHOW" can be used to find data of interests in the database too. The command “help” will give a complete list.

Enter the shell:
$ WebArrayDB
WebArrayDB−>

To load data into WebArrayDB, a file containing necessary information is mandatory. This file is call project file (the name “input.txt” was used for this file for the following discussion). To load data for one project, “fillwith” can be used with a compact “input.txt”:
WebArrayDB−> fillwith input.txt;

To load data for a batch of projects, the input.txt should contain more forms:
WebArrayDB−> fillbatch input.txt;

SQL inquery commands are used to obtain data from WebArrayDB. For example, to get .

WebArrayDB−>SELECT
−>it.array_id, fg, bg, flag
−>FROM
−>intensity AS it, sample, sampxref
−>WHERE
−>sample.name = ’Cell line genomic DNA’
−>AND sampxref.sample_id = sample.id
−>AND sampxref.array_id = it.array_id
−>ORDER BY
−>it.probe_id;

Then the following commands display and save results in the file ’results.txt’:

WebArrayDB−> display;

WebArrayDB−> saveto results.txt;

4  File types

Besides the project file(“input.txt”), intensity files should be supplied for new array data - while image files are optional. Probe files that containing platform information will be necessary for platforms new to WebArrayDB. A protocol file is optional when defining a new protocol.

In general, these files fall into 6 categories:

Important: except image file, intensity file, and project annotation file, all other files MUST be in TAB-delimited ASCII format. EXCEL files MUST be saved in Text (Tab delimited) format for these file types!

5  Image file

Optional file. Usually in binary format.

This file is the raw image file as the output of microarray-scanning. Usually it is in “TIFF” format for a two-color array or the “.DAT” file format for a Affymetrix GeneChip. As a storage tool, WebArrayDB attempts to save all related information to a microarray experiment, including microarray image files. Raw image files can be searched in the database.

6  Intensity file

Required file for array data. Can be in variety formats in ASCII or binary.

Intensity files contain the microarray hybridization intensity data which is required for any arrays to be loaded to WebArrayDB. Different image processing softwares produce intensity files of their specific formats. WebarrayDB may read intensity files from Affymetrix, Agilent, ArrayVision, Genepix, QuantArray, ImaGene (* the ImaGene format may not function in web interface), SMD, SPOT or in a user-defined format. Notice that the name and the value may be different in WebArrayDB for these formats. The following table lists the format names and corresponding values.

Format name Value in WebArrayDB
Affymetrix CEL
Agilent agilent
ArrayVision arrayvision
BlueFuse bluefuse
Genepix genepix
GenePix Custom genepix.custom
GenePix Median genepix.median
ImaGene imagene
QuantArray quantarray
Scanarray Express scanarrayexpress
SMD smd
SMD Old smd.old
SPOT spot
user-defined user.defined

For Affymetrix, the intensity file means the “.CEL” file. All these file formats have been explained in WebArray, but the user-defined format is a little different from that in WebArray, see below.

6.1  user-defined format

Note: User-defined intensity files MUST be in TAB-delimited ASCII format.

User-defined format allows multiple dyes (channels) experiment data. For each channel, the foreground intensity value is required, the background value and the flag value are optional. For the ith channel, the column names should be chi.Intensity, chi.Background and chi.Flag. See the examples below:

IDch1.Intensity+[ch1.Background][ch1.Flag][ch2.Intensity[ch2.Background][ch2.Flag]][ch3.Intensity[ch3.Background][ch3.Flag]]...

The differences of the user-defined format between WebArrayDB and WebArray are:

  1. WebArrayDB supports 1, 2 or more channels while WebArray only recognizes two-color data;
  2. the “ID” column is prerequisite and must be exactly same to that in the Gene List File in WebArray. In contrast, “ID” is optional in WebArrayDB, so a simplest intensity file can be of just one column: “ch1.Intensity”.

    In case that the “ID” column is presented, it will be used to adjust the order of rows in intensity data file to that of “unique_id” in the probe file only when all the following criteria are satisfied:

    • it should contain the same things as the “unique_id” (or “ID”) column in the probe file, but the order of ID items can be different.
    • there are not replicates for any ID item.

  3. The column names are case-insensitive in WebArrayDB, e.g. “CH2.intEnsiTY” will be recognized as “ch2.Intensity”.

WebArrayDB support “ratio” and “log-ratio” values as well as “intensity values”. Sometimes output data from arrays are ratioes of foreground intensity values and background intensity values, or even more, such ratioes are subject to a logarithmic transformation based on 2. In such cases, the values in the data_type column in the [array] sectionare “ratio” and “log-ratio” respectively.

Files for ratio values in user-defined format may have these columns:

IDratio+flag

Files for log-ratio values in user-defined format may have these columns:

IDlog-ratio+flag

7  Protocol file

Optional file.

Note: protocol files MUST be in TAB-delimited ASCII format.

A protocol file is to present the details of an experimental protocol. Portable formats like ASCII TEXT, PDF, PS, RTF format are strongly recommended, although other formats, such as MS-word are acceptable as well.

8  Probe file

Required file for defining a NEW microarray platform.

Note: protocol files MUST be in TAB-delimited ASCII format.

Probe files are used to define a new microarray platform. Only the column “unique_id” (or “ID” can be used as a replacement) must be offered in this file, all other are optional.

For the purpose of cross-platform alignment, users may consider include some columns for mapping with other platforms. Keep in mind, you need to define the column names as reference IDs before using those columns to map arrays (see the reference ID section for more details). In addtion to the reference IDs, “unique_id” and “gene_symbol” can be used for cross-platform alignment too.

For Affymetrix GeneChips, their annotation files (downloadable from Affymetrix’s website) can be directly used as the probe files. For other array platforms, this may be a modified version of a gene list file (sometimes called GAL file). In some gene list files, there is a column “block” instead of “block_row” and “block_col”. For convenience of users, a column “block” can be offered in probe files. WebArrayDB will try to guess a reasonable “block_row” and “block_col” from “block” in case that only the “block” column is available. If none of the three columns is offered, both “block_row” and “block_col” will be set to 1.

idxblock_rowblock_colrowcolunique_id+gene_symbolgene_titlechromosomeprobe_startprobe_endprobe_strandprobe_sequencebioseq_typeprobe_purposedesignationgene_startgene_endgene_strandcpg_distuser_attr1user_attr2...


idx:
This is the index or order number of all probes in the platform. If not offered. WebArrayDB will fill it automatically with the number from 1 to number of probes in order.
block_row & block_col:
For spotted arrays, each pin will make a block at a contiguous area. These two columns record the row and column location of each block. If no block formed in the platform (e.g. Affyemtrix GeneChip), both values will be set to 1.
row & col:
The row and column location of a probe in the block, or in the whole microarry if no block. For Affymetrix gene, “row” are set to 1, “col” use the same value as “idx”
unique_id (or ID):
Required column. This is a name string that is given to a probe in a platform. Idealy, it is very good to have all the “unique_id” items “unique” across a whole platform. But it is not mandatory since some users would like to have replicate probes share a same “unique_id”.
gene_symbol:
A gene symbol, when one is available (from UniGene or other databases). Gene symbol is “A unique abbreviation of a gene name consisting of italicized uppercase Latin letters and Arabic numbers formally assigned by the by HUGO Gene Nomenclature Committee after a gene has been identified (Note: a putative gene may be referred to by its locus name prior to its identification)” - copied from http://ghr.nlm.nih.gov/glossary=genesymbol.
In Affymetrix annotation files, gene symbols are derived by different organizations for different species. Affymetrix data comes from the UniGene record for UniGene based arrays such as human, mouse, and rat. For arrays that are not based on the UniGene database, Affymetrix obtains the gene symbol from various sources including: FlyBase, WormBase, and Saccharomyses Genome Database.
gene_title:
Title of Gene represented by the probe. The gene title (or name) is usually extracted from the Gene or UniGene databases. In some cases, specialty databases (such as WormBase, etc.) may provide the gene title/name.
chromosome:
The chromosome from which the probe sequence comes
probe_start:
start position at the chromosome of the probe sequence
probe_end:
end position at the chromosome of the probe sequence
probe_strand:
The DNA strand from which the probe come
probe_sequence:
nucleic acid sequence of the probe
bioseq_type:
Indicates whether the sequence is an Exemplar, Consensus or Control sequence. An Exemplar is a single nucleotide sequence taken directly from a public database. This sequence could be an mRNA or EST. A Consensus sequence is a nucleotide sequence assembled by Affymetrix, based on one or more sequence taken from a public database
probe_purpose:
The purpose of the probe. Use value “normal”, “control” or something else. “control” probes can be used as controls in within-array normalization method “composite” and “control”.
designation:
The designation features of the probe. e.g. how do the mismatched bases arranged in relation to location and number.
gene_start:
start position at the chromosome of the gene sequence
gene_end:
end position at the chromosome of the gene sequence
gene_strand:
The DNA strand from which the gene come
cpg_dist:
The minimal distance from the probe sequence to CpG island.
user_attr1:
Users can use this column define a feature for their private purpose
user_attr2:
Users can use this column define another feature for their private purpose

8.1  Reference IDs

Reference IDs refer to the IDs that can be used to map probes from different microarray platforms. The IDs should be included in the probe file and listed in columns whose column names should be claimed as "reference IDs" in advance.

We have already included the following IDs from biological databases as reference IDs, you may browse them or add more using the online function “Browse/Add external databases as reference” in the “Database Management” interface of WebArrayDB.

AGI:
AGI ID, a uniform, gene nomenclature system for Arabidopsis created by the Arabidopsis Genome Initiative (AGI). AGI is an international effort to sequence the complete Arabidopsis genome. AGI ID’s are based on the following format: At = organism 1, 2, 3, 4, 5 = chromosome g = gene 00010 = gene id.
Affymetrix:
Affymetrix Probe Set ID.
EC:
Enzyme Commission number (EC number).
Ensembl:
EnsEMBL ID, a transcript identifier from the ENSEMBL project.
Entrez:
IDs and symbols are extracted from Entrez Gene.
FlyBase:
A locus name from FlyBase: A database of the drosophila genome.
GO:
Gene Ontology term.
InterPro:
InterPro ID. InterPro is a database of protein families, domains, repeats and sites in which identifiable features found in known proteins can be applied to new protein sequences.
MGI:
MGI ID, a locus identifier from the Mouse Genome Informatics (MGI) database.
OMIM:
ID in OMIM, Online Mendelian Inheritance in Man. This gives a link to the gene?s description in Online Mendelian Inheritance in Man, a hand-curated database of disease and genetic disorders, biomedical and biochemical information, and phenotypes associated with known human genes. OMIM indexes give the NetAffx user access to detailed descriptions of biomedical research associated with their genes of interest. Only available probe sets to human genes.
Pathway:
Displays the GenMAPP pathway if the transcript has been found to play a role in a proteome functional pathway in the GenMAPP collection.
QTL:
(Quantitative Trait Loci) Genetic linkage data that provide disease associations for some loci. This data comes from RatMap at the Rat Genome Database; so these annotations only appear on Rat arrays.
RGD:
RGD ID, a locus from the Rat Genome Database (RGD).
RefSeq Protein ID:
ID of the protein sequence in the NCBI RefSeq database.
RefSeq Transcript ID:
References to multiple sequences in RefSeq.
Representative Public ID:
The accession number of the representative sequence on which the probe set is based. For UniGene based arrays, this is usually a GenBank, dbEST or RefSeq accession used for sequence selection. Refer to the Sequence Source field under the Sequence section to determine the database used.
SGD:
SGD accession number, a locus from the Saccharomyces Genome Database.
SwissProt:
SWISS-PROT (sometimes known as SWALL) accession numbers of the peptide sequences corresponding to the mRNA’s in the UniGene cluster represented by the probe set.
Trans Membrane:
UniGene:
UniGene ID - The UniGene collection of sequences.
WormBase:
A locus name from Wormbase, a database of the genome and biology of C. elegans.
XDB:
Xenopus Gene Database provides mappings between XGD IDs and Affymetrix probe set IDs.

9  Project file

Note: project files MUST be in TAB-delimited ASCII format.

This section explains the contents of a project file for a single project

The project file is required when operating on the “command console”, or when using the online function “create a new project (with online forms)”. If the user adds or updates a project by online functions “create a new project (with online forms)” or “add more data to an existing project”, tables corresponding to sections of the project file will be automatically produced by WebArrayDB. Whether it is required or not, the annotations for fields/columns in the following subsections are helpful for understanding the data content in WebArrayDB.

*Note:

9.1  [project]


namefactorstissuedesignQCdescriptionauthorsjournalpublish_yearpubmed_iddata_linkuser_namerelease_daterelated_files/keyworduser_added_cols


name:
Project name, should be unique among those from a single user.
factors:
Factors that the project aims to study to investigate their the effects on samples.
tissue:
Sample tissue used in the project.
design:
A description of experiment design.
QC:
Measures or steps taken for quality control.
description:
A comprehensive description of the project.
authors:
people who carried out this project.
journal:
The journal on which the result of this project was published.
publish_year:
The year when the result of the project was published.
pubmed_id:
The pubmed ID of published papers on medline.
data_link:
Web URL links for published papers.
user_name:
The user’s ID who defines this project.
release_date:
The date after that the project will open for public access.
related_files:
This column lists names of annotation files for the project
keyword:
see note.
user_added_cols:
see note.


This table contains project informations. One line for one project. Usually it should be present if the user is inputting array data (hybridization data).

IMPORTANT: If the project file is written for adding data to an existing project, this part (the “prject” table) should be blank, while other parts (including “array”, “sample”, “platform”, and “protocol”) should have a column “project” listing the project names to which they belong.

9.2  [sample]


For generic samples

name+organismtissueindividual_idgenderagedescriptionkeyworduser_added_cols


name:
Required column. Name of the sample is needed and should be unique among all samples defined by a same user.
organism:
The organism name from which the sample was taken.
tissue:
The tissue name from which the sample was taken.
individual_id:
The identical ID/number/name for the individual from which the sample was taken.
gender:
Gender of the sample individual. Valid values are “male”, “female” or “NA”.
age:
Age of the sample individual. It is an integer.
description:
A comprehensive description of the sample.
keyword:
see note.
user_added_cols:
see note.


This table is needed if there are used samples not available in WebArrayDB. Each sample should have one line of data. Samples might be defined in the [array] section too.

9.3  [protocol]


name+categorydescriptionkeyworduser_added_cols


name:
Required column. Name of the protocol is needed and should be unique among all protocols defined by a same user.
category:
A value from (“process”, “technique”, “label”, “hybridization”, “image”, “data”) for “sample growth/treatment/separation”, “DNA/RNA/protein extraction/purification”, “DNA/RNA/protein labeling”, “hybridization and washing method”, “scanning method” and “image quantification method” respectively
description:
A comprehensive description of the protocol
keyword:
see note.
user_added_cols:
see note.


This table should be present only if there are new protocols introduced. Protocols might be defined in the [array] section too.

9.4  [platform]


name+probe_file+category+probe_num+replicate+space+manufacturerorganismdescriptionavailabilitykeyworduser_added_cols


name:
Required column. Name of the platform is needed and should be unique among all platforms in a same database.
probe_file:
Required column. See probe file.
category:
Required column. a value from (“antibody”,“in situ oligonucleotide”,“MPSS”,“MS”,“oligonucleotide beads”,“other”,“RT-PCR”,“SAGE NlaIII”,“SAGE Sau3A”,“spotted DNA/cDNA”,“spotted oligonucleotide”,“spotted protein”).
probe_num:
Required column. Number of probes in the platform.
replicate:
Required column. Number of replicates of each probe.
space:
Required column. Space between replicate spots, or the difference of replicate spots’ sites on array. e.g. if two replicate spots are adjacent on array, its space is 1.
manufacturer:
The manufacturer who produces this platform.
organism:
The organism for which the platform was designed.
description:
A comprehensive description of the platform.
availability:
A value from (“public”,“private”), in which the default is “private”. “private” means that this platform is visiable to others only after the project containing it is released to public. “public” means this platform can be shared/used by other users. We suggest that the users set the value to “public” if they are defining commercially microarray platforms. They can do so if they want to share their platforms with collaborators. This value can be reset later by the user.
keyword:
see note.
user_added_cols:
see note.


This table offers platform informations.

9.5  [array]


identifierplatform+channel_num+intensity_file+/intensity_format+hyb_dateprotocol_hybprotocol_imageprotocol_datadata_typedescriptionsample(_chN)+organism(_chN)tissue(_chN)individual_id(_chN)gender(_chN)age(_chN)description(_chN)dye(_chN)protocol_process(_chN)protocol_tech(_chN)protocol_label(_chN)exp_factor(_chN)image_file(_chN)/image_format(_chN)keyworduser_added_cols


identifier:
should be unique among all arrays defined by a same user, for example, barcodes of slides.
platform:
Required column. Platform name or ID.
channel_num:
Required column. 1 for single channel arrays, 2 for two-color arrays, etc.
intensity_file+/:
The file name of intensity data. for ImaGene data, two-color arrays have two data files and their names should be separated by “ /// ”.
intensity_format:
Required column. The formats of data file are described in Intensity file
hyb_date:
The date on which the array was hybridized.
protocol_hyb:
protocol name or ID for hybridization and washing.
protocol_image:
protocol name or ID for image scanning.
protocol_data:
protocol name or ID for image quantification.
data_type:
a value from (“intensity”, “ratio”, “log-ratio”).
description:
A comprehensive description of the array.
keyword:
see note.
user_added_cols:
see note.

The following columns are for channels:
Important notice for assigning channel number (chN): For two-color intensity files in “genepix” format (and maybe some other formats too), the rank number of a channel is determined by wave length of laser used for this channel, normally the channel number should increase with laser wave length!

sample(_chN):
Required column. sample name for channel N. organism(_chN), tissue(_chN), individual_id(_chN), gender(_chN), age(_chN) and description(_chN) are used for fields in the [sample] section, they might have different name and number.
dye(_chN):
The dye used for channel N, a value from “Cy3”, “Cy5”, “biotin”, “sybr green”, “syto 61”, or other dyes added by users.
protocol_process(_chN):
protocol name or ID for sample growth/treatment/separation.
protocol_tech(_chN):
protocol name or ID for DNA/RNA/protein extraction/purification.
protocol_label(_chN):
protocol name or ID for DNA/RNA/protein sample labelling.
exp_factor(_chN):
The factor that is going to be investigated through this sample.
image_file(_chN)/:
The file name for scanned image
image_format(_chN):
The format of image file. Typically, it is one of “TIF”, “BMP”, “GIF”, “JPG”, or “DAT” for Affymetrix image file.


This table describe information related to microarray slides or chips. One line for each slide/chip.

Simple protocols or samples (without keywords nor the user_added_cols items) can be defined in the [array] section.

For a new protocol, the definition can be completed in corresponding protocol cell while the format is:

[new_protocol_name]:protocol definitions

For a new sample, it is even simpler - just fill something in one or more of the columns that can be found in the [sample] section, like organism(_chN), tissue(_chN), gender(_chN), etc. Since a new sample definition can be produced in such a way, don’t fill anything in these columns except the sample(name) column if it is not a new sample!!

9.6  An Example

Here is an example of input.txt.

10  Project annotation file

Optional file.

Note: project annotatin files can be in any format.

The project annotation file contains additional information for a certain project. Although WebArrayDB doesn’t have limitation in its format, plain text format is recommended.

11  Probe-mapping file

Optional file.

Note: probe-mapping files MUST be in TAB-delimited ASCII format.

This file can be used to map probes across different platforms. Four columns are required:

Platform_Aunique_id_APlatform_Bunique_id_B

Alignment files downloaded from Affymetrix can be used directly to map probes between different types of GeneChip. In such cases, the four columns in the files: “A Array Name”, “A Probe Set Name”, “B Array Name” and “B Probe Set Name” will be used as the four required columns correspondingly.

Usually alignment by files provides a more comprehensive and more reliable map of probes. Based on such an understanding, when users use the option “automatic” to match probes, WebArrayDB will use such alignments if all involved platforms were aligned already, otherwise all available reference columns will be used for alignment.


This document was translated from LATEX by HEVEA.