Help for loading data into WebArrayDB |
The data to be inputted into databases are organized in projects. The information for a project falls into 5 categories:
(1) project | General information about the project. |
(2) array | Detail information of each involved array. |
(3) sample | Information of new samples. |
(4) platform | Information of new microarray platforms. |
(5) protocol | Information of new protocols for array/sample manipulation |
Each project, array, sample, platform or protocol has a record in database. Logically arrays, samples, platforms and protocols are associated with (or belong to) the project with which they were defined, but they can be refered in other projects too. A typical example is, an array can use a protocol defined in another project for hybridization.
There are two interfaces, web interface and command console, for inputting data into databases in WebArrayDB.
Web interface can be used by collaborating groups users or internet users. With the web interface, uses can
A small shell was written for WebArrayDB to facilitate data access. It can be used not only for inputting data, but also for search data. The command console is simple but flexible and powerful since it supports SQL syntax. It is ideal for experienced users.
A few commands were implemented. The commands like “fillwith” and “fillbatch” are designed to load data into WebArrayDB; “display” and “saveto” can display and save searched data. Limited SQL commands, including "SELECT" and "SHOW" can be used to find data of interests in the database too. The command “help” will give a complete list.
Enter the shell:
$ WebArrayDB
WebArrayDB−>
To load data into WebArrayDB, a file containing necessary information is mandatory. This file is call project file (the name “input.txt” was used for this file for the following discussion). To load data for one project, “fillwith” can be used with a compact “input.txt”:
WebArrayDB−> fillwith input.txt;
To load data for a batch of projects, the input.txt should contain more forms:
WebArrayDB−> fillbatch input.txt;
SQL inquery commands are used to obtain data from WebArrayDB. For example, to get .
WebArrayDB | −> | SELECT | |
−> | it.array_id, fg, bg, flag | ||
−> | FROM | ||
−> | intensity AS it, sample, sampxref | ||
−> | WHERE | ||
−> | sample.name = ’Cell line genomic DNA’ | ||
−> | AND sampxref.sample_id = sample.id | ||
−> | AND sampxref.array_id = it.array_id | ||
−> | ORDER BY | ||
−> | it.probe_id; |
Then the following commands display and save results in the file ’results.txt’:
WebArrayDB−> display;
WebArrayDB−> saveto results.txt;
Besides the project file(“input.txt”), intensity files should be supplied for new array data - while image files are optional. Probe files that containing platform information will be necessary for platforms new to WebArrayDB. A protocol file is optional when defining a new protocol.
In general, these files fall into 6 categories:
Important: except image file, intensity file, and project annotation file, all other files MUST be in TAB-delimited ASCII format. EXCEL files MUST be saved in Text (Tab delimited) format for these file types!
Optional file. Usually in binary format.
This file is the raw image file as the output of microarray-scanning. Usually it is in “TIFF” format for a two-color array or the “.DAT” file format for a Affymetrix GeneChip. As a storage tool, WebArrayDB attempts to save all related information to a microarray experiment, including microarray image files. Raw image files can be searched in the database.
Required file for array data. Can be in variety formats in ASCII or binary.
Intensity files contain the microarray hybridization intensity data which is required for any arrays to be loaded to WebArrayDB. Different image processing softwares produce intensity files of their specific formats. WebarrayDB may read intensity files from Affymetrix, Agilent, ArrayVision, Genepix, QuantArray, ImaGene (* the ImaGene format may not function in web interface), SMD, SPOT or in a user-defined format. Notice that the name and the value may be different in WebArrayDB for these formats. The following table lists the format names and corresponding values.
Format name | Value in WebArrayDB |
Affymetrix | CEL |
Agilent | agilent |
ArrayVision | arrayvision |
BlueFuse | bluefuse |
Genepix | genepix |
GenePix Custom | genepix.custom |
GenePix Median | genepix.median |
ImaGene | imagene |
QuantArray | quantarray |
Scanarray Express | scanarrayexpress |
SMD | smd |
SMD Old | smd.old |
SPOT | spot |
user-defined | user.defined |
For Affymetrix, the intensity file means the “.CEL” file. All these file formats have been explained in WebArray, but the user-defined format is a little different from that in WebArray, see below.
Note: User-defined intensity files MUST be in TAB-delimited ASCII format.
User-defined format allows multiple dyes (channels) experiment data. For each channel, the foreground intensity value is required, the background value and the flag value are optional. For the ith channel, the column names should be chi.Intensity, chi.Background and chi.Flag. See the examples below:
ID | ch1.Intensity+ | [ch1.Background] | [ch1.Flag] | [ch2.Intensity | [ch2.Background] | [ch2.Flag]] | [ch3.Intensity | [ch3.Background] | [ch3.Flag]] | ... |
The differences of the user-defined format between WebArrayDB and WebArray are:
In case that the “ID” column is presented, it will be used to adjust the order of rows in intensity data file to that of “unique_id” in the probe file only when all the following criteria are satisfied:
WebArrayDB support “ratio” and “log-ratio” values as well as “intensity values”. Sometimes output data from arrays are ratioes of foreground intensity values and background intensity values, or even more, such ratioes are subject to a logarithmic transformation based on 2. In such cases, the values in the data_type column in the [array] sectionare “ratio” and “log-ratio” respectively.
Files for ratio values in user-defined format may have these columns:
ID | ratio+ | flag |
Files for log-ratio values in user-defined format may have these columns:
ID | log-ratio+ | flag |
Optional file.
Note: protocol files MUST be in TAB-delimited ASCII format.
A protocol file is to present the details of an experimental protocol. Portable formats like ASCII TEXT, PDF, PS, RTF format are strongly recommended, although other formats, such as MS-word are acceptable as well.
Required file for defining a NEW microarray platform.
Note: protocol files MUST be in TAB-delimited ASCII format.
Probe files are used to define a new microarray platform. Only the column “unique_id” (or “ID” can be used as a replacement) must be offered in this file, all other are optional.
For the purpose of cross-platform alignment, users may consider include some columns for mapping with other platforms. Keep in mind, you need to define the column names as reference IDs before using those columns to map arrays (see the reference ID section for more details). In addtion to the reference IDs, “unique_id” and “gene_symbol” can be used for cross-platform alignment too.
For Affymetrix GeneChips, their annotation files (downloadable from Affymetrix’s website) can be directly used as the probe files. For other array platforms, this may be a modified version of a gene list file (sometimes called GAL file). In some gene list files, there is a column “block” instead of “block_row” and “block_col”. For convenience of users, a column “block” can be offered in probe files. WebArrayDB will try to guess a reasonable “block_row” and “block_col” from “block” in case that only the “block” column is available. If none of the three columns is offered, both “block_row” and “block_col” will be set to 1.
idx | block_row | block_col | row | col | unique_id+ | gene_symbol | gene_title | chromosome | probe_start | probe_end | probe_strand | probe_sequence | bioseq_type | probe_purpose | designation | gene_start | gene_end | gene_strand | cpg_dist | user_attr1 | user_attr2 | ... |
Reference IDs refer to the IDs that can be used to map probes from different microarray platforms. The IDs should be included in the probe file and listed in columns whose column names should be claimed as "reference IDs" in advance.
We have already included the following IDs from biological databases as reference IDs, you may browse them or add more using the online function “Browse/Add external databases as reference” in the “Database Management” interface of WebArrayDB.
Note: project files MUST be in TAB-delimited ASCII format.
This section explains the contents of a project file for a single project
The project file is required when operating on the “command console”, or when using the online function “create a new project (with online forms)”. If the user adds or updates a project by online functions “create a new project (with online forms)” or “add more data to an existing project”, tables corresponding to sections of the project file will be automatically produced by WebArrayDB. Whether it is required or not, the annotations for fields/columns in the following subsections are helpful for understanding the data content in WebArrayDB.
“AnUserColName = Value for this record”,
An alternative way to define a new feature is offered in batch loading mode - the mode for loading data by a project-definition file. In the project-definition file, the user are allowed to add new columns that are not list in those sections below, but make sure that the names of those columns start with “[user_added]” as the token, which will not be considered as a part of the column name and will be removed before saving into the database.
Users are allowed to defined three types of features, “string”, “integer” and “float” type. When a user-added feature is used for the first time, WebArrayDB will determine its type by the first occurrence of its value - a character string. There are two ways to tell a value’s type - implicit or explicit.
The implicit way for determination of value type:
The explicit way will use a prefix for definition, “[int]” or “[integer]” for integer, “[float]” for float and “[str]” or “[string]” for character string. The prefix is case-insensitive. A value is:
Any case that doesn’t meet the above criteria will be used as a character string. An integer string will be converted to an integer in the dabatase, a float string will be converted to a double-precision float in the database and a character string will be saved as a string in the database
* IMPORTANT: Since the semicolon “;” is used to separate keywords or user-defined features, so there is no way to include the semicolon character in a keyword or in the value for a user-defined column!
name | factors | tissue | design | QC | description | authors | journal | publish_year | pubmed_id | data_link | user_name | release_date | related_files/ | keyword | user_added_cols |
This table contains project informations. One line for one project. Usually it should be present if the user is inputting array data (hybridization data).
IMPORTANT: If the project file is written for adding data to an existing project, this part (the “prject” table) should be blank, while other parts (including “array”, “sample”, “platform”, and “protocol”) should have a column “project” listing the project names to which they belong.
name+ | organism | tissue | individual_id | gender | age | description | keyword | user_added_cols |
This table is needed if there are used samples not available in WebArrayDB. Each sample should have one line of data. Samples might be defined in the [array] section too.
name+ | category | description | keyword | user_added_cols |
This table should be present only if there are new protocols introduced. Protocols might be defined in the [array] section too.
name+ | probe_file+ | category+ | probe_num+ | replicate+ | space+ | manufacturer | organism | description | availability | keyword | user_added_cols |
This table offers platform informations.
identifier | platform+ | channel_num+ | intensity_file+/ | intensity_format+ | hyb_date | protocol_hyb | protocol_image | protocol_data | data_type | description | sample(_chN)+ | organism(_chN) | tissue(_chN) | individual_id(_chN) | gender(_chN) | age(_chN) | description(_chN) | dye(_chN) | protocol_process(_chN) | protocol_tech(_chN) | protocol_label(_chN) | exp_factor(_chN) | image_file(_chN)/ | image_format(_chN) | keyword | user_added_cols |
The following columns are for channels:
Important notice for assigning channel number (chN):
For two-color intensity files in “genepix” format (and maybe some other formats too), the rank number of a channel is determined by wave length of laser used for this channel, normally the channel number should increase with laser wave length!
This table describe information related to microarray slides or chips. One line for each slide/chip.
Simple protocols or samples (without keywords nor the user_added_cols items) can be defined in the [array] section.
For a new protocol, the definition can be completed in corresponding protocol cell while the format is:
[new_protocol_name]:protocol definitions
For a new sample, it is even simpler - just fill something in one or more of the columns that can be found in the [sample] section, like organism(_chN), tissue(_chN), gender(_chN), etc. Since a new sample definition can be produced in such a way, don’t fill anything in these columns except the sample(name) column if it is not a new sample!!
Here is an example of input.txt.
Optional file.
Note: project annotatin files can be in any format.
The project annotation file contains additional information for a certain project. Although WebArrayDB doesn’t have limitation in its format, plain text format is recommended.
Optional file.
Note: probe-mapping files MUST be in TAB-delimited ASCII format.
This file can be used to map probes across different platforms. Four columns are required:
Platform_A | unique_id_A | Platform_B | unique_id_B |
Alignment files downloaded from Affymetrix can be used directly to map probes between different types of GeneChip. In such cases, the four columns in the files: “A Array Name”, “A Probe Set Name”, “B Array Name” and “B Probe Set Name” will be used as the four required columns correspondingly.
Usually alignment by files provides a more comprehensive and more reliable map of probes. Based on such an understanding, when users use the option “automatic” to match probes, WebArrayDB will use such alignments if all involved platforms were aligned already, otherwise all available reference columns will be used for alignment.
This document was translated from LATEX by HEVEA.