Skip Headers
Oracle® Data Mining Application Developer's Guide,
10g Release 2 (10.2)

Part Number B14340-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

9 Sequence Matching and Annotation (BLAST)

This chapter describes table functions included with ODM that permit you to perform similarity searches against nucleotide and amino acid sequence data stored in an Oracle database. You can use the table functions described in this chapter for ad hoc searches or you can embed them in applications. The inclusion of these table functions in ODM positions Oracle as a platform for bioinformatics.

This chapter discusses the following topics:

9.1 NCBI BLAST

The National Center for Biotechnology Information (NCBI) developed one of the commonly used versions of the Basic Local Alignment Search Tool (BLAST).

Sequence alignments provide a way to compare new sequences with previously characterized sequences. Both functional and evolutionary information can be inferred from well-designed queries and alignments. BLAST provides a method for searching of both nucleotide and protein databases. Since the BLAST algorithm detects local alignments, regions of similarity embedded in otherwise unrelated sequences can be detected.

The BLAST algorithm searches nucleotide and amino acid query sequences against databases of nucleotide and amino acid sequences. Based on the nature of the query and the database sequences, NCBI BLAST provides the following variants:

For more information about NCBI BLAST, see the NCBI BLAST Home Page at

http://www.ncbi.nlm.nih.gov/BLAST

The table functions described in this chapter implement some of the variants of NCBI BLAST version 2.0.

9.2 Using ODM BLAST

This section contains several examples of using the ODM BLAST table functions to perform searches on nucleotide or amino acid sequences.

Most table function parameters have defaults. The defaults were carefully chosen so that users who have limited experience with BLAST should obtain good results.

9.2.1 Using BLASTN_MATCH to Search DNA Sequences

The BLAST table functions accept the CLOB (Character Large OBject) data type as the query sequence. It is not possible to construct a CLOB in an ad hoc SQL query. One way to construct a CLOB is to create a table and insert the query sequence into the table. Another option is to construct a CLOB using the programmatic interface if the BLAST query is part of a larger program. Suppose that the table query_db has the schema (sequence_id VARCHAR2(32), sequence CLOB). The following SQL query inserts the query sequence into query_db:

INSERT INTO query_db VALUES ('1', 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGT');

Suppose that the table GENE_DB stores DNA sequences. Suppose that GENE_DB has attributes seq_id, publication date, modification date, organism, and sequence, among other attributes. There is no required schema for the table that stores the sequences. The only requirement is that the table contain an identifier and the sequence and any number of other optional attributes.

The portion of the database to be used for the search can be specified using SQL. The full power of SQL can be used to perform sophisticated selections.

9.2.1.1 Searching for Good Matches in DNA Sequences

The following query does a BLAST search of the given query sequence against the human genome and returns the seq_id, score, and expect value of matches that score > 25:

SELECT t.t_seq_id, t.score, t.expect
FROM TABLE (
    BLASTN_MATCH (
       (SELECT sequence FROM query_db WHERE sequence_id = '1'),
       CURSOR (SELECT seq_id, sequence FROM GENE_DB
        WHERE organism = 'human'),
       1,
       -1,
       0,
       0,
       10,
       0,
       0,
       0,
       0,
       11,
       0,
       0)
) t WHERE t.score > 25;

Note: The parameter value of 0 invokes the default values in most cases. See the syntax for details.

9.2.1.2 Searching DNA Sequences Published After a Certain Date

The following query does the BLAST search against all sequences published after Jan 01, 2000:

SELECT t.t_seq_id, t.score, t.expect
FROM TABLE (
    BLASTN_MATCH (
       (SELECT sequence FROM query_db WHERE sequence_id = '1'),
       CURSOR (SELECT seq_id, sequence FROM GENE_DB
        WHERE publication_date > '01-JAN-2000'),
       1,
       -1,
       0,
       0,
       10,
       0,
       0,
       0,
       0,
       11,
       0,
       0)
) t WHERE t.score > 25;

You can obtain other attributes of the matching sequence by joining the BLAST result with the original sequence table as follows:

SELECT t.t_seq_id, t.score, t.expect, g.publication_date, g.organism
FROM GENE_DB g, TABLE (
    BLASTN_MATCH (
       (SELECT sequence FROM query_db WHERE sequence_id = '1'),
       CURSOR (SELECT seq_id, sequence FROM GENE_DB
        WHERE publication_date > '01-JAN-2000'),
       1,
       -1,
       0,
       0,
       10,
       0,
       0,
       0,
       0,
       11,
       0,
       0)
) t WHERE t.t_seq_id = g.seq_id AND t.score > 25;

9.2.2 Using BLASTP_MATCH to Search Protein Sequences

Suppose that the table PROT_DB stores protein sequences. Insert the protein query sequence to be used for the search into query_db.

9.2.2.1 Searching for Good Matches in Protein Sequences

The following query does a BLASTP search of the given query sequence against protein sequences in PROT_DB and returns the identifier, score, name, and expect value of matches that score > 25:

SELECT t.t_seq_id, t.score, t.expect, p.name
FROM PROT_DB p, TABLE(
       BLASTP_MATCH (
         (SELECT sequence FROM query_db WHERE sequence_id = '2'),
         CURSOR(SELECT seq_id, sequence FROM PROT_DB),
         1,
         -1,
         0,
         0,
         'BLOSUM62',
         10,
         0,
         0,
         0,
         0,
         0)
       )t WHERE t.t_seq_id = p.seq_id AND t.score > 25
          ORDER BY t.expect;

9.2.3 Using BLASTN_ALIGN to Search and Align DNA Sequences

Suppose that the table GENE_DB stores DNA sequences. Suppose that GENE_DB has attributes seq_id, publication date, modification date, organism, and sequence among other attributes.

9.2.3.1 Searching and Aligning for Good Matches in DNA Sequences

The following query does a BLAST search and alignment of the given query sequence against the human genes and returns the publication_date, organism, and the alignment attributes of the matching sequences that score > 25 and where more than 50% of the sequence is conserved in the match:

SELECT t.t_seq_id, t.alignment_length, t.pct_identity,
       t.q_seq_start, t.q_seq_end, t.t_seq_start, t.t_seq_end,
       t.score, t.expect, g.publication_date, g.organism
FROM GENE_DB g, TABLE (
    BLASTN_ALIGN (
       (SELECT sequence FROM query_db WHERE sequence_id = '1'),
       CURSOR (SELECT seq_id, sequence FROM GENE_DB
        WHERE publication_date > '01-JAN-2000'),
       1,
       -1,
       0,
       0,
       10,
       0,
       0,
       0,
       0,
       11,
       0,
       0)
) t WHERE t.t_seq_id = g.seq_id AND t.score > 25
    AND t.pct_identity > 50;

You can use BLASTP_ALIGN and TBLAST_ALIGN in a similar way.

9.2.4 Output of BLAST Queries

The output of a BLAST query is a table; the output table is described as the output table for the specific query.

Here are two examples of queries and the resulting output tables.

Query 1 is as follows:

select T_SEQ_ID AS seq_id, score, EXPECT as evalue
  from TABLE(
       BLASTP_MATCH (
         (select sequence from query_db),
         CURSOR(SELECT seq_id, seq_data
                FROM swissprot
                WHERE organism = 'Homo sapiens (Human)'),
         1,
         -1,
         0,
         0,
         'BLOSUM62',
         10,
         0,
         0,
         0,
         0,
         0)
       ); 

The output for query 1 is as follows:

SEQ_ID        SCORE     EVALUE
-------- ----------     ----------
P31946          205     5.8977E-18
Q04917          198     3.8228E-17
P31947          169     8.8130E-14
P27348          198     3.8228E-17
P58107           49     7.24297332

Query 2 is as follows:

select T_SEQ_ID AS seq_id, ALIGNMENT_LENGTH as len,
       Q_SEQ_START as q_strt, Q_SEQ_END as q_end, Q_FRAME, T_SEQ_START as t_strt,
       T_SEQ_END as t_end, T_FRAME, score, EXPECT as evalue
  from TABLE(
       BLASTP_ALIGN (
         (select sequence from query_db),
         CURSOR(SELECT seq_id, seq_data
                FROM swissprot
                WHERE organism = 'Homo sapiens (Human)' AND
                      creation_date > '01-Jan-90'),
         1,
         -1,
         0,
         0,
         'BLOSUM62',
         10,
         0,
         0,
         0,
         0,
         0)
       );  

The output for Query 2 is as follows:

SEQ_ID    LEN Q_STRT Q_END Q_FRAME T_STRT T_END T_FRAME   SCORE  EVALUE
-------- ---- ------ ----- ------- ------ ----- ------- ------- ----------
P31946     50      0    50       0     13    63       0     205  5.1694E-18
Q04917     50      0    50       0     12    62       0     198  3.3507E-17
P31947     50      0    50       0     12    62       0     169  7.7247E-14
P27348     50      0    50       0     12    62       0     198  3.3507E-17
P58107     21     30    51       0    792   813       0      49  6.34857645

9.2.5 Using BLASTN_COMPRESS to Improve Search Performance

If you perform frequent BLAST searches on nucleotide sequences, performance improves significantly when the data set of sequences is transformed into a compressed binary format, and the compressed data is used in the searches. The BLASTN_COMPRESS() function transforms a nucleotide data set represented as CLOBs into compressed binary format represented as BLOBs.

9.2.5.1 Compress Sequences

Suppose that the table GENE_DB contains DNA sequences upon which you will perform frequent searches. Suppose that GENE_DB has attributes (seq_id, publication date, modification date, organism, sequence) among other attributes. The following query stores all human DNA sequences in compressed binary format, in the table COMPRESSED_HUMAN_GENES.

create table COMPRESSED_HUMAN_GENES as
select seq_id, seq_data
from Table(BLASTN_COMPRESS (
 from GENE_DB
 where organism = 'human'))) 

The portion of the database to be compressed can be specified using SQL. The full power of SQL can be used to perform more sophisticated selections involving joins.

9.2.5.2 Passing a Compressed Sequence to a BLAST Function

The compressed sequences can be directly passed to BLAST match and align functions as shown in the following example.

select t.t_seq_id, t.alignment_length, t.pct_identity, t.q_start, t.q_end, t.s_start, t.s_end, t.score, t.expect, g.publication_date, g.organism
from GENE_DB g, Table(BLASTN_ALIGN (
select sequence from QUERY_SEQ where id = '1'),
seqdb_cursor => cursor(select seq_id, seq_data
from Table(BLASTN_COMPRESS (
cursor(select seq_id, sequence 
           from GENE_DB
           where organism = 'human')))),
expect_value => 5,
word_size => 12)) t
where t.t_seq_id = g.identifier 
AND t.score > 25 
AND t.pct_identity > 50;

9.2.6 Sample Data for BLAST

We provide a few sample data sets and queries that can be used to verify that the BLAST functions work correctly after ODM is installed.

The DM_USER schema contains the following sequence data tables:

  • SWISSPROT

  • PROT_DB

  • ECOLI10

9.2.6.1 SWISSPROT Table

The SWISSPROT table contains the sequences in Release 40 of the SwissProt database. This table has the sequence identifier, creation_date, organism, and sequence_data attributes. It has 101,602 protein sequences.

SQL> describe SWISSPROT;
Name                                    Null?    Type
--------------------------------------- -------  -------------
SEQ_ID                                          VARCHAR2(32)
CREATION_DATE                                   DATE
ORGANISM                                        VARCHAR2(256)
SEQ_DATA                                        CLOB

9.2.6.2 PROT_DB Table

The PROT_DB table consists of 19 protein sequences from Release 40 of the SwissProt data set.

SQL> describe prot_db;
Name                                     Null?    Type
---------------------------------------- -------  -------------
SEQ_ID                                            VARCHAR2(32)
SEQ_DATA                                          CLOB

9.2.6.3 ECOLI10 Table

The ECOLI10 table contains 10 nucleotide sequences from the Escherichia coli data set.

SQL> describe ECOLI10;
Name                                      Null?    Type
----------------------------------------- -------- ---------------
SEQ_ID                                             VARCHAR2(32)
SEQ_DATA                                           CLOB

9.2.6.4 Genetic Codes and Names

Table 9-1 lists genetic codes and associated names.

Table 9-1 Table of Genetic Codes

Genetic Code Name

1

Standard

2

Vertebrate Mitochondrial

3

Yeast Mitochondrial

4

Mold Mitochondrial, Protozoan Mitochondrial, Coelenterate Mitochondrial, Mycoplasma, Spiroplasm

5

Invertebrate Mitochondrial

6

Ciliate Nuclear, Dasycladacean Nuclear, Hexamita Nuclear

9

Echinoderm Mitochondrial

10

Euplotid Nuclear

11

Bacterial and Plant Plastid

12

Alternative Yeast Nuclear

13

Ascidian Mitochondrial

14

Flatworm Mitochondrial

15

Blepharisma Macronuclear

16

Chlorophycean Mitochondrial

21

Trematode Mitochondrial

22

Scenedesmus Obliquus Mitochondrial

23

Thraustochytrium Mitochondrial Code


9.2.6.5 Sequence Databases

There are several public domain sequence databases available. One of them is the SwissProt database, which is a highly curated collection of protein sequences. SwissProt has recently been combined with other databases to create UniProt. The last release of the SwissProt database can be downloaded from

ftp://ftp.ebi.ac.uk/pub/databases/swissprot/release/sprot45.dat

In addition to the raw sequence data, the SwissProt database contains several other attributes of the sequence including organism, date published, date modified, published literature references, annotations, and so on. BLAST requires only the sequence identifier and the sequence data to be stored to perform searches.

Depending on the needs of your specific application, different sets of these attributes may be important. Therefore, the database schema required to store the data needs to be appropriately designed. You can use a scripting language to parse the required fields from the SwissProt data and format the fields so that they can be loaded into an Oracle database.

The following Perl script outputs the sequence identifier, creation_date, organism, and sequence data in the required format for SQL*Loader. (SQL*Loader is the utility that loads data into an Oracle database; it is described in detail in Oracle Database Utilities.)

#!/bin/perl
#swissprot.pl < input > output
#Input: protein db as provided by SWISSPROT
#
my $string = "";
my $indicator = "";
$sq = 0;
$ac = 0;

while(<>)
{
    #chop;
    if ( /^\/\// ) {
      print "\n";
      $sq = 0;
      $ac = 0;
      next;
    }
    if ($sq == 1) {
        @words = split;
        foreach $word (@words) {
          print "$word";
        }
        next;
    }
    if( /^AC(\s+)(\w+);/ ) {
      if ($ac == 0) {
        $indicator = $2;
        print "$indicator|";
        $sq = 0;
        $dt = 0;
        $ac = 1;
        next;
      }
    }
    if ( /^OS(\s+)(.*)\./ ) {
        $organism = $2;
        print "$organism|";
        next;
    }
    if ( /^DT(\s+)(\S+)/ ) {
        if ($dt == 0) {
           print "$2|";
           $dt = 1;
        }
    }
    if ( /^SQ(\s+)/ ) {
        $sq = "1";
        next;
    }
}

9.2.6.6 Loading Sequences into an Oracle Database

Follow these steps to download, parse, and save the SwissProt data in an Oracle database:

  1. Download SwisProt data to the file sprot45.dat.

  2. Save the perl script in a file named swissprot.pl, type the command

    swissprot.pl sprot45.dat > sprot_formatted.txt
    
    

    This command will read the SwissProt data stored in sprot45.dat, format it, and write it out to sprot_formatted.txt.

  3. In order to load the data using SQL*Loader, you must create a table to hold the data and a control file. Create the table swissprot using the following SQL statement:

    create table swissprot (SEQ_ID VARCHAR2(32), CREATION_DATE DATE,
    ORGANISM VARCHAR2(256), SEQ_DATA CLOB);
    
    
  4. Create a control file named sprot.ctl with the following contents:

    LOAD DATA
    INFILE sprot40_formatted.txt
    INTO TABLE swissprot
    REPLACE
    FIELDS TERMINATED BY '|'
    TRAILING NULLCOLS
    (
    seq_id,
    creation_date,
    organism,
    seq_data char(100000)
    )
    
  5. Finally, load the data:

    sqlldr userid=<user_name>/<passwd> control=sprot.ctl log=sprot.log
    direct=TRUE data=sprot40_formatted.txt
    
    

    The SwisProt data is now stored in the Oracle table swissprot.


Summary of BLAST Table Functions

The BLAST functionality is available as table functions; these table functions can be used in the FROM clause of a SQL query.

Table 9-2 BLAST Table Functions

Table Function Description

BLASTN_COMPRESS Table Function

Compress nucleotide sequence data to improve performance of sequence searches.

BLASTN_MATCH Table Function

Perform a search of the given nucleotide sequence against the selected portion of the nucleotide database

BLASTP_MATCH Table Function

Perform a search of the given amino acid sequence against the selected portion of the protein database

TBLAST_MATCH Table Function

Perform a search involving translations of either the query sequence or the database of sequences

BLASTN_ALIGN Table Function

Perform an alignment of the given nucleotide sequence against the selected portion of the nucleotide database

BLASTP_ALIGN Table Function

Perform an alignment of the given amino acid sequence against the selected portion of the protein database

TBLAST_ALIGN Table Function

Perform alignments involving translations of either the query sequence or the database of sequences



BLASTN_COMPRESS Table Function

This table function compresses nucleotide sequence data. It takes as input a cursor of sequence identifier and sequence data represented as a CLOB and returns the sequence identifier and a BLOB representing the sequence data in compressed binary format. The result of BLASTN_COMPRESS can be either materialized in a table for future use or passed into the BLAST search functions that accept nucleotide sequence data

Syntax

function BLASTN_COMPRESS (
  sequence_cursor REF CURSOR)
  return table of row (seq_id VARCHAR2, seq_data BLOB);

Parameters

Table 9-3 describes the input parameters for BLASTN_COMPRESS; Table 9-4, the output parameters.

Table 9-3 Input Parameters for BLASTN_COMPRESS Table Function

Parameter Description

sequence_cursor

The cursor of the sequences to be compressed. The cursor has two columns the sequence identifier and the sequence string.


Table 9-4 Output Parameters for BLASTN_MATCH Table Function

Attribute Description

seq_id

The sequence identifier of the sequence. The value returned is the same as the sequence identifier in the input cursor.

seq_data

The compressed sequence represented as a BLOB.



BLASTN_MATCH Table Function

This table function performs a BLASTN search of the given nucleotide sequence against the selected portion of the nucleotide database. The database can be selected using a standard SQL select and passed into the function as a cursor. It accepts the standard BLAST parameters that are listed in the following section. The match returns the identifier of the matched (target) sequence (t_seq_id) (for example, the NCBI accession number), the score of the match, and the expect value.

Syntax

function BLASTN_MATCH (
  query_seq CLOB,
  seqdb_cursor REF CURSOR,
  subsequence_from NUMBER default 1,
  subsequence_to NUMBER default -1,
  filter_low_complexity BOOLEAN default false,
  mask_lower_case BOOLEAN default false,
  expect_value NUMBER default 10,
  open_gap_cost NUMBER default 5,
  extend_gap_cost NUMBER default 2,
  mismatch_cost NUMBER default -3,
  match_reward NUMBER default 1,
  word_size NUMBER default 11,
  xdropoff NUMBER default 30,
  final_x_dropoff NUMBER default 50)
  return table of row (t_seq_id VARCHAR2, score NUMBER, expect NUMBER);

Parameters

Table 9-5 describes the input parameters for BLASTN_MATCH; Table 9-6, the output parameters.

Table 9-5 Input Parameters for BLASTN_MATCH Table Function

Parameter Description

query_seq

The query sequence to search. This version of ODM BLAST accepts bare sequences only. A bare sequence is just lines of sequence data. Blank lines are not allowed in the middle of bare sequence input.

seqdb_cursor

The cursor parameter supplied by the user when calling the function. It should return two columns in its returning row, the sequence identifier and the sequence string.

subsequence_from

Start position of a region of the query sequence to be used for the search. The default is 1.

subsequence_to

End position of a region of the query sequence to be used for the search. If -1 is specified, the sequence length is taken as subsequence_to. The default is -1.

filter_low_complexity

TRUE or FALSE. If TRUE, the search masks off segments of the query sequence that have low compositional complexity. Filtering can eliminate statistically significant but biologically uninteresting regions, leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the query sequence. The default is FALSE.

mask_lower_case

TRUE or FALSE. If TRUE, you can specify a sequence in upper case characters as the query sequence and denote areas to be filtered out with lower case. This customizes what is filtered from the sequence. The default is FALSE.

expect_value

The statistical significance threshold for reporting matches against database sequences. The default value is 10. Specifying 0 invokes default behavior.

open_gap_cost

The cost of opening a gap. The default value is 5. Specifying 0 invokes default behavior.

extend_gap_cost

The cost of extending a gap. The default value is 2. Specifying 0 invokes default behavior.

mismatch_cost

The penalty for nucleotide mismatch. The default value is -3. Specifying 0 invokes default behavior.

match_reward

The reward for a nucleotide match. The default value is 1. Specifying 0 invokes default behavior.

word_size

The word size used for dividing the query sequence into subsequences during the search. The default value is 11. Specifying 0 invokes default behavior.

xdropoff

Dropoff for BLAST extensions in bits. The default value is 30. Specifying 0 invokes default behavior.

final_x_dropoff

The final X dropoff value for gapped alignments in bits. The default value is 50. Specifying 0 invokes default behavior.


Table 9-6 Output Parameters for BLASTN_MATCH Table Function

Attribute Description

t_seq_id

The sequence identifier of the returned match.

score

The score of the returned match.

expect

The expect value of the returned match.



BLASTP_MATCH Table Function

This table function performs a BLASTP search of the given amino acid sequence against the portion of the selected protein database. The database can be selected using a standard SQL select and passed into the function as a cursor. We also accept the standard BLAST parameters that are listed in the following section. The match returns the identifier of the matched (target) sequence (t_seq_id) (for example, the NCBI accession number), the score of the match, and the expect value.

Syntax

function BLASTP_MATCH (
  query_seq CLOB,
  seqdb_cursor REF CURSOR,
  subsequence_from NUMBER default 1,
  subsequence_to NUMBER default -1,
  filter_low_complexity BOOLEAN default false,
  mask_lower_case BOOLEAN default false,
  sub_matrix VARCHAR2 default 'BLOSUM62',
  expect_value NUMBER default 10,
  open_gap_cost NUMBER default 11,
  extend_gap_cost NUMBER default 1,
  word_size NUMBER default 3,
  x_dropoff NUMBER default 15,
  final_x_dropoff NUMBER default 25)
  return table of row (t_seq_id VARCHAR2, score NUMBER, expect NUMBER);

Parameters

Table 9-7 describes the input parameters for BLASTN_MATCH; Table 9-8, the output parameters.

Table 9-7 Input Parameters for BLASTP_MATCH Table Function

Parameter Description

query_seq

The query sequence to search. This version of ODM BLAST accepts bare sequences only. A bare sequence is just lines of sequence data. Blank lines are not allowed in the middle of bare sequence input.

seqdb_cursor

The cursor parameter supplied by the user when calling the function. It should return two columns in its returning row, the sequence identifier and the sequence string.

subsequence_from

Start position of a region of the query sequence to be used for the search. The default is 1.

subsequence_to

End position of a region of the query sequence to be used for the search. If -1 is specified, the sequence length is taken as subsequence_to. The default is -1.

filter_low_complexity

TRUE or FALSE. If TRUE, the search masks off segments of the query sequence that have low compositional complexity. Filtering can eliminate statistically significant but biologically uninteresting regions, leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the query sequence. The default value is FALSE.

mask_lower_case

TRUE or FALSE. If TRUE, you can specify a sequence in upper case characters as the query sequence and denote areas to be filtered out with lower case. This customizes what is filtered from the sequence. The default value is FALSE.

sub_matrix

Specifies the substitution matrix used to assign a score for aligning any possible pair of residues. The different options are PAM30, PAM70, BLOSUM80, BLOSUM62, and BLOSUM45. The default is BLOSUM62. See Table 9-9 for supported values of (open_gap_cost, extend_gap_cost) for each matrix.

expect_value

The statistical significance threshold for reporting matches against database sequences. The default value is 10. Specifying 0 invokes default behavior.

open_gap_cost

The cost of opening a gap. The default value is 11. Specifying 0 invokes default behavior.

extend_gap_cost

The cost of extending a gap. The default value is 1. Specifying 0 invokes default behavior.

word_size

The word size used for dividing the query sequence into subsequences during the search. The default value is 3. Specifying 0 invokes default behavior.

x_dropoff

Dropoff for BLAST extensions in bits. The default value is 15. Specifying 0 invokes default behavior.

final_x_dropoff

The final X dropoff value for gapped alignments in bits. The default value is 25. Specifying 0 invokes default behavior.


Table 9-8 Output Parameters for BLASTP_MATCH Table Function

Attribute Description

t_seq_id

The sequence identifier of the returned match.

score

The score of the returned match.

expect

The expect value of the returned match.


For each substitution matrix (sub_matrix), only certain combinations of (open_gap_cost, extend_gap_cost) values are supported. Table 9-9 shows the supported combinations of values for each substitution matrix.

Table 9-9 Supported Combinations of (open_gap_cost, extend_gap cost)

Substitution Matrix Name Supported (open_gap_cost, extend_gap_cost) Values

BLOSUM45

(13,3), (12,3), (11,3), (10,3), (16,2), (15,2), (14,2), (13,2), (12,2), (19,1), (18,1), (17,1), (16,1)

BLOSUM62

(11,2), (10,2), (9,2), (8,2), (7,2), (6,2), (13,1), (12,1), (11,1), (10,1), (9,1)

BLOSUM80

(25,2), (13,2), (9,2), (8,2), (7,2), (6,2), (11,1),(10,1), (9,1)

PAM30

(7,2), (6,2), (5,2), (10,1), (9,1), (8,1)

PAM70

(8,2), (7,2), (6,2), (11,1), (10,1), (9,1)



TBLAST_MATCH Table Function

This table function performs BLAST searches involving translations of either the query sequence or the database of sequences. The available options are:

The database can be selected using a standard SQL select and passed into the function as a cursor. We also accept the standard BLAST parameters that are listed in the following section. The match returns the identifier of the matched (target) sequence (t_seq_id) (for example, the NCBI accession number), the score of the match, and the expect value.

Syntax

function TBLAST_MATCH (
  query_seq CLOB,
  seqdb_cursor REF CURSOR,
  subsequence_from NUMBER default 1,
  subsequence_to NUMBER default -1,
  translation_type VARCHAR2 default 'BLASTX',
  genetic_code NUMBER default 1,
  filter_low_complexity BOOLEAN default false,
  mask_lower_case BOOLEAN default false,
  sub_matrix VARCHAR2 default 'BLOSUM62',
  expect_value NUMBER default 10,
  open_gap_cost NUMBER default 11,
  extend_gap_cost NUMBER default 1,
  word_size NUMBER default 3,
  x_dropoff NUMBER default 15,
  final_x_dropoff NUMBER default 25)
  return table of row (t_seq_id VARCHAR2, score NUMBER, expect NUMBER);

Parameters

Table 9-10 describes the input parameters for TBLAST_MATCH; Table 9-11, the output parameters.

Table 9-10 Input Parameters for TBLAST_MATCH Table Function

Parameter Description

query_seq

The query sequence to search. This version of ODM BKLAST accepts bare sequences only. A bare sequence is just lines of sequence data. Blank lines are not allowed in the middle of bare sequence input.

seqdb_cursor

The cursor parameter supplied by the user when calling the function. It should return two columns in its returning row, the sequence identifier and the sequence string.

subsequence_from

Start position of a region of the query sequence to be used for the search. The default is 1.

subsequence_to

End position of a region of the query sequence to be used for the search. If -1 is specified, the sequence length is taken as subsequence_to. The default is -1.

translation_type

Type of the translation involved. The options are BLASTX, TBLASTN, and TBLASTX. The default is BLASTX.

genetic_code

Used for translating nucleotide sequences to amino acid sequences. genetic_code is sort of like a mapping table. NCBI supports 17 different genetic codes. The supported genetic codes and their names are given in Table 9-1. The default genetic code is 1.

filter_low_complexity

TRUE or FALSE. If TRUE, the search masks off segments of the query sequence that have low compositional complexity. Filtering can eliminate statistically significant but biologically uninteresting regions, leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the query sequence. The default is FALSE.

mask_lower_case

TRUE or FALSE. If TRUE, you can specify a sequence in upper case characters as the query sequence and denote areas to be filtered out with lower case. This customizes what is filtered from the sequence. The default is FALSE.

sub_matrix

Specifies the substitution matrix used to assign a score for aligning any possible pair of residues. The different options are PAM30, PAM70, BLOSUM80, BLOSUM62, and BLOSUM45. The default is BLOSUM62. See Table 9-9 for supported values of (open_gap_cost, extend_gap_cost) for each matrix.

expect_value

The statistical significance threshold for reporting matches against database sequences. The default value is 10. Specifying 0 invokes default behavior.

open_gap_cost

The cost of opening a gap. The default value is 11. Specifying 0 invokes default behavior.

extend_gap_cost

The cost of extending a gap. The default value is 1. Specifying 0 invokes default behavior.

word_size

The word size used for dividing the query sequence into subsequences during the search. The default value is 3. Specifying 0 invokes default behavior.

x_dropoff

Dropoff for BLAST extensions in bits. The default value is 15. Specifying 0 invokes default behavior.

final_x_dropoff

The final X dropoff value for gapped alignments in bits. The default value is 25. Specifying 0 invokes default behavior.


Table 9-11 Output Parameters for TBLAST_MATCH Table Function

Attribute Description

t_seq_id

The sequence identifier of the returned match.

score

The score of the returned match.

expect

The expect value of the returned match.



BLASTN_ALIGN Table Function

This table function performs a BLASTN alignment of the given nucleotide sequence against the selected portion of the nucleotide database. The database can be selected using a standard SQL select and passed into the function as a cursor. It accepts the standard BLAST parameters that are listed in the following section.

BLASTN_MATCH returns only the score and expect value of the match. It does not return information about the alignment. BLASTN_MATCH is typically used when a BLAST search will be followed up with h a more compute intensive alignment, such as the Smith-Waterman alignment.

BLASTN_ALIGN does the BLAST alignment and returns the information about the alignment.

Syntax

function BLASTN_ALIGN (
  query_seq CLOB,
  seqdb_cursor REF CURSOR,
  subsequence_from NUMBER default 1,
  subsequence_to NUMBER default -1,
  filter_low_complexity BOOLEAN default false,
  mask_lower_case BOOLEAN default false,
  expect_value NUMBER default 10,
  open_gap_cost NUMBER default 5,
  extend_gap_cost NUMBER default 2,
  mismatch_cost NUMBER default -3,
  match_reward NUMBER default 1,
  word_size NUMBER default 11,
  xdropoff NUMBER default 30,
  final_x_dropoff NUMBER default 50)
  return table of row ( 
    t_seq_id VARCHAR2,
    pct_identity NUMBER,
    alignment_length NUMBER,
    mismatches NUMBER,
    positives NUMBER,
    gap_openings NUMBER,
    gap_list [Table of NUMBER],
    q_seq_start NUMBER,
    q_frame NUMBER,
    q_seq_end NUMBER,
    t_seq_start NUMBER,
    t_seq_end NUMBER,
    t_frame NUMBER,   
    score NUMBER, 
    expect NUMBER);

Parameters

Table 9-12 describes the input parameters for BLASTN_ALIGN; Table 9-13, the output parameters.

Table 9-12 Input Parameters for BLASTN_ALIGN Table Function

Parameter Description

query_seq

The query sequence to search. This version of ODM BLAST accepts bare sequences only. A bare sequence is just lines of sequence data. Blank lines are not allowed in the middle of bare sequence input.

seqdb_cursor

The cursor parameter supplied by the user when calling the function. It should return two columns in its returning row, the sequence identifier and the sequence string.

subsequence_from

Start position of a region of the query sequence to be used for the search. The default is 1.

subsequence_to

End position of a region of the query sequence to be used for the search. If -1 is specified, the sequence length is taken as subsequence_to. The default is -1.

filter_low_complexity

TRUE or FALSE. If TRUE, the search masks off segments of the query sequence that have low compositional complexity. Filtering can eliminate statistically significant but biologically uninteresting regions, leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the query sequence.

mask_lower_case

TRUE or FALSE. If TRUE, you can specify a sequence in upper case characters as the query sequence and denote areas to be filtered out with lower case. This customizes what is filtered from the sequence. The default is FALSE.

expect_value

The statistical significance threshold for reporting matches against database sequences. The default value is 10. Specifying 0 invokes default behavior.

open_gap_cost

The cost of opening a gap. The default value is 5. Specifying 0 invokes default behavior.

extend_gap_cost

The cost of extending a gap. The default value is 2. Specifying 0 invokes default behavior.

mismatch_cost

The penalty for nucleotide mismatch. The default value is -3. Specifying 0 invokes default behavior.

match_reward

The reward for a nucleotide match. The default value is 1. Specifying 0 invokes default behavior.

word_size

The word size used for dividing the query sequence into subsequences during the search. The default value is 11. Specifying 0 invokes default behavior.

xdropoff

Dropoff for BLAST extensions in bits. The default value is 30. Specifying 0 invokes default behavior.

final_x_dropoff

The final X dropoff value for gapped alignments in bits. The default value is 50. Specifying 0 invokes default behavior.


Table 9-13 Output Parameters for BLASTN_ALIGN Table Function

Parameter Description

t_seq_id

Identifier (for example, the NCBI accession number) of the matched (target) sequence

pct_identity

Percentage of the query sequence that identically matches with the database sequence.

alignment_length

Length of the alignment.

mismatches

Number of base-pair mismatches between the query and the database sequence.

positives

Number of base-pairs with a positive match score between the query and the database sequence.

gap_openings

Number of gaps opened in gapped alignment.

gap_list

List of offsets where a gap is opened.

q_seq_start, q_seq_end

The indexes of the portion of the query sequence that is aligned

q_frame

Translation frame number of the query.

t_seq_start, t_seq_end

The indexes of the portion of the target sequence that is aligned.

t_frame

Translation frame number of the target sequence.

expect

Expect value of the alignment.

score

Score corresponding to the alignment.



BLASTP_ALIGN Table Function

This table function performs a BLASTP alignment of the given amino acid sequences against the selected portion of the protein database. The database can be selected using a standard SQL select and passed into the function as a cursor. You can also use the standard BLAST parameters that are listed in the following section.

BLASTP_MATCH function returns only the score and expect value of the match. It does not return information about the alignment. The BLASTP_MATCH is typically used when a BLAST search will be followed up with h a more compute intensive alignment, such as the Smith-Waterman alignment or a full FASTA alignment.

The BLASTP_ALIGN function does the BLAST alignment and returns the information about the alignment. The schema of the returned alignment is the same as that of BLASTN_ALIGN.

Syntax

function SYS_BLASTP_ALIGN (
  query_seq CLOB,
  seqdb_cursor REF CURSOR,
  subsequence_from NUMBER default 1,
  subsequence_to NUMBER default -1,
  filter_low_complexity BOOLEAN default false,
  mask_lower_case BOOLEAN default false,
  sub_matrix VARCHAR2 default 'BLOSUM62',
  expect_value NUMBER default 10,
  open_gap_cost NUMBER default 11,
  extend_gap_cost NUMBER default 1,
  word_size NUMBER default 3,
  x_dropoff NUMBER default 15,
  final_x_dropoff NUMBER default 25)
 return table of row ( 
    t_seq_id VARCHAR2,
    pct_identity NUMBER,
    alignment_length NUMBER,
    mismatches NUMBER,
    positives NUMBER,
    gap_openings NUMBER,
    gap_list [Table of NUMBER],
    q_seq_start NUMBER,
    q_frame NUMBER,
    q_seq_end NUMBER,
    t_seq_start NUMBER,
    t_seq_end NUMBER,
    t_frame NUMBER,   
    score NUMBER, 
    expect NUMBER);

Parameters

Table 9-14 describes the input parameters for BLASTP_ALIGN; Table 9-15, the output parameters.

Table 9-14 Input Parameters for BLASTP_ALIGN Table Function

Parameter Description

query_seq

The query sequence to search. This version of ODM BKLAST accepts bare sequences only. A bare sequence is just lines of sequence data. Blank lines are not allowed in the middle of bare sequence input.

seqdb_cursor

The cursor parameter supplied by the user when calling the function. It should return two columns in its returning row, the sequence identifier and the sequence string.

subsequence_from

Start position of a region of the query sequence to be used for the search. The default is 1.

subsequence_to

End position of a region of the query sequence to be used for the search. If -1 is specified, the sequence length is taken as subsequence_to. The default is -1.

filter_low_complexity

TRUE or FALSE. If TRUE, the search masks off segments of the query sequence that have low compositional complexity. Filtering can eliminate statistically significant but biologically uninteresting regions, leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the query sequence. The default is FALSE.

mask_lower_case

TRUE or FALSE. If TRUE, you can specify a sequence in upper case characters as the query sequence and denote areas to be filtered out with lower case. This customizes what is filtered from the sequence. The default is FALSE.

sub_matrix

Specifies the substitution matrix used to assign a score for aligning any possible pair of residues. The different options are PAM30, PAM70, BLOSUM80, BLOSUM62, and BLOSUM45. The default is BLOSUM62. See Table 9-9 for supported values of (open_gap_cost, extend_gap_cost) for each matrix.

expect_value

The statistical significance threshold for reporting matches against database sequences. The default value is 10. Specifying 0 invokes default behavior.

open_gap_cost

The cost of opening a gap. The default value is 11. Specifying 0 invokes default behavior.

extend_gap_cost

The cost of extending a gap. The default value is 1. Specifying 0 invokes default behavior.

word_size

The word size used for dividing the query sequence into subsequences during the search. The default value is 3. Specifying 0 invokes default behavior.

x_dropoff

X-dropoff for BLAST extensions in bits. The default value is 15. Specifying 0 invokes default behavior.

final_x_dropoff

The final X dropoff value for gapped alignments in bits. The default value is 25 Specifying 0 invokes default behavior.


Table 9-15 Output Parameters for BLASTP_ALIGN Table Function

Parameter Description

t_seq_id

Identifier (for example, the NCBI accession number) of the matched (target) sequence

pct_identity

Percentage of the query sequence that identically matches with the database sequence.

alignment_length

Length of the alignment.

mismatches

Number of base-pair mismatches between the query and the database sequence.

positives

Number of base-pairs with a positive match score between the query and the database sequence.

gap_openings

Number of gaps opened in gapped alignment.

gap_list

List of offsets where a gap is opened.

q_seq_start, q_seq_end

The indexes of the portion of the query sequence that is aligned.

q_frame

Translation frame number of the query.

t_seq_start, t_seq_end

The indexes of the portion of the target sequence that is aligned.

t_frame

Translation frame number of the target sequence.

score

Score corresponding to the alignment.



TBLAST_ALIGN Table Function

This table function performs BLAST alignments involving translations of either the query sequence or the database of sequences or both the query sequence and the database of sequences. The available translation options are BLASTX, TBLASTN, and TBLASTX. The schema of the returned alignment is the same as that of BLASTN_ALIGN and BLASTP_ALIGN.

Syntax

function TBLAST_ALIGN (
  query_seq CLOB,
  seqdb_cursor REF CURSOR,
  subsequence_from NUMBER default 1,
  subsequence_to NUMBER default 0,
  translation_type VARCHAR2 default 'BLASTX',
  genetic_code NUMBER default 1,
  filter_low_complexity BOOLEAN default false,
  mask_lower_case BOOLEAN default false,
  sub_matrix VARCHAR2 default 'BLOSUM62',
  expect_value NUMBER default 10,
  open_gap_cost NUMBER default 11,
  extend_gap_cost NUMBER default 1,
  word_size NUMBER default 3,
  x_dropoff NUMBER default 15,
  final_x_dropoff NUMBER default 25)
 return table of row ( 
    t_seq_id VARCHAR2,
    pct_identity NUMBER,
    alignment_length NUMBER,
    mismatches NUMBER,
    positives NUMBER,
    gap_openings NUMBER,
    gap_list [Table of NUMBER],
    q_seq_start NUMBER,
    q_frame NUMBER,
    q_seq_end NUMBER,
    t_seq_start NUMBER,
    t_seq_end NUMBER,
    t_frame NUMBER,   
    score NUMBER, 
    expect NUMBER);

Parameters

Table 9-16 describes the input parameters for TBLAST_ALIGN; Table 9-17, the output parameters.

Table 9-16 Input Parameters for TBLAST_ALIGN Table Function

Parameter Description

query_seq

The query sequence to search. This version of ODM BKLAST accepts bare sequences only. A bare sequence is just lines of sequence data. Blank lines are not allowed in the middle of bare sequence input.

seqdb_cursor

The cursor parameter supplied by the user when calling the function. It should return two columns in its returning row, the sequence identifier and the sequence string.

subsequence_from

Start position of a region of the query sequence to be used for the search. The default is 1.

subsequence_to

End position of a region of the query sequence to be used for the search. If -1 is specified, the sequence length is taken as subsequence_to. The default is -1.

translation_type

Type of the translation involved. The options are BLASTX, TBLASTN, and TBLASTX. The default is BLASTX.

genetic_code

Used for translating nucleotide sequences to amino acid sequences. genetic_code is sort of like a mapping table. NCBI supports 17 different genetic codes. The supported genetic codes and their names are given in Table 9-1. The default genetic code is 1.

filter_low_complexity

TRUE or FALSE. If TRUE, the search masks off segments of the query sequence that have low compositional complexity. Filtering can eliminate statistically significant but biologically uninteresting regions, leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the query sequence.The default is FALSE.

mask_lower_case

TRUE or FALSE. If TRUE, you can specify a sequence in upper case characters as the query sequence and denote areas to be filtered out with lower case. This customizes what is filtered from the sequence.The default is FALSE.

sub_matrix

Specifies the substitution matrix used to assign a score for aligning any possible pair of residues. The different options are PAM30, PAM70, BLOSUM80, BLOSUM62, and BLOSUM45. The default is BLOSUM62. See Table 9-9 for supported values of (open_gap_cost, extend_gap_cost) for each matrix.

expect_value

The statistical significance threshold for reporting matches against database sequences. The default value is 10. Specifying 0 invokes default behavior.

open_gap_cost

The cost of opening a gap. The default value is 11. Specifying 0 invokes default behavior.

extend_gap_cost

The cost of extending a gap. The default value is 1. Specifying 0 invokes default behavior.

word_size

The word size used for dividing the query sequence into subsequences during the search. The default value is 3. Specifying 0 invokes default behavior.

x_dropoff

Dropoff for BLAST extensions in bits. The default value is 15. Specifying 0 invokes default behavior.

final_x_dropoff

The final X dropoff value for gapped alignments in bits. The default value is 25. Specifying 0 invokes default behavior.


Table 9-17 Output Parameters for TBLAST_ALIGN Table Function

Parameter Description

t_seq_id

Identifier (for example, the NCBI accession number) of the matched (target) sequence

pct_identity

Percentage of the query sequence that identically matches with the database sequence.

alignment_length

Length of the alignment.

mismatches

Number of base-pair mismatches between the query and the database sequence.

positives

Number of base-pairs with a positive match score between the query and the database sequence.

gap_openings

Number of gaps opened in gapped alignment.

gap_list

List of offsets where a gap is opened.

q_seq_start, q_seq_end

The indexes of the portion of the query sequence that is aligned.

q_frame

Translation frame number of the query.

t_seq_start, t_seq_end

The indexes of the portion of the target sequence that is aligned.

t_frame

Translation frame number of the target sequence.

score

Score corresponding to the alignment.

expect

Expect value of the alignment.