DNA                 package:mlbench                 R Documentation

_P_r_i_m_a_t_e _s_p_l_i_c_e-_j_u_n_c_t_i_o_n _g_e_n_e _s_e_q_u_e_n_c_e_s (_D_N_A)

_D_e_s_c_r_i_p_t_i_o_n:

     It consists of 3,186 data points (splice junctions). The data
     points are described by 180 indicator binary variables and the
     problem is to recognize the 3 classes (ei, ie, neither), i.e., the
     boundaries between exons (the parts of the DNA sequence retained
     after splicing) and introns (the parts of the DNA sequence that
     are spliced out).

     The StaLog dna dataset is a processed version of the Irvine 
     database described below. The main difference is that the 
     symbolic variables representing the nucleotides (only A,G,T,C) 
     were replaced by 3 binary indicator variables. Thus the original 
     60 symbolic attributes were changed into 180 binary attributes.  
     The names of the examples were removed. The examples with 
     ambiguities were removed (there was very few of them, 4).    The
     StatLog version of this dataset was produced by Ross King at
     Strathclyde University. For original details see the Irvine 
     database documentation.

     The nucleotides A,C,G,T were given indicator values as follows:

         A -> 1 0 0
         C -> 0 1 0
         G -> 0 0 1
         T -> 0 0 0

     Hint. Much better performance is generally observed if attributes
     closest to the junction are used. In the StatLog version, this
     means using attributes A61 to A120 only.

_U_s_a_g_e:

     data(DNA)

_F_o_r_m_a_t:

     A data frame with 3,186 observations on 180 variables, all nominal
     and a target class.

_S_o_u_r_c_e:

        *  Source:
            - all examples taken from Genbank 64.1 (ftp site:
           genbank.bio.net)
            - categories "ei" and "ie" include every "split-gene"  for
           primates in Genbank 64.1
            - non-splice examples taken from sequences known not to
           include a splicing site

        *  Donor: G. Towell, M. Noordewier, and J. Shavlik, 
           {towell,shavlik}@cs.wisc.edu, noordewi@cs.rutgers.edu

     These data have been taken from: 

        *  ftp.stams.strath.ac.uk/pub/Statlog

     and were converted to R format by
     Evgenia.Dimitriadou@ci.tuwien.ac.at.

_R_e_f_e_r_e_n_c_e_s:

     machine learning:
      - M. O. Noordewier and G. G. Towell and J. W. Shavlik, 1991; 
     "Training Knowledge-Based Neural Networks to Recognize Genes in 
     DNA Sequences".  Advances in Neural Information Processing
     Systems, volume 3, Morgan Kaufmann.

     - G. G. Towell and J. W. Shavlik and M. W. Craven, 1991;  
     "Constructive Induction in Knowledge-Based Neural Networks",   In
     Proceedings of the Eighth International Machine Learning Workshop,
     Morgan Kaufmann.

     - G. G. Towell, 1991; "Symbolic Knowledge and Neural Networks:
     Insertion, Refinement, and Extraction", PhD Thesis, University of
     Wisconsin - Madison.

     - G. G. Towell and J. W. Shavlik, 1992; "Interpretation of
     Artificial Neural Networks: Mapping  Knowledge-based Neural
     Networks into Rules", In Advances in Neural Information Processing
     Systems, volume 4, Morgan Kaufmann.

