GenBank HOWTO

This is a quick synopsis of the steps needed to initialize a GBrowse database from a genbank record. For the purposes of illustration, we will use the RefSeq record for M. bovis, accession NC_002945.


Using the GBrowse in-memory database

1. Convert from Genbank format into GFF format

Download the Genbank record and convert it into GFF format. You can do this easily using the bp_genbank2gff.pl script, which is part of Bioperl (scripts/Bio-DB-GFF/genbank2gff.pl):

   bp_genbank2gff.pl -stdout -accession NC_002945 > mbovis.gff

This will download the record for M. bovis (refseq NC_002945) and save it to the file mbovis.gff.

If you already have the genbank record available as a file named NC_002945.gb, you can convert it like this:

   bp_genbank2gff.pl -stdout -file NC_002945.gb > mbovis.gff

The newly-converted file uses GFF3 format, which combines feature data with sequence/DNA data. This means that you do not need a separate FASTA file for the sequence.

2. Install the GFF file into the databases directory

Copy this file into your in-memory GFF databases directory, as described in the tutorial. We will assume /usr/local/apache/htdocs/gbrowse/databases.

  mkdir /usr/local/apache/htdocs/gbrowse/databases/mbovis
  chmod o+rwx /usr/local/apache/htdocs/gbrowse/databases/mbovis
  cp mbovis.gff /usr/local/apache/htdocs/gbrowse/databases/mbovis

3. Set up the configuration file

Use the configuration file 08.genbank.conf as your starting template. This is located in contrib/conf_files:

  cp contrib/conf_files/08.genbank.conf /usr/local/apache/conf/gbrowse.conf/mb.conf

4. Edit the configuration file as appropriate

You will need to change the [GENERAL] section to use the in-memory adaptor and to point to the location of the M. bovis GFF file:

 [GENERAL]
 description   = Mycobacterium Bovis In-Memory
 db_adaptor    = Bio::DB::GFF
 db_args       = -adaptor memory
                 -dir     /usr/local/apache/htdocs/gbrowse/databases/mbovis

You might also want to change the "examples" tag to introduce the accession number for the whole genome, and a few choice gene names and search terms:

  examples = NC_002945 Mb1800 galT glucose

That's all there is to it, but since this is a pretty big chunk of DNA (> 4 Mbp), it uses a considerable amount of memory and performance will be sluggish unless you have a fast machine with lots of memory. So you might wish to view it using a MySQL, PostgreSQL or Oracle database. The following are instructions for doing this.


Using the GBrowse GFF database with Mysql

We will assume that you are using a MySQL database.

1. Create the database

Create the database using mysqladmin:

  mysqladmin create mbovis

As described in the GBrowse tutorial, give yourself write permission for the database, and give the web server user (e.g. "nobody") select permission.

2. Convert from Genbank format into GFF format and load it into the database

The bp_genbank2gff.pl script can download the accession, convert it into GFF and load the database directly in one smooth step:

  bp_genbank2gff.pl -create -dsn mbovis -accession NC_002945

If you prefer, you can do this in two steps by first creating the gff file as described for the in-memory adaptor, and then using Bioperl's bp_bulk_load_gff.pl or bp_fast_load_gff.pl.

If you are using a PostgreSQL or Oracle database, you must specify the appropriate adaptor to bp_genbank2gff.pl:

  bp_genbank2gff.pl -create -dsn mbovis -adaptor dbi::oracle -accession NC_002945

3. Set up the configuration file

Use the configuration file 08.genbank.conf as your starting template. This is located in contrib/conf_files:

  cp contrib/conf_files/08.genbank.conf /usr/local/apache/conf/gbrowse.conf/mb.conf

4. Edit the configuration file as appropriate

You will need to change the [GENERAL] section to use the appropriate database adaptor:

 [GENERAL]
 description   = Mycobacterium Bovis Database
 db_adaptor    = Bio::DB::GFF
 db_args       = -adaptor dbi::mysql
                -dsn     dbi:mysql:database=mbovis;host=localhost
                -user    nobody
                -passwd  ""

You might also want to change the "examples" tag to introduce the accession number for the whole genome, and a few choice gene names and search terms:

  examples = NC_002945 Mb1800 galT glucose

That should be it!

NOTE

You can load as many accessions into the database as you like. Each one will appear as a "chromosome" named after the accession number of the entry.