Users Guide

Obtaining and Building the CATE Software

All CATE code is stored in CVS held at the NESC gForge instance. The connection url is forge.nesc.ac.uk:/cvsroot/cate

There are 15 modules held in CVS of which 13 are required to build parts of the CATE software & site. They are

  1. cate This is the parent java module for the CATE web application and associated libraries. It is packaged using the maven software project management and comprehension tool (http://maven.apache.org).
    1. cate-model This module contains the domain model, XML Serialization mappings and Object-Relational mappings.
    2. cate-persistence This module contains the persistence layer for the CATE web application, including versioning logic and DAO 's.
    3. cate-service This module contains the service layer for the CATE web application, including business logic and workflow.
    4. cate-controller This module contains the controller code for the CATE web application
    5. cate-js This module contains the custom javascript used by the CATE web application. It is based on the dojo javascript library version 0.4.3, and packaged using the Maven tools for javascript developers.
    6. cate-view This module contains the view layer for the CATE web application, including static resources such as CSS files and images.
    7. cate-web This module packages the 6 preceding modules into a Web application archive for a generic CATE web application.
    8. cate-tools This module contains code for three command line tools used by CATE. It contains code to import data from XML files into a relational database. It contains code to generate an XML sitemap for a given CATE site. It also contains code to generate a distributional map as a jpeg image for a taxon.
    9. cate-site This module contains code and resources that customize the look of the site generated by maven (deployed at http://www.cate-project.org).
  2. data-extraction This perl module is used to extract data from the databases provided by the content teams and write the data out in a normalized form as XML according to the XML Schema produced by the CATE.
  3. cate-server This module contains the server configuration files for the cate server instances (cate-dev.nhm.ac.uk & cate-live.nhm.ac.uk).

Building the CATE Java modules

The CATE Java modules are built using Apache Maven. They require Maven (I used maven 2.0.9), and a Java 5 SDK (I used the Java Development Kit from Sun version 1.5.0_16). Modules can be built by executing the command:

\> mvn install

In most cases dependencies can be downloaded from existing maven repositories. However some dependencies are not available from existing repositories, either because they are not available in a maven repository, because they needed to be patched or because they are under license (in the case of some jar files from sun).

Building and installing the CATE Perl module & ancillary tools

The CATE Perl module (CATE.pm) is build using ActivePerl (I used v5.5.8), and also requires a version of the Make tool (I used nmake, which can be downloaded by following this link on the Microsoft support site http://support.microsoft.com/default.aspx?scid=kb;en-us;Q132084). CATE.pm itself uses several modules (Carp, utf8, Encode, Data::UUID, IO::File, XML::Writer and Text::CSV::Simple) which can be installed using the ActivePerl Perl Package Manager in most cases, or which can be downloaded from CPAN and installed manually.

In addition, the two revision specific scripts use DBI and DBD::ODBC to connect to the Access databases and Factsheet fusion files (both of which can be installed using the Perl Package Manager).

To build CATE.pm, first build the makefile (in data-extraction/CATE)

\> perl Makefile.PL

Then install the library (from the same directory)

\> nmake install

Data Import

The process of getting data into a CATE site is as follows:

  1. Data is made available to the import code in the correct format by copying files into the correct directory on the importing machine. These directories are revision-specific and are described below.
  2. The import routine is run. This is a perl script. Each revision (sphingidae & araceae) has its own script which in turn uses the data-extraction library. CATE.pm expects to find a number of CSV files in a directory called "resources" (containing terms belonging to controlled vocabularies, available in CVS in the data-extraction module). To run the script, call it with no arguments e.g.

    > perl cate-araceae.pl

  3. This script should regenerate the XML files for the revision.
  4. Once this script has been run, the normalized data can be imported into a relational database using the cate-tools software, and the processed images ("web-friendly" and "zoomified") can be copied to a server using sftp . The cate-tools software is packaged as an executable jar file and can be run thus (assuming that the cate-tools jar file is called cate-tools-0.8.jar and is in the current directory, and the XML file to be imported is called cate-sphingidae.xml and is also in the current directory).
\> java -Xmx1024m -jar cate-tools-0.8.jar import cate-sphingidae.xml>>>

This assumes that there is a hibernate properties file in the current directory. An example of such a file is

hibernate.connection.driver_class=com.mysql.jdbc.Driver
hibernate.connection.url=jdbc:mysql://localhost:3306/araceae?useUnicode=true&characterEncoding=UTF-8
hibernate.connection.username=aroid
hibernate.connection.password=ar01d
hibernate.dialect=org.hibernate.dialect.MySQL5Dialect
hibernate.hbm2ddl.auto=create
hibernate.cache.provider_class=net.sf.ehcache.hibernate.SingletonEhCacheProvider
hibernate.search.default.directory_provider=org.hibernate.search.store.FSDirectoryProvider
hibernate.search.default.indexBase=.
hibernate.search.default.indexDir=indexes
hibernate.search.indexing_strategy=manual
hibernate.show_sql=false

You also need to place a valid log4j properties file in the current directory.

Explanation of the properties beginning with hibernate can be found here.

If the tool imports the data correctly, the data should be imported into the database specified in the hibernate.properties file, overwriting the existing data. In addition, it should generate files used for the free-text searching of data used by the web application (in a format understood by Apache Lucene). These files should be copied into the correct directory on the web server before the web application is started (cate-web, cate-araceae, and cate-sphingidae all use ${webAppRoot}/WEB-INF/).

Databases used: cate-araceae

CATE Araceae requires:

  1. Master database (described in data-extraction/documentation/cate-araceae/Araceae.rtf; available as dbi:ODBC:Araceae).
  2. Database containing the TDWG areas (described in data-extraction/documentation/TDWG_Geography.rtf; available as dbi:ODBC:TDWG-Geography). Downloaded from http://www.kew.org/gis/tdwg/.
  3. Several factsheet fusion databases (described in data-extraction/documentation/Factsheet_Fusion.rtf and available as dbi:ODBC:Flora for the floras database, dbi:ODBC:AraceaeGenera for the Araceae genera and dbi:ODBC:{GENUS_NAME} where {GENUS_NAME} is the name of the genus for subgeneric taxa - there is a Factsheet fusion database per genus). The Factsheet fusion databases do not need to be modified in any way (beyond being registered in the Microsoft ODBC Data Source Administrator).

    The topics in the Factsheet Fusion databases are:

    • Typification | Type of Type
    • Typification | Type Specimen Link
    • Typification | Type Citation
    • Typification | Type Citation Standardised
    • Typification | Typification Discussion
    • Typification | Type Species Identifier
    • Header Images | Header Image
    • Diagnosis | Distinguishing Features Text
    • Diagnosis | Distinguishing Features Source
    • Diagnosis | Distinguishing Features Images
    • Description | Description Text
    • Description | Description Source
    • Description | Altitude
    • Description | Altitude Source
    • Description | Phenology
    • Description | Phenology Source
    • Description | Habitat
    • Description | Habitat Source
    • Key | Title
    • Key | Source
    • Key | Images
    • Key | Text
    • Key | File Ref
    • Distribution Text
    • Discussion
    • Etymology
    • Common names
    • Uses
    • Uses Source
    • References
    • Conservation Status
    • Other Images

Databases used: cate-sphingidae

CATE Sphingidae requires:

  1. Master database (described in data-extraction/documentation/cate-sphingidae/Sphingidae.rtf; available as dbi:ODBC:Sphingidae).
  2. Database containing the TDWG areas (described in data-extraction/documentation/TDWG_Geography.rtf; available as dbi:ODBC:TDWG-Geography). Downloaded from http://www.kew.org/gis/tdwg/.
  3. Database containing the distributional data for the sphingid taxa (described in data-extraction/documentation/cate-sphingidae/Distribution.rtf; available as dbi:ODBC:SphingidDistribution).
  4. Several factsheet fusion databases (described in data-extraction/documentation/Factsheet_Fusion.rtf and available as dbi:ODBC:{GENUS_NAME} where {GENUS_NAME} is the name of the genus for subgeneric taxa - there is a Factsheet fusion database per genus). The Factsheet fusion databases do not need to be modified in any way (beyond being registered in the Microsoft ODBC Data Source Administrator).

The topics in the Factsheet Fusion databases are:

  • Common Name
  • Diagnosis | Aedeagus | Caption
  • Diagnosis | Aedeagus | Collection / Institution
  • Diagnosis | Aedeagus | Image
  • Diagnosis | Aedeagus | Specimen Number
  • Diagnosis | Aedeagus | Text
  • Diagnosis | Based on
  • Diagnosis | Comments
  • Diagnosis | Female | Body Underside
  • Diagnosis | Female | Body Upperside
  • Diagnosis | Female | Forewing Length
  • Diagnosis | Female | Forewing Underside
  • Diagnosis | Female | Forewing Upperside
  • Diagnosis | Female | Habitus
  • Diagnosis | Female | Hindwing Underside
  • Diagnosis | Female | Hindwing Upperside
  • Diagnosis | Female Genitalia | Caption
  • Diagnosis | Female Genitalia | Collection / Institution
  • Diagnosis | Female Genitalia | Image
  • Diagnosis | Female Genitalia | Specimen Number
  • Diagnosis | Female Genitalia | Text
  • Diagnosis | General | Body Underside
  • Diagnosis | General | Body Upperside
  • Diagnosis | General | Forewing Underside
  • Diagnosis | General | Forewing Upperside
  • Diagnosis | General | Habitus
  • Diagnosis | General | Hindwing Underside
  • Diagnosis | General | Hindwing Upperside
  • Diagnosis | Male | Body Underside
  • Diagnosis | Male | Body Upperside
  • Diagnosis | Male | Forewing Length
  • Diagnosis | Male | Forewing Underside
  • Diagnosis | Male | Forewing Upperside
  • Diagnosis | Male | Habitus
  • Diagnosis | Male | Hindwing Underside
  • Diagnosis | Male | Hindwing Upperside
  • Diagnosis | Male Genitalia | Caption
  • Diagnosis | Male Genitalia | Collection / Institution
  • Diagnosis | Male Genitalia | Image
  • Diagnosis | Male Genitalia | Specimen Number
  • Diagnosis | Male Genitalia | Text
  • Diagnosis | Valve | Caption
  • Diagnosis | Valve | Collection / Institution
  • Diagnosis | Valve | Image
  • Diagnosis | Valve | Specimen Number
  • Diagnosis | Valve | Text
  • Discussion | Flight time
  • Discussion | Ovum
  • Discussion | Pupa
  • Discussion | Adult biology
  • Discussion | Larva
  • Discussion | Nomenclature
  • Discussion | Taxonomy
  • Distribution | Codified
  • Distribution | Text
  • Header Image 1 | Collection / Institution
  • Header Image 1 | Image Caption
  • Header Image 1 | Image Ref
  • Header Image 1 | Specimen Number
  • Header Image 2 | Collection / Institution
  • Header Image 2 | Image Caption
  • Header Image 2 | Image Ref
  • Header Image 2 | Specimen Number
  • Header Image 3 | Collection / Institution
  • Header Image 3 | Image Caption
  • Header Image 3 | Image Ref
  • Header Image 3 | Specimen Number
  • Header Image 4 | Collection / Institution
  • Header Image 4 | Image Caption
  • Header Image 4 | Image Ref
  • Header Image 4 | Specimen Number
  • Synonym | Type of Synonym
  • Synoym | Discussion
  • Typification | Syntypes String
  • Typification | Type Comments
  • Typification | Type of Type
  • Typification | Type Specimens Data
  • Typification | Typification Discussion

Creating the JPEG2000 images

CATE is now uses the Adore Djatoka JPEG 2000 Image server to dynamically process and serve images. Images were supplied as JPEG or TIFF images in the most part. Consequently they were processed using Image Magick into smaller (less than 1600px wide and 1200px high) jpeg images, then processed into JPEG 2000 images using the compress command line tool supplied as part of the Djatoka distribution.

CATE-Tools command line syntax

Originally there were several tools used to generate distribution images etc for CATE. Since we switched to using the EDIT Map REST Service, this tool has become somewhat redundant. Only the DataSet Importer has been refactored to use the CDM. The remainder have been somewhat neglected due to lack of time and most likely do not work.

All tools are packaged into the cate-tools jar archive. The three tools are the DataSet Importer, the Sitemap Generator, and the Distribution Map generator. The tool expects its dependencies to be in its classpath (if they are packaged as jar files, then they should be placed in a directory called lib, in the current working directory).

DataSetImporter

This tool imports data in an xml file (conforming to the xml schema defined in cate-model) into a relational database. It overwrites any existing data in the database.

java -jar \{path-to\}cate-tools.jar import \{file-uri\}

The syntax is explained in the section above on data import. This tool expects a hibernate.properties file in the current directory, with some extra parameters (also explained above).

Sitemap Generator

This tool generates a sitemap (http://www.sitemaps.org) to enable easy indexing of a CATE site.

java -jar \{path-to\}cate-tools.jar sitemap \{outputfile\} \{server-address\}

where {outputfile} is the name of the sitemap file and {server-address} is the address (including the protocol, e.g. http://www.cate-project.org) of the server for which the map is being produced. As with the data set importer, this tool expects a hibernate.properties file in the current directory. Unlike the data set importer tool, this tool expects there to be data in the database, and will generate an empty sitemap if the database is empty.

Distribution Map Generator

This tool generates a jpeg image of the distribution of a taxon according to the TDWG botanical regions.

java -jar \{path-to\}cate-tools.jar map \{shapefile/directory\} \{outputfile\} \{level3 code 1\} … \{level3 code n\}

where {shapefile/directory} is a directory containing the shape files supplied by the Kew Gardens GIS unit. The shapefile directory contains four subdirectories (level1, level2, level3 and level4), which contain the shapefiles for the TDWG botanical regions. {outputfile} is the name of the image file to be created. The remaining command line arguments should be TDWG codes from level1, 2 or 3.

The program will create an image where the TDWG level 3 areas are all outlined, and those areas supplied as command line arguments are filled in.

Configuration and Administration of the CATE Servers

Hardware

The CATE servers are cate-dev.nhm.ac.uk and cate-live.nhm.ac.uk. Both CATE servers are virtual machines running at the NHM. The dev site runs on the sandbox host - a single box with a single dual core 2.33 Ghz Intel Xeon processor. The live site is on a 2 (to be upgraded to 3) node cluster, each node is a 2-processor quad core machine. Each virtual host has 1024Mb of memory. Each server has two 8 Gigabyte blocks of storage on the SAN, /dev/sda1 mounted at / and /dev/sdb1 mounted at /var/www/html/cate-sphingidae in order to hold all of the cate-sphingidae images. Current disk usage levels are XXX on sda1 (mounted on /) and YYY (mounted on /var/html/).

Installed Software

Both servers use identical configurations. The software stack is as follows:

  1. Fedora Core 4 (Linux Kernel 2.6.11-1.1369_FC4, http://fedoraproject.org).
  2. Sun Java Development Kit 6 update 6 (jdk-6u6-linux-i586 installed using rpm, http://java.sun.com).
  3. MySQL database server version 4.1 (mysql-4.1.20-1 & mysql-server-4.1.20-1 installed using rpm, http://www.mysql.com).
  4. Apache httpd web server version 2.0.54 (httpd-2.0.54-10 installed using rpm, http://httpd.apache.org).
    1. The AJP Connector Module version 1.2.27 (mod_jk-1.2.27, installed using the make build tool).
    2. The Secure Shell module version 2.0.54 (mod_ssl-2.0.54, installed using rpm).
  5. Apache tomcat servlet container version 5.5.26 (installed from a gzipped tar archive to /home/tomcat/apache-tomcat-5.5.26, http://tomcat.apache.org).
  6. Monit 4.10 (installed from a gzipped tar archive to /usr/bin/monit http://mmonit.com/monit/).

A complete list of installed packages (required to, for example, build and install monit) is given here:

  • apr-devel-0.9.6-3.i386.rpm
  • apr-util-devel--.9.6-2.i386.rpm
  • autoconf-2.59-5.noarch.rpm
  • bison-2.0-6.i386.rpm
  • compat-libstdc++-296-2.96-132.fc4.i386.rpm
  • compat-libstdc++-33-3.2.3-47.fc4.i386.rpm
  • cpp-4.0.0-8.i386.rpm
  • cyrus-sasl-devel-2.1.20-5.i386.rpm
  • db4-devel-4.3.27-3.i386.rpm
  • distcache-1.4.5-7.i386.rpm
  • e2fsprogs-1.38-0.FC4.1.i386.rpm
  • e2fsprogs-devel-1.38-0.FC4.1.i386.rpm
  • expat-devel-1.95.8-6.i386.rpm
  • flex-2.5.4a-34.i386.rpm
  • gcc-4.0.0-8.i386.rpm
  • glibc-devel-2.3.5-10.i386.rpm
  • glibc-headers-2.3.5-10.i386.rpm
  • glibc-kernheaders-2.4-9.1.94.i386.rpm
  • httpd-devel-2.0.54-10.i386.rpm
  • jdk-1_5_0_07-linux-i586.rpm
  • jdk-6u6-linux-i586.rpm
  • logrotate-3.7.1-11.i386.rpm
  • krb5-devel-1.4.1-5.i386.rpm
  • krb5-libs-1.4.1-5.i386.rpm
  • krb5-workstation-1.4.1-5.i386.rpm
  • mod_jk-1.2.6-3jpp_4fc.i386.rpm
  • mod_ssl-2.0.54-10.i386.rpm
  • mysql-4.1.20-1.FC4.1.i386.rpm
  • mysql-devel-4.1.20-1.FC4.1.i386.rpm
  • mysql-server-4.1.20-1.FC4.1.i386.rpm
  • openssl-0.9.7f-7.10.i686.rpm
  • openssl-devel-0.9.7f-7.10.i386.rpm
  • pcre-devel-5.0-4.i386.rpm
  • sun-javadb-client-10.3.1-4.1.i386.rpm
  • sun-javadb-common-10.3.1-4.1.i386.rpm
  • sun-javadb-core-10.3.1-4.1.i386.rpm
  • sun-javadb-demo-10.3.1-4.1.i386.rpm
  • sun-javadb-docs-10.3.1-4.1.i386.rpm
  • sun-javadb-javadoc-10.3.1-4.1.i386.rpm
  • sysstat-5.0.5-9.fc.i386.rpm
  • zlib-devel-1.2.2.2-5.fc4.i386.rpm
  • apache-tomcat-5.5.26.tar.gz
  • jdk-6u6-linux-i586-rpm.bin
  • jdk-1_5_0_07-linux-i586-rpm.bin
  • monit-4.10.tar.gz
  • tomcat-connectors-1.2.28-src.tar.gz

Configuration

Unless otherwise noted, configuration follows the standard install for the given package. Configuration files are stored in CVS under the cate-server module, following the directory layout of the server (i.e. files stored in /etc/httpd/ on the server are found under /etc/httpd in CVS, tomcat's server.xml is found under /home/tomcat/apache-tomcat-5.5.26/conf/).

MySQL

MySQL has two databases, one named sphingidae, the other named araceae. The server is configured to, where possible, use utf-8 encoding as the default character set.

Apache Tomcat

Tomcat is configured with two virtual hosts, named cate-araceae and cate-sphingidae. The root dir for each virtual host is ${CATALINA_HOME}/cate-araceae and ${CATALINA_HOME}/cate-sphingidae respectively. Each virtual host has a single web application deployed in the root context (i.e. in ${CATALINA_HOME}/cate-araceae/ROOT and ${CATALINA_HOME}/cate-sphingidae/ROOT). Tomcat listens on standard port (8009 : AJP/1.3).

Tomcat is configured as a UNIX daemon, following the instructions here http://tomcat.apache.org/tomcat-5.5-doc/setup.html. The jsvc tool is installed in /home/tomcat/jakarta-commons/daemon/src/native/unix/. The init script is located at /etc/init.d/tomcat. This script has been customized to enforce the use of UTF-8 as the standard system file encoding (-Dfile.encoding=UTF-8), and to set the initial and maximal jvm heap size to 256 and 400 Mb respectively (-Xms256m -Xmx400m).

Apache httpd

Apache is configured using /etc/httpd/conf/httpd.conf and the files in /etc/httpd/conf.d following the standard install. The following customizations have been made:

  1. Addition of mod_jk LoadModule jk_module modules/mod_jk.so.

    with the following configuration

       <IfModule mod_jk.c>
         JkWorkersFile "/etc/httpd/conf/workers.properties"
         JkLogFile     "/etc/httpd/logs/mod_jk.log"
         JkLogLevel    warn
       </IfModule>
    
    1. Disabling of ETags FileETag None
    2. The use of mod_deflate to compress output where possible
         <IfModule mod_deflate.c>
          SetOutputFilter DEFLATE
          BrowserMatch ^Mozilla/4 gzip-only-text/html
          BrowserMatch ^Mozilla/4\.0[678] no-gzip
          BrowserMatch \bMSIE !no-gzip !gzip-only-text/html
          SetEnvIfNoCase Request_URI \
            \.(?:gif|jpe?g|png|swf|lkc3|jar)$ no-gzip dont-vary
          Header append Vary User-Agent env=!dont-vary
        </IfModule>
      
    3. Setting up two name-based virtual hosts on ports 80 and 443
         NameVirtualHost <-IP ADDRESS->:80
         NameVirtualHost <-IP ADDRESS->:443
      
        Include conf/vhosts/*.conf
      

    The two virtual hosts are configured using /etc/httpd/conf/vhosts/cate-araceae.conf and /etc/httpd/conf/vhosts/cate-sphingidae.conf respectively. Each file sets the document root to /var/www/html/cate-araceae and /var/www/cate-sphingidae respectively and directs all requests to tomcat if they don't match the static resources held in the document root. It sets a 3 month expires header on static resources.

Monit

Monit is a monitoring daemon that monitors usage and existence of services. It is configured using /etc/monitrc. Monit is configured to check for

  1. mysql existence (/var/run/mysqld/mysql.pid), and to try to restart it (twice) if it crashes repeatedly.
  2. tomcat existence (/var/run/jsvc.pid), and to try to restart it (twice) if it crashes repeatedly.
  3. apache existence (/var/run/httpd.pid), and to try to restart it (twice) if it crashes repeatedly
  4. A 200 response code from the web server on port 80, and to send an alert if it receives another response code.

Monit is also configured to send alerts if various system parameters (cpu usage, memory usage, and load average) exceed specified thresholds defined in monitrc.

Starting and stopping services

All services used by the CATE server (mysqld, tomcat, httpd, and monit) are configured as services and can be started and stopped manually using /sbin/service e.g.

> service servicename (start | stop | . . .)

In addition, these services are managed using chkconfig. Currently each server is configured to start the services in the correct order upon boot to a normal runlevel, and to shut them down in order also, so in theory restarting the server should not necessitate starting services manually.

To resolve problems with apache tomcat or to upgrade the web application, it is necessary to remember to stop monit prior to stopping tomcat and starting it after tomcat has started otherwise it will restart a second instance of tomcat which will compete for memory, ports and generally screw things up. Restarting httpd is not necessary, and apache will display a human friendly error message until tomcat starts responding.