Cornell University

Software

Program	Description	Version
species.bat	Batch file that runs metadata program	2.1
metadata.txt	Maple program that sets input/output file definitions and global program options, and runs desired model-fitting program	2.1
poisson.txt	Fits Poisson/equal class sizes model	2.1
negbin.txt	Fits gamma mixed-Poisson/negative binomial model	2.1
invgauss.txt	Fits inverse Gaussian-mixed Poisson model	2.1
lognormal.txt	Fits lognormal mixed-Poisson model	2.1
pareto.txt	Fits Pareto mixed-Poisson model	2.1
mixed_expl.txt	Fits mixture of 2 exponentials mixed-Poisson model	2.1

Operational overview

Our computer programs are written in Maple. The basic structure of our algorithm is the same across different parametric models (although mathematical differences between the models require somewhat different specific computational strategies in each case). This allows precise comparison of different models fitted to the same dataset.

We have found that, while the graphical user interface (GUI) of Maple is convenient, the numerically-intensive nature of these computations requires maximum speed. We have therefore elected to bypass the GUI in favor of a batch-processing system. Under this system, to run an analysis you simply submit the batch (.bat) file to a "command prompt window" or to the "run" window under Windows. Upon completion the output is written to the location specified in the "metadata" file (see below for details).

Getting started

The following resources are required.

A reasonably fast computer with a substantial amount of RAM. These programs perform a sequence of numerical searches, which can be time-consuming (high number of iterations), and they loop through (what can be) a large number of subsets of the data. We recommend at least a 1GHz processor with 512MB RAM (we are currently using a 3.4GHz processor with 4GB RAM).
The current version of Maple. Many universities use the program for a variety of purposes, especially for teaching calculus in the mathematics department, so check to see if your institution has a site license.
All of the programs in the table above (under "Downloadable code"). For convenience, copy all of these files to a single directory (folder).
Your dataset in text (.txt) file format, structured exactly as described below.

To run an analysis, simply run the species.bat file (after specifying input and output files, and other program options, as described below). You can do this either in a "run" window, or at a command prompt in a DOS window.

The main program files

The species.bat file. This is a batch program that runs under DOS. It is very simple; in fact here it is in its entirety:
```
REM batch program for local machine with one processor

REM set echo on
echo on

REM set path for command-line Maple (cmaple) home directory
path C:\Program Files\Maple 8\bin.win

REM run metadata program
cmaple8 "C:\Documents and Settings\John Smith\My Documents\species\metadata.txt"
```
In this version of species.bat it is assumed that the command-line Maple program, cmaple, is in the directory C:\Program Files\Maple 8\bin.win. To find this directory on your computer, search for the filename "cmaple," and edit the pathname in the species.bat file accordingly if necessary, using a text editor such as Notepad. It is also assumed that the downloaded program files (from this website, above) are in a folder named "species" under the "my documents" folder of user "John Smith." Again, you must edit this line in species.bat to reflect your installation.
The metadata file. This is a Maple program that sets the filenames for the input data file (the dataset to be analyzed), the desired model-fitting program, and the two output files. The structure of this file must be maintained exactly as it is given here, since it is a Maple program. It is in large part self-explanatory. You must edit the following options in metadata.txt, using a text editor such as Notepad:
- program_file: the complete filename for the program that fits the desired model.
- data_file: the complete filename for your dataset (see below for details on format).
- output_fits_file: the complete filename to contain the fitted values at each right truncation point.
- output_analysis_file: the complete filename to contain the full analysis, including estimated number of species, standard error, goodness-of-fit statistics, etc., at each right truncation point.
- fmin: the lowest right truncation point; that is, the smallest subset of the data to be analyzed will contain frequencies from 1 to fmin. Default: 5.
- fmax: the largest right truncation point; that is, the largest subset of the data to be analyzed will contain frequencies from 1 to fmax. Default: the maximum frequency occurring in the data.
- The following options may be left at their defaults, or changed by the user:
- significant digits: the number of significant digits used by Maple in its computations. Default: 16.
- subsequent program options are more technical.
Model-fitting programs. Each program fits a specific parametric model to the observed frequency data, via the method of maximum likelihood. The basic goals of the program are to
- Compute maximum likelihood estimates (MLE's) of the distribution parameters;
- Compute the "conditional maximum likelihood estimate" of the unobserved and of the total number of classes;
- Compute the standard error of these estimates;
- Compute the p-value of the classical chi-squared goodness-of-fit (GOF) statistic; and
- Output text files with (i) fitted values, and (ii) all relevant statistics and program error diagnostics.
These computations are done to a level of precision specified by the user (16 significant digits by default). The complete analysis is run on each of a sequence of subsets of the data: each subset consists of the frequency data from 1 to up to a given right truncation point t, where t ranges from some minimum frequency specified by the user (5 by default), up to a maximum set by the user (by default, the maximum frequency encountered in the data). Each row or line in the output file contains the complete analysis at a given right truncation point t. Thus the user can compare analyses at different right truncation points; typically the fit will vary with t.

The general architecture is as follows:
- First, given a fixed set of starting values, the program attempts to find method-of-moments estimates of the unknown parameters. If this fails (unusual), the program stops and continues to the next right truncation point t.
- The GOF is computed based on the moment-method estimates. If the GOF falls below a user-specified threshold (default: p < 10^(-6)), the program stops and continues to the next right truncation point t.
- Using the moment-method estimates as starting values, the program searches for the MLE's. This process continues through a number of steps, and yields values for the MLE's that are as precise as the program is able to compute (ideally exactly correct).
- The GOF is then computed based on the MLE's. If the GOF falls below a user-specified threshold (default: p < 10^(-3)), the program stops and continues to the next right truncation point t.
- The standard error is computed using the MLE's.
Once all computations are complete at all right truncation points t, the output is formatted and written to a user-specified text file, which can then be read into Excel or any other package for editing and display.

Currently we have six parametric models. The output from each program is structured the same way. They are all mixed-Poisson models (see Basic Theory), with different mixing distributions:
- Poisson, with a point-mass mixing distribution, that is, the ordinary unmixed Poisson. Under this model the sampling intensity is constant or identical for all classes in the population.
- Negative binomial, or gamma-mixed Poisson. The mixing distribution, or the distribution of the sampling intensities, or the stochastic abundance distribution, is the gamma.
- Inverse Gaussian-mixed Poisson. The mixing distribution is the inverse Gaussian.
- Lognormal-mixed Poisson. The mixing distribution is the lognormal.
- Pareto-mixed Poisson. The mixing distribution is the Pareto.
- 2-mixed-exponential-mixed Poisson. The mixing distribution is a mixture of 2 exponentials.
We are continually searching for more families of mixing distributions that (i) have the potential to fit a wide variety of count data, particularly with high diversity and (some) large abundances; (ii) can be shown to satisfy the technical conditions required for the general theory (in particular asymptotic variances, i.e., standard errors) to be valid; and (iii) are feasibly computable. We will add programs for these as they become available.

The input and output files

Your dataset file. This must be a text (ASCII) file, with two columns, tab delimited, with a carriage return/new paragraph mark at the end of each line (note that there must not be an extra return after the last line). The first column contains the frequencies, the second, the frequencies of frequencies. Here is a sample dataset, the same one discussed under Basic Theory. A file with this structure can be readily created using, e.g., Microsoft Excel.
The fitted values file. The structure of this file is as follows. The first, left-most column contains the integers from 1 up to the maximum frequency in the data, i.e., all (potentially) observed frequencies. The second column contains the actual observed frequency-of-frequency counts for each integer (some of these may be zero). Subsequent columns contain the values fitted by the model to the given frequency, from 1 to t; each column contains the fitted values for a given right truncation point t.
The analysis output file. Each row or line in the analysis output file contains the results of a complete analysis at a given right truncation point t. For a description of the analysis results see Basic Theory. From left to right, the statistics are:
- the right truncation point t;
- the MLE's of the parameters of the distribution;
- the MLE of the "non-coverage," i.e., p₀;
- the estimated number of unobserved species, i.e., s₀;
- the estimated number of species based only on the data up to the right truncation point, that is, excluding the species with observed frequencies greater than the right truncation point;
- the estimated total number of species, that is, including the species with observed frequencies greater than the right truncation point;
- the standard error of the estimate of the number of species (the standard error for the estimate based on the subset and for the estimated total is the same);
- a lower bound for the standard error (an empirical version of the simple binomial SE; see Chao and Bunge (2002));
- the "naïve" p-value of the chi-squared goodness-of-fit test for the model, using all cells;
- the p-value for an asymptotically correct chi-squared goodness-of-fit test based on concatenating adjacent cells so that all expected cell counts are at least 5, to conform with asymptotic theory;
- the "program error report," which is actually a numeric code indicating the state of the program when it terminated (not necessarily an error).
Here is a Microsoft Excel template for the output file. To use it, open the template, and from within Excel, open the analysis output text file, and paste the results into the template under the header row. (If you have output files from several models, paste each output into the template in a vertical array (one below the other), labeling each with its model name in column A.)

Estimating the Number of Classes in a Population

How many goodly creatures are there here! - The Tempest, V, i

Software

Operational overview

Getting started

The main program files

The input and output files